DSpace at VNU: Real-time 3D human pose recovery from a single depth image using principal direction analysis

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (4.36 MB, 14 trang )

Appl Intell
DOI 10.1007/s10489-014-0535-z

Real-time 3D human pose recovery from a single depth
image using principal direction analysis
Dong-Luong Dinh · Myeong-Jun Lim ·
Nguyen Duc Thang · Sungyoung Lee · Tae-Seong Kim

© Springer Science+Business Media New York 2014

Abstract In this paper, we present a novel approach to
recover a 3D human pose in real-time from a single depth
image using principal direction analysis (PDA). Human
body parts are first recognized from a human depth silhouette via trained random forests (RFs). PDA is applied to each
recognized body part, which is presented as a set of points in
3D, to estimate its principal direction. Finally, a 3D human
pose is recovered by mapping the principal direction to each
body part of a 3D synthetic human model. We perform both
quantitative and qualitative evaluations of our proposed 3D
human pose recovering methodology. We show that our proposed approach has a low average reconstruction error of
7.07 degrees for four key joint angles and performs more
reliably on a sequence of unconstrained poses than conventional methods. In addition, our methodology runs at a
speed of 20 FPS on a standard PC, indicating that our system

D.-L. Dinh · S. Lee ( )
Department of Computer Engineering, Kyung Hee University,
1 Seocheon-dong, Giheung-gu, Yongin-si, Gyeonggi-do,
Republic of Korea
e-mail:
D.-L. Dinh
e-mail:

M.-J. Lim · T.-S. Kim ( )
Department of Biomedical Engineering, Kyung Hee University,
1 Seocheon-dong, Giheung-gu, Yongin-si, Gyeonggi-do,
Republic of Korea
e-mail:
M.-J. Lim
e-mail:
N. D. Thang
Department of Biomedical Engineering, International University,
Ho Chi Minh City, Vietnam
e-mail:

is suitable for real-time applications. Our 3D pose recovery methodology is applicable to applications ranging from
human computer interactions to human activity recognition.
Keywords 3D human pose recovery · Depth image · Body
part recognition · Principal direction analysis

1 Introduction
Recovering 3D human body poses from a sequence of
images in real-time is a challenging computer vision problem. Potential applications of this methodology in daily
life include entertainment games, surveillance, sports science, health care technology, human computer interactions,
motion tracking, and human activity recognition [14]. In
conventional systems, human body poses are reconstructed
by solving inverse kinematics using the motion information
of optical markers attached to human body parts that are
tracked by multiple cameras. These marker-based systems
are capable of recovering accurate human body poses, but
are not suitable for real-life applications because sensors
have to be attached, multiple cameras need to be installed,
expensive equipment is required, and the set-up is complicated [16]. In contrast to marker-based approaches, some

recent studies have focused on markerless-based methods
that can be utilized in daily applications. These markerless
systems are typically based on a single RGB image [19] or
multi-view RGB images [18, 22, 28].
Recently, with an introduction of depth imaging devices,
depth images can be readily obtained from which human
depth silhouettes can be derived, and each pixel provides distance information in 3D. Therefore, a 3D human
pose recovery from a single depth image or silhouette
without optical markers and multi-view RGB images has

D.-L. Dinh et al.

become an active research topic in computer vision. Novel
human pose estimation methodologies have been developed
based on depth information [4]. In [20, 21], depth data was
used to build a graph-based representation of a human body
silhouette from which the geodesic distance map of body
parts was computed to find primary landmarks such as the
head, hands, and feet. A 3D human pose was reconstructed
by fitting a human skeleton model to the landmarks. In [2],
using the information of primary landmarks as features of
each pose, the best matching pose was found from the set of
poses coded in the hierarchical tree structure. In [7, 8, 15],
depth data was presented as 3D surface meshes and then a
set of geodesic feature points such as the head, hands, and
feet were used to track poses. These approaches are generally based on alternative representations of the human depth
silhouette and the detection of body parts.
Another 3D body pose recovery approach utilizes a learning methodology to recognize each body part [10, 12].
Based on the recognized body parts, the corresponding

3D pose is reconstructed. In [26], the authors developed
a new algorithm based on expectation maximization (EM)
with two-step iterations: namely, body part labeling (E-step)
and model fitting (M-step). The depth silhouette and the
estimated 3D human body model of this method were represented by a cloud of points in 3D and a set of ellipsoids,
respectively. Each 3D point of the cloud was assigned and
then fitted to one corresponding ellipsoid. This process was
iterated by minimizing the discrepancies between the model
and depth silhouette. However, this algorithm is too slow
for real-time applications due to the high computational
costs of labeling. In [23], a new human pose recognition
approach based on body parts from a single depth image
was developed. Human body parts were inferred from the
depth image based on per-pixel classification via some randomized decision trees trained using a large database (BD)
of synthetic depth images. This allowed efficient identification of human body parts in real-time. This system was
able to recognize up to 31 body parts from a single human
depth silhouette. To model 3D human poses, the authors
then applied the mean-shift algorithm [6] to the recognized
human body parts to estimate body joint positions. Human
body poses were recovered from these joint points. However, joint position estimation via the mean-shift algorithm
generally suffers from the following limitations: (i) the position of estimated joints depends on the shape and size of the
subject, (ii) the computed relative information concentrates
on the surface of the body parts, whereas joints are positioned inside of these parts, (iii) the method requires input
of values for the parameters such as window size.
To overcome the limitations of existing studies [23,
26] and develop a more robust 3D human pose recovery methodology, we propose a novel algorithm recovering
3D human poses in real-time based on principal direction

analysis (PDA) of recognized human body parts from a
series of depth images. Human body parts in the depth silhouette are first recognized via trained random forests (RFs)

with our own synthetic training DB similar to the method
described in [11]. Using PDA, principal direction vectors
are estimated from the recognized body parts. Then, the
directional vectors are mapped to each body part of the
3D human body model to recover the 3D human pose. In
our work, instead of using a simple human skeleton model
without constraints on the configuration of joints for 3D
human pose recovery as done in [23], we develop a more
sophisticated model that uses a kinematic chain with predefined degrees of freedom (DOF) for each joint which allows
only feasible body movements and recovery of the 3D
human pose.
The rest of the paper is organized as follows. In Section 2,
we describe our overall system. In Sections 3, 4, and 5
we introduce our proposed methodology including synthetic DB creation, RFs for pixel-based classification, body
parts recognition, PDA, and 3D human pose representation. Experimental results and comparisons with existing
methodologies [23, 26] are presented in Section 6. Conclusions and a discussion of our findings are provided in
Section 7.
2 Our methodology
Our goal in this study is to recover a 3D human pose from
a single human depth silhouette. Figure 1 shows the key
steps involved in our proposed 3D human pose recovering
methodology. In the first step, a single depth image is captured by a depth camera. The human depth silhouette is
then extracted by removing the background. In the second
step, human body parts of the silhouette are recognized via
trained RFs. In the third step, the principal directions of the
recognized body parts are estimated by PDA. Finally, these
directions are mapped to the 3D synthetic human model,
resulting in recovery of a 3D human body pose.
3 Body parts recognition
As mentioned above, to recognize body parts from a depth

human silhouette, we utilize RFs as in [3, 11, 23]. This
learning-based approach requires a training DB; we therefore created our own synthetic training DB [11]. More
details are provided in the following sub-sections.
3.1 A synthetic DB of depth maps and corresponding body
parts labeled maps
To create the training DB, we created synthetic human body
models using 3Ds Max, a commercial 3D graphic package

3D human pose recovery from a depth image using principal direction analysis

Fig. 1 Key processing steps of our proposed system. These steps consist of taking the depth image, removing background, labeling body parts,
applying PDA to the body parts, and finally recovering a 3D human pose

[1]. The human body model consists of a total 31 body
parts [11]. To create various poses, motion information from
Carnegie Mellon University (CMU)’s motion DB [5] was
mapped to the model. Finally, a depth silhouette and its corresponding body part-labeled map were saved in the DB.
Our DB comprised 20,000 depth maps and corresponding
body part labeled maps. A set of samples of the human
body model, a map of the labeled body parts, and the map’s
corresponding depth silhouette are shown in Fig. 2. Images
in the DB had a size of 320 × 240 with 16-bit depth
values.
3.2 Depth feature extraction
We computed depth features based on differences between
neighboring pixel pairs. Depth features, f , were extracted
from a pixel x of the depth silhouette as described in
[13, 23]:

fθ (I, x) = dI x +

o1
dI (x)

− dI x +

o2
dI (x)

(1)

where dI (x) is the depth value at pixel x in image I ,
and parameters θ = (o1 , o2 ) describe offsets o1 and o2
from pixel x. The maximum offset value of o1 , o2 pairs
was 60 pixels, corresponding to 3 meters, which was the
distance of the subject to the camera. Normalization of
the offset by dI 1(x) ensured that the features were distance
invariant.
3.3 RFs for body parts labeling
To create trained RFs, we used an ensemble of five decision
trees. The maximum depth of the trees was 20. Each tree in
the RFs was trained with different pixels sampled randomly
from the synthetic depth silhouettes and their corresponding
body part indices. A subset of 2,000 training sample pixels
was drawn randomly from each synthetic depth silhouette in
the DB. A sample pixel was extracted to obtain 2,000 candidate features as computed using (1). At each splitting node
in the tree, a subset of 50 candidate features was considered.
For pixel classification, each pixel x of a tested depth silhouette was extracted to obtain candidate features. For each
tree, starting from the root node, if the value of the splitting

Fig. 2 a A 3D graphic human body model used in a training DB generation, b a body part-labeled model, and c a depth silhouette in the synthetic
DB

D.-L. Dinh et al.

function was less than the threshold of the node, x went to
the left and otherwise x went to the right based on all built
trees in RFs [3, 9]. The optimal threshold for splitting the
node was determined by maximizing the information gain
in the training process. The probability distribution over 31
human body parts was computed at the leaf nodes in each
tree. Final decision to label each depth pixel for a specific
body part was based on the voting result of all trees in the
RFs.

4 3D human pose recovery from recognized body parts
4.1 Joint position proposal based on the mean shift
In [23], to recover a 3D human pose, a 3D human skeleton model of joints was used where the joints were fitted
from recognized body parts using a mean-shift algorithm
with a weighted Gaussian kernel. The mean shift algorithm is a nonparametric density estimation used to find
the nearest mode of a point sample distribution [27]. This
technique is commonly used in image segmentation and
object tracking fields of computer vision studies [6, 24].
Given n data points xi , i = 1,. . . , n on a d-dimensional
space R d , the multivariate density obtained with a kernel K(x), window radius of kernel h, and weight function
w is
1
fˆh (x) = d

nh

n

wi K
i=1

x − xi
h

.

(2)

The sample mean with kernel K (G(x) = K (x)) at point x
is defined as

mh (x) =

n
x−xi
i=1 xi wi G
h
n
x−xi
i=1 wi G
h

.

(3)

The difference between mh (x) and x is called the mean
shift. The mean shift vector always points toward the
direction of the maximum value of the density. Therefore, the mean shift procedure is guaranteed to converge to a point where the gradient of the defined
density function approaches zero. The mean shift algorithm process is illustrated in Fig. 3. Starting on the
data point in cyan, the mean shift procedure is performed to find the stationary point in red of the density
function. In [23], to optimize parameters and improve the
efficiency of the mean shift, the size of window h was
replaced by bc , which is a learned per-part bandwidth,
whereas the weight wic was obtained from the probability

distribution of each pixel in class C and the given depth
dI (xi ) as follows:
wic = P (x|I, xi ).dI (x)2 .

(4)

To reconstruct and visualize an estimated 3D human
pose using the mean shift, a skeleton model is represented by joint points estimated from the recognized body
parts. This is a simple model and the mean shift used to
estimate joint points has some limitations: the optimal window size is difficult to find, but an inappropriate window
size can cause modes to merge; the position of estimated
joints depends on the shape and size of the recognized
body parts and is only computed on the surface of body
parts, whereas joints are positioned inside of the parts. To
overcome the limitations of this approach, we propose a
PDA algorithm, which we describe in the following section
(Section 4.2).
4.2 Principal direction analysis of recognized body parts

In this section, instead of estimating joint positions from
the recognized body parts as in [23], we focus on analyzing principal directions of human body parts including
the torso, upper arms, lower arms, upper legs, and lower
legs, which can be grouped together as two, three, or four
successive recognized body parts. We denote the human
body parts as {P 1 ,P 2 ,. . . ,P M }, where M is the number of human body parts (in our work M = 9). Each
human body part is presented as the 3D point cloud P m
consisting of n 3D points P m ={xi }ni=1 , where the value
of n changes with the size of human body parts. 3D
point clouds {P m }M
m=1 are used by the PDA algorithm to
determine principal direction vectors Vd1 , Vd2 . . . , VdM .
More details of PDA are provided in the following
sub-sections.
4.2.1 Outlier removal
Recognized body parts, represented as a cloud of points,
contain some outliers and mislabeled points. These points
can hinder PDA, resulting in inaccurate directional vectors
of human body parts. For this reason, before applying PDA,
we devised a technique to select only points of interest from
the labeled point cloud. To select these points from the
cloud, we determined the weight value of all points in the
selected cloud using a logistic function and the Mahalanobis
distance.
The logistic function of a population w can be written
as
w(ti ) =

L
1 + eα(ti −t0 )

(5)

3D human pose recovery from a depth image using principal direction analysis

Fig. 3 Mean shift iteration process to find the centroid of a cloud

where t0 denotes a rough threshold value that is defined
based on the size of the cloud of points, α a constant
value, and L the limiting value of the output (in our
case L = 1). Here, t0 and α are chosen based on the
shape and size of each human body part. ti is the Mahalanobis distance computed at the i th point in the cloud as
follows:

in the cloud. This means that if we assume that the weight
value of function w ∈ [0,1], then the weight of points in red
near the centroid of the cloud is approximately 1, while the
weight of the points in green far from the centroid is approximately 0. The weight of points around the threshold value
t0 is approximately 0.5.
4.2.2 PDA

ti =

(xi − μ)T (S)−1 (xi − μ)

(6)

where xi is the i th point in the cloud, μ is the mean vector
of the cloud, and S is the covariance matrix of the cloud,

which is computed as
n

S=
i=1

(xi − μ)(xi − μ)T
.
n

(7)

Our proposed approach is illustrated in Fig. 4. Selected
points subject to PDA that are used to determine the direction vector are shown in red. Points in green are regarded
as outliers. The size or population of the region containing
the points is controlled by the threshold parameter t0 , while
the parameter α is used to control the weight value of points

Here we describe how we estimate the directional vector
Vdm from the point cloud P m . First, to reduce the influence
of outliers on estimating the principal direction of the point
cloud, a weight vector is computed using a logistic function
and the threshold parameter as described in (5). The weight
value of each point in the cloud is determined based on
its corresponding value of Mahalanobis distance as in (6).
Then, based on a statistical approach and the known weight
vector, the mean vector and covariance matrix of PDA are
calculated as follows:
μ∗ =

∗

S =

n
2
i=1 w ti xi
,
n
2
i=1 w ti

n
i=1 w

(8)

ti2 (xi − μ∗ )(xi − μ∗ )T
n
i=1 w

ti2 − 1

.

(9)

D.-L. Dinh et al.

Fig. 4 a Logistic function with t0 = 2 and α = 2, α = 4. b Effect of parameters t0 and α on the threshold value of 3D point clouds to eliminate
outliers (t0 = 2, α = 4)

Finally, to estimate the principal direction vector Vdm from
the cloud P m , the following equation is solved:
Vdm (Ek ) = arg max EkT S ∗ Ek

(10)

{Ek }3k=1

where, E is eigenvectors of S ∗ and Vdm (Ek ) corresponds to
the largest eigenvalue of S ∗ .

Algorithm 1 Principal Direction Analysis (PDA)
Input: A 3D point cloud P m
Output: A principal direction vector Vdm
Method:
Step 1. Find the mean vector μ and the covariance
matrix S of the point cloud P m , as in (7).
Step 2. Compute the Mahalanobis distance of all points
in the cloud P m with its mean vector μ and
covariance matrix S as described in (6).
Step 3. Assign the weight value for all points in the
cloud P m using a logistic function and the vector of determined Mahalanobis distances, as in
(5).
Step 4. Compute the PDA mean vector μ∗ and PDA
covariance matrix S ∗ of the point cloud P m
using the assigned weight value of each point
as in (8) and (9).

Step 5. Find the eigenvector corresponding to the
largest eigenvalue computed from the covariance matrix S ∗ in (10). The eigenvector is a
determined principal direction vector Vdm .

We apply PDA to estimate the principal direction vectors of human body parts on their corresponding 3D point
clouds. Note that a 3D point cloud P m is represented as
a n × 3 matrix, where n denotes the number of 3D points
in the cloud P m and each 3D point consists of three coordinate in the x, y, and z dimensions. To determine the
principal direction vector Vdm of the cloud P m , PDA starts
with a covariance matrix S determined from the matrix
[P m ]n×3 and a mean vector [μ]1×3 as in (7). By using
the mean vector and covariance matrix, the Mahalanobis
distance vector of all points in cloud P m is computed by
(6). The results of (5) return a weight vector corresponding to all points in this cloud. Based on the weight vector,
a PDA covariance matrix S ∗ and mean vector μ∗ are determined by (8) and (9). Finally, a principal direction vector
Vdm of the cloud P m is estimated by (10). Details of the
PDA algorithm are presented in Algorithm 1. The performance of PDA on point clouds with and without outlier
removal is illustrated in Fig. 5. The results of the estimated principal directions are shown as blue lines on the
clouds.

5 3D human pose representation
To represent a recovered 3D human pose, we utilized a 3D
synthetic human model created by a set of super-quadrics.
The joints of the model were connected with a kinematic
chain and parameterized with rotational angles at each joint
[25, 26]. Our 3D synthetic human body model is defined in
4-D projective space as

me (X) = XT VθT QT DQVθ X − 2 = 0

(11)

3D human pose recovery from a depth image using principal direction analysis

Fig. 5 Comparison results of PDA (a) without outlier removal and (b) with outlier removal. The resultant principal directions are blue lines
superimposed on the point clouds. Two sets of 3D point clouds indicate an upper arm part (left, cyan) and a lower arm part (right, green) with
some outliers

where X is the coordination of the 3D point on the surface of
super-quadrics, D is a diagonal matrix containing the size of
super-quadrics, Q locates the center of super-quadrics in the
local coordination system, and Vθ is a matrix containing relative kinematic parameters computed from the directional
vectors Vd . Our model is comprised of 10 human body parts
(head, torso, left and right upper and lower arms, left and
right upper and lower legs) and 9 joints (two knees, two
hips, two elbows, two shoulders, and one neck). There were
a total of 24 DOFs (including two DOFs at each joint and
six free transformations from the global coordinate system
to the local coordinate system at the hip) as shown in Fig. 6.
In Fig. 6a, the dashed line and arrow superimposed on the
model show the results of PDA and Fig, while the corresponding recovered 3D human pose and the 3D model of
super-quadrics are shown in Fig. 6b.

6 Experimental results
In this section, we evaluated our proposed methodology
through quantitative and qualitative assessments using synthetic and real data as well as though comparison with
previous approaches [23, 26].
6.1 Experimental set-ups
To assess our approach quantitatively, we utilized synthetic

depth silhouettes and ground-truth information extracted
from synthetic 3D body poses. For each synthetic 3D human
pose, we measured joint angles of four major joints, including the left-right elbows and knees from the 3D human
body model, and saved these as the ground truth. Then,
each recovered 3D human pose from the corresponding
body depth silhouettes was recognized via trained RFs, and
principal directions were estimated by PDA. Finally, we
derived the same joint angles from the recovered 3D pose
and compared them to the ground truth. For qualitative
assessment of real data, we utilized the depth silhouettes
captured by a depth camera [17]. These directions were
finally mapped on to the 3D human body model, resulting in recovery of the 3D human body pose. To assess real
data, visual inspection between the results of the recovered 3D human poses and RGB images was performed.
Pose recovery was performed using a standard desktop PC
with an Intel Pentium Core i5, 3.4 GHz CPU, and 8 GB
RAM.
6.2 Experimental results with synthetic data

Fig. 6 3D synthetic human model. a Orientation model and b 3D
model with super-quadrics shapes

We performed a quantitative evaluation using a series of 500
depth silhouettes containing various unconstrained movements. Evaluation results for synthetic poses are shown in

D.-L. Dinh et al.

Figs. 7 and 8. Each plot in Fig. 8 corresponds to a joint angle
estimated by PDA. Solid and dashed lines indicate the PDA
estimated and the corresponding ground truth joint angles,

respectively.
We computed the average reconstruction error between
the estimated joint angles and ground truth joint angles
as

θ

=

nf
i=1

grd

θiest − θi
nf

,

(12)

where nf is the number of frames, i is the frame index,
grd
θi
is the ground-truth angle, and θiest is the estimated
angle. To assess the reconstruction errors, we evaluated
the four different sequences of swimming, boxing, cleaning, and dancing activities. Each sequence contained 100
frames. The average errors at the four considered joint
angles of the second experiment are given in Table 1.
The average reconstruction error of the four different

sequences at the four considered joint angles was 7.07
degrees.

Fig. 7 Sample results of our proposed 3D human pose estimation based on our synthetic data. The 1st and 3rd rows show the synthetic depth
map. The 2nd and 4th rows show the estimated 3D human poses

3D human pose recovery from a depth image using principal direction analysis

Fig. 8 Comparison of the ground-truth and the estimated joint angles in synthetic data: a joint angle of left elbow, b joint angle of right elbow, c
joint angle of left knee, and d joint angle of right knee

6.3 Experimental results with real data
To evaluate real data, we asked three subjects to perform unconstrained movements. Two experiments were
performed. In the first experiment, we examined prin-

cipal direction estimation using PDA for one subject.
Figure 9 shows the results; the principal directions are
shown as lines superimposed on the subject’s poses. In
the second experiment, we assessed movements of the
elbows and knees with arbitrary poses (some simple and

Table 1 Average reconstruction error of evaluated joint angles (degrees) based on analysis of 100 frames of each activity
Activities

Left elbow

Right elbow

Left knee

Right knee

Swimming
Boxing
Cleaning
Dancing
Average reconstruction error (◦ )

5.11
6.78
5.45
5.42
5.69

5.12
6.57
5.62
5.19
5.63

8.34
8.12
7.56
8.86
8.22

8.67
9.24
7.67

9.34
8.73

D.-L. Dinh et al.

Fig. 9 Sample results of PDA. Blue lines indicate the directions of the four human body parts of the upper arms and legs. Red lines indicate the
directions of the lower arms and legs

complex poses). The experimental results for arm and leg
movements of the first subject are shown in Fig. 10. The
2nd and 3rd rows show 3D human poses reconstructed from
the front and side views, respectively. Because ground truth
joint angles are not available for the real data, we only per-

formed qualitative assessments by visual inspection of the
results of the 2nd and 3rd rows and RGB images in the
1st row. Figure 11 shows the qualitative assessments of two
other subjects with different body sizes and shapes from the
first subject and each other.

Fig. 10 Sample results of our proposed 3D human pose estimation for four different arm and leg movements. The 1st row shows RGB images of
four different poses, while the 2nd and 3rd rows show the estimated 3D human pose results from the front and side views respectively

3D human pose recovery from a depth image using principal direction analysis

Fig. 11 Sample results of our proposed 3D human pose estimation for four different poses of differently-shaped subjects. RGB images are shown
in the 1st row and estimated 3D human pose from two different subjects are shown in the 2nd row

6.4 Comparisons to the conventional methods
We evaluated the performance of our proposed methodology
by comparing its performance with those of conventional
methods [23, 26].
For comparison to the mean shift method [23], we
implemented the real-time human pose recognition system
as described in [23]. We used our synthetic DB to train
RFs in this system. We evaluated the mean shift method
through quantitative and qualitative assessments using synthetic and real data. Quantitative assessments of the same
tested synthetic data obtained using different methods are
presented in Table 2; our approach resulted in an average
reconstruction error at the four considered joint angles of
our method of 7.07 degrees compared to 9.79 degrees
for the mean shift method. We also performed a qualitative assessment of the same real data. The recovered 3D
human poses are represented on the same 3D synthetic

pose model shown in Fig. 12. As can be seen in Fig. 12,
our proposed methodology significantly improved accuracy compared with pose reconstruction based on the mean
shift method. In particular, our proposed method was more
robust than the mean shift method in poses involving
overlapped or intersected human body parts. In addition,
our system utilized a 3D synthetic human model created
by a set of super-quadrics connected with the defined
DOF at each joint for recovering 3D human poses. Our
system can therefore be used in real-time for practical
applications.
For comparison to the EM method [26], we used average
reconstruction errors computed from the four experiments
as given in [26]. The average reconstruction errors of the
EM method for the left-right elbows and knees were 7.50,

7.63, 8.03, and 13.81 degrees compared to 5.69, 5.63, 8.22,
and 8.73 degrees using our proposed method, as shown in
Table 2.

Table 2 Comparison of the average reconstruction error (degrees)
Evaluated angles

Left elbow

Right elbow

Left knee

Right knee

Average error of the four joints

Method proposed in [26]
Method proposed in [23]
Our proposed method

7.50
9.24
5.69

7.61
9.41
5.63

8.03

10.15
8.22

13.81
10.34
8.73

9.24
9.79
7.07

D.-L. Dinh et al.

Fig. 12 Comparison of our approach versus that outlined in [23] for four different poses. The 1st row shows RGB images, the 2nd row shows
depth silhouettes, the 3rd row shows the results obtained from the mean shift algorithm and the 4th row shows the results obtained using our
proposed PDA algorithm

7 Conclusions and discussion
We have developed a novel method to recover a 3D human
pose from a single depth silhouette. Our method estimates
principal direction vectors from the recognized body parts
using PDA, which is novel and effective in recovering 3D
human poses. Quantitative assessments revealed that our
method had an average reconstruction error of only 7.07
degrees for four key joint angles, whereas the mean shift
and the EM methods had an average reconstruction error
of 9.79 degrees and 9.24 degrees, respectively. Our algorithm runs at a speed of 20 FPS on a standard PC, which
indicates that our system is suitable for real-time human
activity recognition and human computer interaction applications such as those for personal life-care and health-care

services. By analyzing real data, we demonstrated that our
system performs reliably on sequences consisting of unconstrained movements of subjects with different appearance
and shapes.
Acknowledgments This research was supported by the MSIP (Ministry of Science, ICT & Future Planning), Korea, under the ITRC

(Information Technology Research Center) support program supervised by the NIPA (National IT Industry Promotion Agency (NIPA2013-(H0301-13-2001)). This work was also supported by the Industrial Strategic Technology Development Program (10035348, Development of a Cognitive Planning and Learning Model for Mobile
Platforms) funded by the Ministry of Knowledge Economy(MKE,
Korea).

References
1. Autodesk 3Ds MAX, 2012
2. Baak A, Mller M, Bharaj G, Seidel HP, Theobalt C (2011)
Data-driven approach for real-time full body pose reconstruction
from a depth camera. In: Proceedings of the 2011 international
conference on computer vision. pp 1092–1099
3. Breiman L (2001) Random forests. Mach Learn 45(1):5–32
4. Chen L, Wei H, Ferryman J (2013) A survey of human motion
analysis using depth imagery. Pattern Recognit Lett 34(15):1995–
2006
5. CMU motion capture database.
6. Comaniciu D, Meer P (2002) Mean shift: a robust approach
toward feature space analysis. IEEE Trans Pattern Anal Mach
Intell 24(5):603–619
7. Ganapathi V, Plagemann C, Koller D, Thrun S (2010) Real time
motion capture using a single time-of-flight camera. In: IEEE

3D human pose recovery from a depth image using principal direction analysis

8.

9.
10.

11.

12.

13.

14.

15.

16.
17.
18.

19.

20.

21.

22.

23.

24.

25.

26.

27.

conference on computer vision and pattern recognition (CVPR).
pp 755–762
Ganapathi V, Plagemann C, Koller D, Thrun S (2012) Real-time
human pose tracking from range data. In: Proceedings of the 12th
European conference on computer vision. pp 738–751
Hastie T, Tibshirani R, Friedman J (2008) The elements of
statistical learning. Springer, New York
Holt B, Ong EJ, Bowden R (2013) Accurate static pose estimation
combining direct regression and geodesic extrema. In: IEEE international conference and workshops on automatic face and gesture
recognition
Jalal A, Sharif N, Kim JT, Kim TS (2013) Human activity recognition via recognized body parts of human depth silhouettes
for residents monitoring services at smart home. J Indoor Built
Environ 22:271–279
Jiu M, Wolf C, Taylor G, Baskurt A (2013) Human body part estimation from depth images via spatially-constrained deep learning.
Pattern Recognit Lett
Lepetit V, Lagger P, Fua P (2005) Randomized trees for real-time
keypoint recognition. In: IEEE computer society conference on
computer vision and pattern recognition. pp 775–781
Moeslund TB, Hilton A, Krger V (2006) A survey of advances
in vision-based human motion capture and analysis. Comp Vision
Image Underst 104(2):90–126
Plagemann C, Ganapathi V, Koller D, Thrun S (2010) Real-time
identification and localization of body parts from depth images.
In: IEEE international conference on robotics and automation

(ICRA). pp 3108–3113
Poppe R (2007) Vision-based human motion analysis: an
overview. Comp Vision Image Underst 108(1–2):4–18
PrimeSense Ltd.
Rosenhahn B, Kersting UG, Smith AW, Gurney JK, Brox T, Klette
R (2005) A system for marker-less human motion estimation. Lect
Notes Comput Sci 3663:230–237
Rosenhahn B, Schmaltz C, Brox T, Weickert J, Cremers D, Seidel
HP (2008) Markerless motion capture of man-machine interaction. In: IEEE Conference on Computer Vision and Pattern
Recognition, 2008. pp 23–28
Schwarz LA, Mkhitaryan A, Mateus D, Navab N (2011) Estimating human 3d pose from time-of-flight images based on geodesic
distances and optical flow. In: IEEE conference on automatic face
and gesture recognition. pp 700–706
Schwarz LA, Mkhitaryan A, Mateus D, Navab N (2012) Human
skeleton tracking from depth data using geodesic distances and
optical flow. J Image Vision Comput 30(3):217–226
Shen J, Yang W, Liao Q (2013) Part template: 3d representation
for multiview human pose estimation. Pattern Recog 46(7):1920–
1932
Shotton J, Fitzgibbon A, Cook M, Sharp T, Finocchio M, Moore
R, Kipman A, Blake A (2011) Real-time human pose recognition
in parts from single depth images. In: Proceedings of the 2011
IEEE conference on computer vision and pattern recognition. pp
1297–1304
Shuang Z, Yu-ping Q, Hao D, Gang J (2012) Analyzing of meanshift algorithm in extended target tracking technology. Lect Notes
Electr Eng 144:161–166
Sundaresan A, Chellappa R (2008) Model-driven segmentation of
articulating humans in laplacian eigenspace. IEEE Trans Pattern
Anal Mach Intell 30(10):1771–1785
Thang ND, Kim TS, Lee YK, Lee SY (2011) Estimation of 3-d

human body posture via co-registration of 3-d human model and
sequential stereo information. Appl Intell 35(2):163–177
Vilaplana V, Marques F (2008) Region-based mean sift tracking:
application to face tracking. In: IEEE international conference on
image processing. pp 2712–2715

28. Yuan ZH, Lu T (2013) Incremental 3d reconstruction using
bayesian learning. Appl Intell 39(4):761–771

Dong-Luong Dinh received his B.S. and M.S. degree in Information
and Communication Technology from Ha Noi University of Science
and Technology, Vietnam, in 2001 and 2009 respectively. In 2001,
he was a Lecturer with the Department of Information Technology,
Nha Trang University, Vietnam. He is currently working toward his
Ph.D. degree in the Department of Computer Engineering, Kyung
Hee University, South Korea. His research interests include computer
vision, machine learning, human computer interaction, human motion
analysis and estimation.

Myeong-Jun Lim received his B.S. degree in Biomedical Engineering from Kyung Hee University, South Korea. He is currently working
toward his M.S. degree in the Department of Biomedical Engineering
at Kyung Hee University, Republic of Korea. His research interests
include image processing, pattern recognition, artificial intelligence,
and computer vision.

D.-L. Dinh et al.

Nguyen Duc Thang received his B.E. degree in Computer Engineering from Posts and Telecommunications Institute of Technology,
Vietnam in 2005 and Ph.D. degree from the Department of Computer

Engineering at Kyung Hee University, South Korea in 2011. He is
currently a lecturer of the Department of Biomedical Engineering at
International University, National University Ho Chi Minh, Vietnam.
His research interests include biomedical image and signal processing,
computer vision, and machine learning.

Sungyoung Lee received his B.S. from Korea University, Seoul, South
Korea in 1978. He got his M.S. and Ph.D. degrees in Computer Science from Illinois Institute of Technology (IIT), Chicago, Illinois,
USA in 1987 and 1991 respectively. He has been a professor in the
Department of Computer Engineering, Kyung Hee University, Korea
since 1993. He is a founding director of the Ubiquitous Computing
Laboratory, and has been affiliated with a director of Neo Medicinal
ubiquitous-Life Care Information Technology Research Center, Kyung
Hee University since 2006. Before joining Kyung Hee University, he
was an assistant professor in the Department of Computer Science,
Governors State University, Illinois, USA from 1992 to 1993. His current research focuses on Ubiquitous Computing, Cloud Computing,
Intelligent Computing, Context-Aware Computing, WSN, Embedded Realtime and Cyber- Physical Systems, and eHealth. He has
authored/coauthored more than 405 technical articles (130 of which
are published in archival journals). He is a member of the ACM and
IEEE.

Tae-Seong Kim received the B.S. degree in Biomedical Engineering from the University of Southern California (USC) in 1991, M.S.
degrees in Biomedical and Electrical Engineering from USC in 1993
and 1998 respectively, and Ph.D. in Biomedical Engineering from
USC in 1999. After his postdoctoral work in cognitive sciences at
the University of California, Irvine in 2000, he joined the Alfred E.
Mann Institute for Biomedical Engineering and Dept. of Biomedical
Engineering at USC as a Research Scientist and Research Assistant
Professor. In 2004, he moved to Kyung Hee University in Korea where
he is currently Professor in the Biomedical Engineering Department.

His research interests have spanned various areas of biomedical imaging including Magnetic Resonance Imaging (MRI), functional MRI,
E/MEG imaging, DT-MRI, transmission ultrasonic CT, and Magnetic Resonance Electrical Impedance Imaging. Lately, he has started
research work in proactive computing at the u-Lifecare Research Center where he serves as Vice Director. Dr. Kim has published more than
90 peer reviewed papers and 150 proceedings, 6 international book
chapters, and holds 10 international and domestic patents. He is a
member of IEEE from 2005–2013, KOSOMBE, and Tau Beta Pi, and
listed in Who’s Who in the World ’09-’13 and Who’s Who in Science
and Engineering ’11-’12.

DSpace at VNU: Real-time 3D human pose recovery from a single depth image using principal direction analysis

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về