Tải bản đầy đủ (.pdf) (6 trang)

Uniform Detection in Social Image Streams

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (629.86 KB, 6 trang )

2015 Seventh International Conference on Knowledge and Systems Engineering

Uniform Detection in Social Image Streams
Nguyen Quang Manh

Nguyen Duc Tuan

Dinh Viet Sang

Hanoi University of Science
and Technology
Email:

Hanoi University of Science
and Technology
Email:

Hanoi University of Science
and Technology
Email:

Huynh Thi Thanh Binh

Nguyen Thi Thuy

Hanoi University of Science
Faculty of Information Technology
and Technology
Vietnam National University of Agriculture
Email:
Email:


Abstract—Social media mining from Internet has been an
emerging research topic. The problem is challenging because
of massive data contents from various sources, especially image
data from user upload. In recent years, dictionary learning
based image classification has been widely studied and gained
significant success. In this paper, we propose a framework for
automatic detection of interested uniforms in image streams from
social networks. The systems is composed of a powerful feature
extraction module based on dense SIFT feature and a state-ofthe-art discriminative dictionary learning approach. Beside that,
a parallel implementation of feature extraction is deployed to
make the system work real time. An extensive set of experiments
has been conducted on four real-life datasets. The experimental
results show that we can obtain the detection rate up to 100% on
some datasets. We also get real time performance with a speed of
image stream of about 40 images per second. The framework can
be applied to emerging applications such as uniform detection,
automated image tagging, content base image retrieval or online
advertisement based on image content.

I.

Therefore, detecting interested uniforms from social images
is a good way to monitor internal information related to the
business of a company. To the best of our knowledge, the
problem of detecting uniforms (clothes) from social images
for the purpose of business monitoring is an application that
has not been investigated. In this paper, we propose an efficient
framework for detection and filtering images containing some
interested uniform out of social image streams. The image
streams are continuously retrieved from social networks using

a crawling tool.
An important application of social image mining is automated image tagging, which automatically assign some
semantic keywords to a new image based on a pre-learned
object recognition model. A framework for uniform detection
can be used to assign keywords to images such as company
names, locations and so on. Mining personal photos can also
help to learn the preferences of individual users that can be
used for building a recommendation system or an advertising
service. A framework for uniform detection can be integrated
in such a system. For instance, when a user viewing an image
containing the uniform of some company, it means the user
is likely interested in the company’s products. Therefore, an
advertisement for the company or a suggestion of buying some
company’s products can be pushed on.

I NTRODUCTION

In recent years, we have witnessed a rapid boom of
social networking and online image sharing websites such
as Facebook, Flickr, Photobucket that allow users to upload
their personal photos on the web. The abundant amount
of information generated by social networks has stimulated
people to discover useful patterns hidden in the social media
data. Besides text information such as statuses, comments and
tags, social image data is a rich source to be exploited. Mining
such a huge data of social images on the web has become an
emerging important research topic. Many research works [5],
[12], [14], [21], [24] have been proposed for mining knowledge
from social images based on both visual and textual contents.
Up to now social images have been mined for various purposes

such as mining geographic information, detecting hot events
of society or finding social groups.

Uniform detection is a particular case of image categorization. More generally, it is a two-class classification problem.
A general framework of image categorization usually contains
two steps: feature extraction and classification. After extracting
features, images are represented by features which are then categorized by a pre-learned classifier. Since the data may come
very fast from the social image stream, a good framework for
uniform detection should work fast enough to adapt to the
speed of image crawling while maintaining a high detection
accuracy.

Recently, managing the content of social images has gotten
the attention from companies and organizations for ensuring
the safety of their business. It has been observed that social
network users can unintentionally upload some images containing information that is unfavorable towards the business
of a company. The problem is serious when this kind of
information may reveal business secrets or affect its fame. In
practice, such adverse images are often from company staff
captured in working time when they are wearing uniform.
978-1-4673-8013-3/15 $31.00 © 2015 IEEE
DOI 10.1109/KSE.2015.63

Our main contributions are: (1) a combination of powerful
dense SIFT with a state-of-the-art dictionary learning strategy
for an efficient image feature representation; (2) an efficient
system for uniform detection from social image streams as a
new application of social image mining for business monitoring; and (3) four datasets of uniforms images which are made
available for research community.
The rest of the paper is organized as follows. In section

180


2 we briefly summarize some related work on social image
mining and image classification. In section 3 we describe our
proposed framework for uniform detection. Experiments and
evaluation on four real-life uniform datasets are shown in
section 4. The conclusion is in section 5 with a discussion
for the future work.
II.

R ELATED

WORK

Social image mining: With a rich source of information,
social images can be mined in many ways for various applications. In [6], Crandall et al. proposed a method to organize 35
millions images collected from Flickr using content analysis
based on textual and visual data with structural analysis based
on geospatial data. In [14], Luo et al. combined satellite
information with visual contents to recognize photo-taking
environment. Chen et al. [5] exploited the metadata of Flickr
photo including time, location and user-defined tags to analyze
the distribution of photos and automatically detect hot events.
In [24], Yu el at. proposed a method to automatically suggest
photo groups based on user personal image collections.

Fig. 1: Our framework for uniform detection (a) with the
Feature Extraction module (b)


constraint, have been proposed . Compared with conventional
DL methods, l2 -norm based DL approach not only reduces
time complexity but also leads to very competitive results
in image classification tasks. Therefore, l2 -norm based DL
approach is a good choice for our framework that meets a
real time performance while maintaining a high detection rate.

Automated image tagging, which is an important part in
many applications, is a challenging task in machine learning.
Conventional approach to deal with this problem is to train
a classification model, e.g. SVM [8], from a hand-labeled
training data with a set of certain keywords. Another promising
approach is the search-based paradigm that is based on fast
indexing and search techniques such as hashing functions [17]
or scene matching [16]. Recently, Pengcheng Wu et al. [21]
investigated a new approach to automated image tagging by
learning an effective distance metric based on both visual and
textual contents.

III.

O UR PROPOSED FRAMEWORK

The flowchart of our framework for uniform image detection from social image streams is shown in Fig. 1 and
described in details in Fig. 2. The problem of uniform detection
is formulated as a classification task, wherein each image from
the social data stream is classified into two classes: uniform
or non-uniform. For learning the system, we need to build a
training dataset by crawling image data from social networks.
For classification, with each image, we firstly perform feature

extraction using dense SIFT. The feature vector is then encoded
using sparse coding and spatial pooling. Finally, it is categorized by a pre-learned classifier. We propose to employ stateof-the-art Projective Dictionary Pair Learning (DPL) model to
train our classifier.

Image classification with dictionary learning: In many
image classification systems, usually low-level descriptors such
as HOG [7] and SIFT [13] are firstly extracted at local
interested points. After that, these descriptors are often coded
into higher dimensions by using sparse coding or vector
quantization. Sparse coding is an effective approach that has
been successfully applied to various problems in computer
vision including image classification [10], [18], [19], [22],
[26]. Many state-of-the-art methods in sparse coding based
classification employ dictionary learning (DL) approach, where
learning an overcomplete dictionary plays a critical role in the
success of such methods. A desired dictionary should faithfully
represent the query images while supporting the discrimination
of image classes. Based on KSVD [1] Zhang and Li [26] proposed a joint learning algorithm named discriminative KSVD
(DKSVD) incorporating classification error into the objective
function of KSVD. Jiang et al. [10] proposed a new label
consistence constraint and combined it with the reconstruction
and classification errors to form a new algorithm called LCKSVD. In addition to enforcing discriminative constraints on
the dictionary, Yang et al. [23] proposed a method called
FDDL that uses Fisher criterion to make the sparse codes more
discriminative.

A. Feature extraction
Feature extraction is the first step in the system and plays an
important role in the success of a framework for image classification. Many state-of-the-art methods for image classification
firstly extract low-level image descriptors, and then transform

them into mid-level features that obtain richer representation.
Based on this, we design a scheme for feature extraction
that consists of three steps: local descriptor extraction, sparse
coding and spatial pooling (Fig. 1b).
1) Local descriptors: In [15], Oliva et at. proposed to
represent a scene image by GIST descriptors. GIST presents
a brief observation or a report at the first glance of a scene
ignoring the object details in the scene. GIST is suitable just
for outdoor images. In [20], Centrist descriptors that combines
both local and global information in the image were proposed.
A drawback of Centrist descriptors is that they are not invariant
to rotation. In [13], Lowe proposed SIFT descriptors that
are invariant to many common image deformations such as

Typically, in most of the existing DL methods the standard
l0 or l1 −norm sparsity constraint is imposed on the representation coefficients. However, these regularizations usually lead
to time-consuming training and testing phases. Recently, some
discriminative DL methods [9], [25], which exploit l2 -norm
181




Fig. 2: The illustration architecture of our framework based on sparse coding

p
m. Each DSIFT descriptor is then represented as a sparse
linear combination of a few atoms from the dictionary D.

position, scale, rotation, affine transformation and illumination.

Hence, SIFT descriptor has been widely used in many of
computer vision tasks including image classification.

An efficient dictionary learning method called KSVD is
introduced in [1]. The goal of KSVD is to minimize the
reconstruction error w.r.t. a given sparsity level:

A development of SIFT descriptor is the dense SIFT
(DSIFT) [3]. Instead of computing SIFT descriptor at sparse
interest points, DSIFT is computed over a dense grid in the
image (Fig. 3). Since more information is captured, DSIFT
often leads to better results in image classification tasks.
In our framework we use DSIFT for the low-level image
representation.

2
F

min X − DA
D,A

s.t. ∀i, ai

0

≤ T,

(1)

where A = [a1 , a2 , . . . , aN ] ∈ Rp×N is the matrix of sparse

codes, and T is the sparsity level.
Theoretical results in [4] show that it would be easier to
recover the underlying sparse codes of the data matrix X if we
could keep large mutual incoherence between atoms during the
dictionary learning phase. Inspired by this idea, an improved
version of KSVD called MI-KSVD was introduced in [2]. The
objective function of MI-KSVD is defined as follows:
min X − DA
D,A

2
F

s.t. ∀i, di


2

p
i=1

p
j=1
j=i

= 1 and ∀j, aj

dTi dj
0


≤ T,

(2)

where μ ≥ 0 is a hyperparameter.
In our framework, we use MI-KSVD to encode low-level
image descriptors into higher dimensions for sparsity. After
learning the dictionary D, we can use it to encode the lowlevel descriptor matrix Xnew ∈ Rm×n of a new query image
into a matrix Anew ∈ Rp×n of sparse vectors (Fig. 2), where
n is the number of DSIFT descriptors extracted from the query
image.

Fig. 3: DSIFT descriptor
2) Sparse coding: Sparse coding has been widely researched and applied to many problems of image processing.
It has achieved very impressive results in various image
classification tasks. In our framework, spare coding is used
to encode DSIFT descriptors extracted from each image into
an over-complete representation by learning a dictionary D on
a given training set of L images.

3) Spatial pooling: In order to obtain mid-level image
features for a better representation, the sparse codes of lowlevel descriptors are often pooled together. Spatial pooling
gives us translation-invariant features by reducing the resolution of feature maps. Spatial pyramid matching (SPM) [11] is
a pooling scheme that has been widely used recently. In SPM,
the sparse codes of DSIFT descriptors are pooled in three scale
levels over image space: 4×4, 2×2 and 1×1. Fig. 4 illustrates
the SPM pooling scheme used in our framework.

Let Xl = [x1 , x2 , . . . , xnl ] ∈ Rm×nl be the matrix of
DSIFT descriptors associated with the l-th image from the

training set, where xi ∈ Rm×1 is a column vector representing
i-th DSIFT descriptor, m = 128 is the size of each DSIFT
descriptor and nl is the number of DSIFT descriptors extracted
from the l-th image. Let X = [X1 , X2 , . . . , XL ] ∈ Rm×N be
the matrix of all DSIFT descriptors extracted from the training
set, where N = n1 + n2 + . . . + nL .

Another important role of spatial pooling is that it normalizes the dimension of the feature vectors in the dataset. The
diversity of social image sizes makes the number of DSIFT
descriptors of an image often varied. By using SPM pooling,
features extracted from an image will be converted into a

Suppose that we need to learn an overcomplete dictionary
D = [d1 , d2 , . . . , dp ] ∈ Rm×p that consists of p atoms, where
182


feature vector of the same size that consists of (4 × 4 + 2 ×
2 + 1 × 1) × p = 21p entries.

the feature matrix Z = [Z1 , Z2 , . . . , ZK ] extracted from the
training set, where K is the number of classes. In uniform
detection problem, we have only K = 2 classes that are
uniform and non-uniform. The discriminative power of DPL
is given by each pair of class-specific subdictionaries {Qk ∈
R(21p)×t , Pk ∈ Rt×(21p) }, k = 1, 2, . . . , K, where t is the
number of atoms in each subdictionary. The subdictionary Pk
projects the data from other classes into a nearly null space
while Qk tries to optimally reconstruct the feature matrix Zk
from its projective matrix Pk Zk . The objective function of

DPL is defined as follows:

In general, there are two common pooling strategies: average and max pooling. Average pooling takes the average
of sparse codes of DSIFT descriptors over a certain region,
while max pooling computes the maximum of each entry. In
practice, max pooling often outperforms average pooling in
various tasks. Hence, in our framework we use max pooling
strategy in the SPM pooling layer. (Fig. 4).

K

{ Q∗ , P ∗ } = argmin
+λ Pk Z¯k

2
F

2
F

Zk − Qk Pk Zk

k=1

Q,P

2
2

s.t. qi


+

≤ 1,

(3)

where qi is the i-th atom of Q, Zk is the training feature matrix
of class k and Z¯k is the feature matrix of other classes.
Since the problem in (3) is non-convex, the authors in [9]
relaxed it to the following problem by introducing a variable
matrix U :
K

{P ∗ , Q∗ , U ∗ } = argmin
+ τ Pk Zk −

P,Q
Uk 2F + λ

k=1

( Zk − Qk U k

Pk Z¯k

2
F)

2

2

s.t. qi

2
F+

≤ 1.

(4)

The optimization problem in (4) can be easily solved by
iterating the following two steps:

Fig. 4: Step by step of spatial pooling to obtain the feature
vector from an image

Step 1: Fix Q and P , update U :
U ∗ = argmin

4) Parallel implementation of feature extraction: Our
framework consists of two sub-systems, which are the image
crawler system which crawls photos from the social networks
and the system for uniform detection. The first sub-system
work continuously to download images from social networks,
then images will be saved into a folder. After a period of
time (Δt) or the amount of images is large enough, the
second sub-system will start working. As our feature extraction
involves a learning process for getting informative features, it’s
a time consuming process. The feature extraction process can

be speed up by a parallel implementation. This is done by
calculating the available memory for allocating threads, which
perform feature extraction for each image.

U

K

( Zk − Qk Uk

2
F

+ τ Pk Zk − Uk

2
F ).

(5)

k=1

This is a quadratic optimization problem that has the following
close-form solution:
Uk∗ = (QTk Qk + τ I)−1 (τ Pk Zk + QTk Zk ).

(6)

Step 2: Fix U , update P and Q:
P ∗ = argmin


K

Q∗ = argmin

K

k=1

P

k=1

Q

(τ Pk Zk − Uk
Z k − Q k Uk

2
F

2
F

+ λ Pk Z¯k

s.t. qi

2
2


2
F ),

≤ 1.

(7)
(8)

As in [9], the closed-form of P can be obtained as follows:
(9)
P ∗ = τ Uk Z T (τ Zk Z T + λZ¯k Z¯T + γI)−1 .

B. Classification

k

k

−4

where γ = 10e

In recent years, dictionary learning approaches for classification problems has received a special attention from the
computer vision community. Among the dictionary learning
methods, the recently proposed Projective Dictionary Pair
Learning (DPL) is known as one of the state-of-the-art models.
Unlike earlier [proposed learning methods such as DKSVD
[26], LCKSVD [10], FDDL [23] that try to learn a unique dictionary for representation, DPL [9] learns jointly two separate
dictionaries: a synthesis dictionary for representation power

and an analysis dictionary for classification power. DPL is a
reasonable choice for our framework because it greatly reduces
the time complexity while keeping a high accuracy.

k

k

is a small number.

The problem (8) can be relaxed by introducing a variable
S:
min
Q,S

K
k=1

Zk − Qk U k

2
F

s.t. Q = S, si

2
2

≤ 1.


(10)

The optimal solution of (10) can be obtained by using
ADMM algorithm:


K

(r+1)


Q
=
argmin
( Zk −Qk Uk



Q

k=1
K

2
F

(r)

(r) 2
F ),


+ ρ Q k − S k + Tk

(r+1)
(r)

ρ Qk
− Sk + Tk 2F s.t. si 22 ≤ 1,
S (r+1) = argmin



S

k=1

⎩ (r+1)
(r+1)
(r+1)
= T ( r) + Qk
− Sk
, update ρ if appropriate.
T

1) Training DPL classifier: The essence of training DPL
is to learn a synthesis dictionary Q = [Q1 , Q2 , . . . , QK ] and
an analysis dictionary P = [P1 , P2 , . . . , PK ] to reconstruct
183



2) Classifcation scheme of DPL: Given the feature vector
z of a query image, DPL will predict the class label of the
image based on the class-specific reconstruction error:
identif y(z) = argmin z − Qk Pk z 2 .

TABLE I: The details of the collected datasets

(11)

k

IV.

E XPERIMENT

A. Data sets
To the best of our knowledge there is no public available
dataset for the problem of uniform (clothes) detection. Thus,
we build our own datasets to evaluate the proposed framework.
Each dataset consists of positive and negative images. A image
is called positive if it contains the interested uniform, and
negative otherwise. These datasets will be made available for
research purpose.

No.

Name of Dataset

# Training images


# Test images

1

Lawson

190

245

2

Argentina

195

250

3

Brazil

195

250

4

Barcelona


327

394

8.0GT/s 20MB, Ubuntu Operating System 14.04 64 bit, 32GB
RAM.
Each dataset is separated into training and testing sets. This
is done several times to evaluate the performance. In each
separation we use the training set to learn our framework and
then evaluate its accuracy on the test set. Table. I shows the
number of samples for training and testing of our datasets.
The overall detection accuracy is calculated by averaging the
detection rates in different data separations.

In order to collect negative samples, we use a crawling
module based on the Facebook Graph model with provided
APIs to get images from public albums on many Facebook
fan pages. By doing that, we do not violate the copyright
or an individual’s privacy. The collected images are very
diverse since they are crawled from various categories such
as landscape, street scenes, picnic or maybe selfie images.
Besides that, the collected images are varied in size, color,
and brightness. An image can contain one or more objects.
All the above characteristics make the datasets challenging to
the classification tasks.

Regularization parameters of our framework such as
T, τ, λ, γ are tuned based on cross-validation while others are
set manually. Table. II shows the values of all parameters that
we used in the experiment for each dataset, where dsf step is

the SIFT step that was used for computing DSIFT, and other
parameters are explained above.
TABLE II: Parameters used in our experiments

In order to collect positive samples, we query some relevant
keywords to the Google Search Engine and get the answers in
the form of many links to the images which associated with the
query. From the returned links we can easily get the positive
images containing the interested uniform. The uniform images
are in various sizes, appearing in different real-life scenes with
different viewing angles. Those make the uniform detection
problem challenging.
We have downloaded four datasets of uniforms for our
experiments. Each dataset contains positive images that are
relevant to a particular uniform. Lawson dataset has the
positive image set of the uniform of Lawson Company. Fig. 5
illustrates the diversity of images which we got from the social
network. Other datasets are related to uniforms of some famous
football clubs.

Name of Dataset

dsf step

p

t

T


τ

λ

γ

Lawson

5

200

30

5

0.01

0.001

0.001

Argentina

5

200

30


5

0.01

0.001

0.001

Brazil

5

200

30

5

0.05

0.0005

0.00001

Barcelona

5

200


30

5

0.05

0.005

0.001

C. Result and Evaluation
Table. III shows the performance of our proposed framework in term of detection accuracy for each dataset.
accuracy =

# true positives + # true negatives
# samples

(12)

In average it takes about 0.025 seconds to process an image, that makes our framework suitable for real time detection
of uniform from online social image stream.
TABLE III: The detection accuracy of our framework on four
datasets
Fig. 5: Some samples of the Lawson dataset showing the
diversity of the collected image data

B. Experiment setup
Our experiments have been conducted using Matlab
programing-language on computers with the following specifications: Intel Xeon E5-2650 v2 Eight-Core Processor 2.6GHz


No.

Name of Dataset

Accuracy(%)

1

Lawson

100.0

2

Argentina

97.6

3

Brazil

94.0

4

Barcelona

97.0


Other performance measurements of our framework including precision, recall and F1 score are shown in Table. IV.
Precision and recall are defined as follows:
184


[4]

# true positives
precision =
.
# true positives + # f alse positives

(13)

[5]

# true positives
.
# true positives + # f alse negatives

(14)

[6]

recall =

F1 score is calculated by using precision and recall:
F1 = 2 ∗

precision ∗ recall

precision + recall

[7]

(15)

[8]

TABLE IV: Evaluation of our framework in term of Recall,
Precision and F-measure
No.

Name of Dataset

Precision(%)

Recall(%)

F1 (%)

1

Lawson

100

100

100


2

Argentina

96.07

98.00

97.03

3

Brazil

92.93

92.00

92.46

4

Barcelona

96.50

95.17

95.83


[9]

[10]

[11]

[12]

As one can see, we obtained up to 100% detection rate on
the Lawson dataset. For other datasets, the system can detect
correctly most of the images that contain at least one uniform.
Some missed detections are because the uniform appeared to
be a small object in the image or there are objects that are
similar to the interested uniform. Overall, the system can give
satisfaction performance for the problem of uniform detection
from social images.
V.

[13]

[14]

[15]

[16]

C ONCLUSION

We have proposed an efficient framework for uniform
detection from images in social network. The framework

integrated a powerful feature descriptor DSIFT with a stateof-the-art dictionary learning for efficient image representation.
We have employed a max pooling layer to enhance the features
for classification. A parallel implementation for a fast feature
extraction has been deployed. Experiments have been conducted on various datasets for demonstrating the effectiveness
of our framework for uniform detection. The framework can
be employed for recent emerging applications such as business
monitoring, automated image tagging, content based image
advertising.

[17]

[18]

[19]

[20]
[21]

ACKNOWLEDGMENT

[22]

This research is funded by Vietnam National Foundation
for Science and Technology Development (NAFOSTED) under
grant number 102.01-2011.17.

[23]

R EFERENCES


[24]

[1]

M. Aharon, M. Elad, and A. Bruckstein. K-svd: An algorithm for
designing overcomplete dictionaries for sparse representation. IEEE
TRANSACTIONS ON SIGNAL PROCESSING, 54(11):4311, 2006.
[2] L. Bo, X. Ren, and D. Fox. Multipath sparse coding using hierarchical
matching pursuit. In CVPR, 2013 IEEE Conference on, pages 660–667.
IEEE, 2013.
[3] A. Bosch, A. Zisserman, and X. Muoz. Image classification using
random forests and ferns. In ICCV 2007. IEEE 11th International
Conference on, pages 1–8. IEEE, 2007.

[25]

[26]

185

E. Candes and J. Romberg. Sparsity and incoherence in compressive
sampling. Inverse problems, 23(3):969, 2007.
L. Chen and A. Roy. Event detection from flickr data through waveletbased spatial analysis. In Proceedings of the 18th ACM conference on
Information and knowledge management, pages 523–532. ACM, 2009.
D. J. Crandall, L. Backstrom, D. Huttenlocher, and J. Kleinberg.
Mapping the world’s photos. In Proceedings of the 18th international
conference on World wide web, pages 761–770. ACM, 2009.
N. Dalal and B. Triggs. Histograms of oriented gradients for human
detection. In CVPR, 2005 IEEE Conference on, volume 1, pages 886–
893. IEEE, 2005.

J. Fan, Y. Gao, and H. Luo. Multi-level annotation of natural scenes
using dominant image components and semantic concepts. In Proceedings of the 12th annual ACM international conference on Multimedia,
pages 540–547. ACM, 2004.
S. Gu, L. Zhang, W. Zuo, and X. Feng. Projective dictionary pair
learning for pattern classification. In Advances in Neural Information
Processing Systems, pages 793–801, 2014.
Z. Jiang, Z. Lin, and L. S. Davis. Label consistent k-svd: learning a
discriminative dictionary for recognition. PAMI, IEEE Transactions on,
35(11):2651–2664, 2013.
S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial
pyramid matching for recognizing natural scene categories. In CVPR,
2006 IEEE Conference on, volume 2, pages 2169–2178. IEEE, 2006.
Z. Liu. A survey on social image mining. In Intelligent Computing and
Information Science, pages 662–667. Springer, 2011.
D. G. Lowe. Object recognition from local scale-invariant features.
In Computer vision, 1999. The proceedings of the seventh IEEE
international conference on, volume 2, pages 1150–1157. Ieee, 1999.
J. Luo, J. Yu, D. Joshi, and W. Hao. Event recognition: viewing the
world with a third eye. In Proceedings of the 16th ACM international
conference on Multimedia, pages 1071–1080. ACM, 2008.
A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic
representation of the spatial envelope. International journal of computer
vision, 42(3):145–175, 2001.
A. Torralba, R. Fergus, and Y. Weiss. Small codes and large image
databases for recognition. In CVPR, 2008 IEEE Conference on, pages
1–8. IEEE, 2008.
X.-J. Wang, L. Zhang, F. Jing, and W.-Y. Ma. Annosearch: Image
auto-annotation by search. In CVPR, 2006 IEEE Computer Society
Conference on, volume 2, pages 1483–1490. IEEE, 2006.
J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. S. Huang, and S. Yan. Sparse

representation for computer vision and pattern recognition. Proceedings
of the IEEE, 98(6):1031–1044, 2010.
J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma. Robust
face recognition via sparse representation. PAMI, IEEE Transactions
on, 31(2):210–227, 2009.
J. Wu and J. M. Rehg. Centrist: A visual descriptor for scene
categorization. PAMI, IEEE Transactions on, 33(8):1489–1501, 2011.
P. Wu, S. C.-H. Hoi, P. Zhao, and Y. He. Mining social images with
distance metric learning for automated image tagging. In Proceedings
of the fourth ACM international conference on Web search and data
mining, pages 197–206. ACM, 2011.
J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyramid matching
using sparse coding for image classification. In CVPR, 2009 IEEE
Conference on, pages 1794–1801. IEEE, 2009.
M. Yang, D. Zhang, and X. Feng. Fisher discrimination dictionary
learning for sparse representation. In ICCV, 2011 IEEE International
Conference on, pages 543–550. IEEE, 2011.
J. Yu, X. Jin, J. Han, and J. Luo. Mining personal image collection for
social group suggestion. In ICDMW’09. IEEE International Conference
on, pages 202–207. IEEE, 2009.
D. Zhang, M. Yang, and X. Feng. Sparse representation or collaborative
representation: Which helps face recognition? In ICCV, 2011 IEEE
International Conference on, pages 471–478. IEEE, 2011.
Q. Zhang and B. Li. Discriminative k-svd for dictionary learning in face
recognition. In CVPR, 2010 IEEE Conference on, pages 2691–2698.
IEEE, 2010.




×