Tải bản đầy đủ (.pdf) (36 trang)

Predicting the popularity of social curation =dự đoán nội dung mạng xã hội nổi bật

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.03 MB, 36 trang )

Predicting the Popularity of Social Curation

Kieu Thanh Binh
Faculty of Information Technology
University of Engineering and Technology
Vietnam National University, Hanoi
Supervised by
Assoc. Prof. Pham Bao Son

A thesis submitted in fulfillment of the requirements
for the degree of
Master of Science in Computer Science
December 2015



ORIGINALITY STATEMENT
‘I hereby declare that this submission is my own work and to the best of my knowledge
it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree
or diploma at University of Engineering and Technology (UET/Coltech) or any other
educational institution, except where due acknowledgement is made in the thesis. Any
contribution made to the research by others, with whom I have worked at UET/Coltech
or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual
content of this thesis is the product of my own work, except to the extent that assistance
from others in the project’s design and conception or in style, presentation and linguistic
expression is acknowledged.’

Hanoi, December 30th , 2015
Signed ........................................................................

i




ABSTRACT
The amount and variety of social media content such as status, images, movies,
and music are increasing rapidly. Accordingly, the social curation service is emerging
as a new way to connect, select, and organize information on a massive scale. One
noticeable feature of social curation services is that they are loosely supervised: the
content that users create in the service is manually collected, selected, and maintained. A large proportion of these contents are arbitrarily created by inexperienced
users. In this thesis, we look into social curation, particularly, the Storify website.
This is the most popular social curation for creating stories included in various domains such as Twitter, Flicker, and YouTube... We implemented a machine learning
method with feature extraction to filter these contents and to predict the popularity
of social curation data.
Publication:
Binh Thanh Kieu, Son Bao Pham and Ryutaro Ichise. Predicting the Popularity of Social
Curation . In Proceedings of the 6th International Conference on Knowledge and Systems Engineering, pp.413-424, Springer (KSE 2014)

ii


ACKNOWLEDGEMENTS
First and foremost, I would like to express my deepest gratitude to my supervisor, Assoc. Prof. Pham Bao Son, for his patient guidance and continuous support
throughout the years. He always appears when I need help, and responds to queries
so helpfully and promptly.
I would like to specially thank Prof. Ryutaro Ichise and his colleagues for their help
through my time at Ichise Laboratory, NII.
I sincerely acknowledge the Vietnam National University, Hanoi, Toshiba Foundation Scholarship, and especially Assoc. Prof. Pham Bao Son for supporting finance
to my master study.
Finally, this thesis would not have been possible without the support and love of
my mother and my father. Thank you!


iii


Table of Contents
1 Introduction
1.1 Social Curation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Prediction the poplularity . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Thesis Organisation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Literature review
2.1 Social Curation . . . . . . . .
2.1.1 Definition . . . . . . .
2.1.2 Social Curation Service
2.2 Storify . . . . . . . . . . . . .
2.3 Related Work . . . . . . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.

.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.

.
.

.
.
.
.
.

3 Predicting the Popularity of Social Curation
3.1 Problem Formulation . . . . . . . . . . . . . .
3.1.1 Regression . . . . . . . . . . . . . . . .
3.1.2 Classification . . . . . . . . . . . . . .
3.2 Feature Extraction . . . . . . . . . . . . . . .
3.2.1 Curator features . . . . . . . . . . . .
3.2.2 Curation features . . . . . . . . . . . .
3.2.3 Text features . . . . . . . . . . . . . .
3.2.4 Regression and classification model . .
4 Experimental Results
4.1 The Experimental Dataset
4.2 Results . . . . . . . . . . .
4.2.1 Regression . . . . .
4.2.2 Classification . . .
4.2.3 T-test Evaluation .

.
.
.
.
.


.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

iv

.
.
.
.
.

.
.
.

.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.

.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.


.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.


.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.

.
.

.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.

.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.


.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.

.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.


.
.
.
.
.
.
.
.

.
.
.
.
.

1
1
2
2

.
.
.
.
.

3
3
3
5

7
10

.
.
.
.
.
.
.
.

13
13
13
13
14
14
15
16
17

.
.
.
.
.

19
19

19
19
19
21


TABLE OF CONTENTS
5 Conclusion

v
22


List of Figures
2.1
2.2
2.3

Content Creators . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Content Curators . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example of a Storify list . . . . . . . . . . . . . . . . . . . . . . . . .

vi

6
7
8


List of Tables

2.1
2.2
2.3

Statistics of curated domains . . . . . . . . . . . . . . . . . . . . . . . 9
Element types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Storify action statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4.1
4.2
4.3

Mean Square Errors (MSE) of view count regression by SVR . . . . . 20
Prediction accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Accuracy of 10 tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

vii


List of Abbreviations
BoW
ML
NLP
SNS
SVM

Bag-of-Words
Machine Learning
Natural Language Processing
Social Networking Service

Support Vector Machine

viii


Chapter 1
Introduction
Along with the rapid growth of the Internet, social networks are increasingly attracting users, young people in particular. Therefore, the study of social networks
is getting more and more attention. Social network services such as Facebook, Myspace, and Twitter have become viable sources of information for many online users.
These websites are increasingly used for communicating breaking news, sharing eyewitness accounts, and organizing groups of people. At the most basic level, a curation
service offers the ability to manually collect, select, and maintain this social media
information. This is very different from other social information sources, and we can
utilize this characteristic for efficient content mining.

1.1

Social Curation

The emergence of Web 2.0 and online social networking services, such as Digg,
YouTube, Facebook, and Twitter, has changed how users generate and consume online contents. For example, YouTube, well-known for its fast-growing user-generated
contents, reports 100 hours worth of video upload every minute according YouTube
Statistics1 reported. Online social networking services, augmented with multimedia
content support, sharing, and commenting on other users’ contents, constitute a significant part of the web experience by Internet users. The question is how do users
find interesting contents? Or, how do certain contents rise in popularity? If we can
answer these questions, we can predict the most likely contents to become popular
and filter out others. Moreover, when we can filter out unpopular contents that get
1

/>
1



2

Chapter 1. Introduction

little attention, good contents can be used to build an automatic system for curating
social content.

1.2

Prediction the poplularity

However, predicting the popularity of content is a difficult task for many reasons.
Among these, the effects of external phenomena (e.g., media, natural, and geopolitical) are difficult to incorporate into models (Lee et al., 2010), and cascades
of information are difficult to forecast (Cha et al., 2009). Finally, the underlying
contexts, such as locality, relevance to users, resonance, and impact, are not easy to
decipher (Bandari et al., 2012).
Design is changing into an experience-oriented discipline; consequently designers
need appropriate tools and methods to incorporate experiential aspects into their
designs. A story is a crafted experience and storytelling is the craft. Therefore,
understanding the structural strategies behind storytelling and learning how to incorporate them into a design process is relevant for designers when they want to
envision, discuss and influence user experiences. In this thesis, I introduce storify
website and a method for analyzing it. Storify is a multi-modal tool to provide design teams with an experiential approach towards designing interactive products by
incorporating dramaturgical techniques from film and sequential art.

1.3

Thesis Organisation


The rest of the paper is organized as follows. In the second section, we explain
the social curation service, our target data source, and details of the dataset specifications. In the third section, we review related work. The fourth section is devoted
to the formulation of predicting view counts of a curation list. The fifth section
describes experiments and the evaluation of our results. The last section concludes
this paper with a discussion about future work.


Chapter 2
Literature review
2.1
2.1.1

Social Curation
Definition

The word “curate” is defined as selecting, organizing, and looking after the items
in a collection or exhibition1 . The word is derived from the Latin root “curare” or “to
cure”, which means “to care”. Curation involves assembling, managing and presenting some types of collections. For example, curators of art galleries and museums
research, select, and acquire pieces for their institutions’ collections and oversee interpretation, displays, and exhibits. Social curation is the collaborative oversight of
collections of web content organized around types of content such as Pinterest (a
site for sharing and organizing images) and Storify (a site for collecting and publishing stories). Together with social media raises two new figures or at least two new
ways of naming one of the most common behaviors in this environment. This is the
Content Curators and Content Creators. Content is to be expressed through some
medium, as speech, writing or any of various arts for self-expression, distribution,
marketing or publication. First, we will see a brief explanation of Content Creation
and Content Curation.
On the one hand, content creation is the contribution of information to any media and most especially to digital media for an end-user or audience in specific
contexts. Typical forms of content creation include maintaining and updating web
1


/>
3


4

Chapter 2. Literature review

sites, blogging, photography, videography, online commentary, the maintenance of
social media accounts, and editing and distribution of digital media. A Pew survey described content creation as the creation of "the material people contribute to
the online world" (Horrigan, 2004). On the other hand, content curation is not a
new phenomenon. Museums and galleries have curators to select items for collection
and display. Content curation is the process of collecting, organizing and displaying
information relevant to a particular topic or area of interest.

Statistics say that of all the users, a vast majority just passively consume content,
it means they never create, or share. A minor portion of content curator acts filtering
out others the best. And a minority are content creator and create new original
content that are shared by content curators and consumed by users. In Internet and
thanks to social media these proportions change and more people create and almost
all or a very large share. But still, we can still see these behaviors clearly. A defining
our behavior in social media, we can choose one or the other profile, but at the
professional level, it is interesting to understand the differences and benefits.

Creating original and quality content on a frequent topic is one of the most arduous
tasks, but often has great rewards, because we get to to attract people interested
creating a community around us, and obtain recommendations from other users who
are like vote of confidence in the social economy. On the other hand, the curators of
content, without needing much creative work or composition, but much surveillance
and information processing, can achieve much relevance as in many examples we can

see, if you are able to generate a community around thanks to the information shared
daily interest on a topic. Content Curator that the user profile social network usually
reads a lot and share a lot, either with Share option of Facebook, republishing on his
wall, the publications of others he finds most interesting, or on Twitter doing Retweet
of what better than read, just as in Pinterest, Linkedin, Instagram, Tumblr... These
profiles are read first lot, then selected, with an initial filter and finally share the best
selection. Hence the title of curator, because they care finally share content, helping
to combat the current infoxication. Thus, if we choose our top content properly
curators to follow the topics that we prefer to be informed, we eat better information,
more interesting and less time.


2.1. Social Curation

2.1.2

5

Social Curation Service

Social networks are spaces for dialog and conversation that have grown into ubiquitous information exchanges. Youth today refer to social networks, aggregators, and
mobile apps for most of their information instead of singling out specific media for
news, politics, personal communication, and leisure. In turn, social networks have
provided new functions that help users curate information in meaningful and productive ways. Social curation involves aggregating, organizing, and sharing the content
created by others to add context, narrative, and meaning. Artists, changemakers, and
organizations use social curation to showcase the full range of conversations around
a topic, add more nuance to their own original content, and crowdsource content
from their community members. The rise of social curation can be attributed to
three broad trends.
• Firstly, people are creating a constant stream of social media content, including

updates, location check-ins, blog posts, photos, and videos.
• Secondly, people are using their social networks to filter relevant content by
following others who share similar interests.
• Thirdly, social media platforms are also curating content by giving curation
tools to users (YouTube playlists, Flickr galleries, Amazon lists, Foodspotting
guides), using editors and volunteers (YouTube Politics, Tumblr Tags), or using algorithms (YouTube Trends, Autogenerated YouTube channels, LinkedIn
Today).
One notable trend in Social Networking Service (SNS)-related research is agglomerating multiple information sources or services to obtain a deeper understanding of
social media content. For example, Mejova employ a domain adaptation technique
for sentiment analysis in three different social media streams: weblogs, review articles, and tweets on Twitter (Mejova and Srinivasan, 2012). The authors of (Hu et al.,
2012) extend a topic model (Blei et al., 2003) to associate tweets and real events to
discover topical segmentation in an event. Kulshrestha studied the impact of offline
geolocations on online social network activities and participants (Kulshrestha et al.,
2012). However, the first two studies focus on the same modality: namely, text-based
datasets. In this paper, we employ the social curation service as a complimentary
information source for the automatic understanding and mining of content in social


6

Chapter 2. Literature review

Figure 2.1: Content Creators
network. This is closer to (Kulshrestha et al., 2012) in the sense that the information source is crossmodal: a social network structure with offline geographical
information, as in our case social curation lists are associated with stories.
To the best of our knowledge, there are only some studies dealing with social
curation service, like a work by (Duh et al., 2012). This paper analyzed curation
lists consisting of Twitter messages (tweets). They also studied the objectives and
topics of curation lists, and reported that there are many styles and usages among
social curation services. The difference from our work is again in the modality. The

focus of the authors was unimodal: the authors of (Duh et al., 2012) mainly focus
on text messages (i.e. tweets). In our work, we extract various kinds of information
(features) from a curation list to understand and evaluate the quality of this data
by predicting the popularity of them.
To be more specific, users involved in social curation service are classified into
three types in Figure 2.2 (Duh et al., 2012). First, content creators generate social
media content (or simply, content) that is posted to social networking services.
Formats and domains of the content are diverse: text messages like tweets, photos
taken by mobile phones, weblogs, movies, and so on. Second, curators collect and
evaluate this posted content, and re-organize it to form compound content (called
a curation, a summary or a curation list) based on the opinions, perspectives and
interests of the curators. Usually, a curation list is created by one user. However,
some curation lists are generated through the interaction of multiple curators. Third,
content consumers enjoy, share and consume social media content created by content


2.2. Storify

7

Figure 2.2: Content Curators
creators, as well content expressed by the curation lists. Note that a user can be a
content creator, curator, and content consumer at the same time.
As a result, a number of niche social curation platforms have emerged to enable
people to curate different types of content, including links, photos, sounds, and
videos. We should emphasize that each curation list is a kind of loosely supervised
but organized social dataset. This means that social media items in the same curation
list are expected to share the same context to a certain degree: a curation list is
manually generated to fully convey one idea to the consumer. This is a very distinct
characteristic compared to other social media that are unorganized in many cases.


2.2

Storify

The website Storify is the most well-known site for people telling stories by curating social media. Storify was launched in September 2010 and accounts were
invitation-only until April 2011. The site is now open to everyone and users only
need a Twitter account. Storify provides a function to filter out poor content and
unreliable sources. If social media changes or misinterprets context, Storify can help
curators put it back together again (Fincham, 2011). Storify allows curators to embed dynamic images, text, tweets, and even Facebook status updates, and then knit
these all together with background and context provided by the storyteller. It is an
engaging way for us to learn how to work out what is true and what is specula-


8

Chapter 2. Literature review

Figure 2.3: Example of a Storify list
tion. We have also found that using Twitter has taught us how to look for sources
and news and Storify has helped teach us how to think and write context and narrative. Each story is a curation list which shares some characteristics: manually
collected (bundling a collection of content from diverse sources), manually selected
(re-organizing them to give one’s own perspective), manually maintained (publishing
the resulting story for consumers).

The Storify data is in the form of lists of Twitter messages. An example of a list
is shown in Figure 2.3. A list of tweets corresponds to what we called a story, which
represents a manually filtered and organized bundle. Lists in Storify draw on Twitter
as its source. The lists may be created individually in private or collaboratively in
public as determined by the initial curator. In the Storify curation interface, the

curator begins the list curation process by looking through his Twitter timeline
(tweets from users that he or she follows), or directly searching tweets via relevant
words/hashtags. The curator can drag-and-drop these tweets into a list, reorder
them freely, and also add annotations such as a list header and in-place comments.


2.2. Storify

9

Table 2.1: Statistics of curated domains
Domain Number of Elements Proportion
Twitter
8,514,006
75.5%
Storify
1,206,794
10.7%
YouTube
190,611
1.7%
Facebook
169,361
1.5%
Instagram
155,762
1.4%
Flickr
127192
1.2%

Others
920,089
8%

Table 2.2: Element types
Types Number Proportion
Quote 7,715,616
68.4%
Text
1,195,625
10.6%
Image 1,436,673
12.7%
Video
206,265
1.8%
Link
732,096
6.5%

We first provide some data statistics to get a feel for the curation data. We collected all the data from 2010 to April 2013, which amounted to 63,419 users and
352,540 stories. This corresponds to a total of 11,283,815 elements from various
domains. Table 2.1 describes the various domains used in the stories. Twitter is
the largest domain source with more than 75% elements, and Flickr is the smallest
specific source with 1.2% elements. The statistics of the element types is shown in
Table 2.2. The five types of elements in stories are quote, text, image, video, and
link. Because Storify users use a huge number of tweets, the number of quote contents accounts for a large percentage of nearly 70%. Media content as images and
movies make up approximately 15%. Text contents are written by the curator to
add more information, explain, or link elements. The Storify API provides the four
main actions shown in Table 2.3. The Storify website allows users to comment on

each element or on all parts of a story. However, the average numbers of comments,
element comments and similar actions are quite small. Therefore, approaches utilizing user comments and actions are not suitable for this dataset (Ahmed et al.,
2013).


10

Chapter 2. Literature review

Table 2.3: Storify action statistics
Action
Number
Average
Views
642,666,347
1823 per story
Comments
21,306
0.06 per story
Element comments
21,133 0.002 per element
Likes
206,265
0.12 per story

2.3

Related Work

Several studies have investigated social curation as a new source of data mining.

Pinterest2 is the most popular website for sharing images and video, and the third
most popular social network in the US behind Facebook and Twitter. The website
is built around the activity of collecting digital images and videos and pinning them
to a pin board. Each pin is essentially a visual bookmark and the pin boards are
thematic collections of the bookmarks, where context is added to the collected information. Hall and Zarro described some of the user actions on Pinterest and created
a dataset to find the pin content of Pinterest users across a wide variety of subject areas (Hall and Zarro, 2012). Besides only curating images or video, other sites
curate status, comments, news sources to write blogs, stories. Storyful3 is a social
media news agency established in 2010 with the aim of filtering news, or newsworthy
content, from the vast quantities of noisy data on social networks such as Twitter
and YouTube. Storyful invests considerable time into the manual curation of content
on these networks. It sounds more or less like the same goal as Storify’s but there
is one important difference. Storyful aims to deliver content for news organizations,
whereas Storify is more of a tool for journalists. It allows journalists to use its template to write stories that include relevant tweets and Facebook posts without losing
the original formatting or links. Journalists can create interactive stories with clear
links to original pictures or tweets. Greene et al. proposed a variety of criteria for
generating user list recommendations based on content analysis, network analysis,
and the “crowdsourcing” of existing user lists (Greene et al., 2012). In addition, the
Togetter website4 is a rapidly growing social curation website in Japan. Togetter
averaged more than 4 million user-views per month in 2011. The Togetter curation
2

/> />4
/>3


2.3. Related Work

11

data mainly exist in the form of lists of Twitter messages. Ishiguro et al. used Togetter data for the automatic understanding and mining of images (Ishiguro et al.,

2012) and created a system (Duh et al., 2012) that suggests new tweets to increase
the curator’s productivity and breadth of perspective. Our research discovered another social curation website, Storify. The structure of a Storify list is quite similar
to that of a Togetter list. The only difference is the language: the common language
of Togetter is Japanese and Storify is English. However, we interested in another
aspect which show the quality of curation list made by users.
The problem of predicting online content highlights how much attention it will
ultimately receive. Research shows that user attention is allocated in a rather asymmetric way, with most content getting only some views and downloads, whereas
a few receive a significant amount of user attention; thus, filtering these contents
will help to save much time for viewers. There are different ways to formulate how
much attention of contents. Many researchers interested in the number of views as
the popularity of online content such as YouTube (the number of views (Szabo and
Huberman, 2010)), Vimeo (the number of views (Ahmed et al., 2013)), Flickr (the
number of views (van Zwol et al., 2010)). Otherwise, the popularity is presented
by users’ actions like Dig (the number of user votes (Jamali and Rangwala, 2009)),
Twitter (the number of retweets (Hong et al., 2011)). Moreover, others formulate
the problem to a change of the number of views that contents receive over time.
Predicting the popularity of news articles is a complex and difficult task and
different prediction methods and strategies have been proposed in several recent
studies (Szabo and Huberman, 2010) (Tsagkias et al., 2010). The common point
of all these methods is that they focus on predicting the exact attention that an
article will generate in the near future. First, some researchers have studied features
that describe the underlying social network of the users and contents that can be
leveraged to predict popularity (Hogg and Lerman, 2012) (Jamali and Rangwala,
2009) (Lerman and Hogg, 2010) (Tsagkias et al., 2010). The authors in (Kim et al.,
2011) (Lee et al., 2010) (Lee et al., 2012) (Tsagkias et al., 2010) studied features
that take into account the comments found in blogs to predict popularity. However,
few other works forecast a value for the actual popularity of individual content. Lee
et al. used survival analysis to evaluate the probability that a given content receives
more than some x number of hits (Lee et al., 2010) (Lee et al., 2012). Hong et al.



12

Chapter 2. Literature review

developed a coarse multi-class classifier-based approach to determine whether given
Twitter hashtags are retweeted x ≤ (0; 100; 10000; ∞) times (Hong et al., 2011).
Similarly, Lakkaraju and Ajmera used support vector machines (SVMs) to predict
whether a given content falls into a group that attracts x ≤ (10%; 25%; 50%; 75%;
100%) of the attention in a system (Lakkaraju and Ajmera, 2011), while Jamali and
Rangwala predicted the popularity of content by using an entropy measure (Jamali
and Rangwala, 2009). Finally, Szabo and Huberman presented a linear regression
model based on the number of views (Szabo and Huberman, 2010); this method was
applied to build predictive popularity by applying regression to different feature
spaces (Bandari et al., 2012) (Hogg and Lerman, 2012) (Lerman and Hogg, 2010)
(Tsagkias et al., 2010).
In this work, the popularity of Social Curation is shown by the number of views
that the content will receive in the near future. We propose three groups for categorizing the popularity level of Social Curation. We build a predictor based on a
machine learning method, SVM, with feature selection to classify into these groups.


Chapter 3
Predicting the Popularity of Social
Curation
3.1
3.1.1

Problem Formulation
Regression


Formally, we predict view count yi of content i from information of the content
xi . This is a typical regression problem: i.e. we try to minimize the error between
the predicted view count yˆi and the true view count yi by modifying an unknown
parameter w that governs the regression function yˆi = f (xi; w).
Given content and social curation lists, we extract several features xi and predicted
a view count for each content. Social curation lists contain many kinds of information
that are useful for predicting view counts.

3.1.2

Classification

Similar to normal content, the popularity of social curation is defined by the
number of users’ view. We predict how much view which stories will receive in the
near future. However, it is difficult to predict exact amount of attention and people
are almost interested in the popularity of content; thus, instead of predicting exactly
the number, we cast the task as a multi-class classification problem that predicts the
popularity that a curation list will receive after three months based on the number of
views. Although our system cannot predict exactly the number of attention, but this
13


14

Chapter 3. Predicting the Popularity of Social Curation

system partly helps users to be able to identify popular contents and not popular
contents.
We divide the number of views into three different classes: class 1 – not popular,
with the number of views less than 10, class 2 – less popular, with the number of

views between 10 and 1000, class 3 – very popular, with the number of views more
than 1000.
We used an SVM to classify these classes. LibSVM (Chang and Lin, 2011) with a
radial basic function (RBF) kernel and default parameters, and the feature selection
tool (wei Chen, 2005) were used to optimize the result. We extracted three types
of features, namely curation features, curator features and text features. Curator
features are features of users who collect and organize elements from some domains
and create curation lists. Curation features are features related to the content of the
curation lists. Text features cover all content text of curation lists.

3.2

Feature Extraction

Social curation lists contain many kinds of information that are useful for classifying. For example, if the curation list includes many Twitter contents, the view
count of the contents is expected to increase; or, if elements match the context of
the curation list, the content will attract much more attention.
In this study, as the social curation list included a large number of Twitter messages, we used applicable features for predicting the number of retweets and microblogging popularity. We divided the features into the two distinct sets mentioned
above: curator features (which are related to the author of the story), curation
features (which encompass various statistics of the content in the story) and text
features.

3.2.1

Curator features

The following are the five curator features:
(i) The number of users who follow the curator of the content



3.2. Feature Extraction

15

(ii) The number of users who the curator of the content follows
(iii) The number of stories written by the curator
(iv) The user’s language (English or not)
(v) When the curator of the content started using Storify
These features were selected from the content creator features proposed by Ishiguro
et al. (Ishiguro et al., 2012). We implemented these features as our baseline system.
The number of followers and friends has been consistently shown to be a good
indicator of retweetability, whereas the number of stories has not been found to have
a significant impact (Suh et al., 2010). Our prior analysis also showed that stories
written in English are more likely to be viewed, so we used a binary feature indicating
if the user’s language is English. The date when a curator started using Storify
shows their experience. Normally, longtime users have more experience producing
more popular curator stories than do new users. We are not aware of any prior work
that analyzes the effect of language or date on content popularity.

3.2.2

Curation features

The following are the seven curation features:
(i) The number of hashtags
(ii) The number of versions
(iii) The number of embeds
(iv) The story’s language (English or not)
(v) The number of popular tweet elements/total elements (the number of retweets
greater than 100)

(vi) The number of popular image and video elements/total elements (the number
of image views and video views greater than 1000)
(vii) The total number of elements


×