Big data technologies and applications

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (8.41 MB, 405 trang )

Borko Furht · Flavio Villanustre

Big Data
Technologies
and
Applications

Big Data Technologies and Applications

Borko Furht Flavio Villanustre
•

Big Data Technologies
and Applications

123

Borko Furht
Department of Computer and Electrical
Engineering and Computer Science
Florida Atlantic University
Boca Raton, FL
USA

ISBN 978-3-319-44548-9
DOI 10.1007/978-3-319-44550-2

Flavio Villanustre

LexisNexis Risk Solutions
Alpharetta, GA
USA

ISBN 978-3-319-44550-2

(eBook)

Library of Congress Control Number: 2016948809
© Springer International Publishing Switzerland 2016
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microﬁlms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made.
Printed on acid-free paper
This Springer imprint is published by Springer Nature
The registered company is Springer International Publishing AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

The scope of this book includes leading edge in big data systems, architectures, and
applications. Big data computing refers to capturing, managing, analyzing, and
understanding the data at volumes and rates that push the frontiers of current
technologies. The challenge of big data computing is to provide the hardware
architectures and related software systems and techniques which are capable of
transforming ultra large data into valuable knowledge. Big data and data-intensive
computing demand a fundamentally different set of principles than mainstream
computing. Big data applications typically are well suited for large-scale parallelism
over the data and also require extremely high degree of fault tolerance, reliability,
and availability. In addition, most big data applications require relatively fast
response. The objective of this book is to introduce the basic concepts of big data
computing and then to describe the total solution to big data problems developed by
LexisNexis Risk Solutions.
This book comprises of three parts, which consists of 15 chapters. Part I on Big
Data Technologies includes the chapters dealing with introduction to big data
concepts and techniques, big data analytics and relating platforms, and visualization
techniques and deep learning techniques for big data. Part II on LexisNexis Risk
Solution to Big Data focuses on speciﬁc technologies and techniques developed at
LexisNexis to solve critical problems that use big data analytics. It covers the open
source high performance computing cluster (HPCC Systems®) platform and its
architecture, as well as, parallel data languages ECL and KEL, developed to
effectively solve big data problems. Part III on Big Data Applications describes
various data-intensive applications solved on HPCC Systems. It includes applications such as cyber security, social network analytics, including insurance fraud,
fraud in prescription drugs, and fraud in Medicaid, and others. Other HPCC
Systems applications described include Ebola spread modeling using big data
analytics and unsupervised learning and image classiﬁcation.
With the dramatic growth of data-intensive computing and systems and big data
analytics, this book can be the deﬁnitive resource for persons working in this ﬁeld
as researchers, scientists, programmers, engineers, and users. This book is intended
for a wide variety of people including academicians, designers, developers,

v

vi

Preface

educators, engineers, practitioners, and researchers and graduate students. This
book can also be beneﬁcial for business managers, entrepreneurs, and investors.
The main features of this book can be summarized as follows:
1. This book describes and evaluates the current state of the art in the ﬁeld of big
data and data-intensive computing.
2. This book focuses on LexisNexis’ platform and its solutions to big data.
3. This book describes the real-life solutions to big data analytics.
Boca Raton, FL, USA
Alpharetta, GA, USA
2016

Borko Furht
Flavio Villanustre

Acknowledgments

We would like to thank a number of contributors to this book. The LexisNexis
contributors include David Bayliss, Gavin Halliday, Anthony M. Middleton, Edin
Muharemagic, Jesse Shaw, Bob Foreman, Arjuna Chala, and Flavio Villanustre.
The Florida Atlantic University contributors include Ankur Agarwal, Taghi
Khoshgoftaar, DingDing Wang, Maryam M. Najafabadi, Abhishek Jain, Karl
Weiss, Naeem Seliva, Randal Wald, and Borko Furht. The other contributors

include I. Itauma, M.S. Aslan, and X.W Chen from Wayne State University;
Chun-Wei Tsai, Chin-Feng Lai, Han-Chieh Chao, and Athanasios V. Vasilakos
from Lulea University of Technology in Sweden; and Akaterina Olshannikova,
Aleksandr Ometov, Yevgeni Koucheryavy, and Thomas Olsson from Tampere
University of Technology in Finland.
Without their expertise and effort, this book would never come to fruition.
Springer editors and staffs also deserve our sincere recognition for their support
throughout the project.

vii

Contents

Part I
1

2

Big Data Technologies

Introduction to Big Data . . . . . . . . . . . . . . . . . . . . . . . . .
Borko Furht and Flavio Villanustre
Concept of Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Big Data Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Big Data Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . .
Big Data Layered Architecture . . . . . . . . . . . . . . . . . . . .
Big Data Software . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Splunk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
LexisNexis’ High-Performance Computer Cluster (HPCC).

Big Data Analytics Techniques . . . . . . . . . . . . . . . . . . . . .
Clustering Algorithms for Big Data . . . . . . . . . . . . . . . . . .
Big Data Growth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Big Data Industries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Challenges and Opportunities with Big Data . . . . . . . . . . . .
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Big Data Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Chun-Wei Tsai, Chin-Feng Lai, Han-Chieh Chao
and Athanasios V. Vasilakos
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Data Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Data Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Output the Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Big Data Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Big Data Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Big Data Analysis Frameworks and Platforms . . . . . . . . . . .
Researches in Frameworks and Platforms . . . . . . . . . . . . . .
Comparison Between the Frameworks/Platforms of Big Data .

......
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.

3
4
5
5
6
6
6
7
8
9
9
10
11

......

13

.
.
.
.
.
.
.
.

.
.
.

14
16
17
17
19
22
24
25
26
27
30

.
.
.
.
.
.
.
.
.
.
.

.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.

3

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

ix

x

Contents

Big Data Analysis Algorithms . . . . . . . . . . . . . . . .
Mining Algorithms for Specific Problem . . . . . . .
Machine Learning for Big Data Mining. . . . . . . . . .
Output the Result of Big Data Analysis. . . . . . . . . .
Summary of Process of Big Data Analytics . . . . . . .
The Open Issues. . . . . . . . . . . . . . . . . . . . . . . . . .
Platform and Framework Perspective . . . . . . . . . . .
Input and Output Ratio of Platform . . . . . . . . . .
Communication Between Systems . . . . . . . . . . . . .
Bottlenecks on Data Analytics System . . . . . . . . . .
Security Issues . . . . . . . . . . . . . . . . . . . . . . . . . . .
Data Mining Perspective . . . . . . . . . . . . . . . . . . . .
Data Mining Algorithm for Map-Reduce Solution
Noise, Outliers, Incomplete and Inconsistent Data
Bottlenecks on Data Mining Algorithm . . . . . . . .
Privacy Issues . . . . . . . . . . . . . . . . . . . . . . . . .
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3

.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

Transfer Learning Techniques. . . . . . . . . . . . . . . . . . . . .
Karl Weiss, Taghi M. Khoshgoftaar and DingDing Wang
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Definitions of Transfer Learning. . . . . . . . . . . . . . . . . . . . .
Homogeneous Transfer Learning . . . . . . . . . . . . . . . . . . . .
Instance-Based Transfer Learning . . . . . . . . . . . . . . . . . .
Asymmetric Feature-Based Transfer Learning . . . . . . . . .
Symmetric Feature-Based Transfer Learning . . . . . . . . . .
Parameter-Based Transfer Learning. . . . . . . . . . . . . . . . .

Relational-Based Transfer Learning . . . . . . . . . . . . . . . .
Hybrid-Based (Instance and Parameter) Transfer Learning .
Discussion of Homogeneous Transfer Learning . . . . . . . .
Heterogeneous Transfer Learning . . . . . . . . . . . . . . . . . . . .
Symmetric Feature-Based Transfer Learning . . . . . . . . . .
Asymmetric Feature-Based Transfer Learning . . . . . . . . .
Improvements to Heterogeneous Solutions. . . . . . . . . . . .
Experiment Results. . . . . . . . . . . . . . . . . . . . . . . . . . . .
Discussion of Heterogeneous Solutions . . . . . . . . . . . . . .
Negative Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Transfer Learning Applications . . . . . . . . . . . . . . . . . . . . .
Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . .
Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

31
31
33
36
37
40
40
40
40
41
41
42
42
42
43
43
44
45

......

53

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

53
55
59

60
61
64
68
70
71
72
73
74
79
82
83
83
85
88
90
92
93

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.

Contents

4

5

Visualizing Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Ekaterina Olshannikova, Aleksandr Ometov, Yevgeni Koucheryavy
and Thomas Olsson
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Big Data: An Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Big Data Processing Methods . . . . . . . . . . . . . . . . . . . . . . . . . . .
Big Data Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Visualization Methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Integration with Augmented and Virtual Reality . . . . . . . . . . . . . . .
Future Research Agenda and Data Visualization Challenges . . . . . .
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

101

.
.
.
.
.
.
.
.
.

101
103
104
107
109
119
121
123
124

.

133

.
.
.
.
.
.

.
.
.
.
.
.
.

133
136
138
140
141
144
147
147
148
149
150
152
153

................

159

.
.
.
.

.
.
.
.
.
.
.

159
160
161
161
162
162
163
164
164
167
169

Deep Learning Techniques in Big Data Analytics . . . . . . . . . . . .
Maryam M. Najafabadi, Flavio Villanustre, Taghi M. Khoshgoftaar,
Naeem Seliya, Randall Wald and Edin Muharemagc
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Deep Learning in Data Mining and Machine Learning . . . . . . . . . .
Big Data Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Applications of Deep Learning in Big Data Analytics . . . . . . . . . . .
Semantic Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Discriminative Tasks and Semantic Tagging . . . . . . . . . . . . . . . . .
Deep Learning Challenges in Big Data Analytics . . . . . . . . . . . . . .

Incremental Learning for Non-stationary Data . . . . . . . . . . . . . .
High-Dimensional Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Large-Scale Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Future Work on Deep Learning in Big Data Analytics . . . . . . . . . .
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Part II
6

xi

LexisNexis Risk Solution to Big Data

The HPCC/ECL Platform for Big Data . . . .
Anthony M. Middleton, David Alan Bayliss,
Gavin Halliday, Arjuna Chala and Borko Furht
Introduction . . . . . . . . . . . . . . . . . . . . . . . . .
Data-Intensive Computing Applications. . . .
Data-Parallelism . . . . . . . . . . . . . . . . . . . .
The “Big Data” Problem . . . . . . . . . . . . . .
Data-Intensive Computing Platforms . . . . . . . .
Cluster Configurations . . . . . . . . . . . . . . .
Common Platform Characteristics. . . . . . . .
HPCC Platform . . . . . . . . . . . . . . . . . . . . . .
HPCC System Architecture . . . . . . . . . . . .
HPCC Thor System Cluster . . . . . . . . . . . .
HPCC Roxie System Cluster . . . . . . . . . . .

.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

xii

Contents

ECL Programming Language . . . . . . . . . . . . . . . . . . .
ECL Features and Capabilities . . . . . . . . . . . . . . . .
ECL Compilation, Optimization, and Execution . . . .
ECL Development Tools and User Interfaces. . . . . .
ECL Advantages and Key Benefits. . . . . . . . . . . . .
HPCC High Reliability and High Availability Features .
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7

8

.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.

170
171
173
177
177
179
180
182

.........

185

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

185

186
187
188
195
196
199
201
202
202
205
206
208
209
213
218
219
220
220
222

............

225

.
.
.
.
.
.

.
.
.

225
226
227
228
230
232
233
234
235

Scalable Automated Linking Technology for Big Data
Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Anthony M. Middleton, David Bayliss and Bob Foreman
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SALT—Basic Concepts . . . . . . . . . . . . . . . . . . . . . . .
SALT Process . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Specification File Language. . . . . . . . . . . . . . . . . . .
SALT—Applications . . . . . . . . . . . . . . . . . . . . . . . . .
Data Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Data Hygiene. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Data Source Consistency Checking. . . . . . . . . . . . . .
Delta File Comparison . . . . . . . . . . . . . . . . . . . . . .
Data Ingest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Record Linkage—Process . . . . . . . . . . . . . . . . . . . .
Record Matching Field Weight Computation . . . . . . .
Generating Specificities. . . . . . . . . . . . . . . . . . . . . .

Internal Linking . . . . . . . . . . . . . . . . . . . . . . . . . . .
External Linking . . . . . . . . . . . . . . . . . . . . . . . . . .
Base File Searching . . . . . . . . . . . . . . . . . . . . . . . .
Remote Linking . . . . . . . . . . . . . . . . . . . . . . . . . . .
Attribute Files . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . .
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Aggregated Data Analysis in HPCC Systems . . . .
David Bayliss
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The RDBMS Paradigm . . . . . . . . . . . . . . . . . . . . .
The Reality of SQL . . . . . . . . . . . . . . . . . . . . . . .
Normalizing an Abnormal World . . . . . . . . . . . . . .
A Data Centric Approach . . . . . . . . . . . . . . . . . . .
Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . .
Case Study: Fuzzy Matching . . . . . . . . . . . . . . .
Case Study: Non-obvious Relationship Discovery.
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.

.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

Contents

..................

237

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

237

241
242
243
244
246
247
247
248
249
249
250
251
252
253
255

............

257

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

257
259
260
260
261
262
263
264
265
265
269
273
279
282
283
285
287
303
305

11 Graph Processing with Massive Datasets: A Kel Primer . . . . . . . .
David Bayliss and Flavio Villanustre
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Motivation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

307

9

Models for Big Data . . . . . . . . . . . . . . . .

David Bayliss
Structures Data. . . . . . . . . . . . . . . . . . . . .
Text (and HTML) . . . . . . . . . . . . . . . . . .
Semi-structures Data . . . . . . . . . . . . . . .
Bridging the Gap—The Key-Value Pair . . .
XML—Structured Text . . . . . . . . . . . . . . .
RDF . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Data Model Summary . . . . . . . . . . . . . . . .
Data Abstraction—An Alternative Approach
Structured Data . . . . . . . . . . . . . . . . . . . .
Text . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Semi-structured Data. . . . . . . . . . . . . . . . .
Key-Value Pairs . . . . . . . . . . . . . . . . . . . .
XML . . . . . . . . . . . . . . . . . . . . . . . . . . .
RDF . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Model Flexibility in Practice . . . . . . . . . . .
Conclusion . . . . . . . . . . . . . . . . . . . . . . .

xiii

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

10 Data Intensive Supercomputing Solutions. . . . . . .
Anthony M. Middleton
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Data-Intensive Computing Applications. . . . . . . . . .
Data-Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . .
The “Data Gap” . . . . . . . . . . . . . . . . . . . . . . . . . .
Characteristics of Data-Intensive Computing Systems
Processing Approach . . . . . . . . . . . . . . . . . . . . . .
Common Characteristics . . . . . . . . . . . . . . . . . . . .
Grid Computing . . . . . . . . . . . . . . . . . . . . . . . . . .
Data-Intensive System Architectures . . . . . . . . . . . .
Google MapReduce . . . . . . . . . . . . . . . . . . . . . . .
Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
LexisNexis HPCC . . . . . . . . . . . . . . . . . . . . . . . .
Programming Language ECL . . . . . . . . . . . . . . . . .

Hadoop Versus HPCC Comparison . . . . . . . . . . . .
Terabyte Sort Benchmark . . . . . . . . . . . . . . . . . . .
Pig Versus ECL . . . . . . . . . . . . . . . . . . . . . . . . . .
Architecture Comparison . . . . . . . . . . . . . . . . . . . .
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.

307
308

xiv

Contents

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Open Source HPCC Systems Platform Architecture . . . . .
KEL—Knowledge Engineering Language for Graph Problems.
KEL—A Primer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Data Primitives with Graph Primitive Extensions . . . . . . . . . .
Generated Code and Graph Libraries . . . . . . . . . . . . . . . . . .
KEL Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
KEL Language—Principles . . . . . . . . . . . . . . . . . . . . . . . . .
KEL Language—Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . .
KEL—The Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
KEL Present and Future . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Part III

.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

309
309
309
310
313
313
315
316
316
318
323
328
328

..........

331

.
.
.

.
.
.

.
.
.
.
.
.

332
335
335
337
338
339

.............

341

.
.
.
.
.
.

.

.
.
.
.
.

341
341
341
342
343
346

System . . . . . . . .

347

.
.
.
.
.
.
.
.

347
349
349
350

351
352
353
353

Big Data Applications

12 HPCC Systems for Cyber Security Analytics . . . . . .
Flavio Villanustre and Mauricio Renzi
The Advanced Persistent Threat . . . . . . . . . . . . . . . . .
LexisNexis HPPS Systems for Deep Forensic Analysis .
Pre-computed Analytics for Cyber Security . . . . . . . . .
The Benefits of Pre-computed Analytics . . . . . . . . .
Deep Forensics Analysis . . . . . . . . . . . . . . . . . . . . . .
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13 Social Network Analytics: Hidden and Complex
Fraud Schemes . . . . . . . . . . . . . . . . . . . . . . . . .
Flavio Villanustre and Borko Furht
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Case Study: Insurance Fraud . . . . . . . . . . . . . . . .
Case Study: Fraud in Prescription Drugs . . . . . . . .
Case Study: Fraud in Medicaid . . . . . . . . . . . . . .
Case Study: Network Traffic Analysis. . . . . . . . . .
Case Study: Property Transaction Risk . . . . . . . . .
14 Modeling Ebola Spread and Using HPCC/KEL
Jesse Shaw, Flavio Villanustre, Borko Furht,
Ankur Agarwal and Abhishek Jain
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . .
Survey of Ebola Modeling Techniques . . . . . . . .
Basic Reproduction Number (R0) . . . . . . . . . .

Case Fatality Rate (CFR) . . . . . . . . . . . . . . .
SIR Model . . . . . . . . . . . . . . . . . . . . . . . . .
Improved SIR (ISIR) Model . . . . . . . . . . . . .
SIS Model. . . . . . . . . . . . . . . . . . . . . . . . . .
SEIZ Model . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.

.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.

Contents

Agent-Based Model . . . . . . . . . . . . . . . . . . . . . . .
A Contact Tracing Model . . . . . . . . . . . . . . . . . . .
Spatiotemporal Spread of 2014 Outbreak
of Ebola Virus Disease . . . . . . . . . . . . . . . . . . . . .
Quarantine Model . . . . . . . . . . . . . . . . . . . . . . . .
Global Epidemic and Mobility Model . . . . . . . . . . .
Other Critical Issues in Ebola Study . . . . . . . . . . . . . .
Delays in Outbreak Detection . . . . . . . . . . . . . . . .
Lack of Public Health Infrastructure . . . . . . . . . . . .
Health Worker Infections . . . . . . . . . . . . . . . . . . .
Misinformation Propagation in Social Media . . . . . .

Risk Score Approach in Modeling and Predicting Ebola
Beyond Compartmental Modeling . . . . . . . . . . . . .
Physical and Social Graphs . . . . . . . . . . . . . . . . . .
Graph Knowledge Extraction . . . . . . . . . . . . . . . . .
Graph Propagation . . . . . . . . . . . . . . . . . . . . . . . .
Mobile Applications Related to Ebola Virus Disease. . .
ITU Ebola—Info—Sharing . . . . . . . . . . . . . . . . . .
Ebola Prevention App. . . . . . . . . . . . . . . . . . . . . .
Ebola Guidelines . . . . . . . . . . . . . . . . . . . . . . . . .
About Ebola . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Stop Ebola WHO Official . . . . . . . . . . . . . . . . . . .
HealthMap . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
#ISurvivedEbola. . . . . . . . . . . . . . . . . . . . . . . . . .
Ebola Report Center . . . . . . . . . . . . . . . . . . . . . . .
What is Ebola . . . . . . . . . . . . . . . . . . . . . . . . . . .
Ebola . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Stop Ebola . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Virus Tracker . . . . . . . . . . . . . . . . . . . . . . . . . . .
Ebola Virus News Alert . . . . . . . . . . . . . . . . . . . .
Sierra Leone Ebola Trends . . . . . . . . . . . . . . . . . .
The Virus Ebola. . . . . . . . . . . . . . . . . . . . . . . . . .
MSF Guidance. . . . . . . . . . . . . . . . . . . . . . . . . . .
Novarum Reader . . . . . . . . . . . . . . . . . . . . . . . . .
Work Done by Government. . . . . . . . . . . . . . . . . .
Innovative Mobile Application for Ebola Spread . . . . .
Registering a New User . . . . . . . . . . . . . . . . . . . .
Login the Application . . . . . . . . . . . . . . . . . . . . . .
Basic Information. . . . . . . . . . . . . . . . . . . . . . . . .
Geofencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Web Service Through ECL . . . . . . . . . . . . . . . . . .

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xv

..........
..........

355
357

......
......
......
......
......
......
......
......
Spread .
......
......
......
......
......
......
......
......
......
......

......
......
......
......
......
......
......
......
......
......
......
......
......
......
......
......
......
......
......
......
......

360
361
362
364
364
365
366
367

368
368
369
369
370
373
373
373
373
374
374
374
374
374
375
375
375
375
376
376
376
376
376
378
378
379
380
380
380
382

383
384

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

xvi

15 Unsupervised Learning and Image Classification
in High Performance Computing Cluster . . . . . . . . . .
I. Itauma, M.S. Aslan, X.W. Chen and Flavio Villanustre
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Background and Advantages of HPCC SystemsR . . . .
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Image Reading in HPCC Systems Platform . . . . . . . .
Feature Learning . . . . . . . . . . . . . . . . . . . . . . . . . .
Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . .

Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Experiments and Results . . . . . . . . . . . . . . . . . . . . . . .
Discussion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Contents

.........

387

.
.
.
.
.
.
.
.
.
.
.
.

387
388
389
390
390

391
393
393
393
398
398
399

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

About the Authors

Borko Furht is a professor in the Department of
Electrical and Computer Engineering and Computer
Science at Florida Atlantic University (FAU) in Boca
Raton, Florida. He is also the director of the NSF
Industry/University Cooperative Research Center for

Advanced Knowledge Enablement. Before joining
FAU, he was a vice president of research and a senior
director of development at Modcomp (Ft. Lauderdale),
a computer company of Daimler Benz, Germany; a
professor at University of Miami in Coral Gables,
Florida; and a senior researcher in the Institute Boris
Kidric-Vinca, Yugoslavia. Professor Furht received his
Ph.D. degree in electrical and computer engineering from the University of
Belgrade. His current research is in multimedia systems, multimedia big data and its
applications, 3-D video and image systems, wireless multimedia, and Internet and
cloud computing. He is presently the Principal Investigator and Co-PI of several
projects sponsored by NSF and various high-tech companies. He is the author of
numerous books and articles in the areas of multimedia, data-intensive applications,
computer architecture, real-time computing, and operating systems. He is a founder
and an editor-in-chief of two journals: Journal of Big Data and Journal of
Multimedia Tools and Applications. He has received several technical and publishing awards and has been a consultant for many high-tech companies including
IBM, Hewlett-Packard, Adobe, Xerox, General Electric, JPL, NASA, Honeywell,
and RCA. He has also served as a consultant to various colleges and universities.
He has given many invited talks, keynote lectures, seminars, and tutorials. He
served on the board of directors of several high-tech companies.
Dr. Flavio Villanustre leads HPCC Systems® and is also VP, Technology for
LexisNexis Risk Solutions®. In this position, he is responsible for information and
physical security, overall platform strategy, and new product development. He is
also involved in a number of projects involving Big Data integration, analytics, and

xvii

xviii

About the Authors

Business Intelligence. Previously, he was the director
of Infrastructure for Seisint. Prior to 2001, he served in
a variety of roles at different companies including
infrastructure, information security, and information
technology. In addition to this, he has been involved
with the open source community for over 15 years
through multiple initiatives. Some of these include
founding the ﬁrst Linux User Group in Buenos Aires
(BALUG) in 1994, releasing several pieces of software
under different open source licenses, and evangelizing
open source to different audiences through conferences, training, and education. Prior to his technology
career, he was a neurosurgeon.

Part I

Big Data Technologies

Chapter 1

Introduction to Big Data
Borko Furht and Flavio Villanustre

Concept of Big Data
In this chapter we present the basic terms and concepts in Big Data computing. Big
data is a large and complex collection of data sets, which is difﬁcult to process using
on-hand database management tools and traditional data processing applications.

Big Data topics include the following activities:
•
•
•
•
•
•
•

Capture
Storage
Search
Sharing
Transfer
Analysis
Visualization

Big Data can be also deﬁned using three Vs: Volume, Velocity, and Variety.
Volume refers to size of the data from Terabytes (TB) to Petabytes (PB), and
related big data structures including records, transactions, ﬁles, and tables. Data
volumes are expected to grow 50 times by 2020.
Velocity refers to ways of transferring big data including batch, near time, real
time, and streams. Velocity also includes time and latency characteristics of data
handling. The data can be analyzed, processed, stored, and managed in a fast rate,
or with a lag time between events.
Variety of big data refers to different formats of data including structured,
unstructured, semi-structured data, and the combination of these. The data format
can be in the forms of documents, emails, text messages, audio, images, video,
graphics data, and others.
In addition to these three main characteristics of big data, there are two additional features: Value, and Veracity [1]. Value refers to beneﬁts/value obtained by

the user from the big data. Veracity refers to the quality of big data.
© Springer International Publishing Switzerland 2016
B. Furht and F. Villanustre, Big Data Technologies and Applications,
DOI 10.1007/978-3-319-44550-2_1

3

4

1 Introduction to Big Data

Table 1.1 Comparison between traditional and big data (adopted from [2])
Volume
Data generation rate
Data structure
Data source
Data integration
Data store
Data access

Traditional data

Big data

In GBs
Per hour; per day
Structured
Centralized
Easy

RDBMS
Interactive

TBs and PBs
More rapid
Semi-structured or Unstructured
Fully distributed
Difﬁcult
HDFS, NoSQL
Batch or near real-time

Sources of big data can be classiﬁed to: (1) various transactions, (2) enterprise
data, (3) public data, (4) social media, and (5) sensor data. Table 1.1 illustrates the
difference between traditional data and big data.

Big Data Workflow
Big data workflow consists of the following steps, as illustrated in Fig. 1.1.
These steps are deﬁned as:
Collection—Structured, unstructured and semi-structured data from multiple
sources
Ingestion—loading vast amounts of data onto a single data store
Discovery and Cleansing—understanding format and content; clean up and
formatting
Integration—linking, entity extraction, entity resolution, indexing and data fusion
Analysis—Intelligence, statistics, predictive and text analytics, machine learning
Delivery—querying, visualization, real time delivery on enterprise-class
availability

Fig. 1.1 Big data workflow

Big Data Technologies

5

Big Data Technologies
Big Data technologies is a new generation of technologies and architectures
designed to economically extract value from very large volumes of a wide variety
of data by enabling high-velocity capture, discovery, and analysis. Big Data
technologies include:
•
•
•
•
•

Massively Parallel Processing (MPP)
Data mining tools and techniques
Distributed ﬁle systems and databases
Cloud computing platforms
Scalable storage systems

Big Data Layered Architecture
As proposed in [2], the big data system can be represented using a layered architecture, as shown in Fig. 1.2. The big data layered architecture consists of three
levels: (1) infrastructure layer, (2) computing layer, and (3) application layer.
The infrastructure layer consists of a pool of computing and storage resources
including cloud computer infrastructure. They must meet the big data demand in
terms of maximizing system utilization and storage requirements.
The computing layer is a middleware layer and includes various big data tools
for data integration, data management, and the programming model.

The application layer provides interfaces by the programming models to
implement various data analysis functions including statistical analyses, clustering,
classiﬁcation, data mining, and others and build various big data applications.

Fig. 1.2 Layered architecture
of big data (adopted from [2]

6

1 Introduction to Big Data

Big Data Software
Hadoop (Apache Foundation)
Hadoop is open source software framework for storage and large scale data processing on clusters computers. It is used for processing, storing and analyzing large
amount of distributed unstructured data Hadoop consists of two components:
HDFS, distributive ﬁle system, and Map Reduce, which is programming framework. In Map Reduce programming component large task is divided into two
phases: Map and Reduce, as shown in Fig. 1.3. The Map phase divides the large
task into smaller pieces and dispatches each small piece onto one active node in the
cluster. The Reduce phase collects the results from the Map phase and processes the
results to get the ﬁnal result. More details can be found in [3].

Splunk
Captures, indexes and correlates real-time data in a searchable repository from
which it can generate graphs, reports, alerts, dashboards and visualizations.

LexisNexis’ High-Performance Computer Cluster (HPCC)
HPCC system and software are developed by LexisNexis Risk Solutions.
A software architecture, shown in Fig. 1.4, implemented on computing clusters
Storage

Map

Map

Map

Map

Synchronization: Aggregate intermediate results

Reduce

Reduce

Final Results
Fig. 1.3 MapReduce framework

Reduce

Big Data Technologies

7

Fig. 1.4 The architecture of the HPCC system

provides data parallel processing for applications with Big Data. Includes a
data-centric programming language for parallel data processing—ECL. The part II
of the book is focused on details of the HPCC system and Part III describes various

HPCC applications.

Big Data Analytics Techniques
We classify big data analytics in the following ﬁve categories [4]:
•
•
•
•
•

Text analytics
Audio analytics
Video analytics
Social media analytics
Predictive analytics.

Text analytics or text mining refers to the process of analyzing unstructured
text to extract relevant information. Text analytics techniques use statistical analysis, computational linguistics, and machine learning. Typical applications include
extracting textual information from social network feeds, emails, blogs, online
forums, survey responses, and news.
Audio analytics or speech analytic techniques are used to analyze and extract
information from unstructured audio data. Typical applications of audio analytics
are customer call centers and healthcare companies.
Video analytics or video content analysis deals with analyzing and extracting
meaningful information from video streams. Video analytics can be used in various
video surveillance applications.
Social media analytics includes the analysis of structured and unstructured data
from various social media sources including Facebook, Linkedin, Twitter,
YouTube, Instagram, Wikipedia, and others.

8

1 Introduction to Big Data

Predictive analytics includes techniques for predicting future outcomes based
on past and current data. The popular predictive analytic techniques include NNs,
SVMs, decision trees, linear and logistic regression, association rules, and
scorecards.
More details about big data analytics techniques can be found in [2, 4] as well as
in the chapter in this book on “Big Data Analytics.”

Clustering Algorithms for Big Data
Clustering algorithms are developed to analyze large volume of data with the main
objective to categorize data into clusters based on the speciﬁc metrics. An excellent
survey of clustering algorithms for big data is presented in [5]. The authors proposed the categorization of the clustering algorithms into the following ﬁve
categories:
•
•
•
•
•

Partitioning-based algorithms
Hierarchical-based algorithms
Density-based algorithms
Grid-based algorithms, and
Model-based clustering algorithms.

The clustering algorithms were evaluated for big data applications with respect

to three Vs deﬁed earlier and the results of evaluation are given in [5] and the
authors proposed the candidate clustering algorithms for big data that meet the
criteria relating to three V.
In the case of clustering algorithms, Volume refers to the ability of a clustering
algorithm to deal with a large amount of data. Variety refers to the ability of a
clustering algorithm to handle different types of data, and Velocity refers to the
speed of a clustering algorithm on big data. In [5] the authors selected the following
ﬁve clustering algorithms as the most appropriate for big data:
•
•
•
•
•

Fuzzy-CMeans (FCM) clustering algorithm
The BIRCH clustering algorithm
The DENCLUE clustering algorithm
Optimal Grid (OPTIGRID) clustering algorithm, and
Expectation-Maximization (EM) clustering algorithm.

Authors also performed experimental evaluation of these algorithms on real data
[5].

Big Data Growth

9

Big Data Growth
Figure 1.5 shows the forecast in big data growth by Reuter (2012) that today there

are less than 10 zettabytes of data. They estimate that by 2020 there will be more
than 30 Zettabyte of data, with the big data market growth of 45 % annually.

Big Data Industries
Media and entertainment applications include digital recording, production, and
media delivery. Also, it includes collection of large amounts of rich content and
user viewing behaviors.
Healthcare applications include electronic medical records and images, public
health monitoring programs, and long-term epidemiological research programs.
Life science applications include low-cost gene sequencing that generates tens of
terabytes of information that must be analyzed for genetic variations.
Video surveillance applications include big data analysis received from cameras
and recording systems.
Applications in transportation, logistics, retails, utilities and telecommunications
include sensor data generated from GPS transceivers, RFID tag readers, smart

Fig. 1.5 Big data growth (Source Reuter 2012)

Big data technologies and applications

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về