big data ,with hadoop mapreduce, a classroom approach

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (29.52 MB, 427 trang )

BIG DATA WITH
HADOOP MAPREDUCE
A Classroom Approach

BIG DATA WITH
HADOOP MAPREDUCE
A Classroom Approach

Rathinaraja Jeyaraj
Ganeshkumar Pugalendhi
Anand Paul

Apple Academic Press Inc.
4164 Lakeshore Road
Burlington ON L7L 1A4
Canada

Apple Academic Press, Inc.
1265 Goldenrod Circle NE
Palm Bay, Florida 32905
USA

© 2021 by Apple Academic Press, Inc.
Exclusive worldwide distribution by CRC Press, a member of Taylor & Francis Group
No claim to original U.S. Government works
International Standard Book Number-13: 978-1-77188-834-9 (Hardcover)
International Standard Book Number-13: 978-0-42932-173-3 (eBook)

All rights reserved. No part of this work may be reprinted or reproduced or utilized in any form or by
any electric, mechanical or other means, now known or hereafter invented, including photocopying and
recording, or in any information storage or retrieval system, without permission in writing from the
publisher or its distributor, except in the case of brief excerpts or quotations for use in reviews or critical
articles.
This book contains information obtained from authentic and highly regarded sources. Reprinted material
is quoted with permission and sources are indicated. Copyright for individual articles remains with the
authors as indicated. A wide variety of references are listed. Reasonable efforts have been made to publish reliable data and information, but the authors, editors, and the publisher cannot assume responsibility
for the validity of all materials or the consequences of their use. The authors, editors, and the publisher
have attempted to trace the copyright holders of all material reproduced in this publication and apologize
to copyright holders if permission to publish in this form has not been obtained. If any copyright material
has not been acknowledged, please write and let us know so we may rectify in any future reprint.
Trademark Notice: Registered trademark of products or corporate names are used only for explanation
and identification without intent to infringe.
Library and Archives Canada Cataloguing in Publication
Title: Big data with Hadoop MapReduce : a classroom approach / Rathinaraja Jeyaraj,
Ganeshkumar Pugalendhi, Anand Paul.
Names: Jeyaraj, Rathinaraja, author. | Pugalendhi, Ganeshkumar, author. | Paul, Anand, author.
Description: Includes bibliographical references and index.
Identifiers: Canadiana (print) 20200185195 | Canadiana (ebook) 20200185241 | ISBN 9781771888349
(hardcover) | ISBN 9780429321733 (electronic bk.)
Subjects: LCSH: Apache Hadoop. | LCSH: MapReduce (Computer file) | LCSH: Big data. |
LCSH: File organization (Computer science)
Classification: LCC QA76.9.D5 .J49 2020 | DDC 005.74—dc23

CIP data on file with US Library of Congress

Apple Academic Press also publishes its books in a variety of electronic formats. Some content that appears
in print may not be available in electronic format. For information about Apple Academic Press products,
visit our website at www.appleacademicpress.com and the CRC Press website at www.crcpress.com

About the Authors
Rathinaraja Jeyaraj
Post-Doctoral Researcher, University of Macau, Macau
Rathinaraja Jeyaraj has obtained PhD from National Institute of Technology
Karnataka, India. He recently worked as a visiting researcher at connected
computing and media processing lab, Kyungpook National University, South
Korea and supervised by Prof. Anand Paul. His research interests include
big data processing tools, cloud computing, IoT, and machine learning. He
completed his BTech and MTech at Anna University, Tamil Nadu, India.
He has also earned an MBA in Information Systems and Management at
Bharathiar University, Coimbatore, India.
Ganeshkumar Pugalendhi, PhD
Assistant Professor, Department of Information Technology,
Anna University Regional Campus, Coimbatore, India
Ganeshkumar Pugalendhi, PhD, is an Assistant Professor in the Department of Information Technology, Anna University Regional Campus,
Coimbatore, India. He received his BTech from University of Madras, MS
(by research), and PhD degrees from Anna University, India, and did his
postdoctoral work at Kyungpook NationalUniversity, South Korea. He is
the recipient of a Student Scientist Award from the TNSCST, India; best
paper awards from IEEE, the IET, and the Korean Institute of Industrial and
Systems Engineers; travel grants from Indian Government funding agencies like DST-SERB as a Young Scientist, DBT, and CSIR and a workshop
grant from DBT. He has visited many countries (Singapore, South Korea,
USA, Serbia, Japan, and France) for research interaction and collaboration. He is the resource person for delivering technical talks and seminars
sponsored by Indian Government Organizations like UGC, AICTE, TEQIP,
ICMR, DST and others. His research works are published in well reputed
Scopus/SCIE/SCI journals and renowned top conferences. He has written
two research-oriented textbooks: Data Classification Using Soft Computing
and Soft Computing for Microarray Data Analysis. He is a Track Chair for

Human Computer Interface Track in ACM SAC (Symposium on Applied

vi

About the Authors

Computing) for 2016 in Italy, 2017 in Morocco, 2018 in France and 2019 in
Cyprus. He is a Guest Editor for Taylor & Francis Journal and Inderscience
Journal in 2017, Hindawii Journal in 2018, MDPI Journal of Sensor and
Actuator Networks in 2019. His Citation and h-index are (260, 8), (218, 7)
and (117, 6) in Google Scholar, Scopus and Publons respectively as on 2020.
His research interests are in Data Analytics and Machine Learning.
Anand Paul, PhD
Associate Professor, School of Computer Science and Engineering,
Kyungpook National University, South Korea
Anand Paul, PhD, is currently working in the School of Computer Science and
Engineering at Kyungpook National University, South Korea, as Associate
Professor. He earned his PhD in Electrical Engineering from the National
Cheng Kung University, Taiwan, R.O.C. His research interests include big
data analytics, IoT, and machine learning. He has done extensive work in
big data and IoT-based smart cities. He was a delegate representing South
Korea for the M2M focus group in 2010–2012 and has been an IEEE senior
member since 2015. He is serving as associate editor for the journals IEEE
Access, IET Wireless Sensor Systems, ACM Applied Computing Reviews,
Cyber Physical Systems (Taylor & Francis), Human Behaviour and Emerging
Technology (Wiley), and the Journal of Platform Technology. He has also
guest edited various international journals. He is the track chair for smart
human computer interaction with the Association for Computing Machinery
Symposium on Applied Computing 2014–2019, and general chair for the 8th

International Conference on Orange Technology (ICOT 2020). He is also an
MPEG delegate representing South Korea.

A Message from Kaniyan
From Purananuru written in Tamil

English Translation by Reverend G.U. Pope (in 1906)
To us all towns are one, all men our kin.
Life’s good comes not from others’ gift, nor ill
Man’s pains and pains’ relief are from within.
Death’s no new thing; nor do our bosoms thrill
When Joyous life seems like a luscious draught.
When grieved, we patient suffer; for, we deem
This much – praised life of ours a fragile raft
Borne down the waters of some mountain stream
That o’er huge boulders roaring seeks the plain
Tho’ storms with lightnings’ flash from darken’d skies
Descend, the raft goes on as fates ordain.
Thus have we seen in visions of the wise !
(Puram: 192)
—Kaniyan Pungundran
Kaniyan Pungundran was an influential Tamil philosopher from the Sangam
age (3000 years ago). His name Kaniyan implies that he was an astronomer
as it is a Tamil word referring to mathematics. He was born and brought up
in Mahibalanpatti, a village panchayat in the Thiruppatur taluk of Sivaganga
district in the state of Tamil Nadu, India. He composed two poems called
Purananuru and Natrinai during the Sangam period.

Contents
Abbreviations .................................................................................................. xi
Preface ........................................................................................................... xv
Dedication and Acknowledgment ................................................................ xvii
Introduction .................................................................................................. xix
1.

Big Data ..........................................................................................................1

2.

Hadoop Framework .....................................................................................47

3.

Hadoop 1.2.1 Installation ........................................................................... 113

4.

Hadoop Ecosystem......................................................................................153

5.

Hadoop 2.7.0................................................................................................167

6.

Hadoop 2.7.0 Installation ...........................................................................197

7.

Data Science ................................................................................................357

APPENDIX A: Public Datasets .........................................................................371
APPENDIX B: MapReduce Exercise ...............................................................375
APPENDIX C: Case Study: Application Development for NYSE Dataset ...383
Web References ...................................................................................................391
Index ....................................................................................................................393

Abbreviations
ACID
AsM
AWS
BR
BSP
CDH
CSP
DAG
DAS
DFS
DM
DN
DRF
DSS
DWH
EMR
ETL

FIFO
GFS
GMR
GPU
HA
HB
HDD
HDFS
HDT
HDP
HPC
HPDA
HT
IDC
IO
IS

Atomicity Consistency Isolation Durability
Applications Manager
Amazon Web Services
Block Report
Bulk Synchronous Processing
Cloudera Distribution for Hadoop
Cloud Service Providers
Directed Acyclic Graph
Direct Attached Storage
Distributed File System
Data Mining
Data Node
Dominant Resource Fairness

Decision Support System
Data WareHouse
Elastic MapReduce
Extract Transform Load
First In First Out
Google File System
Google MapReduce
Graphics Processing Unit
High Availability
HeartBeat
Hard Disk Drive
Hadoop Distributed File System
Hadoop Development Tools
Hadoop Data Platform
High Performance Computing
High Performance Data Analytics
Hyper Threading
International Digital Corporation
InputOutput
Input Split

Abbreviations

xii

JHS
JN
JT
JVM

LAN
LHC
LSST
MR
MRAppMaster
MPI
NAS
NCDC
NFS
NIC
NLP
NM
NN
NN HA
NSF
NYSE
OLAP
P-P
QA
QJM
RAID
RDBMS
RF
RM
RPC
RR
RW
SDSS
SAN
SNA

SNN
SPOF
SPS
SSD
STONITH

Job History Server
Journal Node
Job Tracker
Java Virtual Machine
Local Area Network
Large Hadron Collider
Large Synoptic Survey Telescope
MapReduce
MR Application Master
Message Passing Interface
Network-Attached Storage
National Climatic Data Centre
Network File System
Network Interface Card
Natural Language Processing
Node Manager
Name Node
Name Node High Availability
National Science Foundation
New York Stock Exchange
OnLine Analytical Processing
Point to Point
Question Answering
Quorum Journal Manager

Redundant Array of Inexpensive Disks
Relational Database Management System
Replication Factor
Resource Manager
Remote Procedure Call
Record Reader
Record Writer
Sloan Digital Sky Survey
Storage Area Network
Social Network Analysis
Secondary Name Node
Single Point of Failure
Stream Processing System
Solid State Disk
Shoot the Other Node In the Head

Abbreviations

TFLOPS
TT
VM
WORM
WUI
YARN
ZK
ZKFC

xiii

Tera Floating Operation Per Second
Task Tracker
Virtual Machine
Write Once and Read Many
Web User Interface
Yet Another Resource Negotiator
ZooKeeper
ZooKeeper Failover Controller

Preface
“We aim to make our readers visualize and learn big data and Hadoop Map
Reduce from scratch.”
There is a lot of content on Big Data and Hadoop MapReduce available on
the Internet (online lectures, websites) and excellent books are available for
intermediate level users to master Hadoop MapReduce. Are they helpful for
beginners and non-computer science students to understand the basics of big
data, Hadoop cluster setup, and easily write MapReduce jobs? This requires
investing much time to read or watch lectures. Hadoop aspirants (once upon
a time, including me) find difficulties in selecting the right sources to begin
with. Moreover, the basic terminologies in big data, distributed computing,
and inner working of Hadoop MapReduce and Hadoop Distributed File
System are not presented in a simple way, which makes the audience reluctant
in pursuing them.
This motivation sparked us to share our experience in the form of a book
to bridge the gap between inexperienced aspirants and Hadoop. We have
framed this book to provide an understanding of big data and MapReduce by
visualizing the basic terminologies and concepts with more illustrations and
worked-out examples. This book will significantly minimize the time spent

on the Internet to explore big data, MapReduce inner working, and single
node/multi-node installation on physical/virtual machines.
This book covers almost all the necessary information on Hadoop
MapReduce for the online certification exam. We mainly target students,
research scholars, university professors, and big data practitioners who
wish to save their time while learning. Upon completing this book, it will
be easy for users to start with other big data processing tools such as Spark,
Storm, etc. as we provide a firm grip on the basics. Ultimately, our readers
will be able to:
+
+
+
+

understand what big data is and the factors that influence them.
understand the inner working of MapReduce, which is essential for
certification exams.
setup Hadoop clusters with 100s of physical/virtual machines.
create a virtual machine in AWS and setup Hadoop MapReduce.

Preface

xvi

+
+
+

write MapReduce with Eclipse in a simple way.

understand other big data processing tools and their applications.
understand various job positions in data science.

We believe that, regardless of domain and expertise level in Hadoop
MapReduce, many will use our book as a basic manual. We provide some
sample MapReduce jobs ( with
the dataset to practice simultaneously while reading our text. Please note
that it is not necessary to be an expert, but you must have some minimal
knowledge of working in Ubuntu, Java, and Eclipse to setup cluster and
write MapReduce jobs.
Please contact us by mail if you have any queries. We will be happy to
help you to get through the book.

Dedication and Acknowledgment
This is an excellent opportunity for me to thank Prof. V.S. Ananthanarayana
(my research supervisor, Deputy Director, National Institute of Technology
Karnataka), Dr. Ganeshkumar Pugalendhi (my post-graduate supervisor in
Anna University, Coimbatore), and Dr. Anand Paul (my research mentor
in Kyungpook National University, South Korea) for being my constant
motivation. I sincerely extend my gratitude to Prof. V.S. Ananthanarayana,
for the freedom he provided to set my goals and pursue in my style without
any restriction. It would not be an exaggeration to thank Dr. Ganeshkumar
Pugalendhi and Dr. Anand Paul for their significant contribution in shaping
and organizing the contents of this book more simply. It would not be possible
to bring my experience as a book without their help and support. I am so
much grateful to them. I want to thank Mr. Sai Natarajan (Director, Duratech
Solutions, Coimbatore), Dr. Karthik Narayanan (Assistant Professor, Karunya
University, Tamil Nadu), Mr. Benjamin Santhosh Raj (Data-Centre Engineer,
Wipro, Chennai), Dr. Sathishkumar (Assistant Professor, SNS College of

Technology, Coimbatore), Mr. Elayaraja Jeyaraj (Data-Centre Engineer, CGI,
Bangalore), Mr. Rajkumar for spending their time to carry out the technical
review, and Ms. Felicit Beneta, MA, MPhil, for language correction. I thank
them for contributing helpful suggestions and improvements to my drafts. I
also thank all who contributed regardless of the quantum of work directly/
indirectly. It is always impossible without the family support to invest massive
time for preparing a book. I am debted to my parents, Mrs. Radha Ambigai
Jeyaraj and Mr. Jeyaraj Rathinasamy, for my whole life. I am so much grateful
to my brothers, Mr. Sivaraja Jeyaraj, and Mr. Elayaraja Jeyaraj, for supporting
me financially without any expectation. Finally, I should mention my source
of inspiration right from my graduate studies until research degree, my wife,
Dr. Sujiya Rathinaraja, who consistently gave mental support all through the
tough journey. Infinite thanks to her for keeping my life green and lovable.
—Rathinaraja Jeyaraj

Introduction
This book covers the basic terminologies and concepts of big data, distributed
computing, and MapReduce inner working, etc. We have emphasized more
on Hadoop v2 when compared to Hadoop v1 in order to meet today’s trend.
Chapter 1 discusses the reasons that caused big data and why decision
making from digital data is essential. We have compared and contrasted
the importance of horizontal scalability over vertical scalability for big data
processing. The history of Hadoop and its features are mentioned along with
different big data processing framework.
Chapter 2 is built on Hadoop v1 to elaborate on the inner working of the
MapReduce execution cycle, which is very important to implement scalable
algorithms in MapReduce. We have given examples to understand the MapReduce execution sequence step-by-step. Finally, MapReduce weaknesses and
solutions are mentioned at the end of the chapter.

Chapter 3 completely covers single node and multi-node implementation
step-by-step with a basic wordcount MapReduce job. Some Hadoop administrative commands are given to practice with Hadoop tools.
Chapter 4 briefly introduces a set of big data processing tools in a Hadoop
ecosystem for various purposes. Once you are done with Hadoop Distributed File System and MapReduce, you are ready to dirty your hands with
other tools based on your project requirement. We have given many web
links to download the various big datasets for practice.
Chapter 5 takes you into Hadoop v2 by introducing YARN. However,
you will find it easy if you read Chapter 2 already. Therefore, we strongly
recommend you to spend some time on Hadoop v1, which will help you to
understand why Hadoop v2 is necessary. Moreover, the Hadoop cluster and
MapReduce job configurations are discussed in detail.
Chapter 6 is a significant portion in our book that will explain Hadoop v2,
single node/multi-node installation on physical/virtual machines, running
MapReduce job in Eclipse itself (you need not setup a real Hadoop cluster to
frequently test your algorithm), properties used to tune MapReduce cluster

xx

Introduction

and job, art of writing MapReduce jobs, NN high availability, Hadoop Distributed File system federation, meta-data in NN, and finally creating Hadoop
cluster in Amazon cloud. You will find this chapter more helpful if you wish
to write many MapReduce jobs for different concepts.
Chapter 7 briefly describes data science and some big data problems in text
analytics, audio analytics, video analytics, graph processing, etc. Finally, we
have mentioned different job positions and their requirements in the big data
industry.
The Appendixes includes various dataset links and examples to work out.
We have also included a case study on NYSE dataset to have complete experience of MapReduce.

CHAPTER 1

Big Data
A journey of a thousand miles begins with a single step.
—Lao Tzu
INTRODUCTION
Big data has dramatically changed the way in businesses, management, and
research sectors. It is considered to be an emerging fourth scientific paradigm called “data science.” Let us have a quick review of the emergence of
science over centuries.
Empirical Science – The proof of concept is based on experience and
evidence verifiable rather than pure theory or logic.
Theoretical Science – The proof of concept is theoretically derived
(Newton’s law, Kepler’s law, etc.) rather than conducting experiments for
many complex problems, as creating evidence is difficult. It was also infeasible deriving thousands of pages.
Computational Science – Deriving equations over 1000s of pages for
solving problems like weather prediction, protein structure evaluation,
genome analysis, solving puzzle, games, human-computer interaction such
as conversation was typically taking huge time. Application of specialized
computer systems to solve such problems is called computational science.
As part of this, a mathematical model is developed and programed to feed
into the computer along with the input. This deals with calculation-intensive
tasks (which are not humanly possible to calculate in a short time).
Data Science – Deals with data-intensive (massive data) computing. Data
science aims to deal with big data analytics comprehensively to discover
unknown, hidden pattern/trend/association/relationship or any other useful,
understandable, and actionable information (insight/knowledge) that leads
to decision making.

Big Data with Hadoop MapReduce

2

1.1 BIG DATA
New technologies, devices, and social applications exponentially increase the
volume of digital data every year. The size of digital data created till 2003 was
4000 million GB, which would fill an entire football ground if piled up in disks.
The same quantity was created in every two days in 2011, and every 10 minutes
in 2013. This continues to proliferate. The data is meaningful and useful when
processed. “Big data refers to a collection of datasets that are huge or flow large
enough or with diverse types of data or any of these combinations that outpace
our traditional storage (RDBMS), computing, and algorithm ability to store,
process, analyze, and understand with a cost-effective way.” How big “big
data” is? In simple terms, any amount of data that is beyond storage capacity,
computing, and algorithm ability of a machine is called big data. Example:
•
•

10 GB high definition video could be a big data for smartphones but
not for high-end desktops.
Rendering video from 100 GB 3D graphics data could be a big data
for laptop/desktop machines but not for high-end servers.

A decade back, the size was the first, and at times, the only dimension
that indicated big data. Therefore, we might tend to conclude as follows:
Big (huge) + data (volume + velocity + variety)  huge data volume +
huge data velocity + huge data variety
However, the volume is one of the factors that chokes the system capability. Other factors can individually hold the neck of computers. Even

though the last equation is true, volume, velocity, and variety need not be
combined to say a dataset is big data. Anyone of the factors (volume or
velocity or variety) is enough to say a field is facing big data problems if
it chokes the system capability. From the definition, “big data” not only
emerged just from storage capacity (volume) point of view, but also from
“processing capability and algorithm ability” of a machine. Because hardware processing capability and algorithm ability determine how much
amount of data a computer can process in a specified amount of time.
Therefore, some definitions focus on what data is, while others focus on
what data does.
Some interesting facts on big data
The International Digital Corporation (IDC) is a market research firm that
monitors and measures the data created worldwide. It reports that

Big Data

•
•
•
•

3

every year, data created is almost doubled.
over 16 ZB was created in 2016.
over 163 ZB will be created by 2020.
in today’s digital data world, 90% were created in the last couple of
years, in which 95% of data is in semi/unstructured form, and merely
less than 5% belongs to structured form of data.

1.1.1 BIG DATA SOURCES
Anything capable of producing digital data contribute to data accumulation.
However, the way data generated in the last 40 years has changed completely.
For example,
before 1980 – devices were generating data.
1980–2000 – employees generated data as an end user.
since 2000 – people started contributing data via social applications,
e-mails, etc.
after 2005 – every hardware, software, application generated log data.
It is hard to find any activity that does not generate data on the Internet.
We are letting somebody else watch us and monitor our activities over the
Internet. Figure 1.1 [1] illustrates what happened in every 60 seconds in the
digital world in 2017 by Internet-based companies.
•
•
•
•
•
•
•
•
•
•
•
•

YouTube users upload 400 hours of new video and watch 700,000
hours of videos.
3.8 million searches are done on Google.
Over 243,000 images are uploaded, and 70,000 hours of video are

watched on Facebook.
350,000 tweets are generated on Twitter.
Over 65,000 images are uploaded on Instagram.
More than 210,000 snaps are sent on Snapchat.
120 new users are joining LinkedIn.
156 Million E-mails are exchanged.
29 million messages, over 175,000 video messages, and 1 million
images are processed in WhatsApp every day.
Videos of 87,000 hours are watched on Netflix.
Over 25,000 posts are shared on Tumblr.
500,000 applications are downloaded.

Big Data with Hadoop MapReduce

4

•
•
•
•
•

Over 80 new domains are registered.
Minimum of 1,000,000 swipes and 18,000 matches are done on
Tinder.
Over 5,500 check-ins are happening on Foursquare.
More than 800,000 files are uploaded in Dropbox.
Over 2,000,000 minutes of calls are made on Skype.

FIGURE 1.1
permission).

What is happening in every 60 seconds? (Reprinted from Ref. [1] with

Let us discuss how big data impacts the major domains such as businesses, public administration, and scientific research.
Big data in business (E-commerce)
The volume of business data worldwide doubles every 1.2 years. E-com
companies such as Walmart, Flipkart, Amazon, eBay, etc. generate millions
of transactions per day. For instance, in 6000 Walmart stores worldwide,
•
•
•
•

7.5 million RFID entries were done in 2005.
30 billion RFID entries were accounted in 2012.
exceeded 100 billion RFID entries in 2015.
300 million transactions are happening per nowadays.

big data ,with hadoop mapreduce, a classroom approach

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về