BIG DATA & HADOOP
Learn by Example
by
Mayank Bhushan
FIRST EDITION 2020
Copyright © BPB Publications, INDIA
ISBN: 978-93-8655-199-3
All Rights Reserved. No part of this publication can be stored in
a retrieval system or reproduced in any form or by any means
without the prior written permission of the publishers.
LIMITS OF LIABILITY AND DISCLAIMER OF WARRANTY
The Author and Publisher of this book have tried their best to
ensure that the programmes, procedures and functions described
in the book are correct. However, the author and the publishers
make no warranty of any kind, expressed or implied, with regard
to these programmes or the documentation contained in the book.
The author and publisher shall not be liable in any event of any
damages, incidental or consequential, in connection with, or
arising out of the furnishing, performance or use of these
programmes, procedures and functions. Product name mentioned
are used for identification purposes only and may be trademarks
of their respective companies.
All trademarks referred to in the book are acknowledged as
properties of their respective owners.
Distributors:
BPB PUBLICATIONS
20, Ansari Road, Darya Ganj
New Delhi-110002
Ph: 23254990/23254991
BPB BOOK CENTRE
376 Old Lajpat Rai Market,
Delhi-110006
Ph: 23861747
MICRO MEDIA
Shop No. 5, Mahendra Chambers,
150 DN Rd. Next to Capital Cinema,
V.T. (C.S.T.) Station, MUMBAI-400 001
Ph: 22078296/22078297
DECCAN AGENCIES
4-3-329, Bank Street,
Hyderabad-500195
Ph: 24756967/24756400
Published by Manish Jain for BPB Publications, 20, Ansari Road,
Darya Ganj, New Delhi-110002 and Printed by Repro India Pvt Ltd,
Mumbai
Dedicated To
My beloved Family
Mrs. Neelam Sharma/Mr. Gopal Krishna Sharma
Mrs. Aarti/Mr. Shashank
Mrs. Apoorva
Most loving-Anjika
Preface
I am very confident that the present work will come as a relief to
the students wishing to go through a comprehensive work
explaining difficult concepts in the layman's language, offering a
variety of practical approaches and conceptual problems along with
their systematically worked out solutions, covering all the syllabus
prescribed at various levels in universities.
This book promises to be a very good starting point for beginners
and an asset to advanced users too.
This book is written as per the syllabus of various universities
learning pattern and its aim is to keep course approach as
“learning with example” Difficult concepts of Big Data-Hadoop is
given in an easy and practical way, so that students can able to
understand it in an efficient manner. This book provides
screenshots of practical approaches which can be helpful for
students.
It is said “To err is human, to forgive divine”. In this light I wish
that the shortcomings of the book will be forgiven. At the same I
am open to any kind of constructive criticisms and suggestions
for further improvement. All intelligent suggestions are welcome
and I will try my best to incorporate such in valuable suggestions
in the subsequent editions of this book.
23rd March 2018
Mayank Bhushan
Acknowledgement
I would like to express my gratitude to all those who provided
support, talked things over, read, wrote, offered comments, allowed
me to quote their remarks and assisted in the editing,
proofreading and design.
I have relied on many people to guide me directly and indirectly
in writing this book. I am very thankful to Hadoop community;
from whom I have learned with continuous efforts and I also owe
a debt of gratitude for ABES College to provide me all facilities
for Big Data-Hadoop lab.
There is always a sense of gratitude, which every one expresses
others for their helpful and needy services they render during
difficult phases of life and to achieve the goal already set.
It is impossible to thank individually but we are here by making
humble effort to thanks some of them. At the outset I am
thankful to the almighty that is constantly and invisibly guiding
every body and have also helped us to work on the right path.
I am very much thankful to Prof. (Dr.) Shailesh Tiwari, H.O.D.
(CSE), ABES Engineering College, Ghaziabad (U.P.) for guiding and
supporting me. He is the main source of inspiration for me. I
would also like to thanks to Dr. Munesh Chandra Trivedi Dean
(REC-Azamgarh) Dr. Pratibha Singh (Prof., ABES Engineering
College) and Dr. Shaswati Banerjea, Asst. Prof. (MNNIT Allahabad)
who always provide me support everywhere. Without help from
them this book is not possible. I am in debt of technical help
from my dearest friend and colleague Mr. Omesh Kumar who
guide me technically for every problem.
I wish my thanks to my all Guru's, friends and colleagues who
helped and kept us motivated for writing this text. Special thanks
to:
Dr. K.K. Mishra, MNNIT Allahabad
Dr. Mayank Pandey, MNNIT Allahabad
Dr. Shashank Srivastava, MNNIT Allahabad
Mr. Nitin Shukla, MNNIT Allahabad
Mr. Suraj Deb Barma. Govt. Polytechnic College, Agartala
Dr. A.L.N. Rao, GL Bajaj, Greater Noida
Mr. Ankit Yadav, Mr. Desh Deepak Pathak, ABES EC Ghaziabad
Dr. Sumit Yadav, IP University.
Mr. Aatif Jamshed, Galgotia College, Greater Noida
I also thank the Publisher and the whole staff at BPB
Publications, especially
Mr. Manish Jain for bringing this text in a nice presentable form.
Finally, I want to thanks everyone who has directly or indirectly
contributed to complete this authentic work.
Mayank Bhushan
Table of Content
Chapter 1: Big Data-Introduction and Demand
1.1 Big Data
1.1.1 Characteristics of Big Data
1.1.2 Why Big Data
1.2 Hadoop
1.2.1 History of Hadoop
1.2.2 Name of Hadoop
1.2.3 Hadoop Ecosystem
1.3 Convergence of Key Trends
1.3.1 Convergence of Big Data into Business
1.3.2 Big data Vs other techniques
1.4 Unstructured Data
1.5 Industry examples of Big data
1.5.1 Use of Big data-Hadoop at Yahoo
1.5.2 In RackSpace for log processing
1.5.3 Hadoop at Facebook
1.6 Usages of Big Data
1.6.1 Web analytics
1.6.2 Big Data and marketing
1.6.3 Big data and fraud
1.6.4 Risk management in Big Data with Credit card
1.6.5 Big data and algorithm trading
1.6.6 Big data in Healthcare
Chapter 2: NoSQL Data Management
2.1 Introduction to NoSQL database
2.1.1 Terminology used in NoSQL and RDBMS
2.1.2 Database use in NoSQL
2.2 SQL Vs NoSQL
2.2.1 Denormalization
2.2.2 Data distribution
2.2.3 Data Durability
2.3 Consistency in NoSQL
2.3.1 ACID Vs BASE
2.3.2 Relaxing Consistency
2.4 Hbase
2.4.1 Installation
2.4.2 History
2.4.3 Hbase Data Structure
2.4.4 Physical Storage
2.4.5 Components
2.4.6 Hbase Shell Commands
2.4.7 The different usages of scan command
2.4.8 Terminologies
2.4.8.1 Version Stamp
2.4.8.2 Region
2.4.8.3 Locking
2.5 MapReduce
2.5.1 MapReduce Architecture
2.5.2 MapReduce datatype
2.5.3 File input format
2.5.4 Java MapReduce
2.6 Partitioner and Combiner
2.6.1 Example in MapReduce
2.6.2 Situation for Partitioner and Combiner
2.6.3 Use of combiner
2.7 Composing MapReduce Calculations
Chapter 3: Basics of Hadoop
3.1 Data Format
3.2 Analysing data with Hadoop
3.3 Scale-in Vs Scale-out
3.3.1 Number of reducers used
3.3.2 Driver class with no reducer
3.4 Hadoop Streaming
3.4.1 Streaming in Ruby
3.4.2 Streaming in Python
3.4.3 Streaming in Java
3.5 Hadoop pipes
3.6 Design of HDFS
3.6.1 Very large
3.6.2 Streaming data access
3.6.3 Commodity Hardware
3.6.4 Low-latency data access
3.6.5 Lots of small files
3.6.6 Arbitrary file modifications
3.7 HDFS Concept
3.7.1 Blocks
3.7.2 Namenodes and Datanodes
3.7.3 HDFS group
3.7.4 All time availability
3.8 Hadoop Files System
3.9 Java Interface
3.9.1
3.9.2
3.9.3
3.9.4
HTTP
C
FUSE (File System in Userspace)
Reading data using Java interface (URL)
3.9.5 Reading data using java interface (File System API)
3.10 Data Flow
3.10.1 File Read
3.10.2 File write
3.10.3 Coherency Model
3.10.4 Cluster Balance
3.10.5 Hadoop Archive
3.11 Hadoop I/O
3.11.1 Data Integrity
3.11.2 Local File System
3.12 Compression
3.12.1 Codecs
3.12.2 Compression and Input Splits
3.12.3 Map output
3.13 Serialization
3.14 Avro file based data structure
3.14.1 Data type and schemas
3.14.2 Serialization and deserialization
3.14.3 Avro MapReduce
Chapter 4: Hadoop Installation (Step by Step)
4.1 Introduction
4.1.1 On VMware
4.1.2 Oracle Virtual Box
4.2 On Ubuntu 16.04
4.3 Fully Distributed Mode
Chapter 5: MapReduce Applications
5.1 Understanding of MapReduce
5.2 Traditional Way
5.3 MapReduce Workflow
5.3.1 Map Side
5.3.2 Reduce Side
5.4 Unit Test with MRUnit
5.4.1 Testing Mapper Class
5.4.2 Testing Reducer Class
5.4.3 Testing Driver Class of Program
5.4.4 Test output of program
5.5 Test Data and Local Data Check
5.5.1 Debugging MapReduce Job
5.5.2 Job Control
5.6 Anatomy of MapReduce Job
5.6.1 Anatomy of File Write
5.6.2 Anatomy of File Read
5.6.3 Replica Management
5.7 MapReduce Job Run
5.7.1 Classic MapReduce (MapReduce 1)
5.7.2 MapReduce2 (YARN)
5.7.3 Failure in MapReduce1
5.7.4 Failure in YARN
5.8 Job Scheduling
5.9 Shuffle and Sort
5.9.1 Map Side
5.9.2 Reduce Side
5.10 Task Execution
5.10.1 Task JVM
5.10.2 Skipping Bad Records
5.11 MapReduce Types
5.11.1 Input type
5.11.2 Output type
Chapter 6: Hadoop Related Tools-I (Hbase & Cassandra)
6.1 Installation of Hbase
6.2 Conceptual Architecture
6.2.1 Regions
6.2.2 Locking
6.3 Implementation
6.4 HBase Vs RDBMS
6.5 HBase Client
6.6 HBase Examples and Commands
6.6.1
6.6.2
6.6.3
6.6.4
Inserting data by using HBase shell
Updating data by using HBase shell
Reading data by using HBase shell
Reading a given column
6.6.5
6.6.6
6.6.7
6.6.8
Delete specific cell in table
Delete all cells in a table
Scanning using HBase shell
Count
6.6.9 Truncate
6.6.10 Disable table
6.6.11 Verification
6.6.12 Alter table
6.6.13 Scope operator for alter table
6.6.14 Deleting column family
6.6.15 Existence of table
6.6.16 Dropping a table
6.6.17 Drop all table
6.7 HBase using Java APIs
6.7.1 Creating table
6.7.2
6.7.3
6.7.4
6.7.5
List of the tables in HBase
Disable a table
Add column family
Deleting column family
6.7.6 Verifying existence of table
6.7.7 Deleting table
6.7.8 Stopping HBase
6.8 Praxis
6.8.1 Versions
6.8.2 HDFS
6.8.3 Schema design
6.9 Cassandra
6.9.1 CAP Theorem
6.9.2 Characteristics of Cassandra
6.9.3 Installing Cassandra
6.9.4 Basic CLI commands
6.10 Cassandra Data Model
6.10.1 Super Column family
6.10.2 Clusters
6.10.3 Keyspaces
6.10.4 Column families
6.10.5 Super columns
6.11 Cassandra Examples
6.11.1 Creating a keyspace
6.11.2 Alter Keyspace
6.11.3 Dropping a keyspace
6.11.4 Create table
6.11.5 Primary key
6.11.6 Alter table
6.11.7 Truncate table
6.11.8 Executing batch
6.11.9 Delete entire row
6.11.10 Describe
6.12 Cassandra Client
6.12.1 Thrift
6.12.2 Avro
6.12.3 Hector
6.12.4 Chirper
6.12.5 Pelops
6.13 Hadoop Integration
6.14 Use Cases
6.14.1 eBay
6.14.2 HULU
Chapter 7: Hadoop Related Tools-II (PigLatin & HiveQL)
7.1 PigLatin
7.2 Installation
7.3 Execution Type
7.3.1 Local mode
7.3.2 MapReduce mode
7.4 Platform for Running Pig Programs
7.4.1 Script
7.4.2 Grunt
7.4.3 Embedded
7.5 Grunt
7.5.1 Example
7.5.2 Commands in grunt
7.6 Pig Data Model
7.6.1 Scalar
7.6.2 Complex
7.7 PigLatin
7.7.1 Input & Output
7.7.2 Store
7.7.3 Relational Operations
7.7.4 User Defined Functions
7.8 Developing and Testing PigLatin Script
7.8.1 Dump operator
7.8.2 Describe operator
7.8.3 Explanation operator
7.8.4 Illustration operator
7.9 Hive
7.9.1 Installation Hive
7.9.2 Hive Architecture
7.9.3 Hive Services
7.10 Data Type and File Format
7.11 Comparison of HiveQL with Traditional Database
7.12 HiveQL
7.12.1 Data Definition Language
7.12.2 Data Manipulation Language
7.12.3 Example for practice
Chapter 8: Practical & Research based Topics
8.1 Data Analysis with Twitter
8.1.1 Using flume
8.1.2 Data Extraction using Java
8.1.3 Data Extraction using Python
8.2 Use of Bloom Filter in MapReduce
8.2.1 Function of Bloom filter
8.2.2 Working of bloom filter
8.2.3 Application of Bloom filter
8.2.4 Implementation of Bloom filter in MapReduce
8.3 Amazon Web Service
8.3.1 AWS
8.3.2 Setting AWS
8.3.3 Setting up Hadoop on EC2
8.4 Document Archived from NY Times
8.5 Data Mining in Mobiles
8.6 Hadoop Diagnosis
8.6.1 System's health
8.6.2 Setting permission
8.6.3 Managing quotas
8.6.4 Enabling trash
8.6.5 Removing datanode
Appendix: Hadoop Commands
Chapter wise Questions
Previous Year Question Paper
CHAPTER 1
Big Data-Introduction and Demand
“…Data is useless without the skill to analyse it.”
-Jeanne Harris, senior executive at Accenture Institute for High
Performance,
“Taking a hunch, you have about the world and pursuing it in a
structural, mathematical way to understand something new about the
world.”
-Hilary Mason American data scientist and the founder of
technology start-up Fast Forward Labs
1.1 Big Data
In today's scenario, we all are surrounded by bulk of data. We as
human also an example of big data as we are surrounded by
devices and generating data every minute.
“I spend most of my time assuming the world is not ready for the
technology revolution that will be happening to them soon,”
Eric Schmidt Executive Chairman Google
In the matter of fact, if we compare present situation to past
scenario we can find that we are creating as much information in
just two days as we did up-to 2003. That means we are creating
five Exabyte of data in every two days.
Real problem is that the user generated data which they are
producing continuously. At the time of data analysis, we have
challenges to store and analysis those data.
“The real issue is user-generated content,”
Schmidt
Mostly it helps Google for analysis the data and sell data analytics
to companies who required it. We are producing data only the
rough mobile as we already logged in when we buy system:
Map: that collect data of our travelling.
App: that gather information about our mood swings and record
activity in which we involve most of the time.
E-Commerce sites: It also collect information of our requirement
and show whatever we are supposed to buy.
Emails: It produce data of our requirement depend upon the
conversation as all conversation generally filtered through
companies that own mailing addresses.
During the past few decades, technologies like remote sensing,
geographical data systems, and world positioning systems of map
have remodelled the approach of distribution of human population
across the world. For that scenario, we need to map those
population data to meaningful survey that is performing by big
companies. As a result, spatially careful changes across scales of
days, weeks, or months, or maybe year to year, area unit tough to
assess and limit the applying of human population maps in things
within which timely data is needed, like disasters, conflicts, or
epidemics. Information being collected on daily basis by mobile
network suppliers across the planet, the prospect of having the
ability to map up to date and ever-changing human population
distributions over comparatively short intervals exist, paving the
approach for brand new applications and a close to period of
time understanding the patterns and processes in human science.
Some of the facts related to exponential data production are:
Currently, over 2 billion people worldwide are connected to the
Internet, and over 5 billion individuals own mobile phones. By
2020, 50 billion devices are expected to be connected to the
Internet. At this point, predicted data production will be 44 times
greater than that in 2009.
In 2012, 2.5 quintillion bytes of data were generated daily, and
90% of current data worldwide originated in the past two years.
Facebook alone stores, accesses, and analyses 30 + PB of usergenerated data.
In 2008, Google was processing 20,000 TB of data daily.
Walmart processes over 1 million customer transactions, thus
generating data more than 2.5 PB as an estimate.
More than 5 billion people worldwide call, text, tweet, and browse
on mobile devices.
The amount of e-mail accounts created worldwide is expected to
increase from 3.3 billion in 2012 to over 4.3 billion by late 2016 at
an average annual rate of 6% over the next four years. In 2012, a
total of 89 billion e-mails were sent and received daily, and this