Hadoop real world solutions cookbook second edition

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (2.82 MB, 624 trang )

Hadoop Real-World Solutions
Cookbook Second Edition

Table of Contents
Hadoop Real-World Solutions Cookbook Second Edition
Credits
About the Author
Acknowledgements
About the Reviewer
www.PacktPub.com
eBooks, discount offers, and more
Why Subscribe?
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
1. Getting Started with Hadoop 2.X
Introduction
Installing a single-node Hadoop Cluster
Getting ready
How to do it...

How it works...
Hadoop Distributed File System (HDFS)
Yet Another Resource Negotiator (YARN)
There's more
Installing a multi-node Hadoop cluster
Getting ready
How to do it...

How it works...
Adding new nodes to existing Hadoop clusters
Getting ready
How to do it...
How it works...
Executing the balancer command for uniform data distribution
Getting ready
How to do it...
How it works...
There's more...
Entering and exiting from the safe mode in a Hadoop cluster
How to do it...
How it works...
Decommissioning DataNodes
Getting ready
How to do it...
How it works...
Performing benchmarking on a Hadoop cluster
Getting ready
How to do it...
TestDFSIO

NNBench
MRBench
How it works...
2. Exploring HDFS
Introduction
Loading data from a local machine to HDFS
Getting ready
How to do it...
How it works...
Exporting HDFS data to a local machine
Getting ready
How to do it...
How it works...
Changing the replication factor of an existing file in HDFS
Getting ready

How to do it...
How it works...
Setting the HDFS block size for all the files in a cluster
Getting ready
How to do it...
How it works...
Setting the HDFS block size for a specific file in a cluster
Getting ready
How to do it...
How it works...
Enabling transparent encryption for HDFS
Getting ready
How to do it...

How it works...
Importing data from another Hadoop cluster
Getting ready
How to do it...
How it works...
Recycling deleted data from trash to HDFS
Getting ready
How to do it...
How it works...
Saving compressed data in HDFS
Getting ready
How to do it...
How it works...
3. Mastering Map Reduce Programs
Introduction
Writing the Map Reduce program in Java to analyze web log data
Getting ready
How to do it...
How it works...
Executing the Map Reduce program in a Hadoop cluster
Getting ready
How to do it
How it works...

Adding support for a new writable data type in Hadoop
Getting ready
How to do it...
How it works...
Implementing a user-defined counter in a Map Reduce program

Getting ready
How to do it...
How it works...
Map Reduce program to find the top X
Getting ready
How to do it...
How it works
Map Reduce program to find distinct values
Getting ready
How to do it
How it works...
Map Reduce program to partition data using a custom partitioner
Getting ready
How to do it...
How it works...
Writing Map Reduce results to multiple output files
Getting ready
How to do it...
How it works...
Performing Reduce side Joins using Map Reduce
Getting ready
How to do it
How it works...
Unit testing the Map Reduce code using MRUnit
Getting ready
How to do it...
How it works...
4. Data Analysis Using Hive, Pig, and Hbase
Introduction
Storing and processing Hive data in a sequential file format

Getting ready

How to do it...
How it works...
Storing and processing Hive data in the RC file format
Getting ready
How to do it...
How it works...
Storing and processing Hive data in the ORC file format
Getting ready
How to do it...
How it works...
Storing and processing Hive data in the Parquet file format
Getting ready
How to do it...
How it works...
Performing FILTER By queries in Pig
Getting ready
How to do it...
How it works...
Performing Group By queries in Pig
Getting ready
How to do it...
How it works...
Performing Order By queries in Pig
Getting ready
How to do it..
How it works...
Performing JOINS in Pig

Getting ready
How to do it...
How it works
Replicated Joins
Skewed Joins
Merge Joins
Writing a user-defined function in Pig
Getting ready
How to do it...

How it works...
There's more...
Analyzing web log data using Pig
Getting ready
How to do it...
How it works...
Performing the Hbase operation in CLI
Getting ready
How to do it
How it works...
Performing Hbase operations in Java
Getting ready
How to do it
How it works...
Executing the MapReduce programming with an Hbase Table
Getting ready
How to do it
How it works
5. Advanced Data Analysis Using Hive

Introduction
Processing JSON data in Hive using JSON SerDe
Getting ready
How to do it...
How it works...
Processing XML data in Hive using XML SerDe
Getting ready
How to do it...
How it works
Processing Hive data in the Avro format
Getting ready
How to do it...
How it works...
Writing a user-defined function in Hive
Getting ready
How to do it
How it works...

Performing table joins in Hive
Getting ready
How to do it...
Left outer join
Right outer join
Full outer join
Left semi join
How it works...
Executing map side joins in Hive
Getting ready
How to do it...

How it works...
Performing context Ngram in Hive
Getting ready
How to do it...
How it works...
Call Data Record Analytics using Hive
Getting ready
How to do it...
How it works...
Twitter sentiment analysis using Hive
Getting ready
How to do it...
How it works
Implementing Change Data Capture using Hive
Getting ready
How to do it
How it works
Multiple table inserting using Hive
Getting ready
How to do it
How it works
6. Data Import/Export Using Sqoop and Flume
Introduction
Importing data from RDMBS to HDFS using Sqoop
Getting ready

How to do it...
How it works...
Exporting data from HDFS to RDBMS

Getting ready
How to do it...
How it works...
Using query operator in Sqoop import
Getting ready
How to do it...
How it works...
Importing data using Sqoop in compressed format
Getting ready
How to do it...
How it works...
Performing Atomic export using Sqoop
Getting ready
How to do it...
How it works...
Importing data into Hive tables using Sqoop
Getting ready
How to do it...
How it works...
Importing data into HDFS from Mainframes
Getting ready
How to do it...
How it works...
Incremental import using Sqoop
Getting ready
How to do it...
How it works...
Creating and executing Sqoop job
Getting ready
How to do it...

How it works...
Importing data from RDBMS to Hbase using Sqoop
Getting ready

How to do it...
How it works...
Importing Twitter data into HDFS using Flume
Getting ready
How to do it...
How it works
Importing data from Kafka into HDFS using Flume
Getting ready
How to do it...
How it works
Importing web logs data into HDFS using Flume
Getting ready
How to do it...
How it works...
7. Automation of Hadoop Tasks Using Oozie
Introduction
Implementing a Sqoop action job using Oozie
Getting ready
How to do it...
How it works
Implementing a Map Reduce action job using Oozie
Getting ready
How to do it...
How it works...
Implementing a Java action job using Oozie

Getting ready
How to do it
How it works
Implementing a Hive action job using Oozie
Getting ready
How to do it...
How it works...
Implementing a Pig action job using Oozie
Getting ready
How to do it...
How it works

Implementing an e-mail action job using Oozie
Getting ready
How to do it...
How it works...
Executing parallel jobs using Oozie (fork)
Getting ready
How to do it...
How it works...
Scheduling a job in Oozie
Getting ready
How to do it...
How it works...
8. Machine Learning and Predictive Analytics Using Mahout and R
Introduction
Setting up the Mahout development environment
Getting ready
How to do it...

How it works...
Creating an item-based recommendation engine using Mahout
Getting ready
How to do it...
How it works...
Creating a user-based recommendation engine using Mahout
Getting ready
How to do it...
How it works...
Predictive analytics on Bank Data using Mahout
Getting ready
How to do it...
How it works...
Text data clustering using K-Means using Mahout
Getting ready
How to do it...
How it works...
Population Data Analytics using R
Getting ready

How to do it...
How it works...
Twitter Sentiment Analytics using R
Getting ready
How to do it...
How it works...
Performing Predictive Analytics using R
Getting ready
How to do it...

How it works...
9. Integration with Apache Spark
Introduction
Running Spark standalone
Getting ready
How to do it...
How it works...
Running Spark on YARN
Getting ready
How to do it...
How it works...
Performing Olympics Athletes analytics using the Spark Shell
Getting ready
How to do it...
How it works...
Creating Twitter trending topics using Spark Streaming
Getting ready
How to do it...
How it works...
Twitter trending topics using Spark streaming
Getting ready
How to do it...
How it works...
Analyzing Parquet files using Spark
Getting ready
How to do it...
How it works...

Analyzing JSON data using Spark

Getting ready
How to do it...
How it works...
Processing graphs using Graph X
Getting ready
How to do it...
How it works...
Conducting predictive analytics using Spark MLib
Getting ready
How to do it...
How it works...
10. Hadoop Use Cases
Introduction
Call Data Record analytics
Getting ready
How to do it...
Problem Statement
Solution
How it works...
Web log analytics
Getting ready
How to do it...
Problem statement
Solution
How it works...
Sensitive data masking and encryption using Hadoop
Getting ready
How to do it...
Problem statement
Solution

How it works...
Index

Hadoop Real-World Solutions
Cookbook Second Edition

Hadoop Real-World Solutions
Cookbook Second Edition
Copyright © 2016 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a
retrieval system, or transmitted in any form or by any means, without the
prior written permission of the publisher, except in the case of brief
quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the
accuracy of the information presented. However, the information contained in
this book is sold without warranty, either express or implied. Neither the
author, nor Packt Publishing, and its dealers and distributors will be held
liable for any damages caused or alleged to be caused directly or indirectly by
this book.
Packt Publishing has endeavored to provide trademark information about all
of the companies and products mentioned in this book by the appropriate use
of capitals. However, Packt Publishing cannot guarantee the accuracy of this
information.
First published: February 2013
Second edition: March 2016
Production reference: 1220316
Published by Packt Publishing Ltd.
Livery Place

35 Livery Street
Birmingham B3 2PB, UK.

ISBN 978-1-78439-550-6
www.packtpub.com

Credits
Authors
Tanmay Deshpande
Jonathan R. Owens
Jon Lentz
Brian Femiano
Reviewer
Shashwat Shriparv
Commissioning Editor
Akram Hussain
Acquisition Editor
Manish Nainani
Content Development Editor
Sumeet Sawant
Technical Editor
Gebin George
Copy Editor
Sonia Cheema
Project Coordinator

Shweta H Birwatkar

Proofreader
Safis Editing
Indexer
Tejal Daruwale Soni
Production Coordinator
Manu Joseph
Cover Work
Manu Joseph

About the Author
Tanmay Deshpande is a Hadoop and big data evangelist. He's interested in a
wide range of technologies, such as Apache Spark, Hadoop, Hive, Pig,
NoSQL databases, Mahout, Sqoop, Java, cloud computing, and so on. He has
vast experience in application development in various domains, such as
finance, telecoms, manufacturing, security, and retail. He enjoys solving
machine-learning problems and spends his time reading anything that he can
get his hands on. He has a great interest in open source technologies and
promotes them through his lectures. He has been invited to various computer
science colleges to conduct brainstorming sessions with students on the latest
technologies. Through his innovative thinking and dynamic leadership, he
has successfully completed various projects. Tanmay is currently working
with Schlumberger as the lead developer of big data. Before Schlumberger,
Tanmay worked with Lumiata, Symantec, and Infosys.
He currently blogs at .

Acknowledgements
This is my fourth book, and I can't thank the Almighty, enough without
whom this wouldn't have been true. I would like to take this opportunity to

thank my wife, Sneha, my parents, Avinash and Manisha Deshpande, and my
brother, Sakalya Deshpande, for being with me through thick and thin.
Without you, I am nothing!
I would like to take this opportunity to thank my colleagues, friends, and
family for appreciating my work and making it a grand success so far. I'm
truly blessed to have each one of you in my life.
I am thankful to the authors of the first edition of this book, Jonathan R.
Owens, Brian Femino, and Jon Lentz for setting the stage for me, and I hope
this effort lives up to the expectations you had set in the first edition. I am
also thankful to each person in Packt Publishing who has worked to make this
book happen! You guys are family to me!
Above all, I am thankful to my readers for their love, appreciation, and
criticism, and I assure you that I have tried to give you my best. Hope you
enjoy this book! Happy learning!

About the Reviewer
Shashwat Shriparv has 6+ IT experience in industry, and 4+ in BigData
technologies. He possesses a master degree in computer application. He has
experience in technologies such as Hadoop, HBase, Hive, Pig, Flume, Sqoop,
Mongo, Cassandra, Java, C#, Linux, Scripting, PHP,C++,C, Web
technologies, and various real life use cases in BigData technologies as a
developer and administrator.
He has worked with companies such as CDAC, Genilok, HCL,
UIDAI(Aadhaar); he is currently working with CenturyLink Cognilytics. He
is the author of Learning HBase, Packt Publishing and reviewer Pig design
pattern book, Packt Publishing.
I want to acknowledge everyone I know.

www.PacktPub.com

eBooks, discount offers, and more
Did you know that Packt offers eBook versions of every book published, with
PDF and ePub files available? You can upgrade to the eBook version at
www.PacktPub.com and as a print book customer, you are entitled to a
discount on the eBook copy. Get in touch with us at
<> for more details.
At www.PacktPub.com, you can also read a collection of free technical
articles, sign up for a range of free newsletters and receive exclusive
discounts and offers on Packt books and eBooks.

/>Do you need instant solutions to your IT questions? PacktLib is Packt's
online digital book library. Here, you can search, access, and read Packt's
entire library of books.

Why Subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser

Hadoop real world solutions cookbook second edition

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về