Tải bản đầy đủ (.pdf) (179 trang)

HDInsight essentials learn how to build and deploy a modern big data architecture to empower your business 2nd edition

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (8.66 MB, 179 trang )

www.allitebooks.com


HDInsight Essentials
Second Edition

Learn how to build and deploy a modern big data
architecture to empower your business

Rajesh Nadipalli

professional expertise distilled

P U B L I S H I N G
BIRMINGHAM - MUMBAI

www.allitebooks.com


HDInsight Essentials
Second Edition
Copyright © 2015 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, without the prior written
permission of the publisher, except in the case of brief quotations embedded in
critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy
of the information presented. However, the information contained in this book is
sold without warranty, either express or implied. Neither the author, nor Packt
Publishing, and its dealers and distributors will be held liable for any damages


caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.

First published: September 2013
Second edition: January 2015

Production reference: 1200115

Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78439-942-9
www.packtpub.com

www.allitebooks.com


Credits
Author

Project Coordinator

Rajesh Nadipalli

Mary Alex

Reviewers


Proofreaders

Simon Elliston Ball

Ting Baker

Anindita Basak

Ameesha Green

Rami Vemula
Indexer
Commissioning Editor

Rekha Nair

Taron Pereira
Production Coordinator
Acquisition Editor

Melwyn D'sa

Owen Roberts
Cover Work
Content Development Editor

Melwyn D'sa

Rohit Kumar Singh

Technical Editors
Madhuri Das
Taabish Khan
Copy Editor
Rashmi Sawant

www.allitebooks.com


About the Author
Rajesh Nadipalli currently manages software architecture and delivery of Zaloni's
Bedrock Data Management Platform, which enables customers to quickly and easily
realize true Hadoop-based Enterprise Data Lakes. Rajesh is also an instructor and a
content provider for Hadoop training, including Hadoop development, Hive, Pig,
and HBase. In his previous role as a senior solutions architect, he evaluated big data
goals for his clients, recommended a target state architecture, and conducted proof
of concepts and production implementation. His clients include Verizon, American
Express, NetApp, Cisco, EMC, and UnitedHealth Group.

Prior to Zaloni, Rajesh worked for Cisco Systems for 12 years and held a technical
leadership position. His key focus areas have been data management, enterprise
architecture, business intelligence, data warehousing, and Extract Transform Load
(ETL). He has demonstrated success by delivering scalable data management and BI
solutions that empower business to make informed decisions.
Rajesh authored the first version of the book HDInsight Essentials, Packt Publishing,
released in September 2013, the first book in print for HDInsight, providing data
architects, developers, and managers with an introduction to the new Hadoop
distribution from Microsoft.
He has over 18 years of IT experience. He holds an MBA from North Carolina State
University and a BSc degree in Electronics and Electrical from the University of

Mumbai, India.
I would like to thank my family for their unconditional love,
support, and patience during the entire process.
To my friends and coworkers at Zaloni, thank you for inspiring
and encouraging me.
And finally a shout-out to all the folks at Packt Publishing for being
really professional.

www.allitebooks.com


About the Reviewers
Simon Elliston Ball is a solutions engineer at Hortonworks, where he helps a

wide range of companies get the best out of Hadoop. Before that, he was the head of
big data at Red Gate, creating tools to make HDInsight and Hadoop easier to work
with. He has also spoken extensively on big data and NoSQL at conferences around
the world.

Anindita Basak works as a big data cloud consultant and a big data Hadoop

trainer and is highly enthusiastic about Microsoft Azure and HDInsight along with
Hadoop open source ecosystem. She works as a specialist for Fortune 500 brands
including cloud and big data based companies in the US. She has been playing with
Hadoop on Azure since the incubation phase ().
Previously, she worked as a module lead for the Alten group and as a senior system
analyst at Sonata Software Limited, India, in the Azure Professional Direct Delivery
group of Microsoft. She worked as a senior software engineer on implementation
and migration of various enterprise applications on the Azure cloud in healthcare,
retail, and financial domains. She started her journey with Microsoft Azure in the

Microsoft Cloud Integration Engineering (CIE) team and worked as a support
engineer in Microsoft India (R&D) Pvt. Ltd.
With more than 6 years of experience in the Microsoft .NET technology stack,
she is solely focused on big data cloud and data science. As a Most Valued Blogger,
she loves to share her technical experience and expertise through her blog at
and .
You can find more about her on her LinkedIn page and you can follow her
at @imcuteani on Twitter.
She recently worked as a technical reviewer for the books HDInsight Essentials
and Microsoft Tabular Modeling Cookbook, both by Packt Publishing. She is currently
working on Hadoop Essentials, also by Packt Publishing.
I would like to thank my mom and dad, Anjana and Ajit Basak, and
my affectionate brother, Aditya. Without their support, I could not
have reached my goal.

www.allitebooks.com


Rami Vemula is a technology consultant who loves to provide scalable software

solutions for complex business problems through modern day web technologies and
cloud infrastructure. His primary focus is on Microsoft technologies, which include
ASP.Net MVC/WebAPI, jQuery, C#, SQL Server, and Azure. He currently works for
a reputed multinational consulting firm as a consultant, where he leads and supports
a team of talented developers. As a part of his work, he architects, develops, and
maintains technical solutions to various clients with Microsoft technologies. He is
also a Microsoft Certified ASP.Net and Azure Developer.
He has been a Microsoft MVP since 2011 and an active trainer. He conducts online
training on Microsoft web stack technologies. In his free time, he enjoys exploring
different technical questions at and StackOverflow, and

then contributes with prospective solutions through custom written code snippets.
He loves to share his technical experience and expertise through his blog at
/>He holds a Master's Degree in Electrical Engineering from California State
University, Long Beach, USA. He is married and lives with this wife and
parents in Hyderabad, India.
I would like to thank my parents, Ramanaiah and RajaKumari;
my wife, Sneha; and the rest of my family and friends for their
patience and support throughout my life and helping me achieve
all the wonderful milestones and accomplishments. Their consistent
encouragement and guidance gave me the strength to overcome all
the hurdles and kept me moving forward.

www.allitebooks.com


www.PacktPub.com
Support files, eBooks, discount offers, and more
For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub
files available? You can upgrade to the eBook version at www.PacktPub.com and as a print
book customer, you are entitled to a discount on the eBook copy. Get in touch with us at
for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up
for a range of free newsletters and receive exclusive discounts and offers on Packt books
and eBooks.
TM

/>Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book
library. Here, you can search, access, and read Packt's entire library of books.


Why subscribe?


Fully searchable across every book published by Packt



Copy and paste, print, and bookmark content



On demand and accessible via a web browser

Free access for Packt account holders

If you have an account with Packt at www.PacktPub.com, you can use this to access
PacktLib today and view 9 entirely free books. Simply use your login credentials for
immediate access.

Instant updates on new Packt books

Get notified! Find out when new books are published by following @PacktEnterprise
on Twitter or the Packt Enterprise Facebook page.

www.allitebooks.com


www.allitebooks.com



Table of Contents
Preface1
Chapter 1: Hadoop and HDInsight in a Heartbeat
7
Data is everywhere
Business value of big data
Hadoop concepts
Brief history of Hadoop
Core components
Hadoop cluster layout
HDFS overview
Writing a file to HDFS
Reading a file from HDFS
HDFS basic commands

YARN overview

7
9
10
10
11
12
14
14
15
15

16


YARN application life cycle
YARN workloads

17
18

Hadoop distributions
18
HDInsight overview
19
HDInsight and Hadoop relationship
20
Hadoop on Windows deployment options
21
Microsoft Azure HDInsight Service
21
HDInsight Emulator
21
Hortonworks Data Platform (HDP) for Windows
22
Summary22

Chapter 2: Enterprise Data Lake using HDInsight
Enterprise Data Warehouse architecture
Source systems
Data warehouse
Storage
Processing


www.allitebooks.com

23

23
24
24

25
25


Table of Contents

User access
Provisioning and monitoring
Data governance and security
Pain points of EDW
The next generation Hadoop-based Enterprise data architecture
Source systems
Data Lake

26
26
26
27
27
29
29


Storage29
Processing30

User access

Provisioning and monitoring
Data governance, security, and metadata

30

31
31

Journey to your Data Lake dream
31
Ingestion and organization
32
Transformation (rules driven)
32
Access, analyze, and report
32
Tools and technology for Hadoop ecosystem
33
Use case powered by Microsoft HDInsight
34
Problem statement
34
Solution34

Source systems

35
Storage
35
Processing36
User access
36

Benefits
36
Summary38

Chapter 3: HDInsight Service on Azure
Registering for an Azure account
Azure storage
Provisioning an HDInsight cluster
Cluster topology
Provisioning using Azure PowerShell
Creating a storage container
Provisioning a new HDInsight cluster

39

39
41
42
44
45

46
47


HDInsight management dashboard
48
Dashboard48
Monitor49
Configuration
49
Exploring clusters using the remote desktop
51
Running a sample MapReduce
52
Deleting the cluster
55

[ ii ]


Table of Contents

HDInsight Emulator for the development
55
Installing HDInsight Emulator
56
Installation verification
56
Using HDInsight Emulator
56
Summary57

Chapter 4: Administering Your HDInsight Cluster


59

Chapter 5: Ingest and Organize Data Lake

77

Monitoring cluster health
59
Name Node status
61
The Name Node Overview page
62
Datanode Status
64
Utilities and logs
65
Hadoop Service Availability
66
YARN Application Status
66
Azure storage management
68
Configuring your storage account
68
Monitoring your storage account
70
Managing access keys
71
Deleting your storage account

72
Azure PowerShell
73
Access Azure Blob storage using Azure PowerShell
73
Summary75
End-to-end Data Lake solution
77
Ingesting to Data Lake using HDFS command
78
Connecting to a Hadoop client
78
Getting your files on the local storage
78
Transferring to HDFS
80
Loading data to Azure Blob storage using Azure PowerShell
80
Loading files to Data Lake using GUI tools
82
Storage access keys
82
Storage tools
82
CloudXplorer83
Key benefits
Registering your storage account
Uploading files to your Blob storage

Using Sqoop to move data from RDBMS to Data Lake

Key benefits
Two modes of using Sqoop
Using Sqoop to import data (SQL to Hadoop)
Organizing your Data Lake in HDFS
[ iii ]

83
83
84

85
85
86
86
87


Table of Contents

Managing file metadata using HCatalog
88
Key benefits
89
Using HCatalog Command Line to create tables
90
Summary92

Chapter 6: Transform Data in the Data Lake

93


Transformation overview
93
Tools for transforming data in Data Lake
94
HCatalog94
Persisting HCatalog metastore in a SQL database
94
Apache Hive
95
Hive architecture
Starting Hive in HDInsight
Basic Hive commands

96
97
97

Apache Pig

98

Pig architecture
Starting Pig in HDInsight node
Basic Pig commands

98
99
99


Pig or Hive
100
MapReduce100
The mapper code
The reducer code
The driver code
Executing MapReduce on HDInsight

101
101
102
102

Azure PowerShell for execution of Hadoop jobs
103
Transformation for the OTP project
104
Cleaning data using Pig
105
Executing Pig script
106
Registering a refined and aggregate table using Hive
106
Executing Hive script
109
Reviewing results
109
Other tools used for transformation
110
Oozie110

Spark110
Summary110

Chapter 7: Analyze and Report from Data Lake

111

Data access overview
111
Analysis using Excel and Microsoft Hive ODBC driver
112
Prerequisites112
Step 1 – installing the Microsoft Hive ODBC driver
112
Step 2 – creating Hive ODBC Data Source
113
Step 3 – importing data to Excel
115
[ iv ]


Table of Contents

Analysis using Excel Power Query
119
Prerequisites119
Step 1 – installing the Microsoft Power Query for Excel
119
Step 2 – importing Azure Blob storage data into Excel
120

Step 3 – analyzing data using Excel
121
Other BI features in Excel
122
PowerPivot123
Power View and Power Map
123
Step 1 – importing Azure Blob storage data into Excel
123
Step 2 – launch map view
124
Step 3 – configure the map
124
Power BI Catalog
125
Ad hoc analysis using Hive
126
Other alternatives for analysis
126
RHadoop126
Apache Giraph
127
Apache Mahout
127
Azure Machine Learning
128
Summary128

Chapter 8: HDInsight 3.1 New Features


HBase
HBase positioning in Data Lake and use cases
Provisioning HDInsight HBase cluster
Creating a sample HBase schema
Designing the airline on-time performance table
Connecting to HBase using the HBase shell
Creating an HBase table
Loading data to the HBase table
Querying data from the HBase table

129

129
130
131
131
131
132
132
133
134

HBase additional information
134
Storm134
Storm positioning in Data Lake
134
Storm key concepts
135
Provisioning HDInsight Storm cluster

136
Running a sample Storm topology
137
Connecting to Storm using Storm shell
Running the Storm Wordcount topology
Monitoring status of the Wordcount topology

137
138
139

Additional information on Storm
141
Apache Tez
142
Summary142
[v]


Table of Contents

Chapter 9: Strategy for a Successful Data Lake Implementation
Challenges on building a production Data Lake
The success path for a production Data Lake
Identifying the big data problem
Proof of technology for Data Lake
Form a Data Lake Center of Excellence

143


143
144
144
146
146

Executive sponsors
147
Data Lake consumers
148
Development148
Operations and infrastructure
148

Architectural considerations
149
Extensible and modular
149
Metadata-driven solution
150
Integration strategy
151
Security
151
Online resources
151
Summary152

Index153


[ vi ]


Preface
We live in a connected digital era and we are witnessing unprecedented growth
of data. Organizations that are able to analyze big data are demonstrating significant
return on investment by detecting fraud, improved operations, and reduced time
to analyze with a scale-out architecture such as Hadoop. Azure HDInsight is an
enterprise-ready distribution of Hadoop hosted in the cloud and provides
advanced integration with Excel and .NET without the need to buy or maintain
physical hardware.
This book is your guide to building a modern data architecture using HDInsight
to enable your organization to gain insights from various sources, including
smart-connected devices, databases, and social media. This book will take you
through a journey of building the next generation Enterprise Data Lake that
consists of ingestion, transformation, and analysis of big data with a specific
use case that can apply to almost any organization.
This book has working code that developers can leverage and extend in order
to fit their use cases with additional references for self-learning.

What this book covers

Chapter 1, Hadoop and HDInsight in a Heartbeat, covers the business value and the
reason behind the big data hype. It provides a primer on Apache Hadoop, core
concepts with HDFS, YARN, and the Hadoop 2.x ecosystem. Next, it discusses
the Microsoft HDInsight platform, its key benefits, and deployment options.
Chapter 2, Enterprise Data Lake using HDInsight, covers the main points of the current
Enterprise Data Warehouse and provides a path for an enterprise Data Lake based
on the Hadoop platform. Additionally, it explains a use case built on the Azure
HDInsight service.



Preface

Chapter 3, HDInsight Service on Azure, walks you through the steps for provisioning
Azure HDInsight. Next, it explains how to explore, monitor, and delete the cluster
using the Azure management portal. Next, it provides tools for developers to verify
the cluster using a sample program and develop it using HDInsight Emulator.
Chapter 4, Administering Your HDInsight Cluster, covers steps to administer the
HDInsight cluster using remote desktop connection to the head node of the cluster.
It includes management of Azure Blob storage and introduces you to the Azure
scripting environment known as Azure PowerShell.
Chapter 5, Ingest and Organize Data Lake, introduces you to an end-to-end Data Lake
solution with a near real life size project and then focuses on various options to
ingest data to a HDInsight cluster, including HDFS commands, Azure PowerShell,
CloudExplorer, and Sqoop. Next, it provides details on how to organize data using
Apache HCatalog. This chapter uses a real life size sample airline project to explain
the various concepts.
Chapter 6, Transform Data in the Data Lake, provides you with various options to
transform data, including MapReduce, Hive, and Pig. Additionally, it discusses
Oozie and Spark, which are also commonly used for transformation. Throughout
the chapter, you will be guided with a detailed code for the sample airline project.
Chapter 7, Analyze and Report from Data Lake, provides you with details on how to
access and analyze data from the sample airline project using Excel Hive ODBC
driver, Excel Power Query, Powerpivot, and PowerMap. Additionally, it discusses
RHadoop, Giraph, and Mahout as alternatives to analyze data in the cluster.
Chapter 8, HDInsight 3.1 New Features, provides you with new features that are
added to the evolving HDInsight platform with sample use cases for HBase, Tez,
and Storm.
Chapter 9, Strategy for a Successful Data Lake Implementation, covers the key challenges

for building a production Data Lake and provides guidance on the success path for
a sustainable Data Lake. This chapter provides recommendations on architecture,
organization, and links to online resources.

What you need for this book
For this book, the following are the prerequisites:

• To build an HDInsight cluster using the Azure cloud service, you will need
an Azure account and a laptop with Windows Remote Desktop software to
connect to the cluster

[2]


Preface

• For Excel-based exercises, you will need Office 2013/Excel 2013/Office 365
ProPlus/Office 2010 Professional Plus
• For HDInsight Emulator, which is suited for local development, you will
need a Windows laptop with one of these operating systems: Windows
7 Service Pack 1/Windows Server 2008 R2 Service Pack 1/Windows 8/
Windows Server 2012.

Who this book is for

This book is designed for data architects, developers, managers, and business
users who want to modernize their data architectures leveraging the HDInsight
distribution of Hadoop. It guides you through the business values of big data, the
main points of current EDW (Enterprise Data Warehouse), steps for building the
next generation Data Lake, and development tools with real life examples.

The book explains the journey to a Data Lake with a modular approach for ingesting,
transforming, and reporting on a Data Lake leveraging HDInsight platform and
Excel for powerful analysis and reporting.

Conventions

In this book, you will find a number of text styles that distinguish between different
kinds of information. Here are some examples of these styles and an explanation of
their meaning.
Code words in text, database table names, folder names, filenames, file extensions,
pathnames, dummy URLs, user input, and Twitter handles are shown as follows:
"I have selected hdind and the complete URL is hdind.azurehdinsight.net."
Any command-line input or output is written as follows:
# Import PublishSettingsFile that was saved from last step
Import-AzurePublishSettingsFile "C:\Users\Administrator\Downloads\Pay-AsYou-Go-Free Trial-11-21-2014-credentials.publishsettings"

New terms and important words are shown in bold. Words that you see on the
screen, for example, in menus or dialog boxes, appear in the text like this: "You can
select the desired configuration Two Head Nodes on an Extra Large (A4) instance
included or Two Head Nodes on a Large (A3) instance included."

[3]


Preface

Warnings or important notes appear in a box like this.

Tips and tricks appear like this.


Reader feedback

Feedback from our readers is always welcome. Let us know what you think about
this book—what you liked or disliked. Reader feedback is important for us as it helps
us develop titles that you will really get the most out of.
To send us general feedback, simply e-mail , and mention
the book's title in the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing
or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things
to help you to get the most from your purchase.

Downloading the example code

You can download the example code files from your account at http://www.
packtpub.com for all the Packt Publishing books you have purchased. If you
purchased this book elsewhere, you can visit />and register to have the files e-mailed directly to you.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes
do happen. If you find a mistake in one of our books—maybe a mistake in the text or
the code—we would be grateful if you could report this to us. By doing so, you can
save other readers from frustration and help us improve subsequent versions of this
book. If you find any errata, please report them by visiting ktpub.
com/submit-errata, selecting your book, clicking on the Errata Submission Form
link, and entering the details of your errata.

[4]


Preface

Once your errata are verified, your submission will be accepted and the errata will
be uploaded to our website or added to any list of existing errata under the Errata
section of that title.
To view the previously submitted errata, go to />content/support and enter the name of the book in the search field. The required
information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all
media. At Packt, we take the protection of our copyright and licenses very seriously.
If you come across any illegal copies of our works in any form on the Internet, please
provide us with the location address or website name immediately so that we can
pursue a remedy.
Please contact us at with a link to the suspected
pirated material.
We appreciate your help in protecting our authors and our ability to bring you
valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at
, and we will do our best to address the problem.

[5]


www.allitebooks.com



Hadoop and HDInsight
in a Heartbeat
This chapter will provide an overview of Apache Hadoop and Microsoft big data
strategy, where Microsoft HDInsight plays an important role. We will cover the
following topics:
• The era of big data
• Hadoop concepts
• Hadoop distributions
• HDInsight overview
• Hadoop on Windows deployment options

Data is everywhere

We live in a digital era and are always connected with friends and family using
social media and smartphones. In 2014, every second over 5,700 tweets were sent
and 800 links were shared using Facebook and the digital universe was about
1.7 MB per minute for every person on Earth (source: IDC 2014 report). This amount
of data sharing and storing is unprecedented and is contributing to what is known
as big data.


Hadoop and HDInsight in a Heartbeat

The following infographic shows you the details of our current use of the top social
media sites (source />
Other contributors to big data are the smart connected devices such as smartphones,

appliances, cars, sensors, and pretty much everything that we use today and is
connected to the Internet. These devices, which will soon be in trillions, continuously
collect data and communicate with each other about their environment to make
intelligent decisions and help us live better. This digitization of the world has added
to the exponential growth of big data.
The following figure depicts the trend analysis done by Microsoft Azure, which
shows the evolution of big data "internet of things". In the period 1980 to 1990,
IT systems ERM/CRM primarily generated data in a well-structured format
with volume in GBs. In the period between 1990 and 2000, the Web and mobile
applications emerged and now the data volumes increased to terabytes. After the
year 2000, social networking sites, Wikis, blogs, and smart devices emerged and
now we are dealing with petabytes of data. The section in blue highlights the big
data era that includes social media, sensors, and images where Volume, Velocity,
and Variety are the norms. One related key trend is the price of hardware, which
dropped from $190/GB in 1980 to $0.07/GB in 2010. This has been a key enabler
in big data adoption.
[8]


Chapter 1

According to the 2014 IDC digital universe report, the growth trend will continue
and double in size every two years. In 2013, about 4.4 zettabytes were created and in
2020 the forecast is 44 zettabytes, which is 44 trillion gigabytes (source: http://www.
emc.com/leadership/digital-universe/2014iview/executive-summary.htm).

Source: Microsoft TechEd North America 2014 From Zero to Data Insights from HDInsight on Microsoft Azure

Business value of big data


While we generated 4.4 zettabytes of data in 2013, only five percent of it was actually
analyzed and this is the real opportunity of big data. The IDC report forecasts that by
2020, we will analyze over 35 percent of generated data by making smarter sensors
and devices. This data will drive new consumer and business behavior that will
drive trillions of dollars in opportunity for IT vendors and organizations analyzing
this data.
Let's look at some real use cases that have benefited from Big Data:
• IT systems in all major banks are constantly monitoring fraudulent activities
and alerting customers within milliseconds. These systems apply complex
business rules and analyze historical data, geography, type of vendor, and
other parameters based on the customer to get accurate results.

[9]


Hadoop and HDInsight in a Heartbeat

• Commercial drones are transforming agriculture by analyzing real-time
aerial images and identifying the problem areas. These drones are cheaper
and more efficient than satellite imagery, as they fly under the clouds and
can take images anytime. They identify irrigation issues related to water,
pests, or fungal infections, which thereby, increases the crop productivity and
quality. These drones are equipped with technology to capture high quality
images every second and transfer them to a cloud hosted big data system for
further processing. (You can refer to />featuredstory/526491/agricultural-drones/.)
• Developers of the blockbuster Halo 4 game were tasked to analyze player
preferences and support an online tournament in the cloud. The game
attracted over 4 million players in its first five days after the launch. The
development team had to also design a solution that kept track of leader
board for the global Halo 4 Infinity Challenge, which was open to all players.

The development team chose the Azure HDInsight service to analyze the
massive amounts of unstructured data in a distributed manner. The results
from HDInsight were reported using Microsoft SQL Server PowerPivot and
Sharepoint, and business was extremely happy with the response times for
their queries, which was a few hours, or less (source: rosoft.
com/casestudies/Windows-Azure/343-Industries/343-IndustriesGets-New-User-Insights-from-Big-Data-in-the-Cloud/710000002102).

Hadoop concepts

Apache Hadoop is the leading open source big data platform that can store and
analyze massive amounts of structured and unstructured data efficiently and can
be hosted on low cost commodity hardware. There are other technologies that
complement Hadoop under the big data umbrella such as MongoDB, a NoSQL
database; Cassandra, a document database; and VoltDB, an in-memory database.
This section describes Apache Hadoop core concepts and its ecosystem.

Brief history of Hadoop

Doug Cutting created Hadoop; he named it after his kid's stuffed yellow elephant
and it has no real meaning. In 2004, the initial version of Hadoop was launched as
Nutch Distributed Filesystem (NDFS). In February 2006, Apache Hadoop project
was officially started as a standalone development for MapReduce and HDFS. By
2008, Yahoo adopted Hadoop as the engine of its Web search with a cluster size
of around 10,000. In the same year, 2008, Hadoop graduated at top-level Apache
project confirming its success. In 2012, Hadoop 2.x was launched with YARN,
enabling Hadoop to take on various types of workloads.
[ 10 ]



×