Tải bản đầy đủ (.pdf) (398 trang)

Hadoop beginners guide learn how to crunch big data to extract meaning from the data avalanche

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (5.25 MB, 398 trang )

www.allitebooks.com


Hadoop Beginner's Guide

Learn how to crunch big data to extract meaning from the
data avalanche

Garry Turkington

BIRMINGHAM - MUMBAI

www.allitebooks.com


Hadoop Beginner's Guide
Copyright © 2013 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system,
or transmitted in any form or by any means, without the prior written permission of the
publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the
information presented. However, the information contained in this book is sold without
warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers
and distributors will be held liable for any damages caused or alleged to be caused directly or
indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.

First published: February 2013



Production Reference: 1150213

Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-84951-7-300
www.packtpub.com

Cover Image by Asher Wishkerman ()

www.allitebooks.com


Credits
Author

Project Coordinator

Garry Turkington

Leena Purkait

Reviewers

Proofreader

David Gruzman


Maria Gould

Muthusamy Manigandan
Indexer

Vidyasagar N V

Hemangini Bari

Acquisition Editor

Production Coordinator

Robin de Jongh

Nitesh Thakur

Lead Technical Editor
Azharuddin Sheikh

Cover Work
Nitesh Thakur

Technical Editors
Ankita Meshram
Varun Pius Rodrigues
Copy Editors
Brandt D'Mello
Aditya Nair
Laxmi Subramanian

Ruta Waghmare

www.allitebooks.com


About the Author
Garry Turkington has 14 years of industry experience, most of which has been focused

on the design and implementation of large-scale distributed systems. In his current roles as
VP Data Engineering and Lead Architect at Improve Digital, he is primarily responsible for
the realization of systems that store, process, and extract value from the company's large
data volumes. Before joining Improve Digital, he spent time at Amazon.co.uk, where he led
several software development teams building systems that process Amazon catalog data for
every item worldwide. Prior to this, he spent a decade in various government positions in
both the UK and USA.
He has BSc and PhD degrees in Computer Science from the Queens University of Belfast in
Northern Ireland and an MEng in Systems Engineering from Stevens Institute of Technology
in the USA.
I would like to thank my wife Lea for her support and encouragement—not
to mention her patience—throughout the writing of this book and my
daughter, Maya, whose spirit and curiosity is more of an inspiration than
she could ever imagine.

www.allitebooks.com


About the Reviewers
David Gruzman is a Hadoop and big data architect with more than 18 years of hands-on

experience, specializing in the design and implementation of scalable high-performance

distributed systems. He has extensive expertise of OOA/OOD and (R)DBMS technology. He
is an Agile methodology adept and strongly believes that a daily coding routine makes good
software architects. He is interested in solving challenging problems related to real-time
analytics and the application of machine learning algorithms to the big data sets.
He founded—and is working with—BigDataCraft.com, a boutique consulting firm in the area
of big data. Visit their site at www.bigdatacraft.com. David can be contacted at david@
bigdatacraft.com. More detailed information about his skills and experience can be
found at />
Muthusamy Manigandan is a systems architect for a startup. Prior to this, he was a Staff
Engineer at VMWare and Principal Engineer with Oracle. Mani has been programming for
the past 14 years on large-scale distributed-computing applications. His areas of interest are
machine learning and algorithms.

www.allitebooks.com


Vidyasagar N V has been interested in computer science since an early age. Some of his

serious work in computers and computer networks began during his high school days. Later,
he went to the prestigious Institute Of Technology, Banaras Hindu University, for his B.Tech.
He has been working as a software developer and data expert, developing and building
scalable systems. He has worked with a variety of second, third, and fourth generation
languages. He has worked with flat files, indexed files, hierarchical databases, network
databases, relational databases, NoSQL databases, Hadoop, and related technologies.
Currently, he is working as Senior Developer at Collective Inc., developing big data-based
structured data extraction techniques from the Web and local information. He enjoys
producing high-quality software and web-based solutions and designing secure and
scalable data systems. He can be contacted at
I would like to thank the Almighty, my parents, Mr. N Srinivasa Rao and
Mrs. Latha Rao, and my family who supported and backed me throughout

my life. I would also like to thank my friends for being good friends and
all those people willing to donate their time, effort, and expertise by
participating in open source software projects. Thank you, Packt Publishing
for selecting me as one of the technical reviewers for this wonderful book.
It is my honor to be a part of it.

www.allitebooks.com


www.PacktPub.com
Support files, eBooks, discount offers and more
You might want to visit www.PacktPub.com for support files and downloads related
to your book.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub
files available? You can upgrade to the eBook version at www.PacktPub.com and as a
print book customer, you are entitled to a discount on the eBook copy. Get in touch with
us at for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign
up for a range of free newsletters and receive exclusive discounts and offers on Packt
books and eBooks.


Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book
library. Here, you can access, read and search across Packt's entire library of books. 

Why Subscribe?
‹‹
‹‹
‹‹


Fully searchable across every book published by Packt
Copy and paste, print and bookmark content
On demand and accessible via web browser

Free Access for Packt account holders
If you have an account with Packt at www.PacktPub.com, you can use this to access
PacktLib today and view nine entirely free books. Simply use your login credentials for
immediate access.

www.allitebooks.com


www.allitebooks.com


Table of Contents
Preface1
Chapter 1: What It's All About
7
Big data processing
The value of data
Historically for the few and not the many
Classic data processing systems
Limiting factors

8
8
9

9

10

A different approach

11

Hadoop

15

All roads lead to scale-out
Share nothing
Expect failure
Smart software, dumb hardware
Move processing, not data
Build applications, not infrastructure
Thanks, Google
Thanks, Doug
Thanks, Yahoo
Parts of Hadoop
Common building blocks
HDFS
MapReduce
Better together
Common architecture
What it is and isn't good for

Cloud computing with Amazon Web Services
Too many clouds
A third way

Different types of costs
AWS – infrastructure on demand from Amazon
Elastic Compute Cloud (EC2)
Simple Storage Service (S3)

www.allitebooks.com

11
11
12
13
13
14
15
15
15
15
16
16
17
18
19
19

20
20
20
21
22
22

22


Table of Contents
Elastic MapReduce (EMR)

22

What this book covers

23

A dual approach

23

Summary

24

Chapter 2: Getting Hadoop Up and Running
Hadoop on a local Ubuntu host
Other operating systems
Time for action – checking the prerequisites
Setting up Hadoop
A note on versions

25

25

26
26
27
27

Time for action – downloading Hadoop
Time for action – setting up SSH
Configuring and running Hadoop
Time for action – using Hadoop to calculate Pi
Three modes
Time for action – configuring the pseudo-distributed mode
Configuring the base directory and formatting the filesystem
Time for action – changing the base HDFS directory
Time for action – formatting the NameNode
Starting and using Hadoop
Time for action – starting Hadoop
Time for action – using HDFS
Time for action – WordCount, the Hello World of MapReduce
Monitoring Hadoop from the browser

28
29
30
30
32
32
34
34
35
36

36
38
39
42

Using Elastic MapReduce
Setting up an account on Amazon Web Services

45
45

Time for action – WordCount in EMR using the management console
Other ways of using EMR

46
54

The AWS ecosystem
Comparison of local versus EMR Hadoop
Summary

55
55
56

Chapter 3: Understanding MapReduce

57

The HDFS web UI


Creating an AWS account
Signing up for the necessary services

AWS credentials
The EMR command-line tools

Key/value pairs
What it mean
Why key/value data?

Some real-world examples

MapReduce as a series of key/value transformations
[ ii ]

42

45
45

54
54

57
57
58
59

59



Table of Contents

The Hadoop Java API for MapReduce
The 0.20 MapReduce Java API

60
61

Writing MapReduce programs
Time for action – setting up the classpath
Time for action – implementing WordCount
Time for action – building a JAR file
Time for action – running WordCount on a local Hadoop cluster
Time for action – running WordCount on EMR
The pre-0.20 Java MapReduce API
Hadoop-provided mapper and reducer implementations
Time for action – WordCount the easy way
Walking through a run of WordCount
Startup
Splitting the input
Task assignment
Task startup
Ongoing JobTracker monitoring
Mapper input
Mapper execution
Mapper output and reduce input
Partitioning
The optional partition function

Reducer input
Reducer execution
Reducer output
Shutdown
That's all there is to it!
Apart from the combiner…maybe

64
65
65
68
68
69
72
73
73
75
75
75
75
76
76
76
77
77
77
78
78
79
79

79
80
80

Time for action – WordCount with a combiner

80

Time for action – fixing WordCount to work with a combiner
Reuse is your friend
Hadoop-specific data types
The Writable and WritableComparable interfaces
Introducing the wrapper classes

81
82
83
83
84

The Mapper class
The Reducer class
The Driver class

Why have a combiner?

When you can use the reducer as the combiner

Primitive wrapper classes
Array wrapper classes

Map wrapper classes

[ iii ]

61
62
63

80
81

85
85
85


Table of Contents

Time for action – using the Writable wrapper classes

86

Input/output
Files, splits, and records
InputFormat and RecordReader
Hadoop-provided InputFormat
Hadoop-provided RecordReader
Output formats and RecordWriter
Hadoop-provided OutputFormat
Don't forget Sequence files

Summary

88
89
89
90
90
91
91
91
92

Other wrapper classes
Making your own

Chapter 4: Developing MapReduce Programs
Using languages other than Java with Hadoop
How Hadoop Streaming works
Why to use Hadoop Streaming
Time for action – WordCount using Streaming
Differences in jobs when using Streaming
Analyzing a large dataset
Getting the UFO sighting dataset
Getting a feel for the dataset
Time for action – summarizing the UFO data
Examining UFO shapes

88
88


93

94
94
94
95
97
98
98
99
99

101

Time for action – summarizing the shape data
Time for action – correlating sighting duration to UFO shape

102
103

Time for action – performing the shape/time analysis from the command line
Java shape and location analysis
Time for action – using ChainMapper for field validation/analysis

107
107
108

Time for action – using the Distributed Cache to improve location output
Counters, status, and other output

Time for action – creating counters, task states, and writing log output
Too much information!
Summary

114
117
118
125
126

Using Streaming scripts outside Hadoop

Too many abbreviations
Using the Distributed Cache

Chapter 5: Advanced MapReduce Techniques
Simple, advanced, and in-between
Joins

[ iv ]

106

112
113

127

127
128



Table of Contents

When this is a bad idea
Map-side versus reduce-side joins
Matching account and sales information
Time for action – reduce-side joins using MultipleInputs
DataJoinMapper and TaggedMapperOutput

Implementing map-side joins

Using the Distributed Cache
Pruning data to fit in the cache
Using a data representation instead of raw data
Using multiple mappers

128
128
129
129
134

135
135
135
136
136

To join or not to join...

Graph algorithms
Graph 101
Graphs and MapReduce – a match made somewhere
Representing a graph
Time for action – representing the graph
Overview of the algorithm

137
137
138
138
139
140
140

Time for action – creating the source code
Time for action – the first run
Time for action – the second run
Time for action – the third run
Time for action – the fourth and last run
Running multiple jobs
Final thoughts on graphs
Using language-independent data structures
Candidate technologies
Introducing Avro
Time for action – getting and installing Avro
Avro and schemas
Time for action – defining the schema
Time for action – creating the source Avro data with Ruby
Time for action – consuming the Avro data with Java

Using Avro within MapReduce
Time for action – generating shape summaries in MapReduce
Time for action – examining the output data with Ruby
Time for action – examining the output data with Java
Going further with Avro
Summary

142
146
147
148
149
151
151
151
152
152
152
154
154
155
156
158
158
163
163
165
166

The mapper

The reducer
Iterative application

[v]

141
141
141


Table of Contents

Chapter 6: When Things Break

167

Failure
Embrace failure
Or at least don't fear it
Don't try this at home
Types of failure
Hadoop node failure

167
168
168
168
168
168


The dfsadmin command
Cluster setup, test files, and block sizes
Fault tolerance and Elastic MapReduce

169
169
170

Time for action – killing a DataNode process

170

Time for action – the replication factor in action
Time for action – intentionally causing missing blocks

174
176

Time for action – killing a TaskTracker process

180

Killing the cluster masters
Time for action – killing the JobTracker

184
184

Time for action – killing the NameNode process


186

NameNode and DataNode communication

When data may be lost
Block corruption

Comparing the DataNode and TaskTracker failures
Permanent failure

Starting a replacement JobTracker

Starting a replacement NameNode
The role of the NameNode in more detail
File systems, files, blocks, and nodes
The single most important piece of data in the cluster – fsimage
DataNode startup
Safe mode
SecondaryNameNode
So what to do when the NameNode process has a critical failure?
BackupNode/CheckpointNode and NameNode HA
Hardware failure
Host failure
Host corruption
The risk of correlated failures

Task failure due to software
Failure of slow running tasks

173


178
179
183
184

185
188
188
188
189
189
190
190
190
191
191
191
192
192

192
192

Time for action – causing task failure

193

Hadoop's handling of slow-running tasks
Speculative execution

Hadoop's handling of failing tasks

195
195
195

Task failure due to data

196

Handling dirty data through code
Using Hadoop's skip mode

196
197

[ vi ]


Table of Contents

Time for action – handling dirty data by using skip mode

197

Summary

202

To skip or not to skip...


202

Chapter 7: Keeping Things Running

205

Calculating usable space on a node
Location of the master nodes
Sizing hardware
Processor / memory / storage ratio
EMR as a prototyping platform

210
211
211
211
212

A note on EMR
Hadoop configuration properties
Default values
Time for action – browsing default properties
Additional property elements
Default storage location
Where to set properties
Setting up a cluster
How many hosts?

206

206
206
206
208
208
209
209
210

Special node requirements
Storage types

213
213

Hadoop networking configuration

215

Commodity versus enterprise class storage
Single disk versus RAID
Finding the balance
Network storage

214
214
214
214

How blocks are placed

Rack awareness

215
216

Time for action – examining the default rack configuration
Time for action – adding a rack awareness script
What is commodity hardware anyway?
Cluster access control
The Hadoop security model
Time for action – demonstrating the default security

216
217
219
220
220
220

Working around the security model via physical access control
Managing the NameNode
Configuring multiple locations for the fsimage class
Time for action – adding an additional fsimage location

224
224
225
225

User identity

More granular access control

Where to write the fsimage copies

Swapping to another NameNode host

223
224

226

227

Having things ready before disaster strikes

227

[ vii ]


Table of Contents

Time for action – swapping to a new NameNode host

227

Managing HDFS
Where to write data
Using balancer


230
230
230

MapReduce management
Command line job management
Job priorities and scheduling
Time for action – changing job priorities and killing a job
Alternative schedulers

231
231
231
232
233

Scaling
Adding capacity to a local Hadoop cluster
Adding capacity to an EMR job flow

235
235
235

Summary

236

Don't celebrate quite yet!
What about MapReduce?


When to rebalance

Capacity Scheduler
Fair Scheduler
Enabling alternative schedulers
When to use alternative schedulers

Expanding a running job flow

Chapter 8: A Relational View on Data with Hive

Overview of Hive
Why use Hive?
Thanks, Facebook!
Setting up Hive
Prerequisites
Getting Hive
Time for action – installing Hive
Using Hive
Time for action – creating a table for the UFO data
Time for action – inserting the UFO data
Validating the data
Time for action – validating the table
Time for action – redefining the table with the correct column separator
Hive tables – real or not?
Time for action – creating a table from an existing file
Time for action – performing a join
Hive and SQL views
Time for action – using views

Handling dirty data in Hive
[ viii ]

229
229

230

233
234
234
234

235

237

237
238
238
238
238
239
239
241
241
244
246
246
248

250
250
252
254
254
257


Table of Contents

Time for action – exporting query output
Partitioning the table
Time for action – making a partitioned UFO sighting table
Bucketing, clustering, and sorting... oh my!
User Defined Function
Time for action – adding a new User Defined Function (UDF)
To preprocess or not to preprocess...
Hive versus Pig
What we didn't cover
Hive on Amazon Web Services
Time for action – running UFO analysis on EMR
Using interactive job flows for development
Integration with other AWS products
Summary

Chapter 9: Working with Relational Databases

Common data paths
Hadoop as an archive store
Hadoop as a preprocessing step

Hadoop as a data input tool
The serpent eats its own tail
Setting up MySQL
Time for action – installing and setting up MySQL
Did it have to be so hard?
Time for action – configuring MySQL to allow remote connections
Don't do this in production!
Time for action – setting up the employee database
Be careful with data file access rights
Getting data into Hadoop
Using MySQL tools and manual import
Accessing the database from the mapper
A better way – introducing Sqoop
Time for action – downloading and configuring Sqoop
Sqoop and Hadoop versions
Sqoop and HDFS

258
260
260
264
264
265
268
269
269
270
270
277
278

278

279

279
280
280
281
281
281
281
284
285
286
286
287
287
288
288
289
289
290
291

Time for action – exporting data from MySQL to HDFS

291

Importing data into Hive using Sqoop
Time for action – exporting data from MySQL into Hive

Time for action – a more selective import

294
295
297

Sqoop's architecture

Datatype issues

[ ix ]

294

298


Table of Contents

Time for action – using a type mapping
Time for action – importing data from a raw query

299
300

Getting data out of Hadoop
Writing data from within the reducer
Writing SQL import files from the reducer
A better way – Sqoop again
Time for action – importing data from Hadoop into MySQL


303
303
304
304
304

Time for action – importing Hive data into MySQL
Time for action – fixing the mapping and re-running the export

308
310

AWS considerations
Considering RDS
Summary

313
313
314

Sqoop and Hive partitions
Field and line terminators

Differences between Sqoop imports and exports
Inserts versus updates
Sqoop and Hive exports

Other Sqoop features


Chapter 10: Data Collection with Flume

A note about AWS
Data data everywhere
Types of data
Getting network traffic into Hadoop
Time for action – getting web server data into Hadoop
Getting files into Hadoop
Hidden issues
Keeping network data on the network
Hadoop dependencies
Reliability
Re-creating the wheel
A common framework approach

302
302

306
307
307

312

315

315
316
316
316

316
318
318
318
318
318
318
319

Introducing Apache Flume
A note on versioning
Time for action – installing and configuring Flume
Using Flume to capture network data
Time for action – capturing network traffic to a log file
Time for action – logging to the console
Writing network data to log files
Time for action – capturing the output of a command in a flat file

319
319
320
321
321
324
326
326

Time for action – capturing a remote file in a local flat file
Sources, sinks, and channels


328
330

Logs versus files

[x]

327


Table of Contents
Sources
330
Sinks
330
Channels330
Or roll your own
331

Understanding the Flume configuration files
It's all about events
Time for action – writing network traffic onto HDFS
Time for action – adding timestamps
To Sqoop or to Flume...
Time for action – multi level Flume networks
Time for action – writing to multiple sinks
Selectors replicating and multiplexing
Handling sink failure
Next, the world
The bigger picture

Data lifecycle
Staging data
Scheduling
Summary

Chapter 11: Where to Go Next

What we did and didn't cover in this book
Upcoming Hadoop changes
Alternative distributions
Why alternative distributions?
Bundling
Free and commercial extensions
Choosing a distribution

Other Apache projects
HBase
Oozie
Whir
Mahout
MRUnit
Other programming abstractions
Pig
Cascading
AWS resources
HBase on EMR
SimpleDB
DynamoDB

331

332
333
335
337
338
340
342
342
343
343
343
344
344
345

347

347
348
349
349
349
349
351

352
352
352
353
353

354
354
354
354
355
355
355
355

[ xi ]

www.allitebooks.com


Table of Contents

Sources of information
Source code
Mailing lists and forums
LinkedIn groups
HUGs
Conferences
Summary

356
356
356
356
356
357

357

Appendix: Pop Quiz Answers

359

Index

361

Chapter 3, Understanding MapReduce
Chapter 7, Keeping Things Running

359
360

[ xii ]


Preface
This book is here to help you make sense of Hadoop and use it to solve your big data
problems. It's a really exciting time to work with data processing technologies such as
Hadoop. The ability to apply complex analytics to large data sets—once the monopoly of
large corporations and government agencies—is now possible through free open source
software (OSS).
But because of the seeming complexity and pace of change in this area, getting a grip on
the basics can be somewhat intimidating. That's where this book comes in, giving you an
understanding of just what Hadoop is, how it works, and how you can use it to extract
value from your data now.
In addition to an explanation of core Hadoop, we also spend several chapters exploring

other technologies that either use Hadoop or integrate with it. Our goal is to give you an
understanding not just of what Hadoop is but also how to use it as a part of your broader
technical infrastructure.
A complementary technology is the use of cloud computing, and in particular, the offerings
from Amazon Web Services. Throughout the book, we will show you how to use these
services to host your Hadoop workloads, demonstrating that not only can you process
large data volumes, but also you don't actually need to buy any physical hardware to do so.

What this book covers
This book comprises of three main parts: chapters 1 through 5, which cover the core of
Hadoop and how it works, chapters 6 and 7, which cover the more operational aspects
of Hadoop, and chapters 8 through 11, which look at the use of Hadoop alongside other
products and technologies.


Preface

Chapter 1, What It's All About, gives an overview of the trends that have made Hadoop and
cloud computing such important technologies today.
Chapter 2, Getting Hadoop Up and Running, walks you through the initial setup of a local
Hadoop cluster and the running of some demo jobs. For comparison, the same work is also
executed on the hosted Hadoop Amazon service.
Chapter 3, Understanding MapReduce, goes inside the workings of Hadoop to show how
MapReduce jobs are executed and shows how to write applications using the Java API.
Chapter 4, Developing MapReduce Programs, takes a case study of a moderately sized data
set to demonstrate techniques to help when deciding how to approach the processing and
analysis of a new data source.
Chapter 5, Advanced MapReduce Techniques, looks at a few more sophisticated ways of
applying MapReduce to problems that don't necessarily seem immediately applicable to the
Hadoop processing model.

Chapter 6, When Things Break, examines Hadoop's much-vaunted high availability and fault
tolerance in some detail and sees just how good it is by intentionally causing havoc through
killing processes and intentionally using corrupt data.
Chapter 7, Keeping Things Running, takes a more operational view of Hadoop and will be
of most use for those who need to administer a Hadoop cluster. Along with demonstrating
some best practice, it describes how to prepare for the worst operational disasters so you
can sleep at night.
Chapter 8, A Relational View On Data With Hive, introduces Apache Hive, which allows
Hadoop data to be queried with a SQL-like syntax.
Chapter 9, Working With Relational Databases, explores how Hadoop can be integrated with
existing databases, and in particular, how to move data from one to the other.
Chapter 10, Data Collection with Flume, shows how Apache Flume can be used to gather
data from multiple sources and deliver it to destinations such as Hadoop.
Chapter 11, Where To Go Next, wraps up the book with an overview of the broader Hadoop
ecosystem, highlighting other products and technologies of potential interest. In addition, it
gives some ideas on how to get involved with the Hadoop community and to get help.

What you need for this book
As we discuss the various Hadoop-related software packages used in this book, we will
describe the particular requirements for each chapter. However, you will generally need
somewhere to run your Hadoop cluster.
[2]


Preface

In the simplest case, a single Linux-based machine will give you a platform to explore almost
all the exercises in this book. We assume you have a recent distribution of Ubuntu, but as
long as you have command-line Linux familiarity any modern distribution will suffice.
Some of the examples in later chapters really need multiple machines to see things working,

so you will require access to at least four such hosts. Virtual machines are completely
acceptable; they're not ideal for production but are fine for learning and exploration.
Since we also explore Amazon Web Services in this book, you can run all the examples on
EC2 instances, and we will look at some other more Hadoop-specific uses of AWS throughout
the book. AWS services are usable by anyone, but you will need a credit card to sign up!

Who this book is for
We assume you are reading this book because you want to know more about Hadoop at
a hands-on level; the key audience is those with software development experience but no
prior exposure to Hadoop or similar big data technologies.
For developers who want to know how to write MapReduce applications, we assume you are
comfortable writing Java programs and are familiar with the Unix command-line interface.
We will also show you a few programs in Ruby, but these are usually only to demonstrate
language independence, and you don't need to be a Ruby expert.
For architects and system administrators, the book also provides significant value in
explaining how Hadoop works, its place in the broader architecture, and how it can be
managed operationally. Some of the more involved techniques in Chapter 4, Developing
MapReduce Programs, and Chapter 5, Advanced MapReduce Techniques, are probably
of less direct interest to this audience.

Conventions
In this book, you will find several headings appearing frequently.
To give clear instructions of how to complete a procedure or task, we use:

Time for action – heading
1.
2.
3.

Action 1

Action 2
Action 3

Instructions often need some extra explanation so that they make sense, so they are
followed with:
[3]


Preface

What just happened?
This heading explains the working of tasks or instructions that you have just completed.
You will also find some other learning aids in the book, including:

Pop quiz – heading
These are short multiple-choice questions intended to help you test your own
understanding.

Have a go hero – heading
These set practical challenges and give you ideas for experimenting with what you
have learned.
You will also find a number of styles of text that distinguish between different kinds of
information. Here are some examples of these styles, and an explanation of their meaning.
Code words in text are shown as follows: "You may notice that we used the Unix command
rm to remove the Drush directory rather than the DOS del command."
A block of code is set as follows:
# * Fine Tuning
#
key_buffer = 16M
key_buffer_size = 32M

max_allowed_packet = 16M
thread_stack = 512K
thread_cache_size = 8
max_connections = 300

When we wish to draw your attention to a particular part of a code block, the relevant lines
or items are set in bold:
# * Fine Tuning
#
key_buffer = 16M
key_buffer_size = 32M
max_allowed_packet = 16M
thread_stack = 512K
thread_cache_size = 8
max_connections = 300

[4]


×