www.allitebooks.com
www.allitebooks.com
HadoopMapReducev2CookbookSecond
Edition
www.allitebooks.com
TableofContents
HadoopMapReducev2CookbookSecondEdition
Credits
AbouttheAuthor
Acknowledgments
AbouttheAuthor
AbouttheReviewers
www.PacktPub.com
Supportfiles,eBooks,discountoffers,andmore
WhySubscribe?
FreeAccessforPacktaccountholders
Preface
Whatthisbookcovers
Whatyouneedforthisbook
Whothisbookisfor
Conventions
Readerfeedback
Customersupport
Downloadingtheexamplecode
Errata
Piracy
Questions
1.GettingStartedwithHadoopv2
Introduction
HadoopDistributedFileSystem–HDFS
HadoopYARN
HadoopMapReduce
Hadoopinstallationmodes
SettingupHadoopv2onyourlocalmachine
Gettingready
www.allitebooks.com
Howtodoit…
Howitworks…
WritingaWordCountMapReduceapplication,bundlingit,andrunningitusingthe
Hadooplocalmode
Gettingready
Howtodoit…
Howitworks…
There’smore…
Seealso
AddingacombinersteptotheWordCountMapReduceprogram
Howtodoit…
Howitworks…
There’smore…
SettingupHDFS
Gettingready
Howtodoit…
Seealso
SettingupHadoopYARNinadistributedclusterenvironmentusingHadoopv2
Gettingready
Howtodoit…
Howitworks…
Seealso
SettingupHadoopecosysteminadistributedclusterenvironmentusingaHadoop
distribution
Gettingready
Howtodoit…
There’smore…
HDFScommand-linefileoperations
Gettingready
Howtodoit…
Howitworks…
There’smore…
www.allitebooks.com
RunningtheWordCountprograminadistributedclusterenvironment
Gettingready
Howtodoit…
Howitworks…
There’smore…
BenchmarkingHDFSusingDFSIO
Gettingready
Howtodoit…
Howitworks…
There’smore…
BenchmarkingHadoopMapReduceusingTeraSort
Gettingready
Howtodoit…
Howitworks…
2.CloudDeployments–UsingHadoopYARNonCloudEnvironments
Introduction
RunningHadoopMapReducev2computationsusingAmazonElasticMapReduce
Gettingready
Howtodoit…
Seealso
SavingmoneyusingAmazonEC2SpotInstancestoexecuteEMRjobflows
Howtodoit…
There’smore…
Seealso
ExecutingaPigscriptusingEMR
Howtodoit…
There’smore…
StartingaPiginteractivesession
ExecutingaHivescriptusingEMR
Howtodoit…
There’smore…
www.allitebooks.com
StartingaHiveinteractivesession
Seealso
CreatinganAmazonEMRjobflowusingtheAWSCommandLineInterface
Gettingready
Howtodoit…
There’smore…
Seealso
DeployinganApacheHBaseclusteronAmazonEC2usingEMR
Gettingready
Howtodoit…
Seealso
UsingEMRbootstrapactionstoconfigureVMsfortheAmazonEMRjobs
Howtodoit…
There’smore…
UsingApacheWhirrtodeployanApacheHadoopclusterinacloudenvironment
Howtodoit…
Howitworks…
Seealso
3.HadoopEssentials–Configurations,UnitTests,andOtherAPIs
Introduction
OptimizingHadoopYARNandMapReduceconfigurationsforclusterdeployments
Gettingready
Howtodoit…
Howitworks…
There’smore…
ShareduserHadoopclusters–usingFairandCapacityschedulers
Howtodoit…
Howitworks…
There’smore…
Settingclasspathprecedencetouser-providedJARs
Howtodoit…
www.allitebooks.com
Howitworks…
Speculativeexecutionofstragglingtasks
Howtodoit…
There’smore…
UnittestingHadoopMapReduceapplicationsusingMRUnit
Gettingready
Howtodoit…
Seealso
IntegrationtestingHadoopMapReduceapplicationsusingMiniYarnCluster
Gettingready
Howtodoit…
Seealso
AddinganewDataNode
Gettingready
Howtodoit…
There’smore…
RebalancingHDFS
Seealso
DecommissioningDataNodes
Howtodoit…
Howitworks…
Seealso
Usingmultipledisks/volumesandlimitingHDFSdiskusage
Howtodoit…
SettingtheHDFSblocksize
Howtodoit…
There’smore…
Seealso
Settingthefilereplicationfactor
Howtodoit…
Howitworks…
www.allitebooks.com
There’smore…
Seealso
UsingtheHDFSJavaAPI
Howtodoit…
Howitworks…
There’smore…
ConfiguringtheFileSystemobject
Retrievingthelistofdatablocksofafile
4.DevelopingComplexHadoopMapReduceApplications
Introduction
ChoosingappropriateHadoopdatatypes
Howtodoit…
There’smore…
Seealso
ImplementingacustomHadoopWritabledatatype
Howtodoit…
Howitworks…
There’smore…
Seealso
ImplementingacustomHadoopkeytype
Howtodoit…
Howitworks…
Seealso
EmittingdataofdifferentvaluetypesfromaMapper
Howtodoit…
Howitworks…
There’smore…
Seealso
ChoosingasuitableHadoopInputFormatforyourinputdataformat
Howtodoit…
Howitworks…
www.allitebooks.com
There’smore…
Seealso
Addingsupportfornewinputdataformats–implementingacustomInputFormat
Howtodoit…
Howitworks…
There’smore…
Seealso
FormattingtheresultsofMapReducecomputations–usingHadoopOutputFormats
Howtodoit…
Howitworks…
There’smore…
WritingmultipleoutputsfromaMapReducecomputation
Howtodoit…
Howitworks…
UsingmultipleinputdatatypesandmultipleMapperimplementationsinasingle
MapReduceapplication
Seealso
Hadoopintermediatedatapartitioning
Howtodoit…
Howitworks…
There’smore…
TotalOrderPartitioner
KeyFieldBasedPartitioner
Secondarysorting–sortingReduceinputvalues
Howtodoit…
Howitworks…
Seealso
BroadcastinganddistributingsharedresourcestotasksinaMapReducejob–Hadoop
DistributedCache
Howtodoit…
Howitworks…
There’smore…
www.allitebooks.com
DistributingarchivesusingtheDistributedCache
AddingresourcestotheDistributedCachefromthecommandline
AddingresourcestotheclasspathusingtheDistributedCache
UsingHadoopwithlegacyapplications–Hadoopstreaming
Howtodoit…
Howitworks…
There’smore…
Seealso
AddingdependenciesbetweenMapReducejobs
Howtodoit…
Howitworks…
There’smore…
Hadoopcounterstoreportcustommetrics
Howtodoit…
Howitworks…
5.Analytics
Introduction
SimpleanalyticsusingMapReduce
Gettingready
Howtodoit…
Howitworks…
There’smore…
PerformingGROUPBYusingMapReduce
Gettingready
Howtodoit…
Howitworks…
CalculatingfrequencydistributionsandsortingusingMapReduce
Gettingready
Howtodoit…
Howitworks…
There’smore…
PlottingtheHadoopMapReduceresultsusinggnuplot
Gettingready
Howtodoit…
Howitworks…
There’smore…
CalculatinghistogramsusingMapReduce
Gettingready
Howtodoit…
Howitworks…
CalculatingScatterplotsusingMapReduce
Gettingready
Howtodoit…
Howitworks…
ParsingacomplexdatasetwithHadoop
Gettingready
Howtodoit…
Howitworks…
There’smore…
JoiningtwodatasetsusingMapReduce
Gettingready
Howtodoit…
Howitworks…
6.HadoopEcosystem–ApacheHive
Introduction
GettingstartedwithApacheHive
Howtodoit…
Seealso
CreatingdatabasesandtablesusingHiveCLI
Gettingready
Howtodoit…
Howitworks…
There’smore…
Hivedatatypes
Hiveexternaltables
UsingthedescribeformattedcommandtoinspectthemetadataofHivetables
SimpleSQL-styledataqueryingusingApacheHive
Gettingready
Howtodoit…
Howitworks…
There’smore…
UsingApacheTezastheexecutionengineforHive
Seealso
CreatingandpopulatingHivetablesandviewsusingHivequeryresults
Gettingready
Howtodoit…
UtilizingdifferentstorageformatsinHive-storingtabledatausingORCfiles
Gettingready
Howtodoit…
Howitworks…
UsingHivebuilt-infunctions
Gettingready
Howtodoit…
Howitworks…
There’smore…
Seealso
Hivebatchmode-usingaqueryfile
Howtodoit…
Howitworks…
There’smore…
Seealso
PerformingajoinwithHive
Gettingready
Howtodoit…
Howitworks…
Seealso
CreatingpartitionedHivetables
Gettingready
Howtodoit…
WritingHiveUser-definedFunctions(UDF)
Gettingready
Howtodoit…
Howitworks…
HCatalog–performingJavaMapReducecomputationsondatamappedtoHivetables
Gettingready
Howtodoit…
Howitworks…
HCatalog–writingdatatoHivetablesfromJavaMapReducecomputations
Gettingready
Howtodoit…
Howitworks…
7.HadoopEcosystemII–Pig,HBase,Mahout,andSqoop
Introduction
GettingstartedwithApachePig
Gettingready
Howtodoit…
Howitworks…
There’smore…
Seealso
JoiningtwodatasetsusingPig
Howtodoit…
Howitworks…
There’smore…
AccessingaHivetabledatainPigusingHCatalog
Gettingready
Howtodoit…
There’smore…
Seealso
GettingstartedwithApacheHBase
Gettingready
Howtodoit…
There’smore…
Seealso
DatarandomaccessusingJavaclientAPIs
Gettingready
Howtodoit…
Howitworks…
RunningMapReducejobsonHBase
Gettingready
Howtodoit…
Howitworks…
UsingHivetoinsertdataintoHBasetables
Gettingready
Howtodoit…
Seealso
GettingstartedwithApacheMahout
Howtodoit…
Howitworks…
There’smore…
RunningK-meanswithMahout
Gettingready
Howtodoit…
Howitworks…
ImportingdatatoHDFSfromarelationaldatabaseusingApacheSqoop
Gettingready
Howtodoit…
ExportingdatafromHDFStoarelationaldatabaseusingApacheSqoop
Gettingready
Howtodoit…
8.SearchingandIndexing
Introduction
GeneratinganinvertedindexusingHadoopMapReduce
Gettingready
Howtodoit…
Howitworks…
There’smore…
OutputtingarandomaccessibleindexedInvertedIndex
Seealso
IntradomainwebcrawlingusingApacheNutch
Gettingready
Howtodoit…
Seealso
IndexingandsearchingwebdocumentsusingApacheSolr
Gettingready
Howtodoit…
Howitworks…
Seealso
ConfiguringApacheHBaseasthebackenddatastoreforApacheNutch
Gettingready
Howtodoit…
Howitworks…
Seealso
WholewebcrawlingwithApacheNutchusingaHadoop/HBasecluster
Gettingready
Howtodoit…
Howitworks…
Seealso
Elasticsearchforindexingandsearching
Gettingready
Howtodoit…
Howitworks…
Seealso
Generatingthein-linksgraphforcrawledwebpages
Gettingready
Howtodoit…
Howitworks…
Seealso
9.Classifications,Recommendations,andFindingRelationships
Introduction
Performingcontent-basedrecommendations
Howtodoit…
Howitworks…
There’smore…
ClassificationusingthenaïveBayesclassifier
Howtodoit…
Howitworks…
AssigningadvertisementstokeywordsusingtheAdwordsbalancealgorithm
Howtodoit…
Howitworks…
There’smore…
10.MassTextDataProcessing
Introduction
DatapreprocessingusingHadoopstreamingandPython
Gettingready
Howtodoit…
Howitworks…
There’smore…
Seealso
De-duplicatingdatausingHadoopstreaming
Gettingready
Howtodoit…
Howitworks…
Seealso
LoadinglargedatasetstoanApacheHBasedatastore–importtsvandbulkload
Gettingready
Howtodoit…
Howitworks…
There’smore…
Datade-duplicationusingHBase
Seealso
CreatingTFandTF-IDFvectorsforthetextdata
Gettingready
Howtodoit…
Howitworks…
Seealso
ClusteringtextdatausingApacheMahout
Gettingready
Howtodoit…
Howitworks…
Seealso
TopicdiscoveryusingLatentDirichletAllocation(LDA)
Gettingready
Howtodoit…
Howitworks…
Seealso
DocumentclassificationusingMahoutNaiveBayesClassifier
Gettingready
Howtodoit…
Howitworks…
Seealso
Index
www.allitebooks.com
HadoopMapReducev2CookbookSecond
Edition
HadoopMapReducev2CookbookSecond
Edition
Copyright©2015PacktPublishing
Allrightsreserved.Nopartofthisbookmaybereproduced,storedinaretrievalsystem,
ortransmittedinanyformorbyanymeans,withoutthepriorwrittenpermissionofthe
publisher,exceptinthecaseofbriefquotationsembeddedincriticalarticlesorreviews.
Everyefforthasbeenmadeinthepreparationofthisbooktoensuretheaccuracyofthe
informationpresented.However,theinformationcontainedinthisbookissoldwithout
warranty,eitherexpressorimplied.Neithertheauthor,norPacktPublishing,andits
dealersanddistributorswillbeheldliableforanydamagescausedorallegedtobecaused
directlyorindirectlybythisbook.
PacktPublishinghasendeavoredtoprovidetrademarkinformationaboutallofthe
companiesandproductsmentionedinthisbookbytheappropriateuseofcapitals.
However,PacktPublishingcannotguaranteetheaccuracyofthisinformation.
Firstpublished:January2013
Secondedition:February2015
Productionreference:1200215
PublishedbyPacktPublishingLtd.
LiveryPlace
35LiveryStreet
BirminghamB32PB,UK.
ISBN978-1-78328-547-1
www.packtpub.com
CoverimagebyJarekBlaminsky(<>)
Credits
Authors
ThilinaGunarathne
SrinathPerera
Reviewers
SkandaBhargav
RandalScottKing
DmitrySpikhalskiy
JeroenvanWilgenburg
ShinichiYamashita
CommissioningEditor
EdwardGordon
AcquisitionEditors
JoanneFitzpatrick
ContentDevelopmentEditor
ShwetaPant
TechnicalEditors
IndrajitA.Das
PankajKadam
CopyEditors
PujaLalwani
AlfidaPaiva
LaxmiSubramanian
ProjectCoordinator
ShipraChawhan
Proofreaders
BridgetBraund
MariaGould
PaulHindle
BernadetteWatkins
Indexer