Tải bản đầy đủ (.pdf) (695 trang)

Hadoop mapreduce v2 cookbook explore the hadoop mapreduce v2 ecosystem to gain insights from very large datasets 2nd edition

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (4.3 MB, 695 trang )

www.allitebooks.com


www.allitebooks.com


HadoopMapReducev2CookbookSecond
Edition

www.allitebooks.com


TableofContents
HadoopMapReducev2CookbookSecondEdition
Credits
AbouttheAuthor
Acknowledgments
AbouttheAuthor
AbouttheReviewers
www.PacktPub.com
Supportfiles,eBooks,discountoffers,andmore
WhySubscribe?
FreeAccessforPacktaccountholders
Preface
Whatthisbookcovers
Whatyouneedforthisbook
Whothisbookisfor
Conventions
Readerfeedback
Customersupport
Downloadingtheexamplecode


Errata
Piracy
Questions
1.GettingStartedwithHadoopv2
Introduction
HadoopDistributedFileSystem–HDFS
HadoopYARN
HadoopMapReduce
Hadoopinstallationmodes
SettingupHadoopv2onyourlocalmachine
Gettingready
www.allitebooks.com


Howtodoit…
Howitworks…
WritingaWordCountMapReduceapplication,bundlingit,andrunningitusingthe
Hadooplocalmode
Gettingready
Howtodoit…
Howitworks…
There’smore…
Seealso
AddingacombinersteptotheWordCountMapReduceprogram
Howtodoit…
Howitworks…
There’smore…
SettingupHDFS
Gettingready
Howtodoit…

Seealso
SettingupHadoopYARNinadistributedclusterenvironmentusingHadoopv2
Gettingready
Howtodoit…
Howitworks…
Seealso
SettingupHadoopecosysteminadistributedclusterenvironmentusingaHadoop
distribution
Gettingready
Howtodoit…
There’smore…
HDFScommand-linefileoperations
Gettingready
Howtodoit…
Howitworks…
There’smore…
www.allitebooks.com


RunningtheWordCountprograminadistributedclusterenvironment
Gettingready
Howtodoit…
Howitworks…
There’smore…
BenchmarkingHDFSusingDFSIO
Gettingready
Howtodoit…
Howitworks…
There’smore…
BenchmarkingHadoopMapReduceusingTeraSort

Gettingready
Howtodoit…
Howitworks…
2.CloudDeployments–UsingHadoopYARNonCloudEnvironments
Introduction
RunningHadoopMapReducev2computationsusingAmazonElasticMapReduce
Gettingready
Howtodoit…
Seealso
SavingmoneyusingAmazonEC2SpotInstancestoexecuteEMRjobflows
Howtodoit…
There’smore…
Seealso
ExecutingaPigscriptusingEMR
Howtodoit…
There’smore…
StartingaPiginteractivesession
ExecutingaHivescriptusingEMR
Howtodoit…
There’smore…

www.allitebooks.com


StartingaHiveinteractivesession
Seealso
CreatinganAmazonEMRjobflowusingtheAWSCommandLineInterface
Gettingready
Howtodoit…
There’smore…

Seealso
DeployinganApacheHBaseclusteronAmazonEC2usingEMR
Gettingready
Howtodoit…
Seealso
UsingEMRbootstrapactionstoconfigureVMsfortheAmazonEMRjobs
Howtodoit…
There’smore…
UsingApacheWhirrtodeployanApacheHadoopclusterinacloudenvironment
Howtodoit…
Howitworks…
Seealso
3.HadoopEssentials–Configurations,UnitTests,andOtherAPIs
Introduction
OptimizingHadoopYARNandMapReduceconfigurationsforclusterdeployments
Gettingready
Howtodoit…
Howitworks…
There’smore…
ShareduserHadoopclusters–usingFairandCapacityschedulers
Howtodoit…
Howitworks…
There’smore…
Settingclasspathprecedencetouser-providedJARs
Howtodoit…

www.allitebooks.com


Howitworks…

Speculativeexecutionofstragglingtasks
Howtodoit…
There’smore…
UnittestingHadoopMapReduceapplicationsusingMRUnit
Gettingready
Howtodoit…
Seealso
IntegrationtestingHadoopMapReduceapplicationsusingMiniYarnCluster
Gettingready
Howtodoit…
Seealso
AddinganewDataNode
Gettingready
Howtodoit…
There’smore…
RebalancingHDFS
Seealso
DecommissioningDataNodes
Howtodoit…
Howitworks…
Seealso
Usingmultipledisks/volumesandlimitingHDFSdiskusage
Howtodoit…
SettingtheHDFSblocksize
Howtodoit…
There’smore…
Seealso
Settingthefilereplicationfactor
Howtodoit…
Howitworks…


www.allitebooks.com


There’smore…
Seealso
UsingtheHDFSJavaAPI
Howtodoit…
Howitworks…
There’smore…
ConfiguringtheFileSystemobject
Retrievingthelistofdatablocksofafile
4.DevelopingComplexHadoopMapReduceApplications
Introduction
ChoosingappropriateHadoopdatatypes
Howtodoit…
There’smore…
Seealso
ImplementingacustomHadoopWritabledatatype
Howtodoit…
Howitworks…
There’smore…
Seealso
ImplementingacustomHadoopkeytype
Howtodoit…
Howitworks…
Seealso
EmittingdataofdifferentvaluetypesfromaMapper
Howtodoit…
Howitworks…

There’smore…
Seealso
ChoosingasuitableHadoopInputFormatforyourinputdataformat
Howtodoit…
Howitworks…

www.allitebooks.com


There’smore…
Seealso
Addingsupportfornewinputdataformats–implementingacustomInputFormat
Howtodoit…
Howitworks…
There’smore…
Seealso
FormattingtheresultsofMapReducecomputations–usingHadoopOutputFormats
Howtodoit…
Howitworks…
There’smore…
WritingmultipleoutputsfromaMapReducecomputation
Howtodoit…
Howitworks…
UsingmultipleinputdatatypesandmultipleMapperimplementationsinasingle
MapReduceapplication
Seealso
Hadoopintermediatedatapartitioning
Howtodoit…
Howitworks…
There’smore…

TotalOrderPartitioner
KeyFieldBasedPartitioner
Secondarysorting–sortingReduceinputvalues
Howtodoit…
Howitworks…
Seealso
BroadcastinganddistributingsharedresourcestotasksinaMapReducejob–Hadoop
DistributedCache
Howtodoit…
Howitworks…
There’smore…
www.allitebooks.com


DistributingarchivesusingtheDistributedCache
AddingresourcestotheDistributedCachefromthecommandline
AddingresourcestotheclasspathusingtheDistributedCache
UsingHadoopwithlegacyapplications–Hadoopstreaming
Howtodoit…
Howitworks…
There’smore…
Seealso
AddingdependenciesbetweenMapReducejobs
Howtodoit…
Howitworks…
There’smore…
Hadoopcounterstoreportcustommetrics
Howtodoit…
Howitworks…
5.Analytics

Introduction
SimpleanalyticsusingMapReduce
Gettingready
Howtodoit…
Howitworks…
There’smore…
PerformingGROUPBYusingMapReduce
Gettingready
Howtodoit…
Howitworks…
CalculatingfrequencydistributionsandsortingusingMapReduce
Gettingready
Howtodoit…
Howitworks…
There’smore…


PlottingtheHadoopMapReduceresultsusinggnuplot
Gettingready
Howtodoit…
Howitworks…
There’smore…
CalculatinghistogramsusingMapReduce
Gettingready
Howtodoit…
Howitworks…
CalculatingScatterplotsusingMapReduce
Gettingready
Howtodoit…
Howitworks…

ParsingacomplexdatasetwithHadoop
Gettingready
Howtodoit…
Howitworks…
There’smore…
JoiningtwodatasetsusingMapReduce
Gettingready
Howtodoit…
Howitworks…
6.HadoopEcosystem–ApacheHive
Introduction
GettingstartedwithApacheHive
Howtodoit…
Seealso
CreatingdatabasesandtablesusingHiveCLI
Gettingready
Howtodoit…
Howitworks…


There’smore…
Hivedatatypes
Hiveexternaltables
UsingthedescribeformattedcommandtoinspectthemetadataofHivetables
SimpleSQL-styledataqueryingusingApacheHive
Gettingready
Howtodoit…
Howitworks…
There’smore…
UsingApacheTezastheexecutionengineforHive

Seealso
CreatingandpopulatingHivetablesandviewsusingHivequeryresults
Gettingready
Howtodoit…
UtilizingdifferentstorageformatsinHive-storingtabledatausingORCfiles
Gettingready
Howtodoit…
Howitworks…
UsingHivebuilt-infunctions
Gettingready
Howtodoit…
Howitworks…
There’smore…
Seealso
Hivebatchmode-usingaqueryfile
Howtodoit…
Howitworks…
There’smore…
Seealso
PerformingajoinwithHive
Gettingready


Howtodoit…
Howitworks…
Seealso
CreatingpartitionedHivetables
Gettingready
Howtodoit…
WritingHiveUser-definedFunctions(UDF)

Gettingready
Howtodoit…
Howitworks…
HCatalog–performingJavaMapReducecomputationsondatamappedtoHivetables
Gettingready
Howtodoit…
Howitworks…
HCatalog–writingdatatoHivetablesfromJavaMapReducecomputations
Gettingready
Howtodoit…
Howitworks…
7.HadoopEcosystemII–Pig,HBase,Mahout,andSqoop
Introduction
GettingstartedwithApachePig
Gettingready
Howtodoit…
Howitworks…
There’smore…
Seealso
JoiningtwodatasetsusingPig
Howtodoit…
Howitworks…
There’smore…
AccessingaHivetabledatainPigusingHCatalog


Gettingready
Howtodoit…
There’smore…
Seealso

GettingstartedwithApacheHBase
Gettingready
Howtodoit…
There’smore…
Seealso
DatarandomaccessusingJavaclientAPIs
Gettingready
Howtodoit…
Howitworks…
RunningMapReducejobsonHBase
Gettingready
Howtodoit…
Howitworks…
UsingHivetoinsertdataintoHBasetables
Gettingready
Howtodoit…
Seealso
GettingstartedwithApacheMahout
Howtodoit…
Howitworks…
There’smore…
RunningK-meanswithMahout
Gettingready
Howtodoit…
Howitworks…
ImportingdatatoHDFSfromarelationaldatabaseusingApacheSqoop
Gettingready


Howtodoit…

ExportingdatafromHDFStoarelationaldatabaseusingApacheSqoop
Gettingready
Howtodoit…
8.SearchingandIndexing
Introduction
GeneratinganinvertedindexusingHadoopMapReduce
Gettingready
Howtodoit…
Howitworks…
There’smore…
OutputtingarandomaccessibleindexedInvertedIndex
Seealso
IntradomainwebcrawlingusingApacheNutch
Gettingready
Howtodoit…
Seealso
IndexingandsearchingwebdocumentsusingApacheSolr
Gettingready
Howtodoit…
Howitworks…
Seealso
ConfiguringApacheHBaseasthebackenddatastoreforApacheNutch
Gettingready
Howtodoit…
Howitworks…
Seealso
WholewebcrawlingwithApacheNutchusingaHadoop/HBasecluster
Gettingready
Howtodoit…
Howitworks…



Seealso
Elasticsearchforindexingandsearching
Gettingready
Howtodoit…
Howitworks…
Seealso
Generatingthein-linksgraphforcrawledwebpages
Gettingready
Howtodoit…
Howitworks…
Seealso
9.Classifications,Recommendations,andFindingRelationships
Introduction
Performingcontent-basedrecommendations
Howtodoit…
Howitworks…
There’smore…
ClassificationusingthenaïveBayesclassifier
Howtodoit…
Howitworks…
AssigningadvertisementstokeywordsusingtheAdwordsbalancealgorithm
Howtodoit…
Howitworks…
There’smore…
10.MassTextDataProcessing
Introduction
DatapreprocessingusingHadoopstreamingandPython
Gettingready

Howtodoit…
Howitworks…
There’smore…


Seealso
De-duplicatingdatausingHadoopstreaming
Gettingready
Howtodoit…
Howitworks…
Seealso
LoadinglargedatasetstoanApacheHBasedatastore–importtsvandbulkload
Gettingready
Howtodoit…
Howitworks…
There’smore…
Datade-duplicationusingHBase
Seealso
CreatingTFandTF-IDFvectorsforthetextdata
Gettingready
Howtodoit…
Howitworks…
Seealso
ClusteringtextdatausingApacheMahout
Gettingready
Howtodoit…
Howitworks…
Seealso
TopicdiscoveryusingLatentDirichletAllocation(LDA)
Gettingready

Howtodoit…
Howitworks…
Seealso
DocumentclassificationusingMahoutNaiveBayesClassifier
Gettingready
Howtodoit…


Howitworks…
Seealso
Index


www.allitebooks.com


HadoopMapReducev2CookbookSecond
Edition



HadoopMapReducev2CookbookSecond
Edition
Copyright©2015PacktPublishing
Allrightsreserved.Nopartofthisbookmaybereproduced,storedinaretrievalsystem,
ortransmittedinanyformorbyanymeans,withoutthepriorwrittenpermissionofthe
publisher,exceptinthecaseofbriefquotationsembeddedincriticalarticlesorreviews.
Everyefforthasbeenmadeinthepreparationofthisbooktoensuretheaccuracyofthe
informationpresented.However,theinformationcontainedinthisbookissoldwithout
warranty,eitherexpressorimplied.Neithertheauthor,norPacktPublishing,andits

dealersanddistributorswillbeheldliableforanydamagescausedorallegedtobecaused
directlyorindirectlybythisbook.
PacktPublishinghasendeavoredtoprovidetrademarkinformationaboutallofthe
companiesandproductsmentionedinthisbookbytheappropriateuseofcapitals.
However,PacktPublishingcannotguaranteetheaccuracyofthisinformation.
Firstpublished:January2013
Secondedition:February2015
Productionreference:1200215
PublishedbyPacktPublishingLtd.
LiveryPlace
35LiveryStreet
BirminghamB32PB,UK.
ISBN978-1-78328-547-1
www.packtpub.com
CoverimagebyJarekBlaminsky(<>)



Credits
Authors
ThilinaGunarathne
SrinathPerera
Reviewers
SkandaBhargav
RandalScottKing
DmitrySpikhalskiy
JeroenvanWilgenburg
ShinichiYamashita
CommissioningEditor
EdwardGordon

AcquisitionEditors
JoanneFitzpatrick
ContentDevelopmentEditor
ShwetaPant
TechnicalEditors
IndrajitA.Das
PankajKadam
CopyEditors
PujaLalwani
AlfidaPaiva
LaxmiSubramanian
ProjectCoordinator
ShipraChawhan
Proofreaders
BridgetBraund
MariaGould
PaulHindle
BernadetteWatkins
Indexer


×