AboutThisE-Book
EPUBisanopen,industry-standardformatfore-books.However,supportforEPUB
anditsmanyfeaturesvariesacrossreadingdevicesandapplications.Useyourdeviceor
appsettingstocustomizethepresentationtoyourliking.Settingsthatyoucancustomize
oftenincludefont,fontsize,singleordoublecolumn,landscapeorportraitmode,and
figuresthatyoucanclickortaptoenlarge.Foradditionalinformationaboutthesettings
andfeaturesonyourreadingdeviceorapp,visitthedevicemanufacturer’sWebsite.
Manytitlesincludeprogrammingcodeorconfigurationexamples.Tooptimizethe
presentationoftheseelements,viewthee-bookinsingle-column,landscapemodeand
adjustthefontsizetothesmallestsetting.Inadditiontopresentingcodeand
configurationsinthereflowabletextformat,wehaveincludedimagesofthecodethat
mimicthepresentationfoundintheprintbook;therefore,wherethereflowableformat
maycompromisethepresentationofthecodelisting,youwillseea“Clickheretoview
codeimage”link.Clickthelinktoviewtheprint-fidelitycodeimage.Toreturntothe
previouspageviewed,clicktheBackbuttononyourdeviceorapp.
SamsTeachYourself:BigDataAnalytics
withMicrosoftHDInsight®in24Hours
ArshadAli
ManpreetSingh
800East96thStreet,Indianapolis,Indiana,46240USA
SamsTeachYourselfBigDataAnalyticswithMicrosoftHDInsight®in24
Hours
Copyright©2016byPearsonEducation,Inc.
Allrightsreserved.Nopartofthisbookshallbereproduced,storedinaretrievalsystem,
ortransmittedbyanymeans,electronic,mechanical,photocopying,recording,or
otherwise,withoutwrittenpermissionfromthepublisher.Nopatentliabilityisassumed
withrespecttotheuseoftheinformationcontainedherein.Althougheveryprecautionhas
beentakeninthepreparationofthisbook,thepublisherandauthorassumeno
responsibilityforerrorsoromissions.Norisanyliabilityassumedfordamagesresulting
fromtheuseoftheinformationcontainedherein.
ISBN-13:978-0-672-33727-7
ISBN-10:0-672-33727-4
LibraryofCongressControlNumber:2015914167
PrintedintheUnitedStatesofAmerica
FirstPrintingNovember2015
Editor-in-Chief
GregWiegand
AcquisitionsEditor
JoanMurray
DevelopmentEditor
SondraScott
ManagingEditor
SandraSchroeder
SeniorProjectEditor
TonyaSimpson
CopyEditor
KristaHansing
EditorialServices,Inc
SeniorIndexer
CherylLenser
Proofreader
AnneGoebel
TechnicalEditors
ShayneBurgess
RonAbellera
PublishingCoordinator
CindyTeeter
CoverDesigner
MarkShirar
Compositor
codeMantra
Trademarks
Alltermsmentionedinthisbookthatareknowntobetrademarksorservicemarkshave
beenappropriatelycapitalized.SamsPublishingcannotattesttotheaccuracyofthis
information.Useofaterminthisbookshouldnotberegardedasaffectingthevalidityof
anytrademarkorservicemark.
HDInsightisaregisteredtrademarkofMicrosoftCorporation.
WarningandDisclaimer
Everyefforthasbeenmadetomakethisbookascompleteandasaccurateaspossible,but
nowarrantyorfitnessisimplied.Theinformationprovidedisonan“asis”basis.The
authorsandthepublishershallhaveneitherliabilitynorresponsibilitytoanypersonor
entitywithrespecttoanylossordamagesarisingfromtheinformationcontainedinthis
book.
SpecialSales
Forinformationaboutbuyingthistitleinbulkquantities,orforspecialsalesopportunities
(whichmayincludeelectronicversions;customcoverdesigns;andcontentparticularto
yourbusiness,traininggoals,marketingfocus,orbrandinginterests),pleasecontactour
corporatesalesdepartmentator(800)382-3419.
Forgovernmentsalesinquiries,pleasecontact
ForquestionsaboutsalesoutsidetheU.S.,pleasecontact
ContentsataGlance
Introduction
PartI:UnderstandingBigData,Hadoop1.0,and2.0
HOUR1IntroductionofBigData,NoSQL,andBusinessValueProposition
2IntroductiontoHadoop,ItsArchitecture,Ecosystem,andMicrosoftOfferings
3HadoopDistributedFileSystemVersions1.0and2.0
4TheMapReduceJobFrameworkandJobExecutionPipeline
5MapReduce—AdvancedConceptsandYARN
PartII:GettingStartedwithHDInsightandUnderstandingItsDifferent
Components
HOUR6GettingStartedwithHDInsight,ProvisioningYourHDInsightServiceCluster,
andAutomatingHDInsightClusterProvisioning
7ExploringTypicalComponentsofHDFSCluster
8StoringDatainMicrosoftAzureStorageBlob
9WorkingwithMicrosoftAzureHDInsightEmulator
PartIII:ProgrammingMapReduceandHDInsightScriptAction
HOUR10ProgrammingMapReduceJobs
11CustomizingtheHDInsightClusterwithScriptAction
PartIV:QueryingandProcessingBigDatainHDInsight
HOUR12GettingStartedwithApacheHiveandApacheTezinHDInsight
13ProgrammingwithApacheHive,ApacheTezinHDInsight,andApache
HCatalog
14ConsumingHDInsightDatafromMicrosoftBIToolsoverHiveODBCDriver:
Part1
15ConsumingHDInsightDatafromMicrosoftBIToolsoverHiveODBCDriver:
Part2
16IntegratingHDInsightwithSQLServerIntegrationServices
17UsingPigforDataProcessing
18UsingSqoopforDataMovementBetweenRDBMSandHDInsight
PartV:ManagingWorkflowandPerformingStatisticalComputing
HOUR19UsingOozieWorkflowsandJobOrchestrationwithHDInsight
20PerformingStatisticalComputingwithR
PartVI:PerformingInteractiveAnalyticsandMachineLearning
HOUR21PerformingBigDataAnalyticswithSpark
22MicrosoftAzureMachineLearning
PartVII:PerformingReal-timeAnalytics
HOUR23PerformingStreamAnalyticswithStorm
24IntroductiontoApacheHBaseonHDInsight
Index
TableofContents
Introduction
PartI:UnderstandingBigData,Hadoop1.0,and2.0
HOUR1:IntroductionofBigData,NoSQL,andBusinessValueProposition
TypesofAnalysis
TypesofData
BigData
ManagingBigData
NoSQLSystems
BigData,NoSQLSystems,andtheBusinessValueProposition
ApplicationofBigDataandBigDataSolutions
Summary
Q&A
HOUR2:IntroductiontoHadoop,ItsArchitecture,Ecosystem,andMicrosoft
Offerings
WhatIsApacheHadoop?
ArchitectureofHadoopandHadoopEcosystems
What’sNewinHadoop2.0
ArchitectureofHadoop2.0
ToolsandTechnologiesNeededwithBigDataAnalytics
MajorPlayersandVendorsforHadoop
DeploymentOptionsforMicrosoftBigDataSolutions
Summary
Q&A
HOUR3:HadoopDistributedFileSystemVersions1.0and2.0
IntroductiontoHDFS
HDFSArchitecture
RackAwareness
WebHDFS
AccessingandManagingHDFSData
What’sNewinHDFS2.0
Summary
Q&A
HOUR4:TheMapReduceJobFrameworkandJobExecutionPipeline
IntroductiontoMapReduce
MapReduceArchitecture
MapReduceJobExecutionFlow
Summary
Q&A
HOUR5:MapReduce—AdvancedConceptsandYARN
DistributedCache
HadoopStreaming
MapReduceJoins
BloomFilter
PerformanceImprovement
HandlingFailures
Counter
YARN
Uber-TaskingOptimization
FailuresinYARN
ResourceManagerHighAvailabilityandAutomaticFailoverinYARN
Summary
Q&A
PartII:GettingStartedwithHDInsightandUnderstandingItsDifferent
Components
HOUR6:GettingStartedwithHDInsight,ProvisioningYourHDInsightService
Cluster,andAutomatingHDInsightClusterProvisioning
IntroductiontoMicrosoftAzure
UnderstandingHDInsightService
ProvisioningHDInsightontheAzureManagementPortal
AutomatingHDInsightProvisioningwithPowerShell
ManagingandMonitoringHDInsightClusterandJobExecution
Summary
Q&A
Exercise
HOUR7:ExploringTypicalComponentsofHDFSCluster
HDFSClusterComponents
HDInsightClusterArchitecture
HighAvailabilityinHDInsight
Summary
Q&A
HOUR8:StoringDatainMicrosoftAzureStorageBlob
UnderstandingStorageinMicrosoftAzure
BenefitsofAzureStorageBloboverHDFS
AzureStorageExplorerTools
Summary
Q&A
HOUR9:WorkingwithMicrosoftAzureHDInsightEmulator
GettingStartedwithHDInsightEmulator
SettingUpMicrosoftAzureEmulatorforStorage
Summary
Q&A
PartIII:ProgrammingMapReduceandHDInsightScriptAction
HOUR10:ProgrammingMapReduceJobs
MapReduceHelloWorld!
AnalyzingFlightDelayswithMapReduce
SerializationFrameworksforHadoop
HadoopStreaming
Summary
Q&A
HOUR11:CustomizingtheHDInsightClusterwithScriptAction
IdentifyingtheNeedforClusterCustomization
DevelopingScriptAction
ConsumingScriptAction
RunningaGiraphjobonaCustomizedHDInsightCluster
TestingScriptActionwithHDInsightEmulator
Summary
Q&A
PartIV:QueryingandProcessingBigDatainHDInsight
HOUR12:GettingStartedwithApacheHiveandApacheTezinHDInsight
IntroductiontoApacheHive
GettingStartedwithApacheHiveinHDInsight
AzureHDInsightToolsforVisualStudio
ProgrammaticallyUsingtheHDInsight.NETSDK
IntroductiontoApacheTez
Summary
Q&A
Exercise
HOUR13:ProgrammingwithApacheHive,ApacheTezinHDInsight,andApache
HCatalog
ProgrammingwithHiveinHDInsight
UsingTablesinHive
SerializationandDeserialization
DataLoadProcessesforHiveTables
QueryingDatafromHiveTables
IndexinginHive
ApacheTezinAction
ApacheHCatalog
Summary
Q&A
Exercise
HOUR14:ConsumingHDInsightDatafromMicrosoftBIToolsoverHiveODBC
Driver:Part1
IntroductiontoHiveODBCDriver
IntroductiontoMicrosoftPowerBI
AccessingHiveDatafromMicrosoftExcel
Summary
Q&A
HOUR15:ConsumingHDInsightDatafromMicrosoftBIToolsoverHiveODBC
Driver:Part2
AccessingHiveDatafromPowerPivot
AccessingHiveDatafromSQLServer
AccessingHDInsightDatafromPowerQuery
Summary
Q&A
Exercise
HOUR16:IntegratingHDInsightwithSQLServerIntegrationServices
TheNeedforDataMovement
IntroductiontoSSIS
AnalyzingOn-timeFlightDeparturewithSSIS
ProvisioningHDInsightCluster
Summary
Q&A
HOUR17:UsingPigforDataProcessing
IntroductiontoPigLatin
UsingPigtoCountCancelledFlights
UsingHCataloginaPigLatinScript
SubmittingPigJobswithPowerShell
Summary
Q&A
HOUR18:UsingSqoopforDataMovementBetweenRDBMSandHDInsight
WhatIsSqoop?
UsingSqoopImportandExportCommands
UsingSqoopwithPowerShell
Summary
Q&A
PartV:ManagingWorkflowandPerformingStatisticalComputing
HOUR19:UsingOozieWorkflowsandJobOrchestrationwithHDInsight
IntroductiontoOozie
DeterminingOn-timeFlightDeparturePercentagewithOozie
SubmittinganOozieWorkflowwithHDInsight.NETSDK
CoordinatingWorkflowswithOozie
OozieComparedtoSSIS
Summary
Q&A
HOUR20:PerformingStatisticalComputingwithR
IntroductiontoR
IntegratingRwithHadoop
EnablingRonHDInsight
Summary
Q&A
PartVI:PerformingInteractiveAnalyticsandMachineLearning
HOUR21:PerformingBigDataAnalyticswithSpark
IntroductiontoSpark
SparkProgrammingModel
BlendingSQLQueryingwithFunctionalPrograms
Summary
Q&A
HOUR22:MicrosoftAzureMachineLearning
HistoryofTraditionalMachineLearning
IntroductiontoAzureML
AzureMLWorkspace
ProcessestoBuildAzureMLSolutions
GettingStartedwithAzureML
CreatingPredictiveModelswithAzureML
PublishingAzureMLModelsasWebServices
Summary
Q&A
Exercise
PartVII:PerformingReal-timeAnalytics
HOUR23:PerformingStreamAnalyticswithStorm
IntroductiontoStorm
UsingSCP.NETtoDevelopStormSolutions
AnalyzingSpeedLimitViolationIncidentswithStorm
Summary
Q&A
HOUR24:IntroductiontoApacheHBaseonHDInsight
IntroductiontoApacheHBase
HBaseArchitecture
CreatingHDInsightClusterwithHBase
Summary
Q&A
Index
AbouttheAuthors
ArshadAlihasmorethan13yearsofexperienceinthecomputerindustry.Asa
DB/DW/BIconsultantinanend-to-enddeliveryrole,hehasbeenworkingonseveral
enterprise-scaledatawarehousingandanalyticsprojectsforenablinganddeveloping
businessintelligenceandanalyticsolutions.Hespecializesindatabase,datawarehousing,
andbusinessintelligence/analyticsapplicationdesign,development,anddeploymentat
theenterpriselevel.HefrequentlyworkswithSQLServer,MicrosoftAnalyticsPlatform
System(APS,orformallyknownasSQLServerParallelDataWarehouse[PDW]),
HDInsight(Hadoop,Hive,Pig,HBase,andsoon),SSIS,SSRS,SSAS,ServiceBroker,
MDS,DQS,SharePoint,andPPS.Inthepast,hehasalsohandledperformance
optimizationforseveralprojects,withsignificantperformancegain.
ArshadisaMicrosoftCertifiedSolutionsExpert(MCSE)–SQLServer2012Data
Platform,andMicrosoftCertifiedITProfessional(MCITP)inMicrosoftSQLServer
2008–DatabaseDevelopment,DataAdministration,andBusinessIntelligence.Heisalso
certifiedonITIL2011foundation.
HehasworkedindevelopingapplicationsinVB,ASP,.NET,ASP.NET,andC#.Heisa
MicrosoftCertifiedApplicationDeveloper(MCAD)andMicrosoftCertifiedSolution
Developer(MCSD)forthe.NETplatforminWeb,Windows,andEnterprise.
Arshadhaspresentedatseveraltechnicaleventsandhaswrittenmorethan200articles
relatedtoDB,DW,BI,andBAtechnologies,bestpractices,processes,andperformance
optimizationtechniquesonSQLServer,Hadoop,andrelatedtechnologies.Hisarticles
havebeenpublishedonseveralprominentsites.
Ontheeducationalfront,ArshadholdsaMasterinComputerApplicationsdegreeanda
MasterinBusinessAdministrationinITdegree.
Arshadcanbereachedat,orvisitto
connectwithhim.
ManpreetSinghisaconsultantandauthorwithextensiveexpertiseinarchitecture,
design,andimplementationofbusinessintelligenceandBigDataanalyticssolutions.He
ispassionateaboutenablingbusinessestoderivevaluableinsightsfromtheirdata.
ManpreethasbeenworkingonMicrosofttechnologiesformorethan8years,witha
strongfocusonMicrosoftBusinessIntelligenceStack,SharePointBI,andMicrosoft’sBig
DataAnalyticsPlatforms(AnalyticsPlatformSystemandHDInsight).Healsospecializes
inMobileBusinessIntelligencesolutiondevelopmentandhashelpedbusinessesdelivera
consolidatedviewoftheirdatatotheirmobileworkforces.
ManpreethascoauthoredbooksandtechnicalarticlesonMicrosofttechnologies,focusing
onthedevelopmentofdataanalyticsandvisualizationsolutionswiththeMicrosoftBI
StackandSharePoint.HeholdsadegreeincomputerscienceandengineeringfromPanjab
University,India.
Manpreetcanbereachedat
Dedications
Arshad:
Tomyparents,thelateMrs.andMr.MdAzalHussain,whobroughtmeintothisbeautiful
world
andmademethepersonIamtoday.Althoughtheycouldn’tbeheretoseethisday,Iam
sure
theymustbeproud,andallIcansayis,“Thankssomuch—Iloveyouboth.”
Andtomybeautifulwife,ShaziaArshadAli,whomotivatedmetotakeupthechallengeof
writing
thisbookandwhosupportedmethroughoutthisjourney.
Andtomynephew,GulfamHussain,whohasbeenveryexcitedformetobeanauthorand
hasbeenfollowingupwithmeonitsprogressregularlyandsupportingme,wherehe
could,
incompletingthisbook.
Finally,Iwouldliketodedicatethistomyschoolteacher,SankarSarkar,whoshapedmy
career
withhispatienceandperseveranceandhasbeentrulyaninspirationalsource.
Manpreet:
Tomyparents,mywife,andmydaughter.Andtomygrandfather,
Capt.JagatSingh,whocouldn’tbeheretoseethisday.
Acknowledgments
Thisbookwouldnothavebeenpossiblewithoutsupportfromsomeofourspecialfriends.
Firstandforemost,wewouldliketothankYaswantVishwakarma,VijayKorapadi,
AvadhutKulkarni,KuldeepChauhan,RajeevGupta,VivekAdholia,andmanyotherswho
havebeeninspirationsandsupportedusinwritingthisbook,directlyorindirectly.Thanks
alot,guys—wearetrulyindebtedtoyouallforallyoursupportandtheopportunityyou
havegivenustolearnandgrow.
WealsowouldliketothanktheentirePearsonteam,especiallyMarkRenfrowandJoan
Murray,fortakingourproposalfromdreamtoreality.ThanksalsotoShayneBurgessand
RonAbelleraforreadingtheentiredraftofthebookandprovidingveryhelpfulfeedback
andsuggestions.
Thanksonceagain—youallrock!
Arshad
Manpreet
WeWanttoHearfromYou!
Asthereaderofthisbook,youareourmostimportantcriticandcommentator.Wevalue
youropinionandwanttoknowwhatwe’redoingright,whatwecoulddobetter,what
areasyou’dliketoseeuspublishin,andanyotherwordsofwisdomyou’rewillingto
passourway.
Wewelcomeyourcomments.Youcanemailorwritetoletusknowwhatyoudidordidn’t
likeaboutthisbook—aswellaswhatwecandotomakeourbooksbetter.
Pleasenotethatwecannothelpyouwithtechnicalproblemsrelatedtothetopicofthis
book.
Whenyouwrite,pleasebesuretoincludethisbook’stitleandauthorsaswellasyour
nameandemailaddress.Wewillcarefullyreviewyourcommentsandsharethemwiththe
authorsandeditorswhoworkedonthebook.
Email:
Mail:SamsPublishing
ATTN:ReaderFeedback
800East96thStreet
Indianapolis,IN46240USA
ReaderServices
Visitourwebsiteandregisterthisbookatinformit.com/registerforconvenientaccessto
anyupdates,downloads,orerratathatmightbeavailableforthisbook.
Introduction
“Theinformationthat’sstoredinourdatabasesandspreadsheetscannotspeakforitself.It
hasimportantstoriestotellandonlywecangivethemavoice.”—StephenFew
Hello,andwelcometotheworldofBigData!Weareyourauthors,ArshadAliand
ManpreetSingh.Forus,it’sagoodsignthatyou’reactuallyreadingthisintroduction(so
fewreadersoftechbooksdo,inourexperiences).Perhapsyourfirstquestionis,“What’s
initforme?”Weareheretogiveyouthosedetailswithminimalfuss.
Neverhastherebeenamoreexcitingtimeintheworldofdata.Weareseeingthe
convergenceofsignificanttrendsthatarefundamentallytransformingtheindustryand
usheringinaneweraoftechnologicalinnovationinareassuchassocial,mobility,
advancedanalytics,andmachinelearning.Wearewitnessinganexplosionofdata,withan
entirelynewscaleandscopetogaininsightsfrom.Recentestimatessaythatthetotal
amountofdigitalinformationintheworldisincreasing10timesevery5years.Eightyfivepercentofthisdataiscomingfromnewdatasources(connecteddevices,sensors,
RFIDs,webblogs,clickstreams,andsoon),andupto80percentofthisdatais
unstructured.Thispresentsahugeopportunityforanorganization:totapintothisnew
datatoidentifynewopportunityandareasforinnovation.
Tostoreandgetinsightintothishumongousvolumeofdifferentvarietiesofdata,known
asBigData,anorganizationneedstoolsandtechnologies.ChiefamongtheseisHadoop,
forprocessingandanalyzingthisambientdatabornoutsidethetraditionaldataprocessing
platform.HadoopistheopensourceimplementationoftheMapReduceparallel
computationalengineandenvironment,andit’susedquitewidelyinprocessingstreamsof
datathatgowellbeyondeventhelargestenterprisedatasetsinsize.Whetherit’ssensor,
clickstream,socialmedia,telemetry,locationbased,orotherdatathatisgeneratedand
collectedinlargevolumes,Hadoopisoftenonthescenetoprocessandanalyzeit.
Analyticshasbeeninuse(mostlywithorganizations’internaldata)forseveralyearsnow,
butitsusewithBigDataisyieldingtremendousopportunities.Organizationscannow
leveragedataavailableexternallyindifferentformats,toidentifynewopportunitiesand
areasofinnovationbyanalyzingpatterns,customerresponsesorbehavior,markettrends,
competitors’take,researchdatafromgovernmentsororganizations,andmore.This
providesanopportunitytonotonlylookbackonthepast,butalsolookforwardto
understandwhatmighthappeninthefuture,usingpredictiveanalytics.
Inthisbook,weexaminewhatconstitutesBigDataanddemonstratehoworganizations
cantapintoBigDatausingHadoop.Welookatsomeimportanttoolsandtechnologiesin
theHadoopecosystemand,moreimportant,checkoutMicrosoft’spartnershipwith
Hortonworks/Cloudera.TheHadoopdistributionfortheWindowsplatformoronthe
MicrosoftAzurePlatform(cloudcomputing)isanenterprise-readysolutionandcanbe
integratedeasilywithMicrosoftSQLServer,MicrosoftActiveDirectory,andSystem
Center.Thismakesitdramaticallysimpler,easier,moreefficient,andmorecosteffective
foryourorganizationtocapitalizeontheopportunityBigDatabringstoyourbusiness.
ThroughdeepintegrationwithMicrosoftBusinessIntelligencetools(PowerPivotand
PowerView)andEDWtools(SQLServerandSQLServerParallelDataWarehouse),
Microsoft’sBigDatasolutionalsoofferscustomersdeepinsightsintotheirstructuredand
unstructureddatawiththetoolstheyuseeveryday.
ThisbookprimarilyfocusesontheHadoop(Hadoop1.*andHadoop2.*)distributionfor
Azure,MicrosoftHDInsight.ItprovidesseveraladvantagesoverrunningaHadoopcluster
overyourlocalinfrastructure.IntermsofprogrammingMapReducejobsorHiveorPIG
queries,youwillseenodifferences;thesameprogramwillrunflawlesslyoneitherof
thesetwoHadoopdistributions(orevenonotherdistributions),orwithminimalchanges,
ifyouareusingcloudplatform-specificfeatures.Moreover,integratingHadoopandcloud
computingsignificantlylessensthetotalcostownershipanddeliversquickandeasysetup
fortheHadoopcluster.(WedemonstratehowtosetupaHadoopclusteronMicrosoft
AzureinHour6,“GettingStartedwithHDInsight,ProvisioningYourHDInsightService
Cluster,andAutomatingHDInsightClusterProvisioning.”)
Considersomeforecastsfromnotableresearchanalystsorresearchorganizations:
“BigDataisaBigPriorityforCustomers—49%oftopCEOsandCIOsarecurrently
usingBigDataforcustomeranalytics.”—McKinsey&Company,McKinseyGlobal
SurveyResults,MindingYourDigitalBusiness,2012
“By2015,4.4millionITjobsgloballywillbecreatedtosupportBigData,generating1.9
millionITjobsintheUnitedStates.Onlyonethirdofskillsetswillbeavailablebythat
time.”—PeterSondergaard,SeniorVicePresidentatGartnerandGlobalHeadofResearch
“By2015,businesses(organizationsthatareabletotakeadvantageofBigData)thatbuild
amoderninformationmanagementsystemwilloutperformtheirpeersfinanciallyby20
percent.”—Gartner,MarkBeyer,InformationManagementinthe21stCentury
“By2020,theamountofdigitaldataproducedwillexceed40zettabytes,whichisthe
equivalentof5,200GBofdataforeveryman,woman,andchildonEarth.”—Digital
Universestudy
IDChaspublishedananalysispredictingthatthemarketforBigDatawillgrowtoover
$19billionby2015.Thisincludesgrowthinpartnerservicesto$6.5billionin2015and
growthinsoftwareto$4.6billionin2015.Thisrepresents39percentand34percent
compoundannualgrowthrates,respectively.
WehopeyouenjoyreadingthisbookandgainanunderstandingofandexpertiseonBig
DataandBigDataanalytics.WeespeciallyhopeyoulearnhowtoleverageMicrosoft
HDInsighttoexploititsenormousopportunitiestotakeyourorganizationwayaheadof
yourcompetitors.
Wewouldlovetohearyourfeedbackorsuggestionsforimprovement.Feelfreetoshare
withus(ArshadAli,,andManpreetSingh,
)sothatwecanincorporateitintothenextrelease.
WelcometotheworldofBigDataandBigDataanalyticswithMicrosoftHDInsight!
WhoShouldReadThisBook
Whatdoyouhopetogetoutofthisbook?Aswewrotethisbook,wehadthefollowing
audiencesinmind:
Developers—Developers(especiallybusinessintelligencedevelopers)worldwide
areseeingagrowingneedforpractical,step-by-stepinstructioninprocessingBig
Dataandperformingadvancedanalyticstoextractactionableinsights.Thisbook
wasdesignedtomeetthatneed.Itstartsatthegroundlevelandbuildsfromthere,to
makeyouanexpert.Hereyou’lllearnhowtobuildthenextgenerationofappsthat
includesuchcapabilities.
Datascientists—Asadatascientist,youarealreadyfamiliarwiththeprocessesof
acquiring,transforming,andintegratingdataintoyourworkandperforming
advancedanalytics.Thisbookintroducesyoutomoderntoolsandtechnologies
(onesthatareprominent,inexpensive,flexible,andopensourcefriendly)thatyou
canapplywhileacquiring,transforming,andintegratingBigDataandperforming
advancedanalytics.
Bythetimeyoucompletethisbook,you’llbequitecomfortablewiththelatesttools
andtechnologies.
Businessdecisionmakers—Businessdecisionmakersaroundtheworld,from
manydifferentorganizations,arelookingtounlockthevalueofdatatogain
actionableinsightsthatenabletheirbusinessestostayaheadofcompetitors.This
bookdelvesintoadvancedanalyticsapplicationsandcasestudiesbasedonBigData
toolsandtechnologies,toaccelerateyourbusinessgoals.
StudentsaspiringtobeBigDataanalysts—Asyouaregettingreadytotransition
fromtheacademictothecorporateworld,thisbookshelpsyoubuildafoundational
skillsettoaceyourinterviewsandsuccessfullydeliverBigDataprojectsinatimely
manner.Chaptersweredesignedtostartatthegroundlevelandgraduallytakeyou
toanexpertlevel.
Don’tworryifyoudon’tfitintoanyoftheseclassifications.Setyoursightsonlearningas
muchasyoucanandhavingfunintheprocess,andyou’lldofine!
HowThisBookIsOrganized
ThisbookbeginswiththepremisethatyoucanlearnwhatBigDatais,includingthereallifeapplicationsofBigDataandtheprominenttoolsandtechnologiestouseBigData
solutionstoquicklytapintoopportunity,bystudyingthematerialin241-hoursessions.
Youmightuseyourlunchbreakasyourtraininghour,oryoumightstudyforanhour
beforeyougotobedatnight.
Whateverscheduleyouadopt,thesearethehour-by-hourdetailsonhowwestructuredthe
content:
Hour1,“IntroductionofBigData,NoSQL,andBusinessValueProposition,”
introducesyoutotheworldofBigDataandexplainshowanorganizationthat
leveragesthepowerofBigDataanalyticscanbothremaincompetitiveandbeatout
itscompetitors.ItexplainsBigDataindetail,alongwithitscharacteristicsandthe
typesofanalysis(descriptive,predictive,andprescriptive)anorganizationdoeswith
BigData.Finally,itsetsoutthebusinessvaluepropositionofusingBigData
solutions,alongwithsomereal-lifeexamplesofBigDatasolutions.
ThishouralsosummarizestheNoSQLtechnologiesusedtomanageandprocessBig
DataandexplainshowNoSQLsystemsdifferfromtraditionaldatabasesystems
(RDBMS).
InHour2,“IntroductiontoHadoop,ItsArchitecture,Ecosystem,andMicrosoft
Offerings,”youlookatmanagingBigDatawithApacheHadoop.Thishouris
rootedinhistory:ItshowshowHadoopevolvedfrominfancytoHadoop1.0and
thenHadoop2.0,highlightingarchitecturalchangesfromHadoop1.0toHadoop
2.0.Thishouralsofocusesonunderstandingothersoftwareandcomponentsthat
makeuptheHadoopecosystemandlooksatthecomponentsneededindifferent
phasesofBigDataanalytics.Finally,itintroducesyoutoHadoopvendors,evaluates
theirofferings,andanalyzesMicrosoft’sdeploymentoptionsforBigDatasolutions.
InHour3,“HadoopDistributedFileSystemVersions1.0and2.0,”youlearnabout
HDFS,itsarchitecture,andhowdatagetsstored.Youalsolookintotheprocessesof
readingfromHDFSandwritingdatatoHDFS,aswellasinternalbehaviortoensure
faulttolerance.Attheendofthehour,youtakeadetailedlookatHDFS2.0,which
comesasapartofHadoop2.0,toseehowitovercomesthelimitationsofHadoop
1.0andprovideshigh-availabilityandscalabilityenhancements.
InHour4,“TheMapReduceJobFrameworkandJobExecutionPipeline,”you
exploretheMapReduceprogrammingparadigm,itsarchitecture,thecomponentsof
aMapReducejob,andMapReducejobexecutionflow.
Hour5,“MapReduce—AdvancedConceptsandYARN,”introducesyouto
advancedconceptsrelatedtoMapReduce(includingMapReduceStreaming,
MapReducejoins,distributedcaches,failuresandhowtheyarehandled
transparently,andperformanceoptimizationforyourMapReducejobs).
InHadoop2.0,YARNushersinamajorarchitecturalchangeandopensanew
windowforscalability,performance,andmultitenancy.Inthishour,youlearnabout
theYARNarchitecture,itscomponents,theYARNjobexecutionpipeline,andhow
failuresarehandledtransparently.
InHour6,“GettingStartedwithHDInsight,ProvisioningYourHDInsightService
Cluster,andAutomatingHDInsightClusterProvisioning,”youdelveintothe
HDInsightservice.Youalsowalkthroughastep-by-stepprocessforquickly
provisioningHDInsightoraHadoopclusteronMicrosoftAzure,eitherinteractively
usingAzureManagementPortalorautomaticallyusingPowerShellscripting.
InHour7,“ExploringTypicalComponentsofHDFSCluster,”youexplorethe
typicalcomponentsofanHDFScluster:thenamenode,secondarynamenode,and
datanodes.YoualsolearnhowHDInsightseparatesthestoragefromtheclusterand
reliesonAzureStorageBlobinsteadofHDFSasthedefaultfilesystemforstoring
data.Thishourprovidesmoredetailsontheseconceptsinthecontextofthe
HDInsightservice.
Hour8,“StoringDatainMicrosoftAzureStorageBlob,”showsyouhowHDInsight
supportsboththeHadoopDistributedFileSystem(HDFS)andAzureStorageBlob
forstoringuserdata(althoughHDInsightreliesonAzurestorageblobasthedefault
filesysteminsteadofHDFSforstoringdata).ThishourexploresAzureStorage
BlobinthecontextofHDInsightandconcludesbydiscussingtheimpactofblob
storageonperformanceanddatalocality.
Hour9,“WorkingwithMicrosoftAzureHDInsightEmulator,”isdevotedto
Microsoft’sHDInsightemulator.HDInsightemulatoremulatesasingle-nodecluster
andiswellsuitedtodevelopmentscenariosandexperimentation.Thishourfocuses
onsettinguptheHDInsightemulatorandexecutingaMapReducejobtotestits
functionality.
Hour10,“ProgrammingMapReduceJobs,”expandsonthecontentinearlierhours
andprovidesexamplesandtechniquesforprogrammingMapReduceprogramsin
JavaandC#.Itpresentsareal-lifescenariothatanalyzesflightdelayswith
MapReduceandconcludeswithadiscussiononserializationoptionsforHadoop.
Hour11,“CustomizingtheHDInsightClusterwithScriptAction,”looksatthe
HDInsightclusterthatcomespreinstalledwithanumberoffrequentlyused
components.ItalsointroducescustomizationoptionsfortheHDInsightclusterand
walksyouthroughtheprocessforinstallingadditionalHadoopecosystemprojects
usingafeaturecalledScriptAction.Inaddition,thishourintroducestheHDInsight
ScriptActionfeatureandillustratesthestepsindevelopinganddeployingaScript
Action.
InHour12,“GettingStartedwithApacheHiveandApacheTezinHDInsight,”you
learnabouthowyoucanuseApacheHive.Youlearndifferentwaysofwritingand
executingHiveQLqueriesinHDInsightandseehowApacheTezsignificantly
improvesoverallperformanceforHiveQLqueries.
InHour13,“ProgrammingwithApacheHive,ApacheTezinHDInsight,and
ApacheHCatalog,”youextendyourexpertiseonApacheHiveandseehowyoucan
leverageitforadhocqueriesanddataanalysis.Youalsolearnaboutsomeofthe
importantcommandsyouwilluseinApacheHivefordataloadingandquerying.At
theendthishour,youlookatApacheHCatalog,whichhasmergedwithApache
Hive,andseehowtoleveragetheApacheTezexecutionengineforHivequery
executiontoimprovetheperformanceofyourquery.
Hour14,“ConsumingHDInsightDatafromMicrosoftBIToolsoverHiveODBC
Driver:Part1,”showsyouhowtousetheMicrosoftHiveODBCdrivertoconnect
andpulldatafromHivetablesfromdifferentMicrosoftBusinessIntelligence
(MSBI)reportingtools,forfurtheranalysisandadhocreporting.
InHour15,“ConsumingHDInsightDatafromMicrosoftBIToolsoverHiveODBC
Driver:Part2,”youlearntousePowerPivottocreateadatamodel(define
relationshipsbetweenthem,applytransformations,createcalculations,andmore)
basedonHivetablesandthenusePowerViewandPowerMaptovisualizethedata
fromdifferentperspectiveswithintuitiveandinteractivevisualizationoptions.
InHour16,“IntegratingHDInsightwithSQLServerIntegrationServices,”yousee
howyoucanuseSQLServerIntegrationServices(SSIS)tobuilddataintegration
packagestotransferdatabetweenanHDInsightclusterandarelationaldatabase
managementsystem(RDBMS)suchasSQLServer.
Hour17,“UsingPigforDataProcessing,”exploresPigLatin,aworkflow-style
procedurallanguagethatmakesiteasiertospecifytransformationoperationson
data.ThishourprovidesanintroductiontoPigforprocessingBigDatasetsand
illustratesthestepsinsubmittingPigjobstotheHDInsightcluster.
Hour18,“UsingSqoopforDataMovementBetweenRDBMSandHDInsight,”
demonstrateshowSqoopfacilitatesdatamigrationbetweenrelationaldatabasesand
Hadoop.ThishourintroducesyoutotheSqoopconnectorforHadoopandillustrates
itsuseindatamigrationbetweenHadoopandSQLServer/SQLAzuredatabases.
Hour19,“UsingOozieWorkflowsandJobOrchestrationwithHDInsight,”looksat
dataprocessingsolutionsthatrequiremultiplejobschainedtogetherinparticular
sequencetoaccomplishaprocessingtaskintheformofaconditionalworkflow.In
thishour,youlearntouseOozie,aworkflowdevelopmentcomponentwithinthe
Hadoopecosystem.
Hour20,“PerformingStatisticalComputingwithR,”focusesontheRlanguage,
whichispopularamongdatascientistsforanalyticsandstatisticalcomputing.Rwas
notdesignedtoworkwithBigDatabecauseittypicallyworksbypullingdatathat
persistselsewhereintomemory.However,recentadvancementshavemadeit
possibletoleverageRforBigDataanalytics.ThishourintroducesRandlooksat
theapproachesforenablingRonHadoop.
Hour21,“PerformingBigDataAnalyticswithSpark,”introducesSpark,briefly
explorestheSparkprogrammingmodel,andtakesalookatSparkintegrationwith
SQL.
InHour22,“MicrosoftAzureMachineLearning,”youlearnaboutanemerging
technologyknownasMicrosoftAzureMachineLearning(AzureML).AzureMLis
extremelysimpletouseandeasytoimplementsothatanalystswithvarious
backgrounds(evennondatascientists)canleverageitforpredictiveanalytics.
InHour23,“PerformingStreamAnalyticswithStorm,”youlearnaboutApache
Stormandexploreitsuseinperformingreal-timeStreamanalytics.
Hour24,“IntroductiontoApacheHBaseonHDInsight,”youlearnaboutApache
HBase,whentouseit,andhowyoucanleverageitwithHDInsightservice.
ConventionsUsedinThisBook
Inourexperienceasauthorsandtrainers,we’vefoundthatmanyreadersandstudentsskip
overthispartofthebook.Congratulationsforreadingit!Doingsowillpaybigdividends
becauseyou’llunderstandhowandwhyweformattedthisbookthewaywedid.
TryItYourself
Throughoutthebook,you’llfindTryItYourselfexercises,whichareopportunitiesfor
youtoapplywhatyou’relearningrightthenandthere.Ibelieveinknowledgestacking,so
youcanexpectthatlaterTryItYourselfexercisesassumethatyouknowhowtodostuff
youdidinpreviousexercises.Therefore,yourbestbetistoreadeachchapterinsequence
andworkthrougheveryTryItYourselfexercise.