Tải bản đầy đủ (.pdf) (1,049 trang)

Big data analytics with microsoft HDInsight in 24 hours sams teach yourself

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (42.06 MB, 1,049 trang )


AboutThisE-Book
EPUBisanopen,industry-standardformatfore-books.However,supportforEPUB
anditsmanyfeaturesvariesacrossreadingdevicesandapplications.Useyourdeviceor
appsettingstocustomizethepresentationtoyourliking.Settingsthatyoucancustomize
oftenincludefont,fontsize,singleordoublecolumn,landscapeorportraitmode,and
figuresthatyoucanclickortaptoenlarge.Foradditionalinformationaboutthesettings
andfeaturesonyourreadingdeviceorapp,visitthedevicemanufacturer’sWebsite.
Manytitlesincludeprogrammingcodeorconfigurationexamples.Tooptimizethe
presentationoftheseelements,viewthee-bookinsingle-column,landscapemodeand
adjustthefontsizetothesmallestsetting.Inadditiontopresentingcodeand
configurationsinthereflowabletextformat,wehaveincludedimagesofthecodethat
mimicthepresentationfoundintheprintbook;therefore,wherethereflowableformat
maycompromisethepresentationofthecodelisting,youwillseea“Clickheretoview
codeimage”link.Clickthelinktoviewtheprint-fidelitycodeimage.Toreturntothe
previouspageviewed,clicktheBackbuttononyourdeviceorapp.


SamsTeachYourself:BigDataAnalytics
withMicrosoftHDInsight®in24Hours
ArshadAli
ManpreetSingh

800East96thStreet,Indianapolis,Indiana,46240USA


SamsTeachYourselfBigDataAnalyticswithMicrosoftHDInsight®in24
Hours
Copyright©2016byPearsonEducation,Inc.
Allrightsreserved.Nopartofthisbookshallbereproduced,storedinaretrievalsystem,
ortransmittedbyanymeans,electronic,mechanical,photocopying,recording,or


otherwise,withoutwrittenpermissionfromthepublisher.Nopatentliabilityisassumed
withrespecttotheuseoftheinformationcontainedherein.Althougheveryprecautionhas
beentakeninthepreparationofthisbook,thepublisherandauthorassumeno
responsibilityforerrorsoromissions.Norisanyliabilityassumedfordamagesresulting
fromtheuseoftheinformationcontainedherein.
ISBN-13:978-0-672-33727-7
ISBN-10:0-672-33727-4
LibraryofCongressControlNumber:2015914167
PrintedintheUnitedStatesofAmerica
FirstPrintingNovember2015
Editor-in-Chief
GregWiegand
AcquisitionsEditor
JoanMurray
DevelopmentEditor
SondraScott
ManagingEditor
SandraSchroeder
SeniorProjectEditor
TonyaSimpson
CopyEditor
KristaHansing
EditorialServices,Inc
SeniorIndexer
CherylLenser
Proofreader
AnneGoebel
TechnicalEditors
ShayneBurgess
RonAbellera

PublishingCoordinator
CindyTeeter
CoverDesigner


MarkShirar
Compositor
codeMantra
Trademarks
Alltermsmentionedinthisbookthatareknowntobetrademarksorservicemarkshave
beenappropriatelycapitalized.SamsPublishingcannotattesttotheaccuracyofthis
information.Useofaterminthisbookshouldnotberegardedasaffectingthevalidityof
anytrademarkorservicemark.
HDInsightisaregisteredtrademarkofMicrosoftCorporation.
WarningandDisclaimer
Everyefforthasbeenmadetomakethisbookascompleteandasaccurateaspossible,but
nowarrantyorfitnessisimplied.Theinformationprovidedisonan“asis”basis.The
authorsandthepublishershallhaveneitherliabilitynorresponsibilitytoanypersonor
entitywithrespecttoanylossordamagesarisingfromtheinformationcontainedinthis
book.
SpecialSales
Forinformationaboutbuyingthistitleinbulkquantities,orforspecialsalesopportunities
(whichmayincludeelectronicversions;customcoverdesigns;andcontentparticularto
yourbusiness,traininggoals,marketingfocus,orbrandinginterests),pleasecontactour
corporatesalesdepartmentator(800)382-3419.
Forgovernmentsalesinquiries,pleasecontact

ForquestionsaboutsalesoutsidetheU.S.,pleasecontact




ContentsataGlance
Introduction
PartI:UnderstandingBigData,Hadoop1.0,and2.0
HOUR1IntroductionofBigData,NoSQL,andBusinessValueProposition
2IntroductiontoHadoop,ItsArchitecture,Ecosystem,andMicrosoftOfferings
3HadoopDistributedFileSystemVersions1.0and2.0
4TheMapReduceJobFrameworkandJobExecutionPipeline
5MapReduce—AdvancedConceptsandYARN
PartII:GettingStartedwithHDInsightandUnderstandingItsDifferent
Components
HOUR6GettingStartedwithHDInsight,ProvisioningYourHDInsightServiceCluster,
andAutomatingHDInsightClusterProvisioning
7ExploringTypicalComponentsofHDFSCluster
8StoringDatainMicrosoftAzureStorageBlob
9WorkingwithMicrosoftAzureHDInsightEmulator
PartIII:ProgrammingMapReduceandHDInsightScriptAction
HOUR10ProgrammingMapReduceJobs
11CustomizingtheHDInsightClusterwithScriptAction
PartIV:QueryingandProcessingBigDatainHDInsight
HOUR12GettingStartedwithApacheHiveandApacheTezinHDInsight
13ProgrammingwithApacheHive,ApacheTezinHDInsight,andApache
HCatalog
14ConsumingHDInsightDatafromMicrosoftBIToolsoverHiveODBCDriver:
Part1
15ConsumingHDInsightDatafromMicrosoftBIToolsoverHiveODBCDriver:
Part2
16IntegratingHDInsightwithSQLServerIntegrationServices
17UsingPigforDataProcessing
18UsingSqoopforDataMovementBetweenRDBMSandHDInsight

PartV:ManagingWorkflowandPerformingStatisticalComputing
HOUR19UsingOozieWorkflowsandJobOrchestrationwithHDInsight
20PerformingStatisticalComputingwithR
PartVI:PerformingInteractiveAnalyticsandMachineLearning


HOUR21PerformingBigDataAnalyticswithSpark
22MicrosoftAzureMachineLearning
PartVII:PerformingReal-timeAnalytics
HOUR23PerformingStreamAnalyticswithStorm
24IntroductiontoApacheHBaseonHDInsight
Index


TableofContents
Introduction
PartI:UnderstandingBigData,Hadoop1.0,and2.0
HOUR1:IntroductionofBigData,NoSQL,andBusinessValueProposition
TypesofAnalysis
TypesofData
BigData
ManagingBigData
NoSQLSystems
BigData,NoSQLSystems,andtheBusinessValueProposition
ApplicationofBigDataandBigDataSolutions
Summary
Q&A
HOUR2:IntroductiontoHadoop,ItsArchitecture,Ecosystem,andMicrosoft
Offerings
WhatIsApacheHadoop?

ArchitectureofHadoopandHadoopEcosystems
What’sNewinHadoop2.0
ArchitectureofHadoop2.0
ToolsandTechnologiesNeededwithBigDataAnalytics
MajorPlayersandVendorsforHadoop
DeploymentOptionsforMicrosoftBigDataSolutions
Summary
Q&A
HOUR3:HadoopDistributedFileSystemVersions1.0and2.0
IntroductiontoHDFS
HDFSArchitecture
RackAwareness
WebHDFS
AccessingandManagingHDFSData
What’sNewinHDFS2.0
Summary


Q&A
HOUR4:TheMapReduceJobFrameworkandJobExecutionPipeline
IntroductiontoMapReduce
MapReduceArchitecture
MapReduceJobExecutionFlow
Summary
Q&A
HOUR5:MapReduce—AdvancedConceptsandYARN
DistributedCache
HadoopStreaming
MapReduceJoins
BloomFilter

PerformanceImprovement
HandlingFailures
Counter
YARN
Uber-TaskingOptimization
FailuresinYARN
ResourceManagerHighAvailabilityandAutomaticFailoverinYARN
Summary
Q&A
PartII:GettingStartedwithHDInsightandUnderstandingItsDifferent
Components
HOUR6:GettingStartedwithHDInsight,ProvisioningYourHDInsightService
Cluster,andAutomatingHDInsightClusterProvisioning
IntroductiontoMicrosoftAzure
UnderstandingHDInsightService
ProvisioningHDInsightontheAzureManagementPortal
AutomatingHDInsightProvisioningwithPowerShell
ManagingandMonitoringHDInsightClusterandJobExecution
Summary
Q&A
Exercise


HOUR7:ExploringTypicalComponentsofHDFSCluster
HDFSClusterComponents
HDInsightClusterArchitecture
HighAvailabilityinHDInsight
Summary
Q&A
HOUR8:StoringDatainMicrosoftAzureStorageBlob

UnderstandingStorageinMicrosoftAzure
BenefitsofAzureStorageBloboverHDFS
AzureStorageExplorerTools
Summary
Q&A
HOUR9:WorkingwithMicrosoftAzureHDInsightEmulator
GettingStartedwithHDInsightEmulator
SettingUpMicrosoftAzureEmulatorforStorage
Summary
Q&A
PartIII:ProgrammingMapReduceandHDInsightScriptAction
HOUR10:ProgrammingMapReduceJobs
MapReduceHelloWorld!
AnalyzingFlightDelayswithMapReduce
SerializationFrameworksforHadoop
HadoopStreaming
Summary
Q&A
HOUR11:CustomizingtheHDInsightClusterwithScriptAction
IdentifyingtheNeedforClusterCustomization
DevelopingScriptAction
ConsumingScriptAction
RunningaGiraphjobonaCustomizedHDInsightCluster
TestingScriptActionwithHDInsightEmulator
Summary


Q&A
PartIV:QueryingandProcessingBigDatainHDInsight
HOUR12:GettingStartedwithApacheHiveandApacheTezinHDInsight

IntroductiontoApacheHive
GettingStartedwithApacheHiveinHDInsight
AzureHDInsightToolsforVisualStudio
ProgrammaticallyUsingtheHDInsight.NETSDK
IntroductiontoApacheTez
Summary
Q&A
Exercise
HOUR13:ProgrammingwithApacheHive,ApacheTezinHDInsight,andApache
HCatalog
ProgrammingwithHiveinHDInsight
UsingTablesinHive
SerializationandDeserialization
DataLoadProcessesforHiveTables
QueryingDatafromHiveTables
IndexinginHive
ApacheTezinAction
ApacheHCatalog
Summary
Q&A
Exercise
HOUR14:ConsumingHDInsightDatafromMicrosoftBIToolsoverHiveODBC
Driver:Part1
IntroductiontoHiveODBCDriver
IntroductiontoMicrosoftPowerBI
AccessingHiveDatafromMicrosoftExcel
Summary
Q&A
HOUR15:ConsumingHDInsightDatafromMicrosoftBIToolsoverHiveODBC
Driver:Part2



AccessingHiveDatafromPowerPivot
AccessingHiveDatafromSQLServer
AccessingHDInsightDatafromPowerQuery
Summary
Q&A
Exercise
HOUR16:IntegratingHDInsightwithSQLServerIntegrationServices
TheNeedforDataMovement
IntroductiontoSSIS
AnalyzingOn-timeFlightDeparturewithSSIS
ProvisioningHDInsightCluster
Summary
Q&A
HOUR17:UsingPigforDataProcessing
IntroductiontoPigLatin
UsingPigtoCountCancelledFlights
UsingHCataloginaPigLatinScript
SubmittingPigJobswithPowerShell
Summary
Q&A
HOUR18:UsingSqoopforDataMovementBetweenRDBMSandHDInsight
WhatIsSqoop?
UsingSqoopImportandExportCommands
UsingSqoopwithPowerShell
Summary
Q&A
PartV:ManagingWorkflowandPerformingStatisticalComputing
HOUR19:UsingOozieWorkflowsandJobOrchestrationwithHDInsight

IntroductiontoOozie
DeterminingOn-timeFlightDeparturePercentagewithOozie
SubmittinganOozieWorkflowwithHDInsight.NETSDK
CoordinatingWorkflowswithOozie


OozieComparedtoSSIS
Summary
Q&A
HOUR20:PerformingStatisticalComputingwithR
IntroductiontoR
IntegratingRwithHadoop
EnablingRonHDInsight
Summary
Q&A
PartVI:PerformingInteractiveAnalyticsandMachineLearning
HOUR21:PerformingBigDataAnalyticswithSpark
IntroductiontoSpark
SparkProgrammingModel
BlendingSQLQueryingwithFunctionalPrograms
Summary
Q&A
HOUR22:MicrosoftAzureMachineLearning
HistoryofTraditionalMachineLearning
IntroductiontoAzureML
AzureMLWorkspace
ProcessestoBuildAzureMLSolutions
GettingStartedwithAzureML
CreatingPredictiveModelswithAzureML
PublishingAzureMLModelsasWebServices

Summary
Q&A
Exercise
PartVII:PerformingReal-timeAnalytics
HOUR23:PerformingStreamAnalyticswithStorm
IntroductiontoStorm
UsingSCP.NETtoDevelopStormSolutions
AnalyzingSpeedLimitViolationIncidentswithStorm


Summary
Q&A
HOUR24:IntroductiontoApacheHBaseonHDInsight
IntroductiontoApacheHBase
HBaseArchitecture
CreatingHDInsightClusterwithHBase
Summary
Q&A
Index


AbouttheAuthors
ArshadAlihasmorethan13yearsofexperienceinthecomputerindustry.Asa
DB/DW/BIconsultantinanend-to-enddeliveryrole,hehasbeenworkingonseveral
enterprise-scaledatawarehousingandanalyticsprojectsforenablinganddeveloping
businessintelligenceandanalyticsolutions.Hespecializesindatabase,datawarehousing,
andbusinessintelligence/analyticsapplicationdesign,development,anddeploymentat
theenterpriselevel.HefrequentlyworkswithSQLServer,MicrosoftAnalyticsPlatform
System(APS,orformallyknownasSQLServerParallelDataWarehouse[PDW]),
HDInsight(Hadoop,Hive,Pig,HBase,andsoon),SSIS,SSRS,SSAS,ServiceBroker,

MDS,DQS,SharePoint,andPPS.Inthepast,hehasalsohandledperformance
optimizationforseveralprojects,withsignificantperformancegain.
ArshadisaMicrosoftCertifiedSolutionsExpert(MCSE)–SQLServer2012Data
Platform,andMicrosoftCertifiedITProfessional(MCITP)inMicrosoftSQLServer
2008–DatabaseDevelopment,DataAdministration,andBusinessIntelligence.Heisalso
certifiedonITIL2011foundation.
HehasworkedindevelopingapplicationsinVB,ASP,.NET,ASP.NET,andC#.Heisa
MicrosoftCertifiedApplicationDeveloper(MCAD)andMicrosoftCertifiedSolution
Developer(MCSD)forthe.NETplatforminWeb,Windows,andEnterprise.
Arshadhaspresentedatseveraltechnicaleventsandhaswrittenmorethan200articles
relatedtoDB,DW,BI,andBAtechnologies,bestpractices,processes,andperformance
optimizationtechniquesonSQLServer,Hadoop,andrelatedtechnologies.Hisarticles
havebeenpublishedonseveralprominentsites.
Ontheeducationalfront,ArshadholdsaMasterinComputerApplicationsdegreeanda
MasterinBusinessAdministrationinITdegree.
Arshadcanbereachedat,orvisitto
connectwithhim.
ManpreetSinghisaconsultantandauthorwithextensiveexpertiseinarchitecture,
design,andimplementationofbusinessintelligenceandBigDataanalyticssolutions.He
ispassionateaboutenablingbusinessestoderivevaluableinsightsfromtheirdata.
ManpreethasbeenworkingonMicrosofttechnologiesformorethan8years,witha
strongfocusonMicrosoftBusinessIntelligenceStack,SharePointBI,andMicrosoft’sBig
DataAnalyticsPlatforms(AnalyticsPlatformSystemandHDInsight).Healsospecializes
inMobileBusinessIntelligencesolutiondevelopmentandhashelpedbusinessesdelivera
consolidatedviewoftheirdatatotheirmobileworkforces.
ManpreethascoauthoredbooksandtechnicalarticlesonMicrosofttechnologies,focusing
onthedevelopmentofdataanalyticsandvisualizationsolutionswiththeMicrosoftBI
StackandSharePoint.HeholdsadegreeincomputerscienceandengineeringfromPanjab
University,India.
Manpreetcanbereachedat



Dedications
Arshad:
Tomyparents,thelateMrs.andMr.MdAzalHussain,whobroughtmeintothisbeautiful
world
andmademethepersonIamtoday.Althoughtheycouldn’tbeheretoseethisday,Iam
sure
theymustbeproud,andallIcansayis,“Thankssomuch—Iloveyouboth.”
Andtomybeautifulwife,ShaziaArshadAli,whomotivatedmetotakeupthechallengeof
writing
thisbookandwhosupportedmethroughoutthisjourney.
Andtomynephew,GulfamHussain,whohasbeenveryexcitedformetobeanauthorand
hasbeenfollowingupwithmeonitsprogressregularlyandsupportingme,wherehe
could,
incompletingthisbook.
Finally,Iwouldliketodedicatethistomyschoolteacher,SankarSarkar,whoshapedmy
career
withhispatienceandperseveranceandhasbeentrulyaninspirationalsource.
Manpreet:
Tomyparents,mywife,andmydaughter.Andtomygrandfather,
Capt.JagatSingh,whocouldn’tbeheretoseethisday.


Acknowledgments
Thisbookwouldnothavebeenpossiblewithoutsupportfromsomeofourspecialfriends.
Firstandforemost,wewouldliketothankYaswantVishwakarma,VijayKorapadi,
AvadhutKulkarni,KuldeepChauhan,RajeevGupta,VivekAdholia,andmanyotherswho
havebeeninspirationsandsupportedusinwritingthisbook,directlyorindirectly.Thanks
alot,guys—wearetrulyindebtedtoyouallforallyoursupportandtheopportunityyou

havegivenustolearnandgrow.
WealsowouldliketothanktheentirePearsonteam,especiallyMarkRenfrowandJoan
Murray,fortakingourproposalfromdreamtoreality.ThanksalsotoShayneBurgessand
RonAbelleraforreadingtheentiredraftofthebookandprovidingveryhelpfulfeedback
andsuggestions.
Thanksonceagain—youallrock!
Arshad
Manpreet


WeWanttoHearfromYou!
Asthereaderofthisbook,youareourmostimportantcriticandcommentator.Wevalue
youropinionandwanttoknowwhatwe’redoingright,whatwecoulddobetter,what
areasyou’dliketoseeuspublishin,andanyotherwordsofwisdomyou’rewillingto
passourway.
Wewelcomeyourcomments.Youcanemailorwritetoletusknowwhatyoudidordidn’t
likeaboutthisbook—aswellaswhatwecandotomakeourbooksbetter.
Pleasenotethatwecannothelpyouwithtechnicalproblemsrelatedtothetopicofthis
book.
Whenyouwrite,pleasebesuretoincludethisbook’stitleandauthorsaswellasyour
nameandemailaddress.Wewillcarefullyreviewyourcommentsandsharethemwiththe
authorsandeditorswhoworkedonthebook.
Email:
Mail:SamsPublishing
ATTN:ReaderFeedback
800East96thStreet
Indianapolis,IN46240USA


ReaderServices

Visitourwebsiteandregisterthisbookatinformit.com/registerforconvenientaccessto
anyupdates,downloads,orerratathatmightbeavailableforthisbook.


Introduction
“Theinformationthat’sstoredinourdatabasesandspreadsheetscannotspeakforitself.It
hasimportantstoriestotellandonlywecangivethemavoice.”—StephenFew
Hello,andwelcometotheworldofBigData!Weareyourauthors,ArshadAliand
ManpreetSingh.Forus,it’sagoodsignthatyou’reactuallyreadingthisintroduction(so
fewreadersoftechbooksdo,inourexperiences).Perhapsyourfirstquestionis,“What’s
initforme?”Weareheretogiveyouthosedetailswithminimalfuss.
Neverhastherebeenamoreexcitingtimeintheworldofdata.Weareseeingthe
convergenceofsignificanttrendsthatarefundamentallytransformingtheindustryand
usheringinaneweraoftechnologicalinnovationinareassuchassocial,mobility,
advancedanalytics,andmachinelearning.Wearewitnessinganexplosionofdata,withan
entirelynewscaleandscopetogaininsightsfrom.Recentestimatessaythatthetotal
amountofdigitalinformationintheworldisincreasing10timesevery5years.Eightyfivepercentofthisdataiscomingfromnewdatasources(connecteddevices,sensors,
RFIDs,webblogs,clickstreams,andsoon),andupto80percentofthisdatais
unstructured.Thispresentsahugeopportunityforanorganization:totapintothisnew
datatoidentifynewopportunityandareasforinnovation.
Tostoreandgetinsightintothishumongousvolumeofdifferentvarietiesofdata,known
asBigData,anorganizationneedstoolsandtechnologies.ChiefamongtheseisHadoop,
forprocessingandanalyzingthisambientdatabornoutsidethetraditionaldataprocessing
platform.HadoopistheopensourceimplementationoftheMapReduceparallel
computationalengineandenvironment,andit’susedquitewidelyinprocessingstreamsof
datathatgowellbeyondeventhelargestenterprisedatasetsinsize.Whetherit’ssensor,
clickstream,socialmedia,telemetry,locationbased,orotherdatathatisgeneratedand
collectedinlargevolumes,Hadoopisoftenonthescenetoprocessandanalyzeit.
Analyticshasbeeninuse(mostlywithorganizations’internaldata)forseveralyearsnow,
butitsusewithBigDataisyieldingtremendousopportunities.Organizationscannow

leveragedataavailableexternallyindifferentformats,toidentifynewopportunitiesand
areasofinnovationbyanalyzingpatterns,customerresponsesorbehavior,markettrends,
competitors’take,researchdatafromgovernmentsororganizations,andmore.This
providesanopportunitytonotonlylookbackonthepast,butalsolookforwardto
understandwhatmighthappeninthefuture,usingpredictiveanalytics.
Inthisbook,weexaminewhatconstitutesBigDataanddemonstratehoworganizations
cantapintoBigDatausingHadoop.Welookatsomeimportanttoolsandtechnologiesin
theHadoopecosystemand,moreimportant,checkoutMicrosoft’spartnershipwith
Hortonworks/Cloudera.TheHadoopdistributionfortheWindowsplatformoronthe
MicrosoftAzurePlatform(cloudcomputing)isanenterprise-readysolutionandcanbe
integratedeasilywithMicrosoftSQLServer,MicrosoftActiveDirectory,andSystem
Center.Thismakesitdramaticallysimpler,easier,moreefficient,andmorecosteffective
foryourorganizationtocapitalizeontheopportunityBigDatabringstoyourbusiness.
ThroughdeepintegrationwithMicrosoftBusinessIntelligencetools(PowerPivotand
PowerView)andEDWtools(SQLServerandSQLServerParallelDataWarehouse),


Microsoft’sBigDatasolutionalsoofferscustomersdeepinsightsintotheirstructuredand
unstructureddatawiththetoolstheyuseeveryday.
ThisbookprimarilyfocusesontheHadoop(Hadoop1.*andHadoop2.*)distributionfor
Azure,MicrosoftHDInsight.ItprovidesseveraladvantagesoverrunningaHadoopcluster
overyourlocalinfrastructure.IntermsofprogrammingMapReducejobsorHiveorPIG
queries,youwillseenodifferences;thesameprogramwillrunflawlesslyoneitherof
thesetwoHadoopdistributions(orevenonotherdistributions),orwithminimalchanges,
ifyouareusingcloudplatform-specificfeatures.Moreover,integratingHadoopandcloud
computingsignificantlylessensthetotalcostownershipanddeliversquickandeasysetup
fortheHadoopcluster.(WedemonstratehowtosetupaHadoopclusteronMicrosoft
AzureinHour6,“GettingStartedwithHDInsight,ProvisioningYourHDInsightService
Cluster,andAutomatingHDInsightClusterProvisioning.”)
Considersomeforecastsfromnotableresearchanalystsorresearchorganizations:

“BigDataisaBigPriorityforCustomers—49%oftopCEOsandCIOsarecurrently
usingBigDataforcustomeranalytics.”—McKinsey&Company,McKinseyGlobal
SurveyResults,MindingYourDigitalBusiness,2012
“By2015,4.4millionITjobsgloballywillbecreatedtosupportBigData,generating1.9
millionITjobsintheUnitedStates.Onlyonethirdofskillsetswillbeavailablebythat
time.”—PeterSondergaard,SeniorVicePresidentatGartnerandGlobalHeadofResearch
“By2015,businesses(organizationsthatareabletotakeadvantageofBigData)thatbuild
amoderninformationmanagementsystemwilloutperformtheirpeersfinanciallyby20
percent.”—Gartner,MarkBeyer,InformationManagementinthe21stCentury
“By2020,theamountofdigitaldataproducedwillexceed40zettabytes,whichisthe
equivalentof5,200GBofdataforeveryman,woman,andchildonEarth.”—Digital
Universestudy
IDChaspublishedananalysispredictingthatthemarketforBigDatawillgrowtoover
$19billionby2015.Thisincludesgrowthinpartnerservicesto$6.5billionin2015and
growthinsoftwareto$4.6billionin2015.Thisrepresents39percentand34percent
compoundannualgrowthrates,respectively.
WehopeyouenjoyreadingthisbookandgainanunderstandingofandexpertiseonBig
DataandBigDataanalytics.WeespeciallyhopeyoulearnhowtoleverageMicrosoft
HDInsighttoexploititsenormousopportunitiestotakeyourorganizationwayaheadof
yourcompetitors.
Wewouldlovetohearyourfeedbackorsuggestionsforimprovement.Feelfreetoshare
withus(ArshadAli,,andManpreetSingh,
)sothatwecanincorporateitintothenextrelease.
WelcometotheworldofBigDataandBigDataanalyticswithMicrosoftHDInsight!

WhoShouldReadThisBook
Whatdoyouhopetogetoutofthisbook?Aswewrotethisbook,wehadthefollowing
audiencesinmind:



Developers—Developers(especiallybusinessintelligencedevelopers)worldwide
areseeingagrowingneedforpractical,step-by-stepinstructioninprocessingBig
Dataandperformingadvancedanalyticstoextractactionableinsights.Thisbook
wasdesignedtomeetthatneed.Itstartsatthegroundlevelandbuildsfromthere,to
makeyouanexpert.Hereyou’lllearnhowtobuildthenextgenerationofappsthat
includesuchcapabilities.
Datascientists—Asadatascientist,youarealreadyfamiliarwiththeprocessesof
acquiring,transforming,andintegratingdataintoyourworkandperforming
advancedanalytics.Thisbookintroducesyoutomoderntoolsandtechnologies
(onesthatareprominent,inexpensive,flexible,andopensourcefriendly)thatyou
canapplywhileacquiring,transforming,andintegratingBigDataandperforming
advancedanalytics.
Bythetimeyoucompletethisbook,you’llbequitecomfortablewiththelatesttools
andtechnologies.
Businessdecisionmakers—Businessdecisionmakersaroundtheworld,from
manydifferentorganizations,arelookingtounlockthevalueofdatatogain
actionableinsightsthatenabletheirbusinessestostayaheadofcompetitors.This
bookdelvesintoadvancedanalyticsapplicationsandcasestudiesbasedonBigData
toolsandtechnologies,toaccelerateyourbusinessgoals.
StudentsaspiringtobeBigDataanalysts—Asyouaregettingreadytotransition
fromtheacademictothecorporateworld,thisbookshelpsyoubuildafoundational
skillsettoaceyourinterviewsandsuccessfullydeliverBigDataprojectsinatimely
manner.Chaptersweredesignedtostartatthegroundlevelandgraduallytakeyou
toanexpertlevel.
Don’tworryifyoudon’tfitintoanyoftheseclassifications.Setyoursightsonlearningas
muchasyoucanandhavingfunintheprocess,andyou’lldofine!

HowThisBookIsOrganized
ThisbookbeginswiththepremisethatyoucanlearnwhatBigDatais,includingthereallifeapplicationsofBigDataandtheprominenttoolsandtechnologiestouseBigData
solutionstoquicklytapintoopportunity,bystudyingthematerialin241-hoursessions.

Youmightuseyourlunchbreakasyourtraininghour,oryoumightstudyforanhour
beforeyougotobedatnight.
Whateverscheduleyouadopt,thesearethehour-by-hourdetailsonhowwestructuredthe
content:
Hour1,“IntroductionofBigData,NoSQL,andBusinessValueProposition,”
introducesyoutotheworldofBigDataandexplainshowanorganizationthat
leveragesthepowerofBigDataanalyticscanbothremaincompetitiveandbeatout
itscompetitors.ItexplainsBigDataindetail,alongwithitscharacteristicsandthe
typesofanalysis(descriptive,predictive,andprescriptive)anorganizationdoeswith
BigData.Finally,itsetsoutthebusinessvaluepropositionofusingBigData
solutions,alongwithsomereal-lifeexamplesofBigDatasolutions.


ThishouralsosummarizestheNoSQLtechnologiesusedtomanageandprocessBig
DataandexplainshowNoSQLsystemsdifferfromtraditionaldatabasesystems
(RDBMS).
InHour2,“IntroductiontoHadoop,ItsArchitecture,Ecosystem,andMicrosoft
Offerings,”youlookatmanagingBigDatawithApacheHadoop.Thishouris
rootedinhistory:ItshowshowHadoopevolvedfrominfancytoHadoop1.0and
thenHadoop2.0,highlightingarchitecturalchangesfromHadoop1.0toHadoop
2.0.Thishouralsofocusesonunderstandingothersoftwareandcomponentsthat
makeuptheHadoopecosystemandlooksatthecomponentsneededindifferent
phasesofBigDataanalytics.Finally,itintroducesyoutoHadoopvendors,evaluates
theirofferings,andanalyzesMicrosoft’sdeploymentoptionsforBigDatasolutions.
InHour3,“HadoopDistributedFileSystemVersions1.0and2.0,”youlearnabout
HDFS,itsarchitecture,andhowdatagetsstored.Youalsolookintotheprocessesof
readingfromHDFSandwritingdatatoHDFS,aswellasinternalbehaviortoensure
faulttolerance.Attheendofthehour,youtakeadetailedlookatHDFS2.0,which
comesasapartofHadoop2.0,toseehowitovercomesthelimitationsofHadoop
1.0andprovideshigh-availabilityandscalabilityenhancements.

InHour4,“TheMapReduceJobFrameworkandJobExecutionPipeline,”you
exploretheMapReduceprogrammingparadigm,itsarchitecture,thecomponentsof
aMapReducejob,andMapReducejobexecutionflow.
Hour5,“MapReduce—AdvancedConceptsandYARN,”introducesyouto
advancedconceptsrelatedtoMapReduce(includingMapReduceStreaming,
MapReducejoins,distributedcaches,failuresandhowtheyarehandled
transparently,andperformanceoptimizationforyourMapReducejobs).
InHadoop2.0,YARNushersinamajorarchitecturalchangeandopensanew
windowforscalability,performance,andmultitenancy.Inthishour,youlearnabout
theYARNarchitecture,itscomponents,theYARNjobexecutionpipeline,andhow
failuresarehandledtransparently.
InHour6,“GettingStartedwithHDInsight,ProvisioningYourHDInsightService
Cluster,andAutomatingHDInsightClusterProvisioning,”youdelveintothe
HDInsightservice.Youalsowalkthroughastep-by-stepprocessforquickly
provisioningHDInsightoraHadoopclusteronMicrosoftAzure,eitherinteractively
usingAzureManagementPortalorautomaticallyusingPowerShellscripting.
InHour7,“ExploringTypicalComponentsofHDFSCluster,”youexplorethe
typicalcomponentsofanHDFScluster:thenamenode,secondarynamenode,and
datanodes.YoualsolearnhowHDInsightseparatesthestoragefromtheclusterand
reliesonAzureStorageBlobinsteadofHDFSasthedefaultfilesystemforstoring
data.Thishourprovidesmoredetailsontheseconceptsinthecontextofthe
HDInsightservice.
Hour8,“StoringDatainMicrosoftAzureStorageBlob,”showsyouhowHDInsight
supportsboththeHadoopDistributedFileSystem(HDFS)andAzureStorageBlob
forstoringuserdata(althoughHDInsightreliesonAzurestorageblobasthedefault
filesysteminsteadofHDFSforstoringdata).ThishourexploresAzureStorage


BlobinthecontextofHDInsightandconcludesbydiscussingtheimpactofblob
storageonperformanceanddatalocality.

Hour9,“WorkingwithMicrosoftAzureHDInsightEmulator,”isdevotedto
Microsoft’sHDInsightemulator.HDInsightemulatoremulatesasingle-nodecluster
andiswellsuitedtodevelopmentscenariosandexperimentation.Thishourfocuses
onsettinguptheHDInsightemulatorandexecutingaMapReducejobtotestits
functionality.
Hour10,“ProgrammingMapReduceJobs,”expandsonthecontentinearlierhours
andprovidesexamplesandtechniquesforprogrammingMapReduceprogramsin
JavaandC#.Itpresentsareal-lifescenariothatanalyzesflightdelayswith
MapReduceandconcludeswithadiscussiononserializationoptionsforHadoop.
Hour11,“CustomizingtheHDInsightClusterwithScriptAction,”looksatthe
HDInsightclusterthatcomespreinstalledwithanumberoffrequentlyused
components.ItalsointroducescustomizationoptionsfortheHDInsightclusterand
walksyouthroughtheprocessforinstallingadditionalHadoopecosystemprojects
usingafeaturecalledScriptAction.Inaddition,thishourintroducestheHDInsight
ScriptActionfeatureandillustratesthestepsindevelopinganddeployingaScript
Action.
InHour12,“GettingStartedwithApacheHiveandApacheTezinHDInsight,”you
learnabouthowyoucanuseApacheHive.Youlearndifferentwaysofwritingand
executingHiveQLqueriesinHDInsightandseehowApacheTezsignificantly
improvesoverallperformanceforHiveQLqueries.
InHour13,“ProgrammingwithApacheHive,ApacheTezinHDInsight,and
ApacheHCatalog,”youextendyourexpertiseonApacheHiveandseehowyoucan
leverageitforadhocqueriesanddataanalysis.Youalsolearnaboutsomeofthe
importantcommandsyouwilluseinApacheHivefordataloadingandquerying.At
theendthishour,youlookatApacheHCatalog,whichhasmergedwithApache
Hive,andseehowtoleveragetheApacheTezexecutionengineforHivequery
executiontoimprovetheperformanceofyourquery.
Hour14,“ConsumingHDInsightDatafromMicrosoftBIToolsoverHiveODBC
Driver:Part1,”showsyouhowtousetheMicrosoftHiveODBCdrivertoconnect
andpulldatafromHivetablesfromdifferentMicrosoftBusinessIntelligence

(MSBI)reportingtools,forfurtheranalysisandadhocreporting.
InHour15,“ConsumingHDInsightDatafromMicrosoftBIToolsoverHiveODBC
Driver:Part2,”youlearntousePowerPivottocreateadatamodel(define
relationshipsbetweenthem,applytransformations,createcalculations,andmore)
basedonHivetablesandthenusePowerViewandPowerMaptovisualizethedata
fromdifferentperspectiveswithintuitiveandinteractivevisualizationoptions.
InHour16,“IntegratingHDInsightwithSQLServerIntegrationServices,”yousee
howyoucanuseSQLServerIntegrationServices(SSIS)tobuilddataintegration
packagestotransferdatabetweenanHDInsightclusterandarelationaldatabase
managementsystem(RDBMS)suchasSQLServer.


Hour17,“UsingPigforDataProcessing,”exploresPigLatin,aworkflow-style
procedurallanguagethatmakesiteasiertospecifytransformationoperationson
data.ThishourprovidesanintroductiontoPigforprocessingBigDatasetsand
illustratesthestepsinsubmittingPigjobstotheHDInsightcluster.
Hour18,“UsingSqoopforDataMovementBetweenRDBMSandHDInsight,”
demonstrateshowSqoopfacilitatesdatamigrationbetweenrelationaldatabasesand
Hadoop.ThishourintroducesyoutotheSqoopconnectorforHadoopandillustrates
itsuseindatamigrationbetweenHadoopandSQLServer/SQLAzuredatabases.
Hour19,“UsingOozieWorkflowsandJobOrchestrationwithHDInsight,”looksat
dataprocessingsolutionsthatrequiremultiplejobschainedtogetherinparticular
sequencetoaccomplishaprocessingtaskintheformofaconditionalworkflow.In
thishour,youlearntouseOozie,aworkflowdevelopmentcomponentwithinthe
Hadoopecosystem.
Hour20,“PerformingStatisticalComputingwithR,”focusesontheRlanguage,
whichispopularamongdatascientistsforanalyticsandstatisticalcomputing.Rwas
notdesignedtoworkwithBigDatabecauseittypicallyworksbypullingdatathat
persistselsewhereintomemory.However,recentadvancementshavemadeit
possibletoleverageRforBigDataanalytics.ThishourintroducesRandlooksat

theapproachesforenablingRonHadoop.
Hour21,“PerformingBigDataAnalyticswithSpark,”introducesSpark,briefly
explorestheSparkprogrammingmodel,andtakesalookatSparkintegrationwith
SQL.
InHour22,“MicrosoftAzureMachineLearning,”youlearnaboutanemerging
technologyknownasMicrosoftAzureMachineLearning(AzureML).AzureMLis
extremelysimpletouseandeasytoimplementsothatanalystswithvarious
backgrounds(evennondatascientists)canleverageitforpredictiveanalytics.
InHour23,“PerformingStreamAnalyticswithStorm,”youlearnaboutApache
Stormandexploreitsuseinperformingreal-timeStreamanalytics.
Hour24,“IntroductiontoApacheHBaseonHDInsight,”youlearnaboutApache
HBase,whentouseit,andhowyoucanleverageitwithHDInsightservice.

ConventionsUsedinThisBook
Inourexperienceasauthorsandtrainers,we’vefoundthatmanyreadersandstudentsskip
overthispartofthebook.Congratulationsforreadingit!Doingsowillpaybigdividends
becauseyou’llunderstandhowandwhyweformattedthisbookthewaywedid.

TryItYourself
Throughoutthebook,you’llfindTryItYourselfexercises,whichareopportunitiesfor
youtoapplywhatyou’relearningrightthenandthere.Ibelieveinknowledgestacking,so
youcanexpectthatlaterTryItYourselfexercisesassumethatyouknowhowtodostuff
youdidinpreviousexercises.Therefore,yourbestbetistoreadeachchapterinsequence
andworkthrougheveryTryItYourselfexercise.


×