Tải bản đầy đủ (.pdf) (257 trang)

big data essentials

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (4.71 MB, 257 trang )


BigDataEssentials
Copyright©2016byAnilK.Maheshwari,Ph.D.

Bypurchasingthisbook,youagreenottocopyordistributethebookbyany
means,mechanicalorelectronic.
Nopartofthisbookmaybecopiedortransmittedwithoutwrittenpermission.

Otherbooksbythesameauthor:
DataAnalyticsMadeAccessiblethe#1BestsellerinDataMining
Moksha:LiberationThroughTranscendence











Preface
BigDataisanew,andinclusive,naturalphenomenon.Itisasmessyasnatureitself.It
requiresanewkindofConsciousnesstofathomitsscaleandscope,anditsmany
opportunitiesandchallenges.UnderstandingtheessentialsofBigDatarequires
suspendingmanyconventionalexpectationsandassumptionsaboutdata…suchas
completeness,clarity,consistency,andconciseness.FathomingandtamingthemultilayeredBigDataisadreamthatisslowlybecomingareality.Itisarapidlyevolvingfield
thatisgrowingexponentiallyinvalueandcapabilities.
ThereisagrowingnumberofbooksbeingwrittenonBigData.Theyfallmostlyintwo
categories.Thefirstkindfocusonbusinessaspects,anddiscussthestrategicinternalshifts


requiredforreapingthebusinessbenefitsfromthemanyopportunitiesofferedbyBig
Data.Thesecondkindfocusonparticulartechnologyplatforms,suchasHadooporSpark.
Thisbookaimstobringtogetherthebusinesscontextandthetechnologiesinaseamless
way.
ThisbookwaswrittentomeettheneedsforanintroductoryBigDatacourse.Itismeant
forstudents,aswellasexecutives,whowishtotakeadvantageofemergingopportunities
inBigData.Itprovidesanintuitionofthewholenessofthefieldinasimplelanguage,
freefromjargonandcode.AlltheessentialBigDatatechnologytoolsandplatformssuch
asHadoop,MapReduce,Spark,andNoSqlarediscussed.Mostoftherelevant
programmingdetailshavebeenmovedtoAppendicestoensurereadability.Theshort
chaptersmakeiteasytoquicklyunderstandthekeyconcepts.Acompletecasestudyof
developingaBigDataapplicationisincluded.
ThankstoMaharishiMaheshYogiforcreatingawonderfuluniversitywhose
consciousness-basedenvironmentmadewritingthisevolutionarybookpossible.Thanks
tomanycurrentandformerstudentsforcontributingtothisbook.DheerajPandeyassisted
withtheWebloganalyzerapplicationanditsdetails.SurajThapaliaassistedwiththe
Hadoopinstallationguide.EnkhbilegTseeleesurenhelpedwritetheSparktutorial.Thanks
tomyfamilyforsupportingmeinthisprocess.MydaughtersAnkitaandNupurreviewed
thebookandmadehelpfulcomments.MyfatherMr.RLMaheshwariandbrotherDr.
SunilMaheshwarialsoreadthebookandenthusiasticallyapprovedit.MycolleagueDr.
EdiShivajitooreviewedthebook.
MaytheBigDataForcebewithyou!
Dr.AnilMaheshwari


August2016,Fairfield,IA


Contents
Preface

Chapter1–WholenessofBigData
Introduction
UnderstandingBigData
CASELET:IBMWatson:ABigDatasystem
CapturingBigData
VolumeofData
VelocityofData
VarietyofData
VeracityofData
BenefittingfromBigData
ManagementofBigData
OrganizingBigData
AnalyzingBigData
TechnologyChallengesforBigData
StoringHugeVolumes
Ingestingstreamsatanextremelyfastpace
Handlingavarietyofformsandfunctionsofdata
Processingdataathugespeeds
ConclusionandSummary
Organizationoftherestofthebook
ReviewQuestions
LibertyStoresCaseExercise:StepB1
Section1
Chapter2-BigDataApplications
Introduction
CASELET:BigDataGetstheFlu
BigDataSources
PeopletoPeopleCommunications



SocialMedia
PeopletoMachineCommunications
Webaccess
MachinetoMachine(M2M)Communications
RFIDtags
Sensors
BigDataApplications
MonitoringandTrackingApplications
AnalysisandInsightApplications
NewProductDevelopment
Conclusion
ReviewQuestions
LibertyStoresCaseExercise:StepB2
Chapter3-BigDataArchitecture
Introduction
CASELET:GoogleQueryArchitecture
StandardBigdataarchitecture
BigDataArchitectureexamples
IBMWatson
Netflix
Ebay
VMWare
TheWeatherCompany
TicketMaster
LinkedIn
Paypal
CERN
Conclusion
ReviewQuestions
LibertyStoresCaseExercise:StepB3

Section2


Chapter4:DistributedComputingusingHadoop
Introduction
HadoopFramework
HDFSDesignGoals
Master-SlaveArchitecture
Blocksystem
EnsuringDataIntegrity
InstallingHDFS
ReadingandWritingLocalFilesintoHDFS
ReadingandWritingDataStreamsintoHDFS
SequenceFiles
YARN
Conclusion
ReviewQuestions
Chapter5–ParallelProcessingwithMapReduce
Introduction
MapReduceOverview
MapReduceprogramming
MapReduceDataTypesandFormats
WritingMapReduceProgramming
TestingMapReducePrograms
MapReduceJobsExecution
HowMapReduceWorks
ManagingFailures
ShuffleandSort
ProgressandStatusUpdates
HadoopStreaming

Conclusion
ReviewQuestions
Chapter6–NoSQLdatabases
Introduction


RDBMSVsNoSQL
TypesofNoSQLDatabases
ArchitectureofNoSQL
CAPtheorem
PopularNoSQLDatabases
HBase
ArchitectureOverview
ReadingandWritingData
Cassandra
ArchitectureOverview
ReadingandWritingData
HiveLanguage
HIVELanguageCapabilities
PigLanguage
Conclusion
ReviewQuestions
Chapter7–StreamProcessingwithSpark
Introduction
SparkArchitecture
ResilientDistributedDatasets(RDD)
DirectedAcyclicGraph(DAG)
SparkEcosystem
Sparkforbigdataprocessing
MLlib

SparkGraphX
SparkR
SparkSQL
SparkStreaming
Sparkapplications
SparkvsHadoop
Conclusion


ReviewQuestions
Chapter8–IngestingData
Wholeness
MessagingSystems
PointtoPointMessagingSystem
Publish-SubscribeMessagingSystem
ApacheKafka
UseCases
KafkaArchitecture
Producers
Consumers
Broker
Topic
SummaryofKeyAttributes
Distribution
Guarantees
ClientLibraries
ApacheZooKeeper
KafkaProducerexampleinJava
Conclusion
ReviewQuestions

References
Chapter9–CloudComputingPrimer
Introduction
CloudComputingCharacteristics
In-housestorage
Cloudstorage
CloudComputing:EvolutionofVirtualizedArchitecture
CloudServiceModels
CloudComputingMyths
CloudComputing:GettingStarted


Conclusion
ReviewQuestions
Section3
Chapter10–WebLogAnalyzerapplicationcasestudy
Introduction
Client-ServerArchitecture
WebLoganalyzer
Requirements
SolutionArchitecture
Benefitsofthissolution
Technologystack
ApacheSpark
SparkDeployment
ComponentsofSpark
HDFS
MongoDB
ApacheFlume
OverallApplicationlogic

TechnicalPlanfortheApplication
ScalaSparkcodeforloganalysis
SampleLogdata
SampleInputData:
SampleOutputofWebLogAnalysis
ConclusionandFindings
ReviewQuestions
Chapter10:DataMiningPrimer
Gatheringandselectingdata
Datacleansingandpreparation
OutputsofDataMining
EvaluatingDataMiningResults
DataMiningTechniques


MiningBigData
FromCausationtoCorrelation
FromSamplingtotheWhole
FromDatasettoDatastream
DataMiningBestPractices
Conclusion
ReviewQuestions
Appendix1:HadoopInstallationonAmazonWebServices(AWS)ElasticCompute
Cluster(EC2)
CreatingClusterserveronAWS,InstallHadoopfromCloudEra
Step1:CreatingAmazonEC2Servers.
Step2:ConnectingserverandinstallingrequiredClouderadistributionofHadoop
Step3:WordCountusingMapReduce
Appendix2:SparkInstallationandTutorial
Step1:VerifyingJavaInstallation

Step2:VerifyingScalainstallation
Step3:DownloadingScala
Step4:InstallingScala
Step5:DownloadingSpark
Step6:InstallingSpark
Step7:VerifyingtheSparkInstallation
Step8:Application:WordCountinScala
AdditionalResources
AbouttheAuthor




Chapter1–WholenessofBigData
Introduction
BigDataisanall-inclusivetermthatreferstoextremelylarge,veryfast,diverse,and
complexdatathatcannotbemanagedwithtraditionaldatamanagementtools.Ideally,Big
Datawouldharnessallkindsofdata,anddelivertherightinformation,totherightperson,
intherightquantity,attherighttime,tohelpmaketherightdecision.BigDatacanbe
managedbydevelopinginfinitelyscalable,totallyflexible,andevolutionarydata
architectures,coupledwiththeuseofextremelycost-effectivecomputingcomponents.
Theinfinitepotentialknowledgeembeddedwithinthiscosmiccomputerwouldhelp
connecteverythingtotheUnifiedFieldofallthelawsofnature.
ThisbookwillprovideacompleteoverviewofBigDatafortheexecutiveandthedata
specialist.ThischapterwillcoverthekeychallengesandbenefitsofBigData,andthe
essentialtoolsandtechnologiesnowavailablefororganizingandmanipulatingBigData.


UnderstandingBigData
BigDatacanbeexaminedontwolevels.Onafundamentallevel,itisdatathatcanbe

analyzedandutilizedforthebenefitofthebusiness.Onanotherlevel,itisaspecialkind
ofdatathatposesuniquechallenges.Thisisthelevelthatthisbookwillfocuson.

Figure1‑1:BigDataContext

Atthelevelofbusiness,datageneratedbybusinessoperations,canbeanalyzedto
generateinsightsthatcanhelpthebusinessmakebetterdecisions.Thismakesthebusiness
growbigger,andgenerateevenmoredata,andthecyclecontinues.Thisisrepresentedby
thebluecycleonthetop-rightofFigure1.1.ThisaspectisdiscussedinChapter10,a
primeronDataAnalytics.
Onanotherlevel,BigDataisdifferentfromtraditionaldataineveryway:space,time,and
function.ThequantityofBigDatais1,000timesmorethanthatoftraditionaldata.The
speedofdatagenerationandtransmissionis1,000timesfaster.Theformsandfunctions
ofBigDataaremuchmorediverse:fromnumberstotext,pictures,audio,videos,activity
logs,machinedata,andmore.Therearealsomanymoresourcesofdata,fromindividuals
toorganizationstogovernments,usingarangeofdevicesfrommobilephonesto
computerstoindustrialmachines.Notalldatawillbeofequalqualityandvalue.Thisis
representedbytheredcycleonthebottomleftofFigure1.1.ThisaspectofBigData,and
itsnewtechnologies,isthemainfocusofthisbook.
BigDataismostlyunstructureddata.Everytypeofdataisstructureddifferently,andwill
havetobedealtwithdifferently.Therearehugeopportunitiesfortechnologyprovidersto
innovateandmanagetheentirelifecycleofBigData…togenerate,gather,store,
organize,analyze,andvisualizethisdata.


CASELET:IBMWatson:ABigDatasystem
IBMcreatedtheWatsonsystemasawayofpushingtheboundariesof
ArtificialIntelligenceandnaturallanguageunderstandingtechnologies.
WatsonbeattheworldchampionhumanplayersofJeopardy(quizstyleTV
show)inFeb2011.Watsonreadsupondataabouteverythingontheweb

includingtheentireWikipedia.Itdigestsandabsorbsthedatabasedon
simplegenericrulessuchas:bookshaveauthors;storieshaveheroes;and
drugstreatailments.Ajeopardyclue,receivedintheformofacryptic
phrase,isbrokendownintomanypossiblepotentialsub-cluesofthe
correctanswer.Eachsub-clueisexaminedtoseethelikelinessofits
answerbeingthecorrectanswerforthemainproblem.Watsoncalculates
theconfidencelevelofeachpossibleanswer.Iftheconfidencelevel
reachesmorethanathresholdlevel,itdecidestooffertheanswertothe
clue.Itmanagestodoallthisinamere3seconds.
Watsonisnowbeingappliedtodiagnosingdiseases,especiallycancer.
Watsoncanreadallthenewresearchpublishedinthemedicaljournalsto
updateitsknowledgebase.Itisbeingusedtodiagnosetheprobabilityof
variousdiseases,byapplyingfactorssuchaspatient’scurrentsymptoms,
healthhistory,genetichistory,medicationrecords,andotherfactorsto
recommendaparticulardiagnosis.(Source:SmartestmachinesonEarth:
youtube.com/watch?v=TCOhyaw5bwg)

Figure1.2:IBMWatsonplayingJeopardy

Q1:WhatkindsofBigDataknowledge,technologiesandskillsare
requiredtobuildasystemlikeWatson?Whatkindofresourcesare
needed?
Q2:WilldoctorsbeabletocompetewithWatsonindiagnosingdiseases


andprescribingmedications?Whoelsecouldbenefitfromasystemlike
Watson?


CapturingBigData

Ifdataweresimplygrowingtoolarge,ORonlymovingtoofast,ORonlybecomingtoo
diverse,itwouldberelativelyeasy.However,whenthefourVs(Volume,Velocity,
Variety,andVeracity)arrivetogetherinaninteractivemanner,itcreatesaperfectstorm.
WhiletheVolumeandVelocityofdatadrivethemajortechnologicalconcernsandthe
costsofmanagingBigData,thesetwoVsarethemselvesbeingdrivenbythe3rdV,the
Varietyofformsandfunctionsandsourcesofdata.
VolumeofData
Thequantityofdatahasbeenrelentlesslydoublingevery12-18months.Traditionaldata
ismeasuredinGigabytes(GB)andTerabytes(TB),butBigDataismeasuredinPetabytes
(PB)andExabytes(1Exabyte=1MillionTB).
Thisdataissohugethatitisalmostamiraclethatonecanfindanyspecificthinginit,in
areasonableperiodoftime.Searchingtheworld-widewebwasthefirsttrueBigData
application.Googleperfectedtheartofthisapplication,anddevelopedmanyofthepathbreakingtechnologiesweseetodaytomanageBigData.
Theprimaryreasonforthegrowthofdataisthedramaticreductioninthecostofstoring
data.Thecostsofstoringdatahavedecreasedby30-40%everyyear.Therefore,thereis
anincentivetorecordeverythingthatcanbeobserved.Itiscalled‘datafication’ofthe
world.Thecostsofcomputationandcommunicationhavealsobeencomingdown,
similarly.Anotherreasonforthegrowthofdataistheincreaseinthenumberofformsand
functionsofdata.MoreaboutthisintheVarietysection.
VelocityofData
Iftraditionaldataislikealake,BigDataislikeafast-flowingriver.BigDataisbeing
generatedbybillionsofdevices,andcommunicatedatthespeedoftheinternet.Ingesting
allthisdataislikedrinkingfromafirehose.Onedoesnothavecontroloverhowfastthe
datawillcome.Ahugeunpredictabledata-streamisthenewmetaphorforthinkingabout
BigData.
Theprimaryreasonfortheincreasedvelocityofdataistheincreaseininternetspeed.
Internetspeedsavailabletohomesandofficesarenowincreasingfrom10MB/secto1
GB/sec(100timesfaster).Morepeoplearegettingaccesstohigh-speedinternetaround
theworld.Anotherimportantreasonistheincreasedvarietyofsourcesthatcangenerate
andcommunicatedatafromanywhere,atanytime.MoreonthatintheVarietysection.

VarietyofData


Bigdataisinclusiveofallformsofdata,forallkindsoffunctions,fromallsourcesand
devices.Iftraditionaldata,suchasinvoicesandledgerswerelikeasmallstore,BigData
isthebiggestimaginableshoppingmallthatoffersunlimitedvariety.Therearethree
majorkindsofvariety.
1. Thefirstaspectofvarietyistheformofdata.Datatypesrangeinorderof
simplicityandsizefromnumberstotext,graph,map,audio,video,andothers.
Therecouldbeacompositeofdatathatincludesmanyelementsinasinglefile.
Forexample,textdocumentshavetextandgraphsandpicturesembeddedinthem.
Videocanhavechartsandsongsembeddedinthem.Audioandvideohave
differentandmorecomplexstorageformatsthannumbersandtext.Numbersand
textcanbemoreeasilyanalyzedthananaudioorvideofile.Howshould
compositeentitiesbestoredandanalyzed?
2. Thesecondaspectisthevarietyoffunctionofdata.Therearehumanchatsand
conversationdata,songsandmoviesforentertainment,businesstransaction
records,machineoperationsperformancedata,newproductdesigndata,olddata
forbackup,etc.Humancommunicationdatawouldbeprocessedverydifferently
fromoperationalperformancedata,withtotallydifferentobjectives.Avarietyof
applicationsareneededtocomparepicturesinordertorecognizepeople’sfaces;
comparevoicestoidentifythespeaker;andcomparehandwritingstoidentifythe
writer.
3. Thethirdaspectofvarietyisthesourceofdata.Mobilephonesandtabletdevices
enableawideseriesofapplicationsorappstoaccessdataandgeneratedatafrom
anytimeanywhere.Webaccesslogsareanothernewandhugesourceof
diagnosticdata.ERPsystemsgeneratemassiveamountsofstructuredbusiness
transactionalinformation.Sensorsonmachines,andRFIDtagsonassets,generate
incessantandrepetitivedata.Broadlyspeaking,therearethreebroadtypesof
sourcesofdata:Human-humancommunications;human-machine

communications;andmachine-to-machinecommunications.Thesourcesofdata,
andtheirrespectiveapplicationsarisingfromthatdata,willbediscussedinthe
nextchapter.


Figure1.3SourcesofBigData(Source:Hortonworks.com)
VeracityofData
Veracityrelatestothebelievabilityandqualityofdata.BigDataismessy.Thereisalot
ofmisinformationanddisinformation.Thereasonsforpoorqualityofdatacanrangefrom
humanandtechnicalerror,tomaliciousintent.
1. Thesourceofinformationmaynotbeauthoritative.Forexample,allwebsitesare
notequallytrustworthy.Anyinformationfromwhitehouse.govorfrom
nytimes.comismorelikelytobeauthenticandcomplete.Wikipediaisuseful,but
notallpagesareequallyreliable.Thecommunicatormayhaveanagendaora
pointofview.
2. Thedatamaynotbereceivedcorrectlybecauseofhumanortechnicalfailure.
Sensorsandmachinesforgatheringandcommunicatingdatamaymalfunctionand
mayrecordandtransmitincorrectdata.Urgencymayrequirethetransmissionof
thebestdataavailableatapointintime.Suchdatamakesreconciliationwithlater,
accurate,recordsmoreproblematic.
3. Thedataprovidedandreceived,mayhowever,alsobeintentionallywrong,for
competitiveorsecurityreasons.
Dataneedstobesiftedandorganizedbyqualityfactors,forittobeputtoanygreatuse.


BenefittingfromBigData
Datausuallybelongstotheorganizationthatgeneratesit.Thereisotherdata,suchas
socialmediadata,thatisfreelyaccessibleunderanopengenerallicense.Organizations
canusethisdatatolearnabouttheirconsumers,improvetheirservicedelivery,anddesign
newproductstodelighttheircustomersandtogainacompetitiveadvantage.Dataisalso

likeanewnaturalresource.Itisbeingusedtodesignnewdigitalproducts,suchasondemandentertainmentandlearning.
Organizationsmaychoosetogatherandstorethisdataforlateranalysis,ortosellitto
otherorganizations,whomightbenefitfromit.Theymayalsolegitimatelychooseto
discardpartsoftheirdataforprivacyorlegalreasons.However,organizationscannot
affordtoignoreBigData.OrganizationsthatdonotlearntoengagewithBigData,could
findthemselvesleftfarbehindtheircompetition,landinginthedustbinofhistory.
InnovativesmallandneworganizationscanuseBigDatatoquicklyscaleupandbeat
largerandmorematureorganizations.
BigDataapplicationsexistinallindustriesandaspectsoflife.Therearethreemajortypes
ofBigDataapplications:MonitoringandTracking,AnalysisandInsight,andnewdigital
productdevelopment.
MonitoringandTrackingApplications:Consumergoodsproducersusemonitoringand
trackingapplicationstounderstandthesentimentsandneedsoftheircustomers.Industrial
organizationsuseBigDatatotrackinventoryinmassiveinterlinkedglobalsupplychains.
Factoryownersuseittomonitormachineperformanceanddopreventivemaintenance.
Utilitycompaniesuseittopredictenergyconsumption,andmanagedemandandsupply.
InformationTechnologycompaniesuseittotrackwebsiteperformanceandimproveits
usefulness.Financialorganizationsuseittoprojecttrendsbetterandmakemoreeffective
andprofitablebets,etc.
AnalysisandInsight:PoliticalorganizationsuseBigDatatomicro-targetvotersandwin
elections.PoliceuseBigDatatopredictandpreventcrime.Hospitalsuseittobetter
diagnosediseasesandmakemedicineprescriptions.Adagenciesuseittodesignmore
targetedmarketingcampaignsquickly.Fashiondesignersuseittotracktrendsandcreate
moreinnovativeproducts.


Figure1.4:ThefirstBigDataPresident

NewProductDevelopment:Incomingdatacouldbeusedtodesignnewproductssuchas
realityTVentertainment.Stockmarketfeedscouldbeadigitalproduct.Thisareaneeds

muchmoredevelopment.


ManagementofBigData
ManyorganizationshavestartedinitiativesaroundtheuseofBigData.However,most
organizationsdonotnecessarilyhaveagriponit.Herearesomeemerginginsightsinto
makingbetteruseofBigData.
1. Acrossallindustries,thebusinesscaseforBigDataisstronglyfocusedon
addressingcustomer-centricobjectives.ThefirstfocusondeployingBigData
initiativesistoprotectandenhancecustomerrelationshipsandcustomer
experience.
2. Solvearealpain-point.BigDatashouldbedeployedforspecificbusiness
objectivesinordertohavemanagementavoidbeingoverwhelmedbythesheer
sizeofitall.
3. Organizationsarebeginningtheirpilotimplementationsbyusingexistingand
newlyaccessibleinternalsourcesofdata.Itisbettertobeginwithdataunder
one’scontrolandwhereonehasasuperiorunderstandingofthedata.
4. Puthumansanddatatogethertogetthemostinsight.Combiningdata-based
analysiswithhumanintuitionandperspectivesisbetterthangoingjustoneway.
5. Advancedanalyticalcapabilitiesarerequired,butlacking,fororganizationstoget
themostvaluefromBigData.Thereisagrowingawarenessofbuildingorhiring
thoseskillsandcapabilities.
6. Usemorediversedata,notjustmoredata.Thiswouldprovideabroader
perspectiveintorealityandbetterqualityinsights.
7. Thefasteryouanalyzethedata,themoreitspredictivevalue.Thevalueofdata
depreciateswithtime.Ifthedataisnotprocessedinfiveminutes,thenthe
immediateadvantageislost.
8. Don’tthrowawaydataifnoimmediateusecanbeseenforit.Datahasvalue
beyondwhatyouinitiallyanticipate.Datacanaddperspectivetootherdatalater
oninamultiplicativemanner.

9. Maintainonecopyofyourdata,notmultiple.Thiswouldhelpavoidconfusion
andincreaseefficiency.
10. Planforexponentialgrowth.Dataisexpectedtocontinuetogrowatexponential
rates.Storagecostscontinuetofall,datagenerationcontinuestogrow,data-based
applicationscontinuetogrowincapabilityandfunctionality.
11. Ascalableandextensibleinformationmanagementfoundationisaprerequisitefor
bigdataadvancement.BigDatabuildsuponaresilient,secure,efficient,flexible,
andreal-timeinformationprocessingenvironment.


12. BigDataistransformingbusiness,justlikeITdid.BigDataisanewphase
representingadigitalworld.Businessandsocietyarenotimmunetoitsstrong
impacts.


Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Tải bản đầy đủ ngay
×