BigDataForBeginners
UnderstandingSMARTBigData,Data
Mining&DataAnalyticsForimproved
BusinessPerformance,LifeDecisions&
More!
Copyright2016byVinceReynolds-Allrightsreserved.
Thisdocumentisgearedtowardsprovidingexactandreliableinformationinregardstothe
topicandissuecovered.Thepublicationissoldwiththeideathatthepublisherisnot
requiredtorenderaccounting,officiallypermitted,orotherwise,qualifiedservices.If
adviceisnecessary,legalorprofessional,apracticedindividualintheprofessionshould
beordered.
-FromaDeclarationofPrincipleswhichwasacceptedandapprovedequallybya
CommitteeoftheAmericanBarAssociationandaCommitteeofPublishersand
Associations.
Innowayisitlegaltoreproduce,duplicate,ortransmitanypartofthisdocumentineither
electronicmeansorinprintedformat.Recordingofthispublicationisstrictlyprohibited
andanystorageofthisdocumentisnotallowedunlesswithwrittenpermissionfromthe
publisher.Allrightsreserved.
Theinformationprovidedhereinisstatedtobetruthfulandconsistent,inthatanyliability,
intermsofinattentionorotherwise,byanyusageorabuseofanypolicies,processes,or
directionscontainedwithinisthesolitaryandutterresponsibilityoftherecipientreader.
Undernocircumstanceswillanylegalresponsibilityorblamebeheldagainstthe
publisherforanyreparation,damages,ormonetarylossduetotheinformationherein,
eitherdirectlyorindirectly.
Respectiveauthorsownallcopyrightsnotheldbythepublisher.
Theinformationhereinisofferedforinformationalpurposessolely,andisuniversalasso.
Thepresentationoftheinformationiswithoutcontractoranytypeofguaranteeassurance.
Thetrademarksthatareusedarewithoutanyconsent,andthepublicationofthetrademark
iswithoutpermissionorbackingbythetrademarkowner.Alltrademarksandbrands
withinthisbookareforclarifyingpurposesonlyandaretheownedbytheowners
themselves,notaffiliatedwiththisdocument.
TableofContents
Introduction
Chapter1.AConundrumCalled‘BigData’
So,WhatDoesBigDataLookLike?
ThePurposeandValueof‘BigData’
HowBigDataChangesEverything
EnterpriseSupercomputing
SupercomputerPlatforms
Chapter2.UnderstandingBigDataBetter
HowToValueData:4MeasurableCharacteristicsOfBigData
VolumeBasedValue
VelocityBasedValue
VarietyBasedValue
WhatisStructuredData?
WhatisUnstructuredData?
VeracityBasedValue
Cloudorin-house?
BigDataastheUltimateComputingPlatform
BigDatainAirplaneProduction
BigDataPlatforms
BigDataandtheFuture
FireFightersandHighDivers
BigDataisDoItYourselfSupercomputing
PlatformEngineeringandBigData
KeepItSimple,Sunshine
Chapter3:BigDataAnalytics
BigDataandUltraSpeed
TheBigDataRealityofRealTime
TheRealTimeBigDataAnalyticsStack(RTBDA)
WhatCanBigDataDoForYou?
DescriptiveAnalytics
PredictiveAnalytics
PrescriptiveAnalytics
TopHighImpactUseCasesofBigDataAnalytics
Customeranalytics
Operationalanalytics
RiskandComplianceAnalytics
NewProductsandServicesInnovation
Chapter4.WhyBigDataMatters
So,doesBigDatareallymatter?
Thereare,however,otherobstaclesthatremain
Chapter5.ACloserLookatKeyBigDataChallenges
DifficultyinUnderstandingandUtilizingBigData
New,Complex,andContinuouslyEvolvingTechnologies
DataSecurityRelatedtoCloudBasedBigDataSolutions
Chapter6.GeneratingBusinessValuethroughDataMining
TheBusinessValueofData
DataStorage
SoWhatisDataMining?
HowDataMiningCanHelpYourBusiness
TheDataMiningProcess
TechnologiesforDataMining
ExamplesofApplicationsofDataMininginRealWorldSetting
DataMiningProspects
Top10WaystoGetaCompetitiveEdgethroughDataMining
Conclusion
Introduction
IfyouareintheworldofITorbusiness,youhaveprobablyheardabouttheBigData
phenomenon.Youmighthaveevenencounteredprofessionalswhointroducedthemselves
asdatascientists.Hence,youarewondering,justwhatisthisemergingnewareaof
science?Whattypesofknowledgeandproblem-solvingskillsdodatascientistshave?
WhattypesofproblemsaresolvedbydatascientiststhroughBigDatatech?
Afterreadingbook,youwillhavetheanswerstothesequestions.Inaddition,youwill
begintobecomeproficientwithimportantindustrytermsandapplicationsandtoolsin
ordertoprepareyouforadeeperunderstandingoftheotherimportantareasofBigData.
Everyday,oursocietyiscreatingabout3quintillionbytesofdata.Youareprobably
wonderingwhat3quintillionis.Well,thisis3followedby18zeros.Andthatfolksis
generatedEVERYDAY.Withthismassivestreamofdata,theneedtomakesenseoffor
thisbecomesmorecrucialandquicklyincreasingdemandforBigDataunderstanding.
Businessowners,largeorsmall,musthavebasicknowledgeinbigdata.
Chapter1.AConundrumCalled‘BigData’
‘Bigdata’isoneofthelatesttechnologytrendsthatareprofoundlyaffectingtheway
organizationsutilizeinformationtoenhancethecustomerexperience,improvetheir
productsandservices,createuntappedsourcesofrevenue,transformbusinessmodelsand
evenefficientlymanagehealthcareservices.Whatmakesitahighlytrendingtopicisthe
factthattheeffectiveuseofbigdataalmostalwaysendsupwithsignificantlydramatic
results.Yet,theironythoughisnobodyreallyknowswhat‘bigdata’actually
means.
Thereisnodoubtthat‘bigdata’isnotjustahighlytrendingITbuzzword.Rather,itisa
fastevolvingconceptininformationtechnologyanddatamanagementthatis
revolutionizingthewaycompaniesconducttheirbusinesses.Thesadpartis,itisalso
turningouttobeaclassicconundrumbecausenoone,notevenagroupofthebestIT
expertsorcomputergeekscancomeupwithadefinitiveexplanationdescribingexactly
whatitis.Theyalwaysfallshortofcomingupwithanappropriatedescriptionfor‘big
data’thatthatisacceptabletoall.Atbest,whatmostofthesecomputerexpertscould
comeupwithareroundaboutexplanationsandsporadicexamplestodescribeit.Try
askingseveralITexpertswhat‘bigdata’isandyouwillgetjustasmanydifferentanswers
asthenumberofpeopleyouask.
Whatmakesitevenmorecomplicatedanddifficulttounderstandisthefactthatwhatis
deemedas‘big’nowmaynotbethatbiginthenearfutureduetorapidadvancesin
softwaretechnologyandthedatamanagementsystemsdesignedtohandlethem.
Wealsocannotescapethefactthatwenowliveinadigitaluniversewhereeverythingand
anythingwedoleavesadigitaltracewecalldata.Atthecenterofthisdigitaluniverseis
theWorldWideWebfromwhichcomesadelugeofdatathatfloodsourconsciousness
everysinglesecond.Withwelloveronetrillionwebpages(50billionofwhichhave
alreadybeenindexedbyandaresearchablethroughvariousmajorsearchengines),the
weboffersusunparalleledinterconnectivitywhichallowsustointeractwithanyoneand
anythingwithinaconnectednetworkwehappentobepartof.Eachoneofthese
interactionsgeneratesdatatoothatiscoursedthroughandrecordedintheweb-addingup
tothe‘fuzziness’ofanalreadyfuzzyconcept.Asaconsequence,thewebiscontinuously
overflowingwithmassivedatasohugethatitisalmostimpossibletodigestorcrunchinto
usablesegmentsforpracticalapplications–iftheyareofanyuseatall.Thisenormous,
evergrowingdatathatgoesthroughandarestoredinthewebtogetherwiththe
developingtechnologiesdesignedtohandleitiswhatiscollectivelyreferredtoas‘big
data’.
So,WhatDoesBigDataLookLike?
Ifyouwanttohaveanidea
onwhat‘bigdata’really
lookslikeorhowmassiveittrulyis,trytovisualizethe
followingstatisticsifyoucan-withoutgettingdizzy.Think
ofthewebwhichcurrentlycoversmorethan100million
domainsandisstillgrowingattherateof20,000newdomainseverysingleday.
Thedatathatcomesfromthesedomainsissomassiveandmindbogglingthatitis
practicallyimmeasurablemuchlessmanageablebyanyconventionaldatamanagement
andretrievalmethodsthatareavailabletoday.Andthatisonlyforourstarters.Addtothis
the300milliondailyFacebookposts,60milliondailyFacebookupdates,and250million
dailytweetscomingfrommorethan900millioncombinedFacebookandTweeterusers
andforsureyourimaginationisgoingtogothroughtheroof.Don’tforgettoincludethe
voluminousdatacomingfromoversixbillionsmartphonescurrentlyinusetodaywhich
continuallyaccesstheinternettodobusinessonline,topoststatusupdatesonsocial
media,sendouttweets,andmanyotherdigitaltransactions.Remember,approximately
onebillionofthesesmartphonesareGPSenabledwhichmeanstheyareconstantly
connectedtotheinternetandtherefore,theyarecontinuouslyleavingbehindtheirdigital
trailswhichisaddingmoredatatothealreadyburgeoningbulkofinformationalready
storedinmillionsofserversthatspantheinternet.
Andifyourimaginationstillservesyourightatthispoint,trycontemplatingonthemore
thanthirtybillionPointOfSalestransactionsperyearthatarecoursedthrough
electronically-connectedPOSdevices.Ifyouarestilluptoit,whynotalsogooverthe
morethan10,000creditcardpaymentsbeingdoneonlineorthroughotherconnected
deviceseverysinglesecond.Thesheervolumealoneofthecombinedtorrentialdatathat
envelopsusunceasinglyisamazinglyunbelievable.“Mindboggling”isan
understatement.Stupefyingwouldbemoreappropriate.
Don’tblinknow
butthe‘bigdata’
thathadbeenaccumulatedbythewebforthepastfive
years(since2010)andarenowstoredinmillionsof
serversscatteredallovertheglobefarexceedsallof
thepriordatathathadbeenproducedandrecorded
throughoutthewholehistoryofmankind.The‘big
data’werefertoincludesanythingandeverythingthat
hasbeenfedintobigdatasystemssuchassocialnetworkchatters,contentofwebpages,
GPStrails,financialmarketdata,onlinebankingtransactions,streamingmusicand
videos,podcasts,satelliteimagery,etc.Itisestimatedthatover2.5quintillionbytesof
data(2.5x1018)iscreatedbyuseveryday.Thismassivefloodofdatawhichwe
collectivelycallas‘bigdata’justkeepsongettingbiggerandbiggerthroughtime.Experts
estimatethatitsvolumewillreach35zettabytes(35x1021)by2020.
Inessence,ifandwhendatasetsgrowextremelybigorbecomeexcessivelytoocomplex
fortraditionaldatamanagementtoolstohandle,itisconsideredas‘BigData’.The
problemis,thereisnocommonsetceilingoracceptableupperthresholdlevelbeyond
whichthebulkofinformationstartstobeclassifiedasbigdata.Inpractice,whatmost
companiesnormallydoistoconsiderasbigdatathosewhichhaveoutgrowntheirown
respectivedatabasemanagementtools.BigData,insuchcase,istheenormousdatawhich
theycannolongerhandleeitherbecauseitistoomassive,toocomplex,orboth.This
meanstheceilingvariesfromonecompanytotheother.Inotherwords,different
companieshavedifferentupperthresholdlimitstodeterminewhatconstitutesbigdata.
Almostalways,theceilingisdeterminedbyhowmuchdatatheirrespectivedatabase
managementtoolsareabletohandleatanygiventime.That’sprobablyoneofthereasons
whythedefinitionof‘bigdata’issofuzzy.
ThePurposeandValueof‘BigData’
Justasfuzzyandnebulousasitsdefinition,thepurposeor
valueofBigDataalsoappearstobeuncleartomany
entrepreneurs.Infact,manyofthemarestillgropingin
thedarklookingforanswerstosuchquestionsas‘why’
and‘how’touse‘bigdata’togrowtheirbusinesses.
Ifwearetotakeourcue
fromapollconductedby
DigitalistMagazineamongthe300participantstothe
SapphireNow2014eventheldatOrlando,Florida,it
appearsthatabout82%ofcompanieshavestartedto
embraceBigDataasacriticalcomponenttoachieving
theirstrategicobjectives.Butdespitethefactthat60%of
themhavestarteddiggingintobigdata,only3%has
gainedthematurityorhaveacquiredsufficientknowledge
andresourcestosiftthroughandmanagesuchmassiveinformation.Apparently,therest
continuetogropeinthedark.
Itisthereforequiteawonderthatdespitebeingconstantlyonthelookoutfornewwaysto
buildandmaintainacompetitiveedgeanddespiterelentlesslyseekingnewandinnovative
productsandservicestoincreaseprofitability,mostcompaniesstillmissoutonthemany
opportunities‘bigdata’hastooffer.Forwhateverreasontheymayhave,theystopshortof
layingdownthenecessarygroundworkforthemtostartmanaginganddigginginto‘big
data’toextractnewinsightsandcreatenewvalueaswellasdiscoverwaystostayahead
oftheircompetitors.
Forhiddendeepwithinthetorrentofbigdatainformationstreamisawealthofuseful
knowledgeandvaluablebehavioralandmarketpatternsthatcanbeusedbycompanies
(bigorsmall)tofueltheirgrowthandprofitability–simplywaitingtobetapped.
However,suchvaluableinformationhavetobe‘mined’and‘refined’firstbeforetheycan
beputintogooduse-muchlikedrillingforoilthatisburiedunderground.
Similartooilwhichhastobedrilledandrefinedfirstbeforeyoucanharnessitsawesome
powertothehilt,‘bigdata’usershavetodigdeep,siftthrough,andanalyzethelayers
uponlayersofdatasetsthatmakesupbigdatabeforetheycanextractusablesetsthathas
specificvaluetothem.
Inotherwords,likeoil,bigdatabecomesmorevaluableonlyafteritis‘mined’,
processed,andanalyzedforpertinentdatathatcanbeusedtocreatenewvalues.This
cumbersomeprocessiscalledbigdataanalytics.Analyticsiswhatgivesbigdataitsshine
andmakesitusableforapplicationtospecificcases.Tomakethestoryshort,bigdata
goeshandinhandwithanalytics.Withoutanalytics,bigdataisnothingmorethanabunch
ofmeaninglessdigitaltrash.
Thetraditionalwayofprocessingbigdatahowever,usedtobeatoughandexpensivetask
totackle.Itinvolvesanalyzingmassivevolumesofdatawhichtraditionalanalyticsand
conventionalbusinessintelligencesolutionscan’thandle.Itrequirestheuseofequally
massivecomputer
hardwareandthe
mostsophisticateddatamanagementsoftwaredesigned
primarilytohandlesuchenormousandcomplicated
information.
Thegiantcorporationswhostarteddiggingintobigdata
aheadofeverybodyelsehadtospendfortunesonexpensive
hardwareandgroundbreakingdatamanagementsoftwareto
beabledoit–albeit,withagreatdealofsuccessatthat.
Theirpioneeringeffortsrevealednewinsightsthatwere
burieddeepinthemazeofinformationcloggingtheinternet
serversandwhichtheywereabletoretrieveandusetogreat
advantage.Forexample,afteranalyzinggeographicaland
socialdataandafterscrutinizingeverybusinesstransaction,
theydiscoveredanewmarketingfactorcalled‘peer
influence’whichplayedanimportantroleinshaping
shoppingpreferences.Thisdiscoveryallowedthemtoestablishspecificmarketneedsand
segmentswithouttheneedtoconducttediousproductsamplingsthus,blazingthetrailfor
datadrivenmarketing.
Allthiswhile,thenot-so-well-resourcedcompaniescouldonlywatchinawe–sidelined
bytheprohibitivecostofprocessingbigdata.Thegoodnewsthoughisthiswillnotbefor
longbecausethereisnowaffordablecommodityhardwarewhichthenot-so-wellresourcedcanusetostarttheirownbigdataprojects.Therearealsothecloud
architecturesandopensourcesoftwarewhichtremendouslycutthecostofbigdata
processingmakingithighlyfeasibleevenforstartupcompaniestotapintothehuge
potentialofbigdatabysimplysubscribingtocloudservicesforservertime.
Atthispoint,thereisonlyonethingthatisclear-everyenterpriseareleftwithnochoice
buttoembracetheconceptofbigdataandunderstanditsimplicationtotheirbusiness.
Theyhavetorealizethatdatadrivenmarketinganddatadrivenproductinnovationisnow
thenewnorm.
HowBigDataChangesEverything
BigDataisakindofsupercomputingthatcanbeusedbygovernmentsandbusinesses,
whichwillmakeitdoabletokeeptrackofpandemicinrealtime,guesswherethenext
terroristattackwillhappen,improveefficiencyofrestaurantchains,projectvoting
patternsonelections,andpredictthevolatilityoffinancialmarketswhiletheyare
happening.
Hence,manyoftheseeminglyunrelatedyetdiversewillbeintegratedintobigdata
network.Similartoanypowerfultech,whenusedproperlyandeffectivelyforthegood,
BigDatacouldpushthemankindtowardsmanypossibilities.Butifusedwithbad
intentions,theriskscouldbeveryhighandcouldevenbedamaging.
Theneedtogetbigdataisimmediatefordifferentorganizations.Ifamalevolent
organizationgetsthetechfirst,thentherestoftheworldcouldbeatrisk.Ifaterrorist
organizationsecuredthetechfirstbeforetheCIA,thesecurityoftheUSAcouldbe
compromised.
Theresolutionswillneedbusinessestablishmentstobemorecreativeatdifferentlevels
includingorganizational,financial,andtechnical.Ifthecoldwarinthe1950swasall
aboutgettingthearms,today,BigDataisthearmsrace.
EnterpriseSupercomputing
Trendsintheworldofsupercomputingareinsomewayssimilartothoseofthefashion
industry.Evenifyouwaitlongenough,youcanhavethechancetowearitagain.Mostof
thetechusedinBigDatahavebeenusedindifferentindustriesformanyyears,suchas
distributedfilesystems,parallelprocessing,andclustering.
Enterprisesupercomputingwasdevelopedbyonlinecompanieswithworldwide
operationsthatrequiretheprocessingofexponentiallygrowingnumbersofusersandtheir
profiles(Yahoo!,Google,andFacebook).Buttheyneedtodothisasfastastheycan
withoutspendingtoomuchmoney.ThisisenterprisesupercomputingknownasBigData.
BigDatacouldcausedisruptivechangestoorganizations,andcanreachfarbeyondonline
communitiestothesocialmediaplatformsthatspansandconnectstheglobe.BigDatais
notjustafad.Itisacrucialaspectofmoderntechthatwillbeusedforgenerationsto
come.
Bigdatacomputingisactuallynotanewtechnology.Sincethebeginningoftime,
predictingtheweatherhasbeenacrucialbigdataconcern,whenweathermodelsare
processedusingonesupercomputer,whichcanoccupyawholegymnasiumandintegrated
withthen-fastprocessingunitswithcostlymemory.Softwareduringthe1970swasvery
crude,somostoftheperformanceduringthattimewascreditedduetotheinnovative
engineeringofthehardwarecomponent.
Softwaretechnologyhadimprovedinthe1990sleadingtotheimprovedsetupwhere
programsprocessedononehugesupercomputercanbepartitionedintosmallerprograms
thatarerunningsimultaneouslyonseveralworkstations.Oncealltheprogramsaredone
processing,theresultswillbecollatedandanalyzedtoforecasttheweatherforseveral
weeks.
Butevenduringthe1990s,thecomputersimulatorsneedabout15daystocalculateand
projecttheweatherforaweek.Ofcourse,itdoesn’thelppeopletoknowthatitwas
cloudylastweek.Nowadays,theparallelcomputersimulationsforweatherpredictionfor
thewholeweekcouldbecompletedinamatterofhours.
Inreality,thesesupercomputerscannotpredicttheweather.Instead,theyarejusttryingto
simulateandforecastitsbehavior.Throughhumananalysis,theweathercouldbe
predicted.Hence,supercomputersalonecannotprocessBigDataandmakesenseofit.
Manyweatherforecastingagenciesusedifferentsimulatorswithvaryingstrengths.
ComputersimulatorsthataregoodatforecastingwhereahurricanewillfallinNewYork
arenotthataccurateinforecastinghowthehumiditylevelcouldaffecttheairoperations
atAtlantaInternationalAirport.
Weatherforecastersineveryregionstudytheresultsofseveralsimulationswithvarious
setsofinitialdata.Theynotonlyporeoveractualoutputfromweatheragencies,butthey
alsolookatdifferentinstrumentssuchasthedopplerradar.
Eventhoughtherearetonsofdatainvolved,weathersimulationisnotcategorizedasBig
Data,becausethereisalotofcomputingrequired.Scientificcomputingproblems(usually
inengineeringandmeteorology)arealsoregardedasscientificsupercomputingorhigh-
performancecomputingorHPC.
Earlyelectroniccomputersaredesignedtoperformscientificcomputing,suchas
decipheringcodesorcalculatingmissiletrajectories,whichallinvolvesworkingon
mathematicalproblemsusingmillionsofequations.Scientificcalculationscanalsosolve
equationsfornon-scientificproblemslikeinrenderinganimatedmovies.
BigDataisregardedastheenterpriseequivalentofHPCthatisalsoknownasthe
enterprisesupercomputingorhigh-performancecommercialcomputing.BigDatacanalso
resolvehugecomputingproblems,butthisismoreaboutdiscoveringsimulationsandless
aboutequations.
Duringtheearly1960s,financialorganizationssuchasbanksandlendingfirmsused
enterprisecomputerstoautomateaccountsandmanagetheircreditcardventures.
Nowadays,onlinebusinessessuchaseBay,Amazon,andevenlargeretailersareusing
enterprisesupercomputinginordertofindsolutionsfornumerousbusinessproblemsthat
theyencounter.However,enterprisesupercomputingcanbeusedformuchmorethan
studyingcustomerattrition,managingsubscribers,ordiscoveringidleaccounts.
BigDataandHadoop
Hadoopisregardedasthefirstenterprisesupercomputingsoftwareplatform,whichworks
atscaleandisquiteaffordable.Itexploitstheeasytrickofparallelismthatisalreadyin
useinhighperformancecomputingindustry.Yahoo!developedthissoftwareinorderto
findaspecificsolutionforaproblem,buttheyimmediatelyrealizedthatthissoftwarehas
theabilitytosolveothercomputerproblems.
EventhoughthefortunesofYahoo!changeddrastically,ithasmadealargecontribution
totheincubationofFacebook,Google,andbigdata.
Yahoo!originallydevelopedHadooptoeasilyprocessthefloodofclickstreamdata
receivedbythesearchengine.Clickstreamreferstothehistoryoflinksclickedbythe
users.Becauseitcouldbemonetizedtopotentialadvertisers,analyzingthedatafor
clickstreamfromthousandsofYahoo!serversneededahugescalabledatabase,whichwas
cost-effectivetocreateandrun.
Theearlysearchenginecompanydiscoveredthatmanycommercialsolutionsduringthat
timewereeitherveryexpensiveorentirelynotcapableofscalingsuchhugedata.Hence,
Yahoo!hadtodevelopthesoftwarefromscratch,andsoDIYenterprisesupercomputing
began.
SimilartoLinux,Hadoopisdesignedasanopen-sourcesoftwaretech.JustasLinuxledto
thecommoditycloudsandclustersinHPC,Hadoophasdevelopedabigdatanetworkof
disruptivepossibilities,newstartups,oldvendors,andnewproducts.
Hadoopwascreatedasportablesoftware;itcanbeoperatedusingotherplatformsaside
fromLinux.ThepowertorunopensourcesoftwaresimilartoHadooponaMicrosoftOS
isacrucialandasuccessfortheopensourcecommunity,whichwasahugemilestone
duringthattime.
Yahoo!andBigData
KnowingthehistoryofYahooiscrucialinunderstandingthehistoryofBigData,because
Yahoowasthefirstcompanytooperateatsuchmassivescale.DaveFiloandJerryYang
beganYahoo!asatechprojectinordertoindextheinternet.Butastheyworkon,they
realizedthattraditionalindexingstrategiescannotbeusedwiththeexplosionofcontent
thatshouldbeindexed.
EvenbeforethecreationofHadoop,Yahoo!hadtheneedforacomputerplatform,which
cantakethesameamountoftimetodevelopthewebindexregardlessofthegrowthrate
ofinternetcontent.Thecreatorsrealizedthatthereisaneedtousetheparallelismtactic
fromthehighpowercomputingworldfortheprojecttobecomescalableandthenthe
computinggridofYahoo!becametheclusternetworkthatHadoopwasbasedon.
SimilartotheimportanceofHadoopwasYahoo!’sinnovationinrestructuringtheir
OperationsandEngineeringteamsinordertosupportnetworkplatformsofthisscale.The
experienceofYahooinoperatingalarge-scalecomputingplatform,whichspreadacross
severallocationsresultedtothere-inventionoftheInformationTechnologyDepartment.
Complicatedplatformshadtobedevelopedinitiallyanddeployedbysmallteams.
Runninganorganizationtoscaleupinordertoprovidesupporttotheseplatformsisan
altogetherseparatematter.However,reinventingtheITdepartmentisjustasimportantas
gettingthesoftwareandhardwaretoscale.
SimilartomanycorporatedepartmentsfromSalestoHR,ITfirmsconventionallyattain
scalabilitybywayofcentralizingtheprocess.ByhavingadedicatedteamofITexperts
managingathousandstoragedatabasesismorecost-effectivecomparedtocompensating
thesalariesforalargeteam.However,StorageAdminsusuallydon’thaveaworking
know-howofthenumerousappsonthesearrays.
Centralizationwillexchangetheworkingknowledgeofthegeneralistforexpertiseofthe
subjectmatteraswellascostefficiency.Businessesarenowrealizingtheunintendedrisks
ofexchangesmadeseveralyearsago,whichcreatedsilos,whichwillinhibittheircapacity
tousebigdata.
ConventionalITfirmsdivideexpertiseandresponsibilitiesthatoftenconstrain
collaborationamongandbetweenteams.Minorglitchesbecauseofmiscommunications
couldbeacceptableonafewminoremailservers,butevenasmallglitchinproducing
supercomputersmaycostbusinessestolosemoney.
Evenasmallmarginoferrorcouldresulttoalargedifference.IntheBigDataworld,100
TerabytesisjustaSmallData,but1%errorin100TBis1MillionMB.Detectingand
resolvingerrorsatthismassivescalecouldconsumemanyhours.
Yahoo!adoptedthestrategyusedbyHPCcommunityformorethantwodecades.
Yahoovianslearnedthatspecializedteamswithaworkingknowledgeofthewhole
platformcanworkbest.Datasilosandresponsibilitybecomeobstaclesineither
commercialsupercomputingandscientificsupercomputing.
Online-scalecomputingsilosworkbecauseearlyadopterslearnednewinsights:
supercomputersarefinelytunedplatformswithnumerousinterdependentpartsandthey
don’trunasprocessingsilos.Butinthe1980s,peopleviewcomputersasamachinewith
interdependentfunctionalitylayers.
Thisparadigmwaseasiertounderstand,butwithexponentiallyincreasingsophisticated
platforms,thelayerparadigmstartedtocovertheunderlyingsophistication,which
impededorevenavoidedeffectivetriageofperformanceandreliabilityconcerns.
SimilartoaBoeing747,platformsforsupercomputingshouldbeinterpretedasawhole
collectionoftechnologiesorthemanageabilityorefficiencycouldbeaffected.
SupercomputerPlatforms
Intheearlystagesofcomputerhistory,systemsareconsideredasplatforms-theseare
calledasmainframesandusuallytheyareregardedasmainframesandareproducedby
companiesthatalsosuppliesspecializedteamsofengineerswhocloselyworkwiththeir
customerstomakecertainthattheplatformcanfunctionaccordingtoitsdesign.
ThismethodwaseffectivesolongasyoutakesatisfactionasacustomerofIBM.But
whenIBMstartedtomakesomechangesinthe1960s,othercompaniesprovidedmore
optionsandbetterprices.However,thishasresultedtopartitioningofindustrysilos.
Nowadays,enterprisesthatarestilldominatingtheirsilostillhavethetendencytobehave
likeamonopolysolongastheycangetawaywithit.Whenstorage,server,anddatabase
companiesstartedtoproliferate,ITfirmsmimicthisalignmentwiththeirrelativegroups
ofstorage,server,anddatabasespecialists.
Butinordertoeffectivelystandupabigdatacluster,eachmemberwhoisworkingonthe
clustershouldbeorganizationallyandphysicallypresent.Therequiredcollaborativework
foreffectiveclusterdeploymentsatthisscalecouldbedifficulttoachieveinasubsequent
levelofasilo.
Ifyourbusinesslikestoembracebigdataorcometogetherinthatmagicalplacewhere
BigDataWorksintheCloud,theITdepartmentshouldreorganizesomesilosandstudy
theplatformwell.
Butfarfromreality,manybusinessorganizationscannoteasilyhandlesuchchanges,
especiallyifthechangeistoofast.Disruptionandchaoshavebeenconstantsinthe
industry,butwerealwaysinclosecoordinationwithinnovationandpossibility.For
businesses,whicharewillingtoembracethistech,BigDatacanbeastreamofnewideas
aswellasenterpriseopportunities.
BigDataBang
AsthenetworkofBigDataevolvesoverthenextdecades,itwillsurelyoverwhelmboth
customersandvendorsinvariousways.
1.Theimpacttothesilomindset,bothintheindustryandtheorganizationwillbean
importantmilestoneofbigdata.
2.TheITindustrywillbebombardedbythenewtechofbigdata,sincemostofthe
productsbeforethecreationofHadooparenotfunctioningatall.BigDatasoftwareand
hardwareismanytimesfastercomparedtoexistingbusiness-scaleproductsandalsoalot
cheaper.
3.TechasdisruptiveandnewasBigDataisusuallynoteasilywelcomedinanestablished
ITorganizationbecausetheirorganizationalmandatecompelsthemtofocuson
minimizingOPEXandnotencourageinnovation,whichforcesITtobethedevil’s
advocateofBigData.
4.Itcompanieswillbedisruptedbythenewgenerationthatwillcomeafterthosewho
havefocusedworkingonEMC,Microsoft,andOracle.BigDataisconsideredasthemost
importantforceintheITindustrytodaysincetheintroductionoftherelationaldatabase.
5.InworkingwithBigData,programmersanddatascientistsarerequiredtosetthingsup
forabetterunderstandingofhowthedatawillflowbeneath.Thisincludesthe
introductionaswellasthereintroductiontothecomputingplatform,whichmakesit
possible.Thiscouldbewaybeyondtheircomfortzonesiftheyareentrenchedinsidesilos.
ITprofessionalswhoareopeninlearningnewwaysofthinking,working,and
collaboratingwillprosper,andthisprosperitycouldequatetoefficiency.
6.Privacyandcivillibertiescouldbecompromisedastechnologyadvancementswill
makeitlessexpensiveforanyorganization(publicorprivate)tostudydatapatternsas
wellasindividualbehaviorofanyonewhoaccessestheinternet.
GreatPossibilitieswithBigData
Nowadays,BigDataisnotjustforsocialnetworkingormachine-generatedonlinelogs.
Enterprisesandagenciescanseekanswerstoquestions,whichtheymayneverhavethe
capacitytoaskandBigDatacouldhelpinidentifyingsuchquestions.
Forexample,carproducerscannowaccesstheirworldwidepartsinventoryacross
numerousplantsandalsoacquiretonsofdata(usuallyinpetabytes)comingfromthe
sensorsthatcanbeinstalledinthecarstheyhavemanufactured.
Otherenterprisescannowanalyzeandprocesstonsofdatawhiletheyarestillcollectingit
onthefield.Forinstance,prospectingforgoldreserveswillinvolveseismicsensorsinthe
fieldacquiringtonsofdatathatcouldbesenttoHQandanalyzedwithinminutes.
Inthepast,thisdatashouldbetakenbacktoacostlydatacenter,andtransferredtohighpoweredsupercomputers–thisprocesstakesalotoftime.Today,aHadoopcluster
distributedalloverseismictrucksparkedinavacantlotcouldstilldothetaskwithin
hours,andfindpatternstoknowtheprospectingrouteforthenextday.
Inthefieldofagriculture,farmerscanusehundredsoffarmsensorsthatcouldtransmit
databacktotheHadoopclusterinstalledinabardinordertomonitorthegrowthofthe
crops.
GovernmentagenciesarealsousingHadoopclustersbecausethesearemoreaffordable.
Forinstance,theCDCandtheWHOarenowusingBigDatatotrackthespreadof
pandemicsuchasSARSorH1N1astheyhappen.
EventhoughBigDataallowsittoprocesslargedatasets,theprocesscouldbefast,thanks
toparallelism.Hadoopcouldalsobeusedfordatasets,whicharenotconsideredasBig
Data.ThesmallHadoopclustercanbeconsideredasanartificialretina.
Regardlessoftheformofdatatransmission,thedatashouldstillbecollectedintoacosteffectivereservoir,sothatthebusinessorenterprisecouldfullyrealizethesepossibilities.
Thedatareservoircannotbeconsideredasanotherdrag-and-dropbusinesswarehouse.
Thedatastoredinthereservoir,similartothefreshwaterstoredinwaterreservoirshould
beusedtosustaintheoperationsofthebusiness.