Tải bản đầy đủ (.pdf) (87 trang)

Big data for beginners understanding SMART big data, data mining data analytics for improved business performance, life decisions more vince reynolds

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (969.83 KB, 87 trang )




BigDataForBeginners

UnderstandingSMARTBigData,Data
Mining&DataAnalyticsForimproved
BusinessPerformance,LifeDecisions&
More!




Copyright2016byVinceReynolds-Allrightsreserved.
Thisdocumentisgearedtowardsprovidingexactandreliableinformationinregardstothe
topicandissuecovered.Thepublicationissoldwiththeideathatthepublisherisnot
requiredtorenderaccounting,officiallypermitted,orotherwise,qualifiedservices.If
adviceisnecessary,legalorprofessional,apracticedindividualintheprofessionshould
beordered.

-FromaDeclarationofPrincipleswhichwasacceptedandapprovedequallybya
CommitteeoftheAmericanBarAssociationandaCommitteeofPublishersand
Associations.
Innowayisitlegaltoreproduce,duplicate,ortransmitanypartofthisdocumentineither
electronicmeansorinprintedformat.Recordingofthispublicationisstrictlyprohibited
andanystorageofthisdocumentisnotallowedunlesswithwrittenpermissionfromthe
publisher.Allrightsreserved.

Theinformationprovidedhereinisstatedtobetruthfulandconsistent,inthatanyliability,
intermsofinattentionorotherwise,byanyusageorabuseofanypolicies,processes,or
directionscontainedwithinisthesolitaryandutterresponsibilityoftherecipientreader.


Undernocircumstanceswillanylegalresponsibilityorblamebeheldagainstthe
publisherforanyreparation,damages,ormonetarylossduetotheinformationherein,
eitherdirectlyorindirectly.
Respectiveauthorsownallcopyrightsnotheldbythepublisher.
Theinformationhereinisofferedforinformationalpurposessolely,andisuniversalasso.
Thepresentationoftheinformationiswithoutcontractoranytypeofguaranteeassurance.

Thetrademarksthatareusedarewithoutanyconsent,andthepublicationofthetrademark
iswithoutpermissionorbackingbythetrademarkowner.Alltrademarksandbrands
withinthisbookareforclarifyingpurposesonlyandaretheownedbytheowners
themselves,notaffiliatedwiththisdocument.



TableofContents

Introduction
Chapter1.AConundrumCalled‘BigData’
So,WhatDoesBigDataLookLike?
ThePurposeandValueof‘BigData’
HowBigDataChangesEverything
EnterpriseSupercomputing
SupercomputerPlatforms
Chapter2.UnderstandingBigDataBetter
HowToValueData:4MeasurableCharacteristicsOfBigData
VolumeBasedValue
VelocityBasedValue
VarietyBasedValue
WhatisStructuredData?
WhatisUnstructuredData?

VeracityBasedValue
Cloudorin-house?
BigDataastheUltimateComputingPlatform
BigDatainAirplaneProduction
BigDataPlatforms
BigDataandtheFuture
FireFightersandHighDivers
BigDataisDoItYourselfSupercomputing
PlatformEngineeringandBigData
KeepItSimple,Sunshine
Chapter3:BigDataAnalytics
BigDataandUltraSpeed
TheBigDataRealityofRealTime
TheRealTimeBigDataAnalyticsStack(RTBDA)
WhatCanBigDataDoForYou?


DescriptiveAnalytics
PredictiveAnalytics
PrescriptiveAnalytics
TopHighImpactUseCasesofBigDataAnalytics
Customeranalytics
Operationalanalytics
RiskandComplianceAnalytics
NewProductsandServicesInnovation
Chapter4.WhyBigDataMatters
So,doesBigDatareallymatter?
Thereare,however,otherobstaclesthatremain
Chapter5.ACloserLookatKeyBigDataChallenges
DifficultyinUnderstandingandUtilizingBigData

New,Complex,andContinuouslyEvolvingTechnologies
DataSecurityRelatedtoCloudBasedBigDataSolutions
Chapter6.GeneratingBusinessValuethroughDataMining
TheBusinessValueofData
DataStorage
SoWhatisDataMining?
HowDataMiningCanHelpYourBusiness
TheDataMiningProcess
TechnologiesforDataMining
ExamplesofApplicationsofDataMininginRealWorldSetting
DataMiningProspects
Top10WaystoGetaCompetitiveEdgethroughDataMining
Conclusion




Introduction
IfyouareintheworldofITorbusiness,youhaveprobablyheardabouttheBigData
phenomenon.Youmighthaveevenencounteredprofessionalswhointroducedthemselves
asdatascientists.Hence,youarewondering,justwhatisthisemergingnewareaof
science?Whattypesofknowledgeandproblem-solvingskillsdodatascientistshave?
WhattypesofproblemsaresolvedbydatascientiststhroughBigDatatech?
Afterreadingbook,youwillhavetheanswerstothesequestions.Inaddition,youwill
begintobecomeproficientwithimportantindustrytermsandapplicationsandtoolsin
ordertoprepareyouforadeeperunderstandingoftheotherimportantareasofBigData.
Everyday,oursocietyiscreatingabout3quintillionbytesofdata.Youareprobably
wonderingwhat3quintillionis.Well,thisis3followedby18zeros.Andthatfolksis
generatedEVERYDAY.Withthismassivestreamofdata,theneedtomakesenseoffor
thisbecomesmorecrucialandquicklyincreasingdemandforBigDataunderstanding.

Businessowners,largeorsmall,musthavebasicknowledgeinbigdata.




Chapter1.AConundrumCalled‘BigData’


‘Bigdata’isoneofthelatesttechnologytrendsthatareprofoundlyaffectingtheway
organizationsutilizeinformationtoenhancethecustomerexperience,improvetheir
productsandservices,createuntappedsourcesofrevenue,transformbusinessmodelsand
evenefficientlymanagehealthcareservices.Whatmakesitahighlytrendingtopicisthe
factthattheeffectiveuseofbigdataalmostalwaysendsupwithsignificantlydramatic
results.Yet,theironythoughisnobodyreallyknowswhat‘bigdata’actually
means.
Thereisnodoubtthat‘bigdata’isnotjustahighlytrendingITbuzzword.Rather,itisa
fastevolvingconceptininformationtechnologyanddatamanagementthatis
revolutionizingthewaycompaniesconducttheirbusinesses.Thesadpartis,itisalso
turningouttobeaclassicconundrumbecausenoone,notevenagroupofthebestIT
expertsorcomputergeekscancomeupwithadefinitiveexplanationdescribingexactly
whatitis.Theyalwaysfallshortofcomingupwithanappropriatedescriptionfor‘big
data’thatthatisacceptabletoall.Atbest,whatmostofthesecomputerexpertscould
comeupwithareroundaboutexplanationsandsporadicexamplestodescribeit.Try
askingseveralITexpertswhat‘bigdata’isandyouwillgetjustasmanydifferentanswers
asthenumberofpeopleyouask.
Whatmakesitevenmorecomplicatedanddifficulttounderstandisthefactthatwhatis
deemedas‘big’nowmaynotbethatbiginthenearfutureduetorapidadvancesin
softwaretechnologyandthedatamanagementsystemsdesignedtohandlethem.
Wealsocannotescapethefactthatwenowliveinadigitaluniversewhereeverythingand
anythingwedoleavesadigitaltracewecalldata.Atthecenterofthisdigitaluniverseis

theWorldWideWebfromwhichcomesadelugeofdatathatfloodsourconsciousness
everysinglesecond.Withwelloveronetrillionwebpages(50billionofwhichhave
alreadybeenindexedbyandaresearchablethroughvariousmajorsearchengines),the
weboffersusunparalleledinterconnectivitywhichallowsustointeractwithanyoneand
anythingwithinaconnectednetworkwehappentobepartof.Eachoneofthese
interactionsgeneratesdatatoothatiscoursedthroughandrecordedintheweb-addingup


tothe‘fuzziness’ofanalreadyfuzzyconcept.Asaconsequence,thewebiscontinuously
overflowingwithmassivedatasohugethatitisalmostimpossibletodigestorcrunchinto
usablesegmentsforpracticalapplications–iftheyareofanyuseatall.Thisenormous,
evergrowingdatathatgoesthroughandarestoredinthewebtogetherwiththe
developingtechnologiesdesignedtohandleitiswhatiscollectivelyreferredtoas‘big
data’.



So,WhatDoesBigDataLookLike?
Ifyouwanttohaveanidea
onwhat‘bigdata’really
lookslikeorhowmassiveittrulyis,trytovisualizethe
followingstatisticsifyoucan-withoutgettingdizzy.Think
ofthewebwhichcurrentlycoversmorethan100million
domainsandisstillgrowingattherateof20,000newdomainseverysingleday.
Thedatathatcomesfromthesedomainsissomassiveandmindbogglingthatitis
practicallyimmeasurablemuchlessmanageablebyanyconventionaldatamanagement
andretrievalmethodsthatareavailabletoday.Andthatisonlyforourstarters.Addtothis
the300milliondailyFacebookposts,60milliondailyFacebookupdates,and250million
dailytweetscomingfrommorethan900millioncombinedFacebookandTweeterusers
andforsureyourimaginationisgoingtogothroughtheroof.Don’tforgettoincludethe

voluminousdatacomingfromoversixbillionsmartphonescurrentlyinusetodaywhich
continuallyaccesstheinternettodobusinessonline,topoststatusupdatesonsocial
media,sendouttweets,andmanyotherdigitaltransactions.Remember,approximately
onebillionofthesesmartphonesareGPSenabledwhichmeanstheyareconstantly
connectedtotheinternetandtherefore,theyarecontinuouslyleavingbehindtheirdigital
trailswhichisaddingmoredatatothealreadyburgeoningbulkofinformationalready
storedinmillionsofserversthatspantheinternet.
Andifyourimaginationstillservesyourightatthispoint,trycontemplatingonthemore
thanthirtybillionPointOfSalestransactionsperyearthatarecoursedthrough
electronically-connectedPOSdevices.Ifyouarestilluptoit,whynotalsogooverthe
morethan10,000creditcardpaymentsbeingdoneonlineorthroughotherconnected
deviceseverysinglesecond.Thesheervolumealoneofthecombinedtorrentialdatathat
envelopsusunceasinglyisamazinglyunbelievable.“Mindboggling”isan
understatement.Stupefyingwouldbemoreappropriate.
Don’tblinknow
butthe‘bigdata’
thathadbeenaccumulatedbythewebforthepastfive
years(since2010)andarenowstoredinmillionsof
serversscatteredallovertheglobefarexceedsallof
thepriordatathathadbeenproducedandrecorded
throughoutthewholehistoryofmankind.The‘big
data’werefertoincludesanythingandeverythingthat
hasbeenfedintobigdatasystemssuchassocialnetworkchatters,contentofwebpages,
GPStrails,financialmarketdata,onlinebankingtransactions,streamingmusicand
videos,podcasts,satelliteimagery,etc.Itisestimatedthatover2.5quintillionbytesof
data(2.5x1018)iscreatedbyuseveryday.Thismassivefloodofdatawhichwe
collectivelycallas‘bigdata’justkeepsongettingbiggerandbiggerthroughtime.Experts
estimatethatitsvolumewillreach35zettabytes(35x1021)by2020.
Inessence,ifandwhendatasetsgrowextremelybigorbecomeexcessivelytoocomplex



fortraditionaldatamanagementtoolstohandle,itisconsideredas‘BigData’.The
problemis,thereisnocommonsetceilingoracceptableupperthresholdlevelbeyond
whichthebulkofinformationstartstobeclassifiedasbigdata.Inpractice,whatmost
companiesnormallydoistoconsiderasbigdatathosewhichhaveoutgrowntheirown
respectivedatabasemanagementtools.BigData,insuchcase,istheenormousdatawhich
theycannolongerhandleeitherbecauseitistoomassive,toocomplex,orboth.This
meanstheceilingvariesfromonecompanytotheother.Inotherwords,different
companieshavedifferentupperthresholdlimitstodeterminewhatconstitutesbigdata.
Almostalways,theceilingisdeterminedbyhowmuchdatatheirrespectivedatabase
managementtoolsareabletohandleatanygiventime.That’sprobablyoneofthereasons
whythedefinitionof‘bigdata’issofuzzy.



ThePurposeandValueof‘BigData’
Justasfuzzyandnebulousasitsdefinition,thepurposeor
valueofBigDataalsoappearstobeuncleartomany
entrepreneurs.Infact,manyofthemarestillgropingin
thedarklookingforanswerstosuchquestionsas‘why’
and‘how’touse‘bigdata’togrowtheirbusinesses.
Ifwearetotakeourcue
fromapollconductedby
DigitalistMagazineamongthe300participantstothe
SapphireNow2014eventheldatOrlando,Florida,it
appearsthatabout82%ofcompanieshavestartedto
embraceBigDataasacriticalcomponenttoachieving
theirstrategicobjectives.Butdespitethefactthat60%of
themhavestarteddiggingintobigdata,only3%has
gainedthematurityorhaveacquiredsufficientknowledge

andresourcestosiftthroughandmanagesuchmassiveinformation.Apparently,therest
continuetogropeinthedark.
Itisthereforequiteawonderthatdespitebeingconstantlyonthelookoutfornewwaysto
buildandmaintainacompetitiveedgeanddespiterelentlesslyseekingnewandinnovative
productsandservicestoincreaseprofitability,mostcompaniesstillmissoutonthemany
opportunities‘bigdata’hastooffer.Forwhateverreasontheymayhave,theystopshortof
layingdownthenecessarygroundworkforthemtostartmanaginganddigginginto‘big
data’toextractnewinsightsandcreatenewvalueaswellasdiscoverwaystostayahead
oftheircompetitors.
Forhiddendeepwithinthetorrentofbigdatainformationstreamisawealthofuseful
knowledgeandvaluablebehavioralandmarketpatternsthatcanbeusedbycompanies
(bigorsmall)tofueltheirgrowthandprofitability–simplywaitingtobetapped.
However,suchvaluableinformationhavetobe‘mined’and‘refined’firstbeforetheycan
beputintogooduse-muchlikedrillingforoilthatisburiedunderground.
Similartooilwhichhastobedrilledandrefinedfirstbeforeyoucanharnessitsawesome
powertothehilt,‘bigdata’usershavetodigdeep,siftthrough,andanalyzethelayers
uponlayersofdatasetsthatmakesupbigdatabeforetheycanextractusablesetsthathas
specificvaluetothem.
Inotherwords,likeoil,bigdatabecomesmorevaluableonlyafteritis‘mined’,
processed,andanalyzedforpertinentdatathatcanbeusedtocreatenewvalues.This
cumbersomeprocessiscalledbigdataanalytics.Analyticsiswhatgivesbigdataitsshine
andmakesitusableforapplicationtospecificcases.Tomakethestoryshort,bigdata
goeshandinhandwithanalytics.Withoutanalytics,bigdataisnothingmorethanabunch
ofmeaninglessdigitaltrash.
Thetraditionalwayofprocessingbigdatahowever,usedtobeatoughandexpensivetask
totackle.Itinvolvesanalyzingmassivevolumesofdatawhichtraditionalanalyticsand
conventionalbusinessintelligencesolutionscan’thandle.Itrequirestheuseofequally


massivecomputer

hardwareandthe
mostsophisticateddatamanagementsoftwaredesigned
primarilytohandlesuchenormousandcomplicated
information.
Thegiantcorporationswhostarteddiggingintobigdata
aheadofeverybodyelsehadtospendfortunesonexpensive
hardwareandgroundbreakingdatamanagementsoftwareto
beabledoit–albeit,withagreatdealofsuccessatthat.
Theirpioneeringeffortsrevealednewinsightsthatwere
burieddeepinthemazeofinformationcloggingtheinternet
serversandwhichtheywereabletoretrieveandusetogreat
advantage.Forexample,afteranalyzinggeographicaland
socialdataandafterscrutinizingeverybusinesstransaction,
theydiscoveredanewmarketingfactorcalled‘peer
influence’whichplayedanimportantroleinshaping
shoppingpreferences.Thisdiscoveryallowedthemtoestablishspecificmarketneedsand
segmentswithouttheneedtoconducttediousproductsamplingsthus,blazingthetrailfor
datadrivenmarketing.
Allthiswhile,thenot-so-well-resourcedcompaniescouldonlywatchinawe–sidelined
bytheprohibitivecostofprocessingbigdata.Thegoodnewsthoughisthiswillnotbefor
longbecausethereisnowaffordablecommodityhardwarewhichthenot-so-wellresourcedcanusetostarttheirownbigdataprojects.Therearealsothecloud
architecturesandopensourcesoftwarewhichtremendouslycutthecostofbigdata
processingmakingithighlyfeasibleevenforstartupcompaniestotapintothehuge
potentialofbigdatabysimplysubscribingtocloudservicesforservertime.
Atthispoint,thereisonlyonethingthatisclear-everyenterpriseareleftwithnochoice
buttoembracetheconceptofbigdataandunderstanditsimplicationtotheirbusiness.
Theyhavetorealizethatdatadrivenmarketinganddatadrivenproductinnovationisnow
thenewnorm.



HowBigDataChangesEverything
BigDataisakindofsupercomputingthatcanbeusedbygovernmentsandbusinesses,
whichwillmakeitdoabletokeeptrackofpandemicinrealtime,guesswherethenext
terroristattackwillhappen,improveefficiencyofrestaurantchains,projectvoting
patternsonelections,andpredictthevolatilityoffinancialmarketswhiletheyare
happening.
Hence,manyoftheseeminglyunrelatedyetdiversewillbeintegratedintobigdata
network.Similartoanypowerfultech,whenusedproperlyandeffectivelyforthegood,
BigDatacouldpushthemankindtowardsmanypossibilities.Butifusedwithbad
intentions,theriskscouldbeveryhighandcouldevenbedamaging.
Theneedtogetbigdataisimmediatefordifferentorganizations.Ifamalevolent
organizationgetsthetechfirst,thentherestoftheworldcouldbeatrisk.Ifaterrorist
organizationsecuredthetechfirstbeforetheCIA,thesecurityoftheUSAcouldbe
compromised.
Theresolutionswillneedbusinessestablishmentstobemorecreativeatdifferentlevels
includingorganizational,financial,andtechnical.Ifthecoldwarinthe1950swasall
aboutgettingthearms,today,BigDataisthearmsrace.



EnterpriseSupercomputing
Trendsintheworldofsupercomputingareinsomewayssimilartothoseofthefashion
industry.Evenifyouwaitlongenough,youcanhavethechancetowearitagain.Mostof
thetechusedinBigDatahavebeenusedindifferentindustriesformanyyears,suchas
distributedfilesystems,parallelprocessing,andclustering.
Enterprisesupercomputingwasdevelopedbyonlinecompanieswithworldwide
operationsthatrequiretheprocessingofexponentiallygrowingnumbersofusersandtheir
profiles(Yahoo!,Google,andFacebook).Buttheyneedtodothisasfastastheycan
withoutspendingtoomuchmoney.ThisisenterprisesupercomputingknownasBigData.
BigDatacouldcausedisruptivechangestoorganizations,andcanreachfarbeyondonline

communitiestothesocialmediaplatformsthatspansandconnectstheglobe.BigDatais
notjustafad.Itisacrucialaspectofmoderntechthatwillbeusedforgenerationsto
come.
Bigdatacomputingisactuallynotanewtechnology.Sincethebeginningoftime,
predictingtheweatherhasbeenacrucialbigdataconcern,whenweathermodelsare
processedusingonesupercomputer,whichcanoccupyawholegymnasiumandintegrated
withthen-fastprocessingunitswithcostlymemory.Softwareduringthe1970swasvery
crude,somostoftheperformanceduringthattimewascreditedduetotheinnovative
engineeringofthehardwarecomponent.
Softwaretechnologyhadimprovedinthe1990sleadingtotheimprovedsetupwhere
programsprocessedononehugesupercomputercanbepartitionedintosmallerprograms
thatarerunningsimultaneouslyonseveralworkstations.Oncealltheprogramsaredone
processing,theresultswillbecollatedandanalyzedtoforecasttheweatherforseveral
weeks.
Butevenduringthe1990s,thecomputersimulatorsneedabout15daystocalculateand
projecttheweatherforaweek.Ofcourse,itdoesn’thelppeopletoknowthatitwas
cloudylastweek.Nowadays,theparallelcomputersimulationsforweatherpredictionfor
thewholeweekcouldbecompletedinamatterofhours.
Inreality,thesesupercomputerscannotpredicttheweather.Instead,theyarejusttryingto
simulateandforecastitsbehavior.Throughhumananalysis,theweathercouldbe
predicted.Hence,supercomputersalonecannotprocessBigDataandmakesenseofit.
Manyweatherforecastingagenciesusedifferentsimulatorswithvaryingstrengths.
ComputersimulatorsthataregoodatforecastingwhereahurricanewillfallinNewYork
arenotthataccurateinforecastinghowthehumiditylevelcouldaffecttheairoperations
atAtlantaInternationalAirport.
Weatherforecastersineveryregionstudytheresultsofseveralsimulationswithvarious
setsofinitialdata.Theynotonlyporeoveractualoutputfromweatheragencies,butthey
alsolookatdifferentinstrumentssuchasthedopplerradar.
Eventhoughtherearetonsofdatainvolved,weathersimulationisnotcategorizedasBig
Data,becausethereisalotofcomputingrequired.Scientificcomputingproblems(usually

inengineeringandmeteorology)arealsoregardedasscientificsupercomputingorhigh-


performancecomputingorHPC.
Earlyelectroniccomputersaredesignedtoperformscientificcomputing,suchas
decipheringcodesorcalculatingmissiletrajectories,whichallinvolvesworkingon
mathematicalproblemsusingmillionsofequations.Scientificcalculationscanalsosolve
equationsfornon-scientificproblemslikeinrenderinganimatedmovies.
BigDataisregardedastheenterpriseequivalentofHPCthatisalsoknownasthe
enterprisesupercomputingorhigh-performancecommercialcomputing.BigDatacanalso
resolvehugecomputingproblems,butthisismoreaboutdiscoveringsimulationsandless
aboutequations.
Duringtheearly1960s,financialorganizationssuchasbanksandlendingfirmsused
enterprisecomputerstoautomateaccountsandmanagetheircreditcardventures.
Nowadays,onlinebusinessessuchaseBay,Amazon,andevenlargeretailersareusing
enterprisesupercomputinginordertofindsolutionsfornumerousbusinessproblemsthat
theyencounter.However,enterprisesupercomputingcanbeusedformuchmorethan
studyingcustomerattrition,managingsubscribers,ordiscoveringidleaccounts.

BigDataandHadoop
Hadoopisregardedasthefirstenterprisesupercomputingsoftwareplatform,whichworks
atscaleandisquiteaffordable.Itexploitstheeasytrickofparallelismthatisalreadyin
useinhighperformancecomputingindustry.Yahoo!developedthissoftwareinorderto
findaspecificsolutionforaproblem,buttheyimmediatelyrealizedthatthissoftwarehas
theabilitytosolveothercomputerproblems.
EventhoughthefortunesofYahoo!changeddrastically,ithasmadealargecontribution
totheincubationofFacebook,Google,andbigdata.
Yahoo!originallydevelopedHadooptoeasilyprocessthefloodofclickstreamdata
receivedbythesearchengine.Clickstreamreferstothehistoryoflinksclickedbythe
users.Becauseitcouldbemonetizedtopotentialadvertisers,analyzingthedatafor

clickstreamfromthousandsofYahoo!serversneededahugescalabledatabase,whichwas
cost-effectivetocreateandrun.
Theearlysearchenginecompanydiscoveredthatmanycommercialsolutionsduringthat
timewereeitherveryexpensiveorentirelynotcapableofscalingsuchhugedata.Hence,
Yahoo!hadtodevelopthesoftwarefromscratch,andsoDIYenterprisesupercomputing
began.
SimilartoLinux,Hadoopisdesignedasanopen-sourcesoftwaretech.JustasLinuxledto
thecommoditycloudsandclustersinHPC,Hadoophasdevelopedabigdatanetworkof
disruptivepossibilities,newstartups,oldvendors,andnewproducts.
Hadoopwascreatedasportablesoftware;itcanbeoperatedusingotherplatformsaside
fromLinux.ThepowertorunopensourcesoftwaresimilartoHadooponaMicrosoftOS
isacrucialandasuccessfortheopensourcecommunity,whichwasahugemilestone
duringthattime.


Yahoo!andBigData
KnowingthehistoryofYahooiscrucialinunderstandingthehistoryofBigData,because
Yahoowasthefirstcompanytooperateatsuchmassivescale.DaveFiloandJerryYang
beganYahoo!asatechprojectinordertoindextheinternet.Butastheyworkon,they
realizedthattraditionalindexingstrategiescannotbeusedwiththeexplosionofcontent
thatshouldbeindexed.
EvenbeforethecreationofHadoop,Yahoo!hadtheneedforacomputerplatform,which
cantakethesameamountoftimetodevelopthewebindexregardlessofthegrowthrate
ofinternetcontent.Thecreatorsrealizedthatthereisaneedtousetheparallelismtactic
fromthehighpowercomputingworldfortheprojecttobecomescalableandthenthe
computinggridofYahoo!becametheclusternetworkthatHadoopwasbasedon.
SimilartotheimportanceofHadoopwasYahoo!’sinnovationinrestructuringtheir
OperationsandEngineeringteamsinordertosupportnetworkplatformsofthisscale.The
experienceofYahooinoperatingalarge-scalecomputingplatform,whichspreadacross
severallocationsresultedtothere-inventionoftheInformationTechnologyDepartment.

Complicatedplatformshadtobedevelopedinitiallyanddeployedbysmallteams.
Runninganorganizationtoscaleupinordertoprovidesupporttotheseplatformsisan
altogetherseparatematter.However,reinventingtheITdepartmentisjustasimportantas
gettingthesoftwareandhardwaretoscale.
SimilartomanycorporatedepartmentsfromSalestoHR,ITfirmsconventionallyattain
scalabilitybywayofcentralizingtheprocess.ByhavingadedicatedteamofITexperts
managingathousandstoragedatabasesismorecost-effectivecomparedtocompensating
thesalariesforalargeteam.However,StorageAdminsusuallydon’thaveaworking
know-howofthenumerousappsonthesearrays.
Centralizationwillexchangetheworkingknowledgeofthegeneralistforexpertiseofthe
subjectmatteraswellascostefficiency.Businessesarenowrealizingtheunintendedrisks
ofexchangesmadeseveralyearsago,whichcreatedsilos,whichwillinhibittheircapacity
tousebigdata.
ConventionalITfirmsdivideexpertiseandresponsibilitiesthatoftenconstrain
collaborationamongandbetweenteams.Minorglitchesbecauseofmiscommunications
couldbeacceptableonafewminoremailservers,butevenasmallglitchinproducing
supercomputersmaycostbusinessestolosemoney.
Evenasmallmarginoferrorcouldresulttoalargedifference.IntheBigDataworld,100
TerabytesisjustaSmallData,but1%errorin100TBis1MillionMB.Detectingand
resolvingerrorsatthismassivescalecouldconsumemanyhours.
Yahoo!adoptedthestrategyusedbyHPCcommunityformorethantwodecades.
Yahoovianslearnedthatspecializedteamswithaworkingknowledgeofthewhole
platformcanworkbest.Datasilosandresponsibilitybecomeobstaclesineither
commercialsupercomputingandscientificsupercomputing.
Online-scalecomputingsilosworkbecauseearlyadopterslearnednewinsights:
supercomputersarefinelytunedplatformswithnumerousinterdependentpartsandthey


don’trunasprocessingsilos.Butinthe1980s,peopleviewcomputersasamachinewith
interdependentfunctionalitylayers.

Thisparadigmwaseasiertounderstand,butwithexponentiallyincreasingsophisticated
platforms,thelayerparadigmstartedtocovertheunderlyingsophistication,which
impededorevenavoidedeffectivetriageofperformanceandreliabilityconcerns.
SimilartoaBoeing747,platformsforsupercomputingshouldbeinterpretedasawhole
collectionoftechnologiesorthemanageabilityorefficiencycouldbeaffected.




SupercomputerPlatforms
Intheearlystagesofcomputerhistory,systemsareconsideredasplatforms-theseare
calledasmainframesandusuallytheyareregardedasmainframesandareproducedby
companiesthatalsosuppliesspecializedteamsofengineerswhocloselyworkwiththeir
customerstomakecertainthattheplatformcanfunctionaccordingtoitsdesign.
ThismethodwaseffectivesolongasyoutakesatisfactionasacustomerofIBM.But
whenIBMstartedtomakesomechangesinthe1960s,othercompaniesprovidedmore
optionsandbetterprices.However,thishasresultedtopartitioningofindustrysilos.
Nowadays,enterprisesthatarestilldominatingtheirsilostillhavethetendencytobehave
likeamonopolysolongastheycangetawaywithit.Whenstorage,server,anddatabase
companiesstartedtoproliferate,ITfirmsmimicthisalignmentwiththeirrelativegroups
ofstorage,server,anddatabasespecialists.
Butinordertoeffectivelystandupabigdatacluster,eachmemberwhoisworkingonthe
clustershouldbeorganizationallyandphysicallypresent.Therequiredcollaborativework
foreffectiveclusterdeploymentsatthisscalecouldbedifficulttoachieveinasubsequent
levelofasilo.
Ifyourbusinesslikestoembracebigdataorcometogetherinthatmagicalplacewhere
BigDataWorksintheCloud,theITdepartmentshouldreorganizesomesilosandstudy
theplatformwell.
Butfarfromreality,manybusinessorganizationscannoteasilyhandlesuchchanges,
especiallyifthechangeistoofast.Disruptionandchaoshavebeenconstantsinthe

industry,butwerealwaysinclosecoordinationwithinnovationandpossibility.For
businesses,whicharewillingtoembracethistech,BigDatacanbeastreamofnewideas
aswellasenterpriseopportunities.

BigDataBang
AsthenetworkofBigDataevolvesoverthenextdecades,itwillsurelyoverwhelmboth
customersandvendorsinvariousways.
1.Theimpacttothesilomindset,bothintheindustryandtheorganizationwillbean
importantmilestoneofbigdata.
2.TheITindustrywillbebombardedbythenewtechofbigdata,sincemostofthe
productsbeforethecreationofHadooparenotfunctioningatall.BigDatasoftwareand
hardwareismanytimesfastercomparedtoexistingbusiness-scaleproductsandalsoalot
cheaper.
3.TechasdisruptiveandnewasBigDataisusuallynoteasilywelcomedinanestablished
ITorganizationbecausetheirorganizationalmandatecompelsthemtofocuson
minimizingOPEXandnotencourageinnovation,whichforcesITtobethedevil’s
advocateofBigData.
4.Itcompanieswillbedisruptedbythenewgenerationthatwillcomeafterthosewho
havefocusedworkingonEMC,Microsoft,andOracle.BigDataisconsideredasthemost


importantforceintheITindustrytodaysincetheintroductionoftherelationaldatabase.
5.InworkingwithBigData,programmersanddatascientistsarerequiredtosetthingsup
forabetterunderstandingofhowthedatawillflowbeneath.Thisincludesthe
introductionaswellasthereintroductiontothecomputingplatform,whichmakesit
possible.Thiscouldbewaybeyondtheircomfortzonesiftheyareentrenchedinsidesilos.
ITprofessionalswhoareopeninlearningnewwaysofthinking,working,and
collaboratingwillprosper,andthisprosperitycouldequatetoefficiency.
6.Privacyandcivillibertiescouldbecompromisedastechnologyadvancementswill
makeitlessexpensiveforanyorganization(publicorprivate)tostudydatapatternsas

wellasindividualbehaviorofanyonewhoaccessestheinternet.

GreatPossibilitieswithBigData
Nowadays,BigDataisnotjustforsocialnetworkingormachine-generatedonlinelogs.
Enterprisesandagenciescanseekanswerstoquestions,whichtheymayneverhavethe
capacitytoaskandBigDatacouldhelpinidentifyingsuchquestions.
Forexample,carproducerscannowaccesstheirworldwidepartsinventoryacross
numerousplantsandalsoacquiretonsofdata(usuallyinpetabytes)comingfromthe
sensorsthatcanbeinstalledinthecarstheyhavemanufactured.
Otherenterprisescannowanalyzeandprocesstonsofdatawhiletheyarestillcollectingit
onthefield.Forinstance,prospectingforgoldreserveswillinvolveseismicsensorsinthe
fieldacquiringtonsofdatathatcouldbesenttoHQandanalyzedwithinminutes.
Inthepast,thisdatashouldbetakenbacktoacostlydatacenter,andtransferredtohighpoweredsupercomputers–thisprocesstakesalotoftime.Today,aHadoopcluster
distributedalloverseismictrucksparkedinavacantlotcouldstilldothetaskwithin
hours,andfindpatternstoknowtheprospectingrouteforthenextday.
Inthefieldofagriculture,farmerscanusehundredsoffarmsensorsthatcouldtransmit
databacktotheHadoopclusterinstalledinabardinordertomonitorthegrowthofthe
crops.
GovernmentagenciesarealsousingHadoopclustersbecausethesearemoreaffordable.
Forinstance,theCDCandtheWHOarenowusingBigDatatotrackthespreadof
pandemicsuchasSARSorH1N1astheyhappen.
EventhoughBigDataallowsittoprocesslargedatasets,theprocesscouldbefast,thanks
toparallelism.Hadoopcouldalsobeusedfordatasets,whicharenotconsideredasBig
Data.ThesmallHadoopclustercanbeconsideredasanartificialretina.
Regardlessoftheformofdatatransmission,thedatashouldstillbecollectedintoacosteffectivereservoir,sothatthebusinessorenterprisecouldfullyrealizethesepossibilities.
Thedatareservoircannotbeconsideredasanotherdrag-and-dropbusinesswarehouse.
Thedatastoredinthereservoir,similartothefreshwaterstoredinwaterreservoirshould
beusedtosustaintheoperationsofthebusiness.








×