AdvancedInformationandKnowledgeProcessing
SpringerBriefsinAdvancedInformationandKnowledge
Processing
SeriesEditors
XindongWu
SchoolofComputingandInformatics,UniversityofLouisianaatLafayette,Lafayette,LA,USA
LakhmiJain
UniversityofCanberra,Adelaide,SA,Australia
SpringerBriefsinAdvancedInformationandKnowledgeProcessingpresentsconciseresearchin
thisexcitingfield.DesignedtocomplementSpringer’sAdvancedInformationandKnowledge
Processingseries,thisBriefsseriesprovidesresearcherswithaforumtopublishtheir
cutting-edgeresearchwhichisnotyetmatureenoughforabookintheAdvancedInformation
andKnowledgeProcessingseries,butwhichhasgrownbeyondthelevelofaworkshoppaper
orjournalarticle.
Typicaltopicsmayinclude,butarenotrestrictedto:
BigDataanalytics
BigKnowledge
Bioinformatics
Businessintelligence
Computersecurity
Dataminingandknowledgediscovery
Informationqualityandprivacy
Internetofthings
Knowledgemanagement
Knowledge-basedsoftwareengineering
Machineintelligence
Ontology
SemanticWeb
Smartenvironments
Softcomputing
Socialnetworks
SpringerBriefsarepublishedaspartofSpringer’seBookcollection,withmillionsofusers
worldwideandareavailableforindividualprintandelectronicpurchase.Briefsare
characterizedbyfast,globalelectronicdissemination,standardpublishingcontracts,easy-to-
usemanuscriptpreparationandformattingguidelinesandexpeditedproductionschedulesto
assistresearchersindistributingtheirresearchfastandefficiently.
Moreinformationaboutthisseriesathttp://www.springer.com/series/16024
RajendraAkerkar
ModelsofComputationforBigData
RajendraAkerkar
WesternNorwayResearchInstitute,Sogndal,Norway
ISSN1610-3947
e-ISSN2197-8441
AdvancedInformationandKnowledgeProcessing
ISSN2524-5198
e-ISSN2524-5201
SpringerBriefsinAdvancedInformationandKnowledgeProcessing
ISBN978-3-319-91850-1
e-ISBN978-3-319-91851-8
/>LibraryofCongressControlNumber:2018951205
©TheAuthor(s),underexclusivelicensetoSpringerNatureSwitzerlandAG2018
Thisworkissubjecttocopyright.Allrightsaresolelyandexclusivelylicensedbythe
Publisher,whetherthewholeorpartofthematerialisconcerned,specificallytherightsof
translation,reprinting,reuseofillustrations,recitation,broadcasting,reproductionon
microfilmsorinanyotherphysicalway,andtransmissionorinformationstorageand
retrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodology
nowknownorhereafterdeveloped.
Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.in
thispublicationdoesnotimply,evenintheabsenceofaspecificstatement,thatsuchnames
areexemptfromtherelevantprotectivelawsandregulationsandthereforefreeforgeneral
use.
Thepublisher,theauthorsandtheeditorsaresafetoassumethattheadviceandinformation
inthisbookarebelievedtobetrueandaccurateatthedateofpublication.Neitherthe
publishernortheauthorsortheeditorsgiveawarranty,expressorimplied,withrespectto
thematerialcontainedhereinorforanyerrorsoromissionsthatmayhavebeenmade.The
publisherremainsneutralwithregardtojurisdictionalclaimsinpublishedmapsand
institutionalaffiliations.
ThisSpringerimprintispublishedbytheregisteredcompanySpringerNatureSwitzerlandAG
Theregisteredcompanyaddressis:Gewerbestrasse11,6330Cham,Switzerland
Preface
Thisbookaddressesalgorithmicproblemsintheageofbigdata.Rapidlyincreasingvolumes
ofdiversedatafromdistributedsourcescreatechallengesforextractingvaluableknowledge
andcommercialvaluefromdata.Thismotivatesincreasedinterestinthedesignandanalysis
ofalgorithmsforrigorousanalysisofsuchdata.
Thebookcoversmathematicallyrigorousmodels,aswellassomeprovablelimitationsof
algorithmsoperatinginthosemodels.Mosttechniquesdiscussedinthebookmostlycome
fromresearchinthelastdecadeandofthealgorithmswediscusshavehugeapplicationsin
Webdatacompression,approximatequeryprocessingindatabases,networkmeasurement
signalprocessingandsoon.Wediscusslowerboundmethodsinsomemodelsshowingthat
manyofthealgorithmswepresentedareoptimalornearoptimal.Thebookitselfwillfocus
ontheunderlyingtechniquesratherthanthespecificapplications.
Thisbookgrewoutofmylecturesforthecourseonbigdataalgorithms.Actually,
algorithmicaspectsformoderndatamodelsisasuccessinresearch,teachingandpractice
whichhastobeattributedtotheeffortsofthegrowingnumberofresearchersinthefield,to
nameafewPiotrIndyk,JelaniNelson,S.Muthukrishnan,RajivMotwani.Theirexcellentwork
isthefoundationofthisbook.Thisbookisintendedforbothgraduatestudentsandadvanced
undergraduatestudentssatisfyingthediscreteprobability,basicalgorithmicsandlinear
algebraprerequisites.
IwishtoexpressmyheartfeltgratitudetomycolleaguesatVestlandsforsking,Norway,
andTechnomathematicsResearchFoundation,India,fortheirencouragementinpersuading
metoconsolidatemyteachingmaterialsintothisbook.IthankMinsungHongforhelpinthe
LaTeXtyping.IwouldalsoliketothankHelenDesmondandproductionteamatSpringer.
ThankstotheINTPARTprogrammefundingforpartiallysupportingthisbookproject.The
love,patienceandencouragementofmyfather,sonandwifemadethisprojectpossible.
RajendraAkerkar
Sogndal,Norway
May2018
Contents
1StreamingModels
1.1Introduction
1.2SpaceLowerBounds
1.3StreamingAlgorithms
1.4Non-adaptiveRandomizedStreaming
1.5LinearSketch
1.6Alon–Matias–SzegedySketch
1.7Indyk’sAlgorithm
1.8BranchingProgram
1.8.1LightIndicesandBernstein’sInequality
1.9HeavyHittersProblem
1.10Count-MinSketch
1.10.1CountSketch
1.10.2Count-MinSketchandHeavyHittersProblem
1.11Streamingk-Means
1.12GraphSketching
1.12.1GraphConnectivity
2Sub-linearTimeModels
2.1Introduction
2.2Fano’sInequality
2.3RandomizedExactandApproximateBound
2.4t-PlayerDisjointnessProblem
2.5DimensionalityReduction
2.5.1JohnsonLindenstraussLemma
2.5.2LowerBoundsonDimensionalityReduction
2.5.3DimensionalityReductionfork-MeansClustering
2.6Gordon’sTheorem
2.7Johnson–LindenstraussTransform
2.8FastJohnson–LindenstraussTransform
2.9Sublinear-TimeAlgorithms:AnExample
2.10MinimumSpanningTree
2.10.1ApproximationAlgorithm
3LinearAlgebraicModels
3.1Introduction
3.2SamplingandSubspaceEmbeddings
3.3Non-commutativeKhintchineInequality
3.4IterativeAlgorithms
3.5SarlósMethod
3.6Low-RankApproximation
3.7CompressedSensing
3.8TheMatrixCompletionProblem
3.8.1AlternatingMinimization
4AssortedComputationalModels
4.1CellProbeModel
4.1.1TheDictionaryProblem
4.1.2ThePredecessorProblem
4.2OnlineBipartiteMatching
4.2.1BasicApproach
4.2.2RankingMethod
4.3MapReduceProgrammingModel
4.4MarkovChainModel
4.4.1RandomWalksonUndirectedGraphs
4.4.2ElectricNetworksandRandomWalks
4.4.3Example:TheLollipopGraph
4.5CrowdsourcingModel
4.5.1FormalModel
4.6CommunicationComplexity
4.6.1InformationCost
4.6.2SeparationofInformationandCommunication
4.7AdaptiveSparseRecovery
References
©TheAuthor(s),underexclusivelicensetoSpringerNatureSwitzerlandAG2018
RajendraAkerkar,ModelsofComputationforBigData,AdvancedInformationandKnowledgeProcessing
/>
1.StreamingModels
RajendraAkerkar1
(1) WesternNorwayResearchInstitute,Sogndal,Norway
RajendraAkerkar
Email:
1.1 Introduction
Intheanalysisofbigdatatherearequeriesthatdonotscalesincetheyneedmassive
computingresourcesandtimetogenerateexactresults.Forexample,countdistinct,most
frequentitems,joins,matrixcomputations,andgraphanalysis.Ifapproximateresultsare
acceptable,thereisaclassofdedicatedalgorithms,knownasstreamingalgorithmsor
sketchesthatcanproduceresultsorders-ofmagnitudefasterandwithpreciselyprovenerror
bounds.Forinteractivequeriestheremaynotbesupplementarypracticaloptions,andinthe
caseofreal-timeanalysis,sketchesaretheonlyrecognizedsolution.
Streamingdataisasequenceofdigitallyencodedsignalsusedtorepresentinformationin
transmission.Forstreamingdata,theinputdatathataretobeoperatedarenotavailableall
atonce,butratherarriveascontinuousdatasequences.Naturally,adatastreamisa
sequenceofdataelements,whichisextremelybiggerthantheamountofavailablememory.
Moreoftenthannot,anelementwillbesimplyan(integer)numberfromsomerange.
However,itisoftenconvenienttoallowotherdatatypes,suchas:multidimensionalpoints,
metricpoints,graphverticesandedges,etc.Thegoalistoapproximatelycomputesome
functionofthedatausingonlyonepassoverthedatastream.Thecriticalaspectindesigning
datastreamalgorithmsisthatanydataelementthathasnotbeenstoredisultimatelylost
forever.Hence,itisvitalthatdataelementsareproperlyselectedandpreserved.Data
streamsariseinseveralrealworldapplications.Forexample,anetworkroutermustprocess
terabitsofpacketdata,whichcannotbeallstoredbytherouter.Whereas,therearemany
statisticsandpatternsofthenetworktrafficthatareusefultoknowinordertobeableto
detectunusualnetworkbehaviour.Datastreamalgorithmsenablecomputingsuchstatistics
fastbyusinglittlememory.InStreamingwewanttomaintainasketchF(X)ontheflyasXis
updated.Thusinpreviousexample,ifnumberscomeonthefly,Icankeeparunningsum,
whichisastreamingalgorithm.Thestreamingsettingappearsinalotofplaces,forexample,
yourroutercanmonitoronlinetraffic.Youcansketchthenumberoftraffictofindthetraffic
pattern.
Thefundamentalmathematicalideastoprocessstreamingdataaresamplingandrandom
projections.Manydifferentsamplingmethodshavebeenproposed,suchasdomainsampling,
universesampling,reservoirsampling,etc.Therearetwomaindifficultieswithsamplingfor
streamingdata.First,samplingisnotapowerfulprimitiveformanyproblemssincetoomany
samplesareneededforperformingsophisticatedanalysisandalowerboundisgivenin.
Second,asstreamunfolds,ifthesamplesmaintainedbythealgorithmgetdeleted,onemaybe
forcedtoresamplefromthepast,whichisingeneral,expensiveorimpossibleinpracticeand
inanycase,notallowedinstreamingdataproblems.Randomprojectionsrelyon
dimensionalityreduction,usingprojectionalongrandomvectors.Therandomvectorsare
generatedbyspace-efficientcomputationofrandomvariables.Theseprojectionsarecalled
thesketches.Therearemanyvariationsofrandomprojectionswhichareofsimplertype.
Samplingandsketchingaretwobasictechniquesfordesigningstreamingalgorithms.The
ideabehindsamplingissimpletounderstand.Everyarrivingitemispreservedwithacertain
probability,andonlyasubsetofthedataiskeptforfurthercomputation.Samplingisalso
easytoimplement,andhasmanyapplications.Sketchingistheothertechniquefordesigning
streamingalgorithms.Sketchtechniqueshaveundergonewidedevelopmentwithinthepast
fewyears.Theyareparticularlyappropriateforthedatastreamingscenario,inwhichlarge
quantitiesofdataflowbyandthethesketchsummarymustcontinuallybeupdatedrapidly
andcompactly.Asketch-basedalgorithmcreatesacompactsynopsisofthedatawhichhas
beenobserved,andthesizeofthesynopsisisusuallysmallerthanthefullobserveddata.Each
updateobservedinthestreampotentiallycausesthissynopsistobeupdated,sothatthe
synopsiscanbeusedtoapproximatecertainfunctionsofthedataseensofar.Inorderto
buildasketch,weshouldeitherbeabletoperformasinglelinearscanoftheinputdata(inno
strictorder),ortoscantheentirestreamwhichcollectivelybuilduptheinput.Seethatmany
sketcheswereoriginallydesignedforcomputationsinsituationswheretheinputisnever
collectedtogetherinoneplace,butexistsonlyimplicitlyasdefinedbythestream.SketchF(X)
withrespecttosomefunctionfisacompressionofdataX.Itallowsuscomputingf(X)(with
approximation)givenaccessonlytoF(X).Asketchofalarge-scaledataisasmalldata
structurethatletsyouapproximateparticularcharacteristicsoftheoriginaldata.Theexact
natureofthesketchdependsonwhatyouaretryingtoapproximateaswellasthenatureof
thedata.
Thegoalofthestreamingalgorithmistomakeonepassoverthedataandtouselimited
memorytocomputefunctionsofx,suchasthefrequencymoments,thenumberofdistinct
elements,theheavyhitters,andtreatingxasamatrix,variousquantitiesinnumericallinear
algebrasuchasalowrankapproximation.Sincecomputingthesequantitiesexactlyor
deterministicallyoftenrequiresaprohibitiveamountofspace,thesealgorithmsareusually
randomizedandapproximate.
Manyalgorithmsthatwewilldiscussinthisbookarerandomized,sinceitisoften
necessarytoachievegoodspacebounds.Arandomizedalgorithmisanalgorithmthatcantoss
coinsandtakedifferentactionsdependingontheoutcomeofthosetosses.Randomized
algorithmshaveseveraladvantagesoverdeterministicones.Usually,randomizedalgorithms
tendtobesimplerthandeterministicalgorithmsforthesametask.Thestrategyofpickinga
randomelementtopartitiontheproblemintosubproblemsandrecursingononeofthe
partitionsismuchsimpler.Further,forsomeproblemsrandomizedalgorithmshaveabetter
asymptoticrunningtimethantheirdeterministicone.Randomizationcanbebeneficialwhen
thealgorithmfaceslackofinformationandalsoveryusefulinthedesignofonlinealgorithms
thatlearntheirinputovertime,orinthedesignofobliviousalgorithmsthatoutputasingle
solutionthatisgoodforallinputs.Randomization,intheformofsampling,canassistus
estimatethesizeofexponentiallylargespacesorsets.
1.2 SpaceLowerBounds
Adventofcutting-edgecommunicationandstoragetechnologyenablelargeamountofraw
datatobeproduceddaily,andsubsequently,thereisarisingdemandtoprocessthisdata
efficiently.Sinceitisunrealisticforanalgorithmtostoreevenasmallfractionofthedata
stream,itsperformanceistypicallymeasuredbytheamountofspaceituses.Inmany
scenarios,suchasinternetrouting,onceastreamelementisexamineditislostforever
unlessexplicitlysavedbytheprocessingalgorithm.This,alongwiththecompletesizeofthe
data,makesmultiplepassesoverthedataimpracticable.
Letusconsiderthedistinctelementsproblemstofindthenumberofdistinctelementsin
astream,wherequeriesandadditionsareallowed.Wetakesthespaceofthealgorithm,nthe
sizeoftheuniversefromwhichtheelementsarrive,andmthelengthofthestream.
Theorem1.1 Thereisnodeterministicexactalgorithmforcomputingnumberofdistinct
elementsinO(minn,m)space(Alonetal.1999).
Proof Usingastreamingalgorithmwithspacesfortheproblemwearegoingtoshowhowto
encode
usingonlysbits.Obviously,wearegoingtoproduceaninjectivemappingfrom
to
.Hence,thisimpliesthatsmustbeatleastn.Welookforproceduressuch
that
andEnc(x)isafunctionfrom
to
.
Intheencodingprocedure,givenastringx,deviseastreamcontainingandaddiattheend
ofthestreamif
.ThenEnc(x)isthememorycontentofthealgorithmonthatstream.
Inthedecodingprocedure,letusconsidereachiandadditattheendofthestreamand
querythenthenumberofdistinctelements.Ifthenumberofdistinctelementsincreasesthis
impliesthat
,otherwiseitimpliesthat
.Sowecanrecoverxcompletely.Hence
proved.
Nowweshowthatapproximatealgorithmsareinadequateforsuchproblem.
Theorem1.2
Anydeterministic
algorithmthatprovides1.1approximationrequires
space.
Proof
SupposewehadacollectionFfulfillingthefollowing:
,forsomeconstant
.
Letusconsiderthealgorithmtoencodevectors
ofsetS.Thelowerboundfollowssincewemusthave
,where
istheindicatorvector
.Theencodingprocedureis
similarasthepreviousproof.
Inthedecodingprocedure,letusiterateoverallsetsandtestforeachsetSifit
correspondstoourinitialencodedset.FurthertakeateachtimethememorycontentsofMof
thestreamingalgorithmafterhavinginsertedinitialstring.ThenforeachS,weinitializethe
algorithmwithmemorycontentsMandthenfeedelementiif
.SupposeifSequalsthe
initialencodedset,thenumberofdistinctelementsdoesincreaseslightly,whereasifitisnot
italmostdoubles.Consideringtheapproximationassuranceofthealgorithmweunderstand
thatifSisnotourinitialsetthenthenumberofdistinctelementsgrowsby .
InordertoconfirmtheexistenceofsuchafamilyofsetsF,wepartitionninto
intervalsoflength100each.ToformasetSweselectonenumberfromeachinterval
uniformlyatrandom.Obviously,suchasethassizeexactly .FortwosetsS,Tselected
uniformlyatrandomasbeforelet
betherandomvariablethatequals1iftheyhavethe
samenumberselectedfromintervali.So,
intersectionisjust
timesitsmeanissmallerthan
.Hencetheanticipatedsizeofthe
.Theprobabilitythatthisintersectionisbiggerthanfive
forsomeconstant ,byastandardChernoffbound.
Finally,byapplyingaunionboundoverallfeasibleintersectionsonecanprovetheresult.
1.3 StreamingAlgorithms
Animportantaspectofstreamingalgorithmsisthatthesealgorithmshavetobe
approximate.Thereareafewthingsthatonecancomputeexactlyinastreamingmanner,but
therearelotsofcrucialthingsthatonecan’tcomputethatway,sowehavetoapproximate.
Mostsignificantaggregatescanbeapproximatedonline.Manyoftheseapproximate
aggregatescanbecomputedonline.Therearetwoways:(1)Hashing:whichturnsapretty
identityfunctionintohash.(2)sketching:youcantakeaverylargeamountofdataandbuilda
verysmallsketchofthedata.Carefullydone,youcanusethesketchtogetvaluesofinterest.
Thisinturnwillfindagoodsketch.Allofthealgorithmsdiscussedinthischapteruse
sketchingofsomekindandsomeusehashingaswell.Onepopularstreamingalgorithmis
HyperLogLogbyFlajolet.Cardinalityestimationisthetaskofdeterminingthenumberof
distinctelementsinadatastream.Whilethecardinalitycanbeeasilycomputedusingspace
linearinthecardinality,forseveralapplications,thisistotallyunrealisticandrequirestoo
muchmemory.Therefore,manyalgorithmsthatapproximatethecardinalitywhileusingless
resourceshavebeendeveloped.HyperLogLogisoneofthem.Thesealgorithmsplayan
importantroleinnetworkmonitoringsystems,dataminingapplications,aswellasdatabase
systems.Thebasicideaisifwehavensamplesthatarehashedandinsertedintoa[0,1)
interval,thosensamplesaregoingtomake
intervals.Therefore,theaveragesizeofthe
intervalshastobe
.Bysymmetry,theaveragedistancetotheminimumof
thosehashedtypesisalsogoingtobe
.Furthermore,duplicatesvalueswillgo
exactlyontopofpreviousvalues,thusthenisthenumberofuniquevalueswehaveinserted.
Forinstance,ifwehavetensamples,theminimumisgoingtoberightaround1/11.
HyperLogLogisshowntobenearoptimalamongalgorithmsthatarebasedonorder
statistics.
1.4 Non-adaptiveRandomizedStreaming
Thenon-trivialupdatetimelowerboundsforrandomizedstreamingalgorithmsinthe
TurnstileModelwaspresentedin(Larsenetal.2014).Onlyaspecificrestrictedclassof
randomizedstreamingalgorithms,namelythosethatarenon-adaptivecouldbebounded.
Mostwell-knownturnstilestreamingalgorithmsintheliteraturearenon-adaptive.Reference
(Larsenetal.2014)givesthenon-trivialupdatetimelowerboundsforbothrandomizedand
deterministicturnstilestreamingalgorithms,whichholdwhenthealgorithmsarenonadaptive.
Definition1.1 Anon-adaptiverandomizedstreamingalgorithmisanalgorithmwhereit
maytossrandomcoinsbeforeprocessinganyelementsofthestream,andthewordsread
fromandwrittentomemoryaredeterminedbytheindexoftheupdatedelementandthe
initiallytossedcoins,onanyupdateoperation.
Theseconstraintssuggestthatmemorymustnotbereadorwrittentobasedonthecurrent
stateofthememory,butonlyaccordingtothecoinsandtheindex.Comparingtheabove
definitiontothesketches,ahashfunctionchosenindependentlyfromanydesiredhashfamily
canemulatethesecoins,enablingtheupdatealgorithmtofindsomespecificwordsof
memorytoupdateusingonlythehashfunctionandtheindexoftheelementtoupdate.This
makesthenon-adaptiverestrictionfitexactlywithalloftheTurnstileModelalgorithm.Both
theCount-MinSketchandtheCount-MedianSketcharenon-adaptiveandsupportpoint
queries.
1.5 LinearSketch
Manydatastreamproblemscannotbesolvedwithjustasample.Wecanrathermakeuseof
datastructureswhich,includeacontributionfromtheentireinput,insteadofsimplythe
itemspickedinthesample.Forinstance,considertryingtocountthenumberofdistinct
objectsinastream.Itiseasytoseethatunlessalmostallitemsareincludedinthesample,
thenwecannottellwhethertheyarethesameordistinct.Sinceastreamingalgorithmgetsto
seeeachiteminturn,itcandobetter.Weconsiderasketchascompactdatastructurewhich
summarizesthestreamforcertaintypesofquery.Itisalineartransformationofthestream:
wecanimaginethestreamasdefiningavector,andthealgorithmcomputestheproductofa
matrixwiththisvector.
Asweknowadatastreamisasequenceofdata,whereeachitembelongstotheuniverse.
Adatastreamingalgorithmtakesadatastreamasinputandcomputessomefunctionofthe
stream.Further,algorithmhasaccesstheinputinastreamingfashion,i.e.algorithmcannot
readtheinputinanotherorderandformostcasesthealgorithmcanonlyreadthedataonce.
Dependingonhowitemsintheuniverseareexpressedindatastream,therearetwotypical
models:
CashRegisterModel:Eachiteminstreamisanitemofuniverse.Differentitemscomeinan
arbitraryorder.
TurnstileModel:Inthismodelwehaveamulti-set.Everyin-comingitemislinkedwithone
oftwospecialsymbolstoindicatethedynamicchangesofthedataset.Theturnstilemodel
capturesmostpracticalsituationsthatthedatasetmaychangeovertime.Themodelisalso
knownasdynamicstreams.
Wenowdiscusstheturnstilemodelinstreamingalgorithms.Intheturnstilemodel,the
streamconsistsofasequenceofupdateswhereeachupdateeitherinsertsanelementor
deletesone,butadeletioncannotdeleteanelementthatdoesnotexist.Whenthereare
duplicates,thismeansthatthemultiplicityofanyelementcannotgonegative.
Inthemodelthereisavector
thatstartsastheallzerovectorandthenasequence
ofupdatescomes.Eachupdateisoftheform
matchestotheoperation
,where
and
.This
.
Givenafunctionf,wewanttoapproximatef(x).Forexample,inthedistinctelements
problem isalways1and
.
Thewell-knownapproachfordesigningturnstilealgorithmsislinearsketching.Theidea
istopreserveinmemory
,where
,amatrixthatisshortandfat.Weknow
that
,obviouslymuchsmaller.Wecanseethatyism-dimensional,sowecanstoreit
efficientlybutifweneedtostorethewhole inmemorythenwewillnotgetspace-wise
betteralgorithm.Hence,therearetwooptionsincreatingandstoring .
isdeterministicandsowecaneasilycompute
withoutkeepingthewholematrixin
memory.
isdefinedbyk-wiseindependenthashfunctionsforsomesmallk,sowecanafford
storingthehashfunctionsandcomputing
.
Let
betheithcolumnofthematrix
whentheupdate
.Then
occureswehavethatthenewyequals
.Sobystoring
.The
firstsummandistheoldyandthesecondsummandissimplymultipleoftheithcolumnof
.
Thisishowupdatestakeplacewhenwehavealinearsketch.
NowletusconsiderMomentEstimationProblem(Alonetal.1999).Theproblemof
estimating(frequency)momentsofadatastreamhasattractedalotofattentionsincethe
inceptionofstreamingalgorithms.Supposelet
.Wewanttoestimate
thespaceneededtosolvethemomentestimationproblemaspchanges.Thereisatransition
pointincomplexityof .
spaceisachievablefor
probability(Alonetal.1999;Indyk2006).For
bitsofspacefor
approximationwith success
thenweneedexactly
spacewith successprobability(Bar-Yossefetal.2004;Indykand
Woodruff2005).
1.6 Alon–Matias–SzegedySketch
Streamingalgorithmsaimtosummarizealargevolumeofdataintoacompactsummary,by
maintainingadatastructurethatcanbeincrementallymodifiedasupdatesareobserved.
Theyallowtheapproximationofparticularquantities.Alon–Matias–Szegedy(AMS)sketches
(Alonetal.1999)arerandomizedsummariesofthedatathatcanbeusedtocompute
aggregatessuchasthesecondfrequencymomentandsizesofjoins.AMSsketchescanbe
viewedasrandomprojectionsofthedatainthefrequencydomainon±1pseudo-random
vectors.ThekeypropertyofAMSsketchesisthattheproductofprojectionsonthesame
randomvectoroffrequenciesofthejoinattributeoftworelationsisanunbiasedestimateof
thesizeofjoinoftherelations.WhileasingleAMSsketchisinaccurate,multiplesuch
sketchescanbecomputedandcombinedusingaveragesandmedianstoobtainanestimate
ofanydesiredprecision.
Inparticular,theAMSSketchisfocusedonapproximatingthesumofsquaredentriesofa
vectordefinedbyastreamofupdates.ThisquantityisnaturallyrelatedtotheEuclidean
normofthevector,andsohasmanyapplicationsinhigh-dimensionalgeometry,andindata
miningandmachinelearningsettingsthatusevectorrepresentationsofdata.Thedata
structuremaintainsalinearprojectionofthestreamwithanumberofrandomlychosen
vectors.Theserandomvectorsaredefinedimplicitlybysimplehashfunctions,andsodonot
havetobestoredexplicitly.Varyingthesizeofthesketchchangestheaccuracyguaranteeson
theresultingestimation.Thefactthatthesummaryisalinearprojectionmeansthatitcanbe
updatedflexibly,andsketchescanbecombinedbyadditionorsubtraction,yieldingsketches
correspondingtotheadditionandsubtractionoftheunderlyingvectors.
Acommonfeatureof(Count-MinandAMS)sketchalgorithmsisthattheyrelyonhash
functionsonitemidentifiers,whicharerelativelyeasytoimplementandfasttocompute.
Definition1.2
Hisak-wiseindependenthashfamilyif
TherearetwoversionsoftheAMSalgorithm.Thefasterversion,basedonthehashingisalso
referredtoasfastAMStodistinguishitfromtheoriginal“slower”sketch,sinceeachupdateis
veryfast.
Algorithm:
1.
Considerarandomhashfunction
fromafour-wiseindependent
family.
2.
3.
Let
Let
4.
5.
,output
7.
.
isanunbiasedestimatorwithvariancebig-Ohofthesquareofitsexpectation.
Sample
independenttimes:
toobtaina
6.
.
Let
.
independenttimes:
-approximationwithprobability
Eachofthehashfunctiontakes
functionsintotal.
Lemma1.1
approximationwith probability.
Sample
Proof
.UseChebyshev’sinequality
.
.Takethemediantoget
.
bitstostore,andthereare
hash
where
sincepair-wiseindependence.
Lemma1.2
.
Proof
where
sincepair-wiseindependence,
and
sincefour-wiseindependence.
Inthenextsectionwewillpresentanidealizedalgorithmwithinfiniteprecision,givenby
Indyk(Indyk2006).Thoughthesampling-basedalgorithmsaresimple,theycannotbe
employedforturnstilestreams,andweneedtodevelopothertechniques.
Letuscalladistribution over
iffor
fromthisdistributionand
forall
wehavethat
isarandomvariablewithdistribution
exampleofsuchadistributionaretheGaussiansfor
distribution,whichhasprobabilitydensityfunction
andfor
.An
theCauchy
.
Fromprobabilitytheory,weknowthatthecentrallimittheoremestablishesthat,insome
situations,whenindependentrandomvariablesareadded,theirproperlynormalizedsum
tendstowardanormaldistributioneveniftheoriginalvariablesthemselvesarenotnormally
distributed.Hence,bytheCentralLimitTheoremanaverageofdsamplesfromadistribution
approachesaGaussianasdgoestoinfinity.
1.7 Indyk’sAlgorithm
TheIndyk’salgorithmisoneoftheoldestalgorithmswhichworksondatastreams.Themain
drawbackofthisalgorithmisthatitisatwopassalgorithm,i.e.,itrequirestwolinearscansof
thedatawhichleadstohighrunningtime.
Lettheithrowof be ,asbefore,where comesfromap-stabledistribution.Then
consider
.Whenaqueryarrives,outputthemedianofallthe .Withoutlossof
generality,letussupposeap-stabledistributionhasmedianequalto1,whichinfactmeans
thatforzfromthisdistribution
.
Let
bean
distribution,
where
.Given
matrixwhereeveryelement
issampledfromap-stable
,Indyk’salgorithm(Indyk2006)estimatesthep-normofxas
.
Inaturnstilestreamingmodel,eachelementinthestreamreflectsanupdatetoanentry
inx.Whenanalgorithmwouldmaintainxinmemoryandcalculates
attheend,hence
need
space,Indyk’salgorithmstoresyand
.Combinedwithaspace-efficientwayto
produce weattainSuperiorspacecomplexity.
Letussuppose isgeneratedwith suchthatif
then
assumetheprobabilitymassof
is1/2.Moreover,let
assignedtointerval
.So,we
beanindicatorfunctiondefinedas
Let
betheithrowof
.Wehave
(1.1)
whichfollowsfromthedefinitionofp-stabledistributionsandnotingthat
’sare
sampledfrom
.Thisimplies
(1.2)
since
.
Moreover,itispossibletoshowthat
(1.3)
(1.4)
Next,considerthefollowingquantities:
(1.5)
(1.6)
representsthefractionof ’sthatsatisfy
representsthefractionof ’sthatsatisfy
property,wehave
,andlikewise,
.Usinglinearityofexpectation
and
.Therefore,themedianof
liesin
asdesired.
Nextstepistoanalyzethevarianceof
and
.Wehave
(1.7)
Sincevarianceofanyindicatorvariableisnotmorethan1,
.Likewise,
.Withanappropriatechoiceofmnowwecantrustthatthemedianof
thedesired -rangeof
isin
withhighprobability.
Hence,Indyk’salgorithmworks,butindependentlyproducingandstoringallmnelements
of iscomputationallycostly.Toinvokethedefinitionofp-stabledistributionsforEq.1.1,
weneedtheentriesineachrowtobeindependentfromoneanother.Therowsneedtobe
pairwiseindependentforcalculationofvariancetohold.
Letusassume
where
’sarek-wiseindependentp-stabledistribution
samples.
(1.8)
Ifwecanmakethisclaim,thenwecanusek-wiseindependentsamplesineachrow
insteadoffullyindependentsamplestoinvokethesameargumentsintheanalysisabove.
Thishasbeenshownfor
(Kaneetal.2010).Withthistechnique,wecanstate
usingonly
bits;acrossrows,weonlyneedtouse2-wiseindependenthashfunction
thatmapsarowindextoa
Indyk’sapproachforthe
bitseedforthek-wiseindependenthashfunction.
normisbasedonthepropertyofthemedian.However,itis
possibletoconstructestimatorsbasedonotherquantilesandtheymayevenoutperformthe
medianestimator,intermsofestimationaccuracy.However,sincetheimprovementis
marginalforourparameterssettings,westicktothemedianestimator.
1.8 BranchingProgram
Abranchingprogramsarebuiltondirectedacyclicgraphsandworkbystartingatasource
vertexandtestingthevaluesofthevariablesthateachvertexislabeledwithandfollowing
theappropriateedgetillasinkisreached,andacceptingorrejectingbasedontheidentityof
thesink.Theprogramstartsatansourcevertexwhichisnotpartofthegrid.Ateachstep,the
programreadsSbitsofinput,reflectingthefactthatspaceisboundedbyS,andmakesa
decisionaboutwhichvertexinthesubsequentcolumnofthegridtojumpto.AfterRsteps,
thelastvertexvisitedbytheprogramrepresentstheoutcome.Theentireinput,whichcanbe
representedasalength-RSbitstring,inducesadistributionoverthefinalstates.Herewe
wishtogeneratetheinputstringusingfewer(
)randombitssuchthattheoriginal
distributionoverfinalstatesiswellpreserved.Thefollowingtheoremaddressesthisidea.
Theorem1.3
(Nisan1992)Thereexists
for
suchthat
(1.9)
foranybranchingprogramBandanyfunction
.
Thefunctionhcansimulatetheinputtothebranchingprogramwithonlytrandombitssuch
thatitisalmostimpossibletodiscriminatetheoutcomeofthesimulatedprogramfromthat
oftheoriginalprogram.
Arandomsamplexfrom
andaddxattheroot.Repeatthefollowingprocedureto
createacompletebinarytree.Ateachvertex,createtwochildrenandcopythestringoverto
theleftchild.Fortherightchild,usearandom2-wiseindependenthashfunction
chosenforthecorrespondinglevelofthetreeandrecordtheresultofthe
hash.OncewereachRlevels,outputtheconcatenationofallleaves,whichisalength-RSbit
string.SinceeachhashfunctionrequiresSrandombitsandthereare
levelsinthetree,
thisfunctionuses
bitstotal.
Onewaytosimulaterandomizedcomputationswithdeterministiconesistobuilda
pseudorandomgenerator,namely,anefficientlycomputablefunctiongthatcanstretcha
shortuniformlyrandomseedofsbitsintonbitsthatcannotbedistinguishedfromuniform
onesbysmallspacemachines.Oncewehavesuchagenerator,wecanobtainadeterministic
computationbycarryingoutthecomputationforeveryfixedsettingoftheseed.Iftheseedis
shortenough,andthegeneratorisefficientenough,thissimulationremainsefficient.Wewill
useNisan’spseudorandomgenerator(PRG)toderandomize inIndyk’salgorithm.
Specifically,whenthecolumnindexedbyxisrequired,Nisansgeneratortakesxastheinput
and,togetherwiththeoriginal,thegeneratoroutputsasequenceofpseudorandom
sequences.
1.
2.
Initialize
,
For
:
a.
b.
d.
Initialize
For
i.
c.
:
Update
If
,thenincrement
If
,thenincrement
Thisprocedureuses
bitsandisabranchingalgorithmthatimitatetheproofof
correctnessforIndyk’salgorithm.Thealgorithmsucceededifandonlyifattheendofthe
computation
and
.Theonlysourceofrandomnessinthisprogramarethe
’s.
WewillapplyNisan’sPRGtogeneratetheserandomnumbers.WeinvokeTheorem1.3with
thealgorithmgivenaboveasBandanindicatorfunctioncheckingwhetherthealgorithm
succeededornotasf.Seethatthespaceboundis
andthenumberofstepstaken
bytheprogramis
,or
since
.Thismeanswecandeludetheproofof
correctnessofIndyk’salgorithmbyusing
randombitstoproduce
algorithmusesp-stabledistributionswhichonlyexistfor
when
.Indyk’s
.Weshallconsideracase
.
Theorem1.4
spaceisnecessaryandsufficient.
Nearlyoptimallowerboundrelateddetailsarediscussedin(Bar-Yossefetal.2004)and
(IndykandWoodruff2005).
InthischapterwewilldiscussthealgorithmofAndoni(Andoni2012),whichisbasedon
(Andonietal.2011;Jowharietal.2011).Wewillfocuson
.Inthisalgorithm,welet
.Pisa
1or
.Disa
matrix,whereeachcolumnhasasinglenon-zeroelementthatiseither
diagonalmatrixwith
,where
.
Thatistosay,
So,sameasthe
case,wewillkeep
,butweestimate
with
(1.10)
Theorem1.5
Let
,whichmeans
for
.
.ToproveTheorem1.5,wewillbeginbyshowingthat
deliversagoodestimateandthenprovethatapplyingPtozmaintainsit.
Claim
Proof
.
Let
.Wehave
(1.11)
(1.12)
(1.13)
whichimplies
.Thus,
(1.14)
(1.15)
(1.16)
for
.
ThefollowingclaimestablishesthatifwecouldmaintainQinsteadofythenwewouldhavea
bettersolutiontoourproblem.HoweverwecannotstoreQinmemorybecauseit’sndimensionaland
.Thusweneedtoanalyze
.
Claim
Let
.Then
LetussupposeeachentryinyisasortofcounterandthematrixPtakeseachentryinQ,
hashesittoarandomcounter,andaddsthatentryofQtimesarandomsigntothecounter.
Therewillbecollisionbecause
andonlymcounters.Thesewillcausedifferent to
potentiallycanceleachotheroutoraddtogetherinawaythatonemightexpecttocause
problems.Weshallshowthatthereareveryfewlarge ’s.
Interestingly,small
’sandbig
’smightcollidewitheachother.Whenweaddthe
small
’s,wemultiplythemwitharandomsign.Sotheexpectationoftheaggregate
contributionsofthesmall
’stoeachbucketis0.Weshallboundtheirvarianceaswell,
whichwillshowthatiftheycollidewithbig
’sthenwithhighprobabilitythiswouldnot
considerablychangetheadmissiblecounter.Ultimately,themaximalcountervalue(i.e.,
)isclosetothemaximal
andsoto
withhighprobability.
1.8.1 LightIndicesandBernstein’sInequality
Bernstein’sinequalityinprobabilitytheoryisamorepreciseformulationoftheclassical
Chebyshevinequalityinprobabilitytheory,proposedbyS.N.Bernshteinin1911;itpermits
onetoestimatetheprobabilityoflargedeviationsbyamonotonedecreasingexponential
function.Inordertoanalysethelightindices,wewilluseBernstein’sinequality.
Theorem1.6
(Bernstein’sinequality)Suppose
areindependent,andforalli,
,and
.Thenforall
Weconsiderthatthelightindicestogetherwillnotdistorttheheavyindices.Letus
parametrizePasfollowsandchooseafunction
aswellasafunction
.Then,
Therefore,hstateselementofthecolumntomakenon-zero,and stateswhichsigntouse
forcolumnj.
Thefollowinglightindicesclaimholdswithconstantprobabilitythatforall
,
Claim
If hasnoheavyindicesthenthemagnitudeof ismuchlessthanT.Obviously,itwouldnot
hinderwithestimate.If assignedthemaximal
,thenbypreviousclaimthatistheonly
heavyindexassignedto .Therefore,allthelightindicesassignedto wouldnotchangeit
bymorethanT/10,andsince
iswithinafactorof2ofT, willstillbewithinaconstant
multiplicativefactorofT.If assignedsomeotherheavyindex,thenthecorresponding
lessthan2Tsince islessthanthemaximal
is
.Thisclaimconcludesthat willbeatmost
2.1T.
Ultimately:
wherethesecondtermisaddedonlyif hasheavyindex.Bythetriangleinequality,
Applyingthistothebucketcontainingthemaximal showsthatbucketofyshouldhold
atleast0.4T.Furthermore,bysimilarargumentallotherbucketsshouldholdatmost2.1T.
Proof
Fix
.Thenfor
,define
Then
Wewillcallthejthtermofthesummand
1.
2.
Wehave
Wealsohave
,sincethe
since
andthenuseBernstein’sinequality.
representrandomsigns.
,
,andweiterateoverlightindices