Tải bản đầy đủ (.pdf) (125 trang)

Models of computation for big data

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (28.15 MB, 125 trang )


AdvancedInformationandKnowledgeProcessing
SpringerBriefsinAdvancedInformationandKnowledge
Processing
SeriesEditors
XindongWu
SchoolofComputingandInformatics,UniversityofLouisianaatLafayette,Lafayette,LA,USA
LakhmiJain
UniversityofCanberra,Adelaide,SA,Australia

SpringerBriefsinAdvancedInformationandKnowledgeProcessingpresentsconciseresearchin
thisexcitingfield.DesignedtocomplementSpringer’sAdvancedInformationandKnowledge
Processingseries,thisBriefsseriesprovidesresearcherswithaforumtopublishtheir
cutting-edgeresearchwhichisnotyetmatureenoughforabookintheAdvancedInformation
andKnowledgeProcessingseries,butwhichhasgrownbeyondthelevelofaworkshoppaper
orjournalarticle.
Typicaltopicsmayinclude,butarenotrestrictedto:
BigDataanalytics
BigKnowledge
Bioinformatics
Businessintelligence
Computersecurity
Dataminingandknowledgediscovery
Informationqualityandprivacy
Internetofthings
Knowledgemanagement
Knowledge-basedsoftwareengineering
Machineintelligence
Ontology
SemanticWeb
Smartenvironments


Softcomputing
Socialnetworks
SpringerBriefsarepublishedaspartofSpringer’seBookcollection,withmillionsofusers
worldwideandareavailableforindividualprintandelectronicpurchase.Briefsare
characterizedbyfast,globalelectronicdissemination,standardpublishingcontracts,easy-to-


usemanuscriptpreparationandformattingguidelinesandexpeditedproductionschedulesto
assistresearchersindistributingtheirresearchfastandefficiently.
Moreinformationaboutthisseriesathttp://​www.​springer.​com/​series/​16024


RajendraAkerkar

ModelsofComputationforBigData


RajendraAkerkar
WesternNorwayResearchInstitute,Sogndal,Norway

ISSN1610-3947
e-ISSN2197-8441
AdvancedInformationandKnowledgeProcessing
ISSN2524-5198
e-ISSN2524-5201
SpringerBriefsinAdvancedInformationandKnowledgeProcessing
ISBN978-3-319-91850-1
e-ISBN978-3-319-91851-8
/>LibraryofCongressControlNumber:2018951205
©TheAuthor(s),underexclusivelicensetoSpringerNatureSwitzerlandAG2018

Thisworkissubjecttocopyright.Allrightsaresolelyandexclusivelylicensedbythe
Publisher,whetherthewholeorpartofthematerialisconcerned,specificallytherightsof
translation,reprinting,reuseofillustrations,recitation,broadcasting,reproductionon
microfilmsorinanyotherphysicalway,andtransmissionorinformationstorageand
retrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodology
nowknownorhereafterdeveloped.
Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.in
thispublicationdoesnotimply,evenintheabsenceofaspecificstatement,thatsuchnames
areexemptfromtherelevantprotectivelawsandregulationsandthereforefreeforgeneral
use.
Thepublisher,theauthorsandtheeditorsaresafetoassumethattheadviceandinformation
inthisbookarebelievedtobetrueandaccurateatthedateofpublication.Neitherthe
publishernortheauthorsortheeditorsgiveawarranty,expressorimplied,withrespectto
thematerialcontainedhereinorforanyerrorsoromissionsthatmayhavebeenmade.The
publisherremainsneutralwithregardtojurisdictionalclaimsinpublishedmapsand
institutionalaffiliations.
ThisSpringerimprintispublishedbytheregisteredcompanySpringerNatureSwitzerlandAG
Theregisteredcompanyaddressis:Gewerbestrasse11,6330Cham,Switzerland


Preface
Thisbookaddressesalgorithmicproblemsintheageofbigdata.Rapidlyincreasingvolumes
ofdiversedatafromdistributedsourcescreatechallengesforextractingvaluableknowledge
andcommercialvaluefromdata.Thismotivatesincreasedinterestinthedesignandanalysis
ofalgorithmsforrigorousanalysisofsuchdata.
Thebookcoversmathematicallyrigorousmodels,aswellassomeprovablelimitationsof
algorithmsoperatinginthosemodels.Mosttechniquesdiscussedinthebookmostlycome
fromresearchinthelastdecadeandofthealgorithmswediscusshavehugeapplicationsin
Webdatacompression,approximatequeryprocessingindatabases,networkmeasurement
signalprocessingandsoon.Wediscusslowerboundmethodsinsomemodelsshowingthat

manyofthealgorithmswepresentedareoptimalornearoptimal.Thebookitselfwillfocus
ontheunderlyingtechniquesratherthanthespecificapplications.
Thisbookgrewoutofmylecturesforthecourseonbigdataalgorithms.Actually,
algorithmicaspectsformoderndatamodelsisasuccessinresearch,teachingandpractice
whichhastobeattributedtotheeffortsofthegrowingnumberofresearchersinthefield,to
nameafewPiotrIndyk,JelaniNelson,S.Muthukrishnan,RajivMotwani.Theirexcellentwork
isthefoundationofthisbook.Thisbookisintendedforbothgraduatestudentsandadvanced
undergraduatestudentssatisfyingthediscreteprobability,basicalgorithmicsandlinear
algebraprerequisites.
IwishtoexpressmyheartfeltgratitudetomycolleaguesatVestlandsforsking,Norway,
andTechnomathematicsResearchFoundation,India,fortheirencouragementinpersuading
metoconsolidatemyteachingmaterialsintothisbook.IthankMinsungHongforhelpinthe
LaTeXtyping.IwouldalsoliketothankHelenDesmondandproductionteamatSpringer.
ThankstotheINTPARTprogrammefundingforpartiallysupportingthisbookproject.The
love,patienceandencouragementofmyfather,sonandwifemadethisprojectpossible.
RajendraAkerkar
Sogndal,Norway
May2018


Contents
1StreamingModels
1.​1Introduction
1.​2SpaceLowerBounds
1.​3StreamingAlgorithms
1.​4Non-adaptiveRandomizedStreaming
1.​5LinearSketch
1.​6Alon–Matias–SzegedySketch
1.​7Indyk’sAlgorithm
1.​8BranchingProgram

1.​8.​1LightIndicesandBernstein’sInequality
1.​9HeavyHittersProblem
1.​10Count-MinSketch
1.​10.​1CountSketch
1.​10.​2Count-MinSketchandHeavyHittersProblem
1.11Streamingk-Means
1.​12GraphSketching
1.​12.​1GraphConnectivity
2Sub-linearTimeModels
2.​1Introduction
2.​2Fano’sInequality
2.3RandomizedExactandApproximateBound
2.4t-PlayerDisjointnessProblem
2.​5DimensionalityReduction
2.​5.​1JohnsonLindenstraussLemma
2.​5.​2LowerBoundsonDimensionalityReduction
2.5.3DimensionalityReductionfork-MeansClustering
2.​6Gordon’sTheorem
2.​7Johnson–LindenstraussTransform
2.​8FastJohnson–LindenstraussTransform
2.​9Sublinear-TimeAlgorithms:​AnExample


2.​10MinimumSpanningTree
2.​10.​1ApproximationAlgorithm
3LinearAlgebraicModels
3.​1Introduction
3.​2SamplingandSubspaceEmbeddings
3.​3Non-commutativeKhintchineInequality
3.​4IterativeAlgorithms

3.​5SarlósMethod
3.​6Low-RankApproximation
3.​7CompressedSensing
3.​8TheMatrixCompletionProblem
3.​8.​1AlternatingMinimization
4AssortedComputationalModels
4.​1CellProbeModel
4.​1.​1TheDictionaryProblem
4.​1.​2ThePredecessorProblem
4.​2OnlineBipartiteMatching
4.​2.​1BasicApproach
4.​2.​2RankingMethod
4.​3MapReduceProgrammingModel
4.​4MarkovChainModel
4.​4.​1RandomWalksonUndirectedGraphs
4.​4.​2ElectricNetworksandRandomWalks
4.​4.​3Example:​TheLollipopGraph
4.​5CrowdsourcingModel
4.​5.​1FormalModel
4.​6CommunicationComplexity
4.​6.​1InformationCost
4.​6.​2SeparationofInformationandCommunication
4.​7AdaptiveSparseRecovery
References


©TheAuthor(s),underexclusivelicensetoSpringerNatureSwitzerlandAG2018
RajendraAkerkar,ModelsofComputationforBigData,AdvancedInformationandKnowledgeProcessing
/>
1.StreamingModels

RajendraAkerkar1
(1) WesternNorwayResearchInstitute,Sogndal,Norway


RajendraAkerkar
Email:

1.1 Introduction
Intheanalysisofbigdatatherearequeriesthatdonotscalesincetheyneedmassive
computingresourcesandtimetogenerateexactresults.Forexample,countdistinct,most
frequentitems,joins,matrixcomputations,andgraphanalysis.Ifapproximateresultsare
acceptable,thereisaclassofdedicatedalgorithms,knownasstreamingalgorithmsor
sketchesthatcanproduceresultsorders-ofmagnitudefasterandwithpreciselyprovenerror
bounds.Forinteractivequeriestheremaynotbesupplementarypracticaloptions,andinthe
caseofreal-timeanalysis,sketchesaretheonlyrecognizedsolution.
Streamingdataisasequenceofdigitallyencodedsignalsusedtorepresentinformationin
transmission.Forstreamingdata,theinputdatathataretobeoperatedarenotavailableall
atonce,butratherarriveascontinuousdatasequences.Naturally,adatastreamisa
sequenceofdataelements,whichisextremelybiggerthantheamountofavailablememory.
Moreoftenthannot,anelementwillbesimplyan(integer)numberfromsomerange.
However,itisoftenconvenienttoallowotherdatatypes,suchas:multidimensionalpoints,
metricpoints,graphverticesandedges,etc.Thegoalistoapproximatelycomputesome
functionofthedatausingonlyonepassoverthedatastream.Thecriticalaspectindesigning
datastreamalgorithmsisthatanydataelementthathasnotbeenstoredisultimatelylost
forever.Hence,itisvitalthatdataelementsareproperlyselectedandpreserved.Data
streamsariseinseveralrealworldapplications.Forexample,anetworkroutermustprocess
terabitsofpacketdata,whichcannotbeallstoredbytherouter.Whereas,therearemany
statisticsandpatternsofthenetworktrafficthatareusefultoknowinordertobeableto
detectunusualnetworkbehaviour.Datastreamalgorithmsenablecomputingsuchstatistics
fastbyusinglittlememory.InStreamingwewanttomaintainasketchF(X)ontheflyasXis

updated.Thusinpreviousexample,ifnumberscomeonthefly,Icankeeparunningsum,
whichisastreamingalgorithm.Thestreamingsettingappearsinalotofplaces,forexample,
yourroutercanmonitoronlinetraffic.Youcansketchthenumberoftraffictofindthetraffic
pattern.
Thefundamentalmathematicalideastoprocessstreamingdataaresamplingandrandom
projections.Manydifferentsamplingmethodshavebeenproposed,suchasdomainsampling,


universesampling,reservoirsampling,etc.Therearetwomaindifficultieswithsamplingfor
streamingdata.First,samplingisnotapowerfulprimitiveformanyproblemssincetoomany
samplesareneededforperformingsophisticatedanalysisandalowerboundisgivenin.
Second,asstreamunfolds,ifthesamplesmaintainedbythealgorithmgetdeleted,onemaybe
forcedtoresamplefromthepast,whichisingeneral,expensiveorimpossibleinpracticeand
inanycase,notallowedinstreamingdataproblems.Randomprojectionsrelyon
dimensionalityreduction,usingprojectionalongrandomvectors.Therandomvectorsare
generatedbyspace-efficientcomputationofrandomvariables.Theseprojectionsarecalled
thesketches.Therearemanyvariationsofrandomprojectionswhichareofsimplertype.
Samplingandsketchingaretwobasictechniquesfordesigningstreamingalgorithms.The
ideabehindsamplingissimpletounderstand.Everyarrivingitemispreservedwithacertain
probability,andonlyasubsetofthedataiskeptforfurthercomputation.Samplingisalso
easytoimplement,andhasmanyapplications.Sketchingistheothertechniquefordesigning
streamingalgorithms.Sketchtechniqueshaveundergonewidedevelopmentwithinthepast
fewyears.Theyareparticularlyappropriateforthedatastreamingscenario,inwhichlarge
quantitiesofdataflowbyandthethesketchsummarymustcontinuallybeupdatedrapidly
andcompactly.Asketch-basedalgorithmcreatesacompactsynopsisofthedatawhichhas
beenobserved,andthesizeofthesynopsisisusuallysmallerthanthefullobserveddata.Each
updateobservedinthestreampotentiallycausesthissynopsistobeupdated,sothatthe
synopsiscanbeusedtoapproximatecertainfunctionsofthedataseensofar.Inorderto
buildasketch,weshouldeitherbeabletoperformasinglelinearscanoftheinputdata(inno
strictorder),ortoscantheentirestreamwhichcollectivelybuilduptheinput.Seethatmany

sketcheswereoriginallydesignedforcomputationsinsituationswheretheinputisnever
collectedtogetherinoneplace,butexistsonlyimplicitlyasdefinedbythestream.SketchF(X)
withrespecttosomefunctionfisacompressionofdataX.Itallowsuscomputingf(X)(with
approximation)givenaccessonlytoF(X).Asketchofalarge-scaledataisasmalldata
structurethatletsyouapproximateparticularcharacteristicsoftheoriginaldata.Theexact
natureofthesketchdependsonwhatyouaretryingtoapproximateaswellasthenatureof
thedata.
Thegoalofthestreamingalgorithmistomakeonepassoverthedataandtouselimited
memorytocomputefunctionsofx,suchasthefrequencymoments,thenumberofdistinct
elements,theheavyhitters,andtreatingxasamatrix,variousquantitiesinnumericallinear
algebrasuchasalowrankapproximation.Sincecomputingthesequantitiesexactlyor
deterministicallyoftenrequiresaprohibitiveamountofspace,thesealgorithmsareusually
randomizedandapproximate.
Manyalgorithmsthatwewilldiscussinthisbookarerandomized,sinceitisoften
necessarytoachievegoodspacebounds.Arandomizedalgorithmisanalgorithmthatcantoss
coinsandtakedifferentactionsdependingontheoutcomeofthosetosses.Randomized
algorithmshaveseveraladvantagesoverdeterministicones.Usually,randomizedalgorithms
tendtobesimplerthandeterministicalgorithmsforthesametask.Thestrategyofpickinga
randomelementtopartitiontheproblemintosubproblemsandrecursingononeofthe
partitionsismuchsimpler.Further,forsomeproblemsrandomizedalgorithmshaveabetter
asymptoticrunningtimethantheirdeterministicone.Randomizationcanbebeneficialwhen
thealgorithmfaceslackofinformationandalsoveryusefulinthedesignofonlinealgorithms
thatlearntheirinputovertime,orinthedesignofobliviousalgorithmsthatoutputasingle


solutionthatisgoodforallinputs.Randomization,intheformofsampling,canassistus
estimatethesizeofexponentiallylargespacesorsets.

1.2 SpaceLowerBounds
Adventofcutting-edgecommunicationandstoragetechnologyenablelargeamountofraw

datatobeproduceddaily,andsubsequently,thereisarisingdemandtoprocessthisdata
efficiently.Sinceitisunrealisticforanalgorithmtostoreevenasmallfractionofthedata
stream,itsperformanceistypicallymeasuredbytheamountofspaceituses.Inmany
scenarios,suchasinternetrouting,onceastreamelementisexamineditislostforever
unlessexplicitlysavedbytheprocessingalgorithm.This,alongwiththecompletesizeofthe
data,makesmultiplepassesoverthedataimpracticable.
Letusconsiderthedistinctelementsproblemstofindthenumberofdistinctelementsin
astream,wherequeriesandadditionsareallowed.Wetakesthespaceofthealgorithm,nthe
sizeoftheuniversefromwhichtheelementsarrive,andmthelengthofthestream.
Theorem1.1 Thereisnodeterministicexactalgorithmforcomputingnumberofdistinct
elementsinO(minn,m)space(Alonetal.1999).
Proof Usingastreamingalgorithmwithspacesfortheproblemwearegoingtoshowhowto
encode
usingonlysbits.Obviously,wearegoingtoproduceaninjectivemappingfrom
to

.Hence,thisimpliesthatsmustbeatleastn.Welookforproceduressuch

that

andEnc(x)isafunctionfrom

to

.

Intheencodingprocedure,givenastringx,deviseastreamcontainingandaddiattheend
ofthestreamif
.ThenEnc(x)isthememorycontentofthealgorithmonthatstream.
Inthedecodingprocedure,letusconsidereachiandadditattheendofthestreamand

querythenthenumberofdistinctelements.Ifthenumberofdistinctelementsincreasesthis
impliesthat
,otherwiseitimpliesthat
.Sowecanrecoverxcompletely.Hence
proved.
Nowweshowthatapproximatealgorithmsareinadequateforsuchproblem.
Theorem1.2

Anydeterministic

algorithmthatprovides1.1approximationrequires

space.
Proof
SupposewehadacollectionFfulfillingthefollowing:
,forsomeconstant

.


Letusconsiderthealgorithmtoencodevectors
ofsetS.Thelowerboundfollowssincewemusthave

,where

istheindicatorvector

.Theencodingprocedureis

similarasthepreviousproof.

Inthedecodingprocedure,letusiterateoverallsetsandtestforeachsetSifit
correspondstoourinitialencodedset.FurthertakeateachtimethememorycontentsofMof
thestreamingalgorithmafterhavinginsertedinitialstring.ThenforeachS,weinitializethe
algorithmwithmemorycontentsMandthenfeedelementiif
.SupposeifSequalsthe
initialencodedset,thenumberofdistinctelementsdoesincreaseslightly,whereasifitisnot
italmostdoubles.Consideringtheapproximationassuranceofthealgorithmweunderstand
thatifSisnotourinitialsetthenthenumberofdistinctelementsgrowsby .
InordertoconfirmtheexistenceofsuchafamilyofsetsF,wepartitionninto
intervalsoflength100each.ToformasetSweselectonenumberfromeachinterval
uniformlyatrandom.Obviously,suchasethassizeexactly .FortwosetsS,Tselected
uniformlyatrandomasbeforelet

betherandomvariablethatequals1iftheyhavethe

samenumberselectedfromintervali.So,
intersectionisjust
timesitsmeanissmallerthan

.Hencetheanticipatedsizeofthe

.Theprobabilitythatthisintersectionisbiggerthanfive
forsomeconstant ,byastandardChernoffbound.

Finally,byapplyingaunionboundoverallfeasibleintersectionsonecanprovetheresult.

1.3 StreamingAlgorithms
Animportantaspectofstreamingalgorithmsisthatthesealgorithmshavetobe
approximate.Thereareafewthingsthatonecancomputeexactlyinastreamingmanner,but
therearelotsofcrucialthingsthatonecan’tcomputethatway,sowehavetoapproximate.

Mostsignificantaggregatescanbeapproximatedonline.Manyoftheseapproximate
aggregatescanbecomputedonline.Therearetwoways:(1)Hashing:whichturnsapretty
identityfunctionintohash.(2)sketching:youcantakeaverylargeamountofdataandbuilda
verysmallsketchofthedata.Carefullydone,youcanusethesketchtogetvaluesofinterest.
Thisinturnwillfindagoodsketch.Allofthealgorithmsdiscussedinthischapteruse
sketchingofsomekindandsomeusehashingaswell.Onepopularstreamingalgorithmis
HyperLogLogbyFlajolet.Cardinalityestimationisthetaskofdeterminingthenumberof
distinctelementsinadatastream.Whilethecardinalitycanbeeasilycomputedusingspace
linearinthecardinality,forseveralapplications,thisistotallyunrealisticandrequirestoo
muchmemory.Therefore,manyalgorithmsthatapproximatethecardinalitywhileusingless


resourceshavebeendeveloped.HyperLogLogisoneofthem.Thesealgorithmsplayan
importantroleinnetworkmonitoringsystems,dataminingapplications,aswellasdatabase
systems.Thebasicideaisifwehavensamplesthatarehashedandinsertedintoa[0,1)
interval,thosensamplesaregoingtomake
intervals.Therefore,theaveragesizeofthe
intervalshastobe

.Bysymmetry,theaveragedistancetotheminimumof

thosehashedtypesisalsogoingtobe

.Furthermore,duplicatesvalueswillgo

exactlyontopofpreviousvalues,thusthenisthenumberofuniquevalueswehaveinserted.
Forinstance,ifwehavetensamples,theminimumisgoingtoberightaround1/11.
HyperLogLogisshowntobenearoptimalamongalgorithmsthatarebasedonorder
statistics.


1.4 Non-adaptiveRandomizedStreaming
Thenon-trivialupdatetimelowerboundsforrandomizedstreamingalgorithmsinthe
TurnstileModelwaspresentedin(Larsenetal.2014).Onlyaspecificrestrictedclassof
randomizedstreamingalgorithms,namelythosethatarenon-adaptivecouldbebounded.
Mostwell-knownturnstilestreamingalgorithmsintheliteraturearenon-adaptive.Reference
(Larsenetal.2014)givesthenon-trivialupdatetimelowerboundsforbothrandomizedand
deterministicturnstilestreamingalgorithms,whichholdwhenthealgorithmsarenonadaptive.
Definition1.1 Anon-adaptiverandomizedstreamingalgorithmisanalgorithmwhereit
maytossrandomcoinsbeforeprocessinganyelementsofthestream,andthewordsread
fromandwrittentomemoryaredeterminedbytheindexoftheupdatedelementandthe
initiallytossedcoins,onanyupdateoperation.
Theseconstraintssuggestthatmemorymustnotbereadorwrittentobasedonthecurrent
stateofthememory,butonlyaccordingtothecoinsandtheindex.Comparingtheabove
definitiontothesketches,ahashfunctionchosenindependentlyfromanydesiredhashfamily
canemulatethesecoins,enablingtheupdatealgorithmtofindsomespecificwordsof
memorytoupdateusingonlythehashfunctionandtheindexoftheelementtoupdate.This
makesthenon-adaptiverestrictionfitexactlywithalloftheTurnstileModelalgorithm.Both
theCount-MinSketchandtheCount-MedianSketcharenon-adaptiveandsupportpoint
queries.

1.5 LinearSketch
Manydatastreamproblemscannotbesolvedwithjustasample.Wecanrathermakeuseof
datastructureswhich,includeacontributionfromtheentireinput,insteadofsimplythe
itemspickedinthesample.Forinstance,considertryingtocountthenumberofdistinct
objectsinastream.Itiseasytoseethatunlessalmostallitemsareincludedinthesample,
thenwecannottellwhethertheyarethesameordistinct.Sinceastreamingalgorithmgetsto
seeeachiteminturn,itcandobetter.Weconsiderasketchascompactdatastructurewhich


summarizesthestreamforcertaintypesofquery.Itisalineartransformationofthestream:

wecanimaginethestreamasdefiningavector,andthealgorithmcomputestheproductofa
matrixwiththisvector.
Asweknowadatastreamisasequenceofdata,whereeachitembelongstotheuniverse.
Adatastreamingalgorithmtakesadatastreamasinputandcomputessomefunctionofthe
stream.Further,algorithmhasaccesstheinputinastreamingfashion,i.e.algorithmcannot
readtheinputinanotherorderandformostcasesthealgorithmcanonlyreadthedataonce.
Dependingonhowitemsintheuniverseareexpressedindatastream,therearetwotypical
models:
CashRegisterModel:Eachiteminstreamisanitemofuniverse.Differentitemscomeinan
arbitraryorder.
TurnstileModel:Inthismodelwehaveamulti-set.Everyin-comingitemislinkedwithone
oftwospecialsymbolstoindicatethedynamicchangesofthedataset.Theturnstilemodel
capturesmostpracticalsituationsthatthedatasetmaychangeovertime.Themodelisalso
knownasdynamicstreams.
Wenowdiscusstheturnstilemodelinstreamingalgorithms.Intheturnstilemodel,the
streamconsistsofasequenceofupdateswhereeachupdateeitherinsertsanelementor
deletesone,butadeletioncannotdeleteanelementthatdoesnotexist.Whenthereare
duplicates,thismeansthatthemultiplicityofanyelementcannotgonegative.
Inthemodelthereisavector
thatstartsastheallzerovectorandthenasequence
ofupdatescomes.Eachupdateisoftheform
matchestotheoperation

,where

and

.This

.


Givenafunctionf,wewanttoapproximatef(x).Forexample,inthedistinctelements
problem isalways1and
.
Thewell-knownapproachfordesigningturnstilealgorithmsislinearsketching.Theidea
istopreserveinmemory
,where
,amatrixthatisshortandfat.Weknow
that

,obviouslymuchsmaller.Wecanseethatyism-dimensional,sowecanstoreit

efficientlybutifweneedtostorethewhole inmemorythenwewillnotgetspace-wise
betteralgorithm.Hence,therearetwooptionsincreatingandstoring .
isdeterministicandsowecaneasilycompute

withoutkeepingthewholematrixin

memory.
isdefinedbyk-wiseindependenthashfunctionsforsomesmallk,sowecanafford
storingthehashfunctionsandcomputing
.
Let

betheithcolumnofthematrix

whentheupdate

.Then


occureswehavethatthenewyequals

.Sobystoring
.The

firstsummandistheoldyandthesecondsummandissimplymultipleoftheithcolumnof

.


Thisishowupdatestakeplacewhenwehavealinearsketch.
NowletusconsiderMomentEstimationProblem(Alonetal.1999).Theproblemof
estimating(frequency)momentsofadatastreamhasattractedalotofattentionsincethe
inceptionofstreamingalgorithms.Supposelet
.Wewanttoestimate
thespaceneededtosolvethemomentestimationproblemaspchanges.Thereisatransition
pointincomplexityof .
spaceisachievablefor
probability(Alonetal.1999;Indyk2006).For
bitsofspacefor

approximationwith success
thenweneedexactly

spacewith successprobability(Bar-Yossefetal.2004;Indykand

Woodruff2005).

1.6 Alon–Matias–SzegedySketch
Streamingalgorithmsaimtosummarizealargevolumeofdataintoacompactsummary,by

maintainingadatastructurethatcanbeincrementallymodifiedasupdatesareobserved.
Theyallowtheapproximationofparticularquantities.Alon–Matias–Szegedy(AMS)sketches
(Alonetal.1999)arerandomizedsummariesofthedatathatcanbeusedtocompute
aggregatessuchasthesecondfrequencymomentandsizesofjoins.AMSsketchescanbe
viewedasrandomprojectionsofthedatainthefrequencydomainon±1pseudo-random
vectors.ThekeypropertyofAMSsketchesisthattheproductofprojectionsonthesame
randomvectoroffrequenciesofthejoinattributeoftworelationsisanunbiasedestimateof
thesizeofjoinoftherelations.WhileasingleAMSsketchisinaccurate,multiplesuch
sketchescanbecomputedandcombinedusingaveragesandmedianstoobtainanestimate
ofanydesiredprecision.
Inparticular,theAMSSketchisfocusedonapproximatingthesumofsquaredentriesofa
vectordefinedbyastreamofupdates.ThisquantityisnaturallyrelatedtotheEuclidean
normofthevector,andsohasmanyapplicationsinhigh-dimensionalgeometry,andindata
miningandmachinelearningsettingsthatusevectorrepresentationsofdata.Thedata
structuremaintainsalinearprojectionofthestreamwithanumberofrandomlychosen
vectors.Theserandomvectorsaredefinedimplicitlybysimplehashfunctions,andsodonot
havetobestoredexplicitly.Varyingthesizeofthesketchchangestheaccuracyguaranteeson
theresultingestimation.Thefactthatthesummaryisalinearprojectionmeansthatitcanbe
updatedflexibly,andsketchescanbecombinedbyadditionorsubtraction,yieldingsketches
correspondingtotheadditionandsubtractionoftheunderlyingvectors.
Acommonfeatureof(Count-MinandAMS)sketchalgorithmsisthattheyrelyonhash
functionsonitemidentifiers,whicharerelativelyeasytoimplementandfasttocompute.
Definition1.2
Hisak-wiseindependenthashfamilyif


TherearetwoversionsoftheAMSalgorithm.Thefasterversion,basedonthehashingisalso
referredtoasfastAMStodistinguishitfromtheoriginal“slower”sketch,sinceeachupdateis
veryfast.
Algorithm:

1.

Considerarandomhashfunction

fromafour-wiseindependent



family.
2.
3.

Let
Let

4.
5.

,output

7.

.

isanunbiasedestimatorwithvariancebig-Ohofthesquareofitsexpectation.
Sample



independenttimes:


toobtaina
6.




.

Let



.

independenttimes:

-approximationwithprobability
Eachofthehashfunctiontakes
functionsintotal.
Lemma1.1



approximationwith probability.

Sample

Proof


.UseChebyshev’sinequality



.

.Takethemediantoget

.
bitstostore,andthereare

hash




where

sincepair-wiseindependence.

Lemma1.2

.

Proof

where

sincepair-wiseindependence,


and

sincefour-wiseindependence.

Inthenextsectionwewillpresentanidealizedalgorithmwithinfiniteprecision,givenby
Indyk(Indyk2006).Thoughthesampling-basedalgorithmsaresimple,theycannotbe
employedforturnstilestreams,andweneedtodevelopothertechniques.
Letuscalladistribution over
iffor
fromthisdistributionand
forall

wehavethat

isarandomvariablewithdistribution

exampleofsuchadistributionaretheGaussiansfor
distribution,whichhasprobabilitydensityfunction

andfor

.An

theCauchy
.


Fromprobabilitytheory,weknowthatthecentrallimittheoremestablishesthat,insome
situations,whenindependentrandomvariablesareadded,theirproperlynormalizedsum
tendstowardanormaldistributioneveniftheoriginalvariablesthemselvesarenotnormally

distributed.Hence,bytheCentralLimitTheoremanaverageofdsamplesfromadistribution
approachesaGaussianasdgoestoinfinity.

1.7 Indyk’sAlgorithm
TheIndyk’salgorithmisoneoftheoldestalgorithmswhichworksondatastreams.Themain
drawbackofthisalgorithmisthatitisatwopassalgorithm,i.e.,itrequirestwolinearscansof
thedatawhichleadstohighrunningtime.
Lettheithrowof be ,asbefore,where comesfromap-stabledistribution.Then
consider

.Whenaqueryarrives,outputthemedianofallthe .Withoutlossof

generality,letussupposeap-stabledistributionhasmedianequalto1,whichinfactmeans
thatforzfromthisdistribution
.
Let

bean

distribution,

where

.Given

matrixwhereeveryelement

issampledfromap-stable

,Indyk’salgorithm(Indyk2006)estimatesthep-normofxas


.

Inaturnstilestreamingmodel,eachelementinthestreamreflectsanupdatetoanentry
inx.Whenanalgorithmwouldmaintainxinmemoryandcalculates
attheend,hence
need

space,Indyk’salgorithmstoresyand

.Combinedwithaspace-efficientwayto

produce weattainSuperiorspacecomplexity.
Letussuppose isgeneratedwith suchthatif

then

assumetheprobabilitymassof

is1/2.Moreover,let

assignedtointerval

.So,we

beanindicatorfunctiondefinedas

Let

betheithrowof


.Wehave

(1.1)
whichfollowsfromthedefinitionofp-stabledistributionsandnotingthat

’sare


sampledfrom

.Thisimplies
(1.2)

since

.

Moreover,itispossibletoshowthat
(1.3)

(1.4)
Next,considerthefollowingquantities:
(1.5)

(1.6)
representsthefractionof ’sthatsatisfy
representsthefractionof ’sthatsatisfy
property,wehave


,andlikewise,
.Usinglinearityofexpectation

and

.Therefore,themedianof

liesin
asdesired.
Nextstepistoanalyzethevarianceof

and

.Wehave
(1.7)

Sincevarianceofanyindicatorvariableisnotmorethan1,

.Likewise,

.Withanappropriatechoiceofmnowwecantrustthatthemedianof
thedesired -rangeof

isin

withhighprobability.

Hence,Indyk’salgorithmworks,butindependentlyproducingandstoringallmnelements
of iscomputationallycostly.Toinvokethedefinitionofp-stabledistributionsforEq.1.1,



weneedtheentriesineachrowtobeindependentfromoneanother.Therowsneedtobe
pairwiseindependentforcalculationofvariancetohold.
Letusassume
where
’sarek-wiseindependentp-stabledistribution
samples.
(1.8)
Ifwecanmakethisclaim,thenwecanusek-wiseindependentsamplesineachrow
insteadoffullyindependentsamplestoinvokethesameargumentsintheanalysisabove.
Thishasbeenshownfor
(Kaneetal.2010).Withthistechnique,wecanstate
usingonly

bits;acrossrows,weonlyneedtouse2-wiseindependenthashfunction

thatmapsarowindextoa
Indyk’sapproachforthe

bitseedforthek-wiseindependenthashfunction.
normisbasedonthepropertyofthemedian.However,itis

possibletoconstructestimatorsbasedonotherquantilesandtheymayevenoutperformthe
medianestimator,intermsofestimationaccuracy.However,sincetheimprovementis
marginalforourparameterssettings,westicktothemedianestimator.

1.8 BranchingProgram
Abranchingprogramsarebuiltondirectedacyclicgraphsandworkbystartingatasource
vertexandtestingthevaluesofthevariablesthateachvertexislabeledwithandfollowing
theappropriateedgetillasinkisreached,andacceptingorrejectingbasedontheidentityof

thesink.Theprogramstartsatansourcevertexwhichisnotpartofthegrid.Ateachstep,the
programreadsSbitsofinput,reflectingthefactthatspaceisboundedbyS,andmakesa
decisionaboutwhichvertexinthesubsequentcolumnofthegridtojumpto.AfterRsteps,
thelastvertexvisitedbytheprogramrepresentstheoutcome.Theentireinput,whichcanbe
representedasalength-RSbitstring,inducesadistributionoverthefinalstates.Herewe
wishtogeneratetheinputstringusingfewer(
)randombitssuchthattheoriginal
distributionoverfinalstatesiswellpreserved.Thefollowingtheoremaddressesthisidea.
Theorem1.3
(Nisan1992)Thereexists

for

suchthat
(1.9)

foranybranchingprogramBandanyfunction

.

Thefunctionhcansimulatetheinputtothebranchingprogramwithonlytrandombitssuch
thatitisalmostimpossibletodiscriminatetheoutcomeofthesimulatedprogramfromthat


oftheoriginalprogram.
Arandomsamplexfrom

andaddxattheroot.Repeatthefollowingprocedureto

createacompletebinarytree.Ateachvertex,createtwochildrenandcopythestringoverto

theleftchild.Fortherightchild,usearandom2-wiseindependenthashfunction
chosenforthecorrespondinglevelofthetreeandrecordtheresultofthe
hash.OncewereachRlevels,outputtheconcatenationofallleaves,whichisalength-RSbit
string.SinceeachhashfunctionrequiresSrandombitsandthereare
levelsinthetree,
thisfunctionuses

bitstotal.

Onewaytosimulaterandomizedcomputationswithdeterministiconesistobuilda
pseudorandomgenerator,namely,anefficientlycomputablefunctiongthatcanstretcha
shortuniformlyrandomseedofsbitsintonbitsthatcannotbedistinguishedfromuniform
onesbysmallspacemachines.Oncewehavesuchagenerator,wecanobtainadeterministic
computationbycarryingoutthecomputationforeveryfixedsettingoftheseed.Iftheseedis
shortenough,andthegeneratorisefficientenough,thissimulationremainsefficient.Wewill
useNisan’spseudorandomgenerator(PRG)toderandomize inIndyk’salgorithm.
Specifically,whenthecolumnindexedbyxisrequired,Nisansgeneratortakesxastheinput
and,togetherwiththeoriginal,thegeneratoroutputsasequenceofpseudorandom
sequences.
1.
2.

Initialize

,

For

:


a.
b.

d.




Initialize
For
i.

c.



:



Update

If

,thenincrement

If

,thenincrement


Thisprocedureuses




bitsandisabranchingalgorithmthatimitatetheproofof

correctnessforIndyk’salgorithm.Thealgorithmsucceededifandonlyifattheendofthe
computation
and
.Theonlysourceofrandomnessinthisprogramarethe

’s.


WewillapplyNisan’sPRGtogeneratetheserandomnumbers.WeinvokeTheorem1.3with
thealgorithmgivenaboveasBandanindicatorfunctioncheckingwhetherthealgorithm
succeededornotasf.Seethatthespaceboundis
andthenumberofstepstaken
bytheprogramis

,or

since

.Thismeanswecandeludetheproofof

correctnessofIndyk’salgorithmbyusing

randombitstoproduce


algorithmusesp-stabledistributionswhichonlyexistfor
when

.Indyk’s

.Weshallconsideracase

.

Theorem1.4

spaceisnecessaryandsufficient.

Nearlyoptimallowerboundrelateddetailsarediscussedin(Bar-Yossefetal.2004)and
(IndykandWoodruff2005).
InthischapterwewilldiscussthealgorithmofAndoni(Andoni2012),whichisbasedon
(Andonietal.2011;Jowharietal.2011).Wewillfocuson
.Inthisalgorithm,welet
.Pisa
1or

.Disa

matrix,whereeachcolumnhasasinglenon-zeroelementthatiseither
diagonalmatrixwith

,where

.


Thatistosay,

So,sameasthe

case,wewillkeep

,butweestimate

with
(1.10)

Theorem1.5
Let

,whichmeans

for

.

.ToproveTheorem1.5,wewillbeginbyshowingthat

deliversagoodestimateandthenprovethatapplyingPtozmaintainsit.
Claim
Proof

.



Let

.Wehave
(1.11)

(1.12)

(1.13)
whichimplies

.Thus,

(1.14)
(1.15)
(1.16)
for

.

ThefollowingclaimestablishesthatifwecouldmaintainQinsteadofythenwewouldhavea
bettersolutiontoourproblem.HoweverwecannotstoreQinmemorybecauseit’sndimensionaland
.Thusweneedtoanalyze
.
Claim
Let

.Then

LetussupposeeachentryinyisasortofcounterandthematrixPtakeseachentryinQ,
hashesittoarandomcounter,andaddsthatentryofQtimesarandomsigntothecounter.

Therewillbecollisionbecause
andonlymcounters.Thesewillcausedifferent to
potentiallycanceleachotheroutoraddtogetherinawaythatonemightexpecttocause
problems.Weshallshowthatthereareveryfewlarge ’s.
Interestingly,small

’sandbig

’smightcollidewitheachother.Whenweaddthe


small

’s,wemultiplythemwitharandomsign.Sotheexpectationoftheaggregate

contributionsofthesmall

’stoeachbucketis0.Weshallboundtheirvarianceaswell,

whichwillshowthatiftheycollidewithbig

’sthenwithhighprobabilitythiswouldnot

considerablychangetheadmissiblecounter.Ultimately,themaximalcountervalue(i.e.,
)isclosetothemaximal

andsoto

withhighprobability.


1.8.1 LightIndicesandBernstein’sInequality
Bernstein’sinequalityinprobabilitytheoryisamorepreciseformulationoftheclassical
Chebyshevinequalityinprobabilitytheory,proposedbyS.N.Bernshteinin1911;itpermits
onetoestimatetheprobabilityoflargedeviationsbyamonotonedecreasingexponential
function.Inordertoanalysethelightindices,wewilluseBernstein’sinequality.
Theorem1.6
(Bernstein’sinequality)Suppose

areindependent,andforalli,

,and

.Thenforall

Weconsiderthatthelightindicestogetherwillnotdistorttheheavyindices.Letus
parametrizePasfollowsandchooseafunction
aswellasafunction
.Then,

Therefore,hstateselementofthecolumntomakenon-zero,and stateswhichsigntouse
forcolumnj.
Thefollowinglightindicesclaimholdswithconstantprobabilitythatforall

,

Claim

If hasnoheavyindicesthenthemagnitudeof ismuchlessthanT.Obviously,itwouldnot



hinderwithestimate.If assignedthemaximal

,thenbypreviousclaimthatistheonly

heavyindexassignedto .Therefore,allthelightindicesassignedto wouldnotchangeit
bymorethanT/10,andsince

iswithinafactorof2ofT, willstillbewithinaconstant

multiplicativefactorofT.If assignedsomeotherheavyindex,thenthecorresponding
lessthan2Tsince islessthanthemaximal

is

.Thisclaimconcludesthat willbeatmost

2.1T.
Ultimately:

wherethesecondtermisaddedonlyif hasheavyindex.Bythetriangleinequality,

Applyingthistothebucketcontainingthemaximal showsthatbucketofyshouldhold
atleast0.4T.Furthermore,bysimilarargumentallotherbucketsshouldholdatmost2.1T.
Proof
Fix

.Thenfor

,define


Then

Wewillcallthejthtermofthesummand
1.
2.

Wehave
Wealsohave

,sincethe
since

andthenuseBernstein’sinequality.

representrandomsigns.
,

,andweiterateoverlightindices





×