Tải bản đầy đủ (.pdf) (284 trang)

Advanced machine learning with python azw3 tủ tài liệu training

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (3.06 MB, 284 trang )


AdvancedMachineLearningwithPython


TableofContents
AdvancedMachineLearningwithPython
Credits
AbouttheAuthor
AbouttheReviewers
www.PacktPub.com
eBooks,discountoffers,andmore
Whysubscribe?
Preface
Whatisadvancedmachinelearning?
Whatshouldyouexpectfromthisbook?
Whatthisbookcovers
Whatyouneedforthisbook
Whothisbookisfor
Conventions
Readerfeedback
Customersupport
Downloadingtheexamplecode
Downloadingthecolorimagesofthisbook
Errata
Piracy
Questions
1.UnsupervisedMachineLearning
Principalcomponentanalysis
PCA–aprimer
EmployingPCA
Introducingk-meansclustering


Clustering–aprimer
Kick-startingclusteringanalysis
Tuningyourclusteringconfigurations
Self-organizingmaps
SOM–aprimer
EmployingSOM
Furtherreading
Summary
2.DeepBeliefNetworks
Neuralnetworks–aprimer
Thecompositionofaneuralnetwork
Networktopologies
RestrictedBoltzmannMachine
IntroducingtheRBM
Topology


Training
ApplicationsoftheRBM
FurtherapplicationsoftheRBM
Deepbeliefnetworks
TrainingaDBN
ApplyingtheDBN
ValidatingtheDBN
Furtherreading
Summary
3.StackedDenoisingAutoencoders
Autoencoders
Introducingtheautoencoder
Topology

Training
Denoisingautoencoders
ApplyingadA
StackedDenoisingAutoencoders
ApplyingtheSdA
AssessingSdAperformance
Furtherreading
Summary
4.ConvolutionalNeuralNetworks
IntroducingtheCNN
Understandingtheconvnettopology
Understandingconvolutionlayers
Understandingpoolinglayers
Trainingaconvnet
Puttingitalltogether
ApplyingaCNN
FurtherReading
Summary
5.Semi-SupervisedLearning
Introduction
Understandingsemi-supervisedlearning
Semi-supervisedalgorithmsinaction
Self-training
Implementingself-training
Finessingyourself-trainingimplementation
Improvingtheselectionprocess
ContrastivePessimisticLikelihoodEstimation
Furtherreading
Summary
6.TextFeatureEngineering

Introduction
Textfeatureengineering


Cleaningtextdata
TextcleaningwithBeautifulSoup
Managingpunctuationandtokenizing
Taggingandcategorisingwords
TaggingwithNLTK
Sequentialtagging
Backofftagging
Creatingfeaturesfromtextdata
Stemming
Baggingandrandomforests
Testingourprepareddata
Furtherreading
Summary
7.FeatureEngineeringPartII
Introduction
Creatingafeatureset
EngineeringfeaturesforMLapplications
Usingrescalingtechniquestoimprovethelearnabilityoffeatures
Creatingeffectivederivedvariables
Reinterpretingnon-numericfeatures
Usingfeatureselectiontechniques
Performingfeatureselection
Correlation
LASSO
RecursiveFeatureElimination
Geneticmodels

Featureengineeringinpractice
AcquiringdataviaRESTfulAPIs
Testingtheperformanceofourmodel
Twitter
TranslinkTwitter
Consumercomments
TheBingTrafficAPI
Derivingandselectingvariablesusingfeatureengineeringtechniques
TheweatherAPI
Furtherreading
Summary
8.EnsembleMethods
Introducingensembles
Understandingaveragingensembles
Usingbaggingalgorithms
Usingrandomforests
Applyingboostingmethods
UsingXGBoost
Usingstackingensembles


Applyingensemblesinpractice
Usingmodelsindynamicapplications
Understandingmodelrobustness
Identifyingmodelingriskfactors
Strategiestomanagingmodelrobustness
Furtherreading
Summary
9.AdditionalPythonMachineLearningTools
Alternativedevelopmenttools

IntroductiontoLasagne
GettingtoknowLasagne
IntroductiontoTensorFlow
GettingtoknowTensorFlow
UsingTensorFlowtoiterativelyimproveourmodels
Knowingwhentousetheselibraries
Furtherreading
Summary
A.ChapterCodeRequirements
Index


AdvancedMachineLearningwithPython


AdvancedMachineLearningwithPython
Copyright©2016PacktPublishing
Allrightsreserved.Nopartofthisbookmaybereproduced,storedinaretrievalsystem,ortransmittedin
anyformorbyanymeans,withoutthepriorwrittenpermissionofthepublisher,exceptinthecaseofbrief
quotationsembeddedincriticalarticlesorreviews.
Everyefforthasbeenmadeinthepreparationofthisbooktoensuretheaccuracyoftheinformation
presented.However,theinformationcontainedinthisbookissoldwithoutwarranty,eitherexpressor
implied.Neithertheauthor,norPacktPublishing,anditsdealersanddistributorswillbeheldliablefor
anydamagescausedorallegedtobecauseddirectlyorindirectlybythisbook.
PacktPublishinghasendeavoredtoprovidetrademarkinformationaboutallofthecompaniesand
productsmentionedinthisbookbytheappropriateuseofcapitals.However,PacktPublishingcannot
guaranteetheaccuracyofthisinformation.
Firstpublished:July2016
Productionreference:1220716
PublishedbyPacktPublishingLtd.

LiveryPlace
35LiveryStreet
BirminghamB32PB,UK.
ISBN978-1-78439-863-7
www.packtpub.com


Credits
Author
JohnHearty
Reviewers
JaredHuffman
AshwinPajankar
CommissioningEditor
AkramHussain
AcquisitionEditor
SonaliVernekar
ContentDevelopmentEditor
MayurPawanikar
TechnicalEditor
SuwarnaPatil
CopyEditor
TasneemFatehi
ProjectCoordinator
NidhiJoshi
Proofreader
SafisEditing
Indexer
MariammalChettiyar
Graphics

DishaHaria


ProductionCoordinator
ArvindkumarGupta
CoverWork
ArvindkumarGupta


AbouttheAuthor
JohnHeartyisaconsultantindigitalindustrieswithsubstantialexpertiseindatascienceand
infrastructureengineering.Havingstartedoutinmobilegaming,hewasdrawntothechallengeofAAA
consoleanalytics.
Keentostartputtingadvancedmachinelearningtechniquesintopractice,hesignedonwithMicrosoftto
developplayermodellingcapabilitiesandbigdatainfrastructureatanXboxstudio.Histeammade
significantstridesinengineeringanddatasciencethatwerereplicatedacrossMicrosoftStudios.Someof
themorerewardinginitiativesheledincludedplayerskillmodellinginasymmetricalgames,andthe
creationofplayersegmentationmodelsforindividualizedgameexperiences.
EventuallyJohnstruckoutonhisownasaconsultantofferingcomprehensiveinfrastructureandanalytics
solutionsforinternationalclientteamsseekingnewinsightsordata-drivencapabilities.Hisfavourite
currentengagementinvolvescreatingpredictivemodelsandquantifyingtheimportanceofuser
connectionsforapopularsocialnetwork.
Afteryearsspentworkingwithdata,Johnislargelyunabletostopaskingquestions.Inhisowntime,he
routinelybuildsMLsolutionsinPythontofulfilabroadsetofpersonalinterests.Theseincludeanovel
variantontheStyleNetcomputationalcreativityalgorithmandsolutionsforalgo-tradingandgeolocationbasedrecommendation.HecurrentlylivesintheUK.


AbouttheReviewers
JaredHuffmanisalifelonggamerandextremedatageek.Aftercompletinghisbachelor'sdegreein
computerscience,hestartedhiscareerinhishometownofMelbourne,Florida.Whilethere,hehonedhis

softwaredevelopmentskills,includingworkonacreditcard-processingsystemandavarietyofweb
tools.HefinisheditoffwithafuncontractworkingatNASA'sKennedySpaceCenterbeforemigratingto
hiscurrenthomeintheSeattlearea.
Divingheadfirstintotheworldofdata,hetookuparoleworkingonMicrosoft'sinternalfinancetools
andreportingsystems.Feelingthathecouldnolongerresisthisloveforvideogames,hejoinedtheXbox
divisiontobuildtheirBusiness.Todate,Jaredhashelpedshipandsupport12gamesandpresentedat
severaleventsonvariousmachinelearningandotherdatatopics.Hislatestendeavorhashimapplying
bothhissoftwareskillsandanalyticsexpertiseinleadingthedatascienceeffortsforMinecraft.Therehe
getstoapplymachinelearningtechniques,tryingoutfunandimpactfulprojects,suchascustomer
segmentationmodels,churnprediction,andrecommendationsystems.
Outsideofwork,Jaredspendsmuchofhisfreetimeplayingboardgamesandvideogameswithhis
familyandfriends,aswellasdabblinginoccasionalgamedevelopment.
FirstI'dliketogiveabigthankstoJohnforgivingmethehonorofreviewingthisbook;it'sbeenagreat
learningexperience.Second,thankstomyamazingwife,Kalen,forallowingmetorepeatedlyskipchores
toworkonit.Last,andcertainlynotleast,I'dliketothankGodforprovidingmetheopportunitiesto
workonthingsIloveandstillmakealivingdoingit.Beingabletowakeupeverydayandcreategames
thatbringjoytomillionsofplayersistrulyapleasure.
AshwinPajankarisasoftwareprofessionalandIoTenthusiastwithmorethan8yearsofexperiencein
softwaredesign,development,testing,andautomation.
HegraduatedfromIIITHyderabad,earninganM.Techincomputerscienceandengineering.Heholds
multipleprofessionalcertificationsfromOracle,IBM,Teradata,andISTQBindevelopment,databases,
andtesting.Hehaswonseveralawardsincollegethroughoutreachinitiatives,atworkfortechnical
achievements,andcommunityservicethroughcorporatesocialresponsibilityprograms.
HewasintroducedtoRaspberryPiwhileorganizingahackathonathisworkplace,andhasbeenhooked
onPieversince.HewritesplentyofcodeinC,Bash,Python,andJavaonhisclusterofPis.He'salready
authoredtwobooksonRaspberryPiandreviewedthreeothertitlesrelatedtoPythonforPackt
Publishing.
HisLinkedInProfileis />Iwouldliketothankmywife,Kavitha,forthemotivation.



www.PacktPub.com


eBooks,discountoffers,andmore
DidyouknowthatPacktofferseBookversionsofeverybookpublished,withPDFandePubfiles
available?YoucanupgradetotheeBookversionatwww.PacktPub.comandasaprintbookcustomer,
youareentitledtoadiscountontheeBookcopy.Getintouchwithusat
<>formoredetails.
Atwww.PacktPub.com,youcanalsoreadacollectionoffreetechnicalarticles,signupforarangeof
freenewslettersandreceiveexclusivediscountsandoffersonPacktbooksandeBooks.

/>DoyouneedinstantsolutionstoyourITquestions?PacktLibisPackt'sonlinedigitalbooklibrary.Here,
youcansearch,access,andreadPackt'sentirelibraryofbooks.


Whysubscribe?
FullysearchableacrosseverybookpublishedbyPackt
Copyandpaste,print,andbookmarkcontent
Ondemandandaccessibleviaawebbrowser
OfthemanypeopleIfeelgratitudetowards,Iparticularlywanttothankmyparents…mostlyfor
theirpatience.I'dliketoextendthankstoTylerLoweforhisinvaluablefriendship,toMark
Huntleyforhisbothersomeemphasisonaccuracy,andtotheformerteamatLionheadStudios.I
alsogreatlyvaluetheexcellentworkdonebyJaredHuffmanandtheindustriouseditorialteamat
PacktPublishing,whowerehugelypositiveandsupportivethroughoutthecreationofthisbook.
Finally,I'dliketodedicatetheworkandwordshereintoyou,thereader.Therehasneverbeena
bettertimetogettogripswiththesubjectsofthisbook;theworldisstuffedwithnew
opportunitiesthatcanbeseizedusingcreativityandanappropriatemodel.Ihopeforyourevery
successinthepursuitofthosesolutions.



Preface
Hello!WelcometothisguidetoadvancedmachinelearningusingPython.It'spossiblethatyou'vepicked
thisupwithsomeinitialinterest,butaren'tquitesurewhattoexpect.Inanutshell,therehasneverbeena
moreexcitingtimetolearnandusemachinelearningtechniques,andworkinginthefieldisonlygetting
morerewarding.Ifyouwanttogetup-to-speedwithsomeofthemoreadvanceddatamodelingtechniques
andgainexperienceusingthemtosolvechallengingproblems,thisisagoodbookforyou!


Whatisadvancedmachinelearning?
Ongoingadvancesincomputationalpower(perMoore'sLaw)havebeguntomakemachinelearning,once
mostlyaresearchdiscipline,moreviableincommercialcontexts.Thishascausedanexplosionofnew
applicationsandneworrediscoveredtechniques,catapultingtheobscureconceptsofdatascience,AI,
andmachinelearningintothepublicconsciousnessandstrategicplanningofcompaniesinternationally.
Therapiddevelopmentofmachinelearningapplicationsisfueledbyanongoingstruggletocontinually
innovate,playingoutatanarrayofresearchlabs.Thetechniquesdevelopedbythesepioneersareseeding
newapplicationareasandexperiencinggrowingpublicawareness.Whilesomeoftheinnovationssought
inAIandappliedmachinelearningarestillelusivelyfarfromreadiness,othersareareality.Self-driving
cars,sophisticatedimagerecognitionandalteringcapability,ever-greaterstridesingeneticsresearch,
andperhapsmostpervasivelyofall,increasinglytailoredcontentinourdigitalstores,e-mailinboxes,
andonlinelives.
Withallofthesepossibilitiesandmoreatthefingertipsofthecommitteddatascientist,theprofessionis
seeingameteoric,ifclumsy,growth.NotonlyaretherefarmoredatascientistsandAIpractitionersnow
thantherewereeventwoyearsago(inearly2014),buttheaccessibilityandopennessaroundsolutionsat
thehighendofmachinelearningresearchhasincreased.
ResearchteamsatGoogleandFacebookbegantosharemoreandmoreoftheirarchitecture,languages,
models,andtoolsinthehopeofseeingthemappliedandimprovedonbythegrowingdatascientist
population.
Themachinelearningcommunitymaturedenoughtobeginseeingtrendsaspopularalgorithmswere
definedorrediscovered.Toputthismoreaccurately,pre-existingtrendsfromamainlyresearch
communitybegantoreceivegreatattentionfromindustry,withoneproductbeingagroupofmachine

learningexpertsstraddlingindustryandacademia.Anotherproduct,thesubjectofthissection,isa
growingawarenessofadvancedalgorithmsthatcanbeusedtocrackthefrontierproblemsofthecurrent
day.Frommonthtomonth,weseenewadvancesmade,scoresrise,andthefrontiermoveseverfurther
out.
Whatallofthismeansisthattheremayneverhavebeenabettertimetomoveintothefieldofdata
scienceanddevelopyourmachinelearningskillset.Theintroductoryalgorithms(includingclustering,
regressionmodels,andneuralnetworkarchitectures)andtoolsarewidelycoveredinwebcoursesand
blogcontent.Whilethetechniquesatthecuttingedgeofdatascience(includingdeeplearning,semisupervisedalgorithms,andensembles)remainlessaccessible,thetechniquesthemselvesarenow
availablethroughsoftwarelibrariesinmultiplelanguages.Allthat'sneededisthecombinationof
theoreticalknowledgeandpracticalguidancetoimplementmodelscorrectly.Thatistherequirementthat
thisbookwaswrittentoaddress.


Whatshouldyouexpectfromthisbook?
You'vebeguntoreadabookthatfocusesonteachingsomeoftheadvancedmodelingtechniquesthat've
emergedinrecentyears.Thisbookisaimedatanyonewhowantstolearnaboutthosealgorithms,
whetheryou'reanexperienceddatascientistordeveloperlookingtoparlayexistingskillsintoanew
environment.
Iaimedfirstandforemostatmakingsurethatyouunderstandthealgorithmsinquestion.Someofthemare
fairlytrickyandtieintootherconceptsinstatisticsandmachinelearning.
Forneophytereaders,Idefinitelyrecommendgatheringaninitialunderstandingofkeyconcepts,including
thefollowing:
NeuralnetworkarchitecturesincludingtheMLParchitecture
Learningmethodcomponentsincludinggradientdescentandbackpropagation
Networkperformancemeasures,forexample,rootmeansquarederror
K-meansclustering
Attimes,thisbookwon'tbeabletogiveasubjecttheattentionthatitdeserves.Wecoveralotofground
inthisbookandthepaceisfairlybriskasaresult!Attheendofeachchapter,Ireferyoutofurther
reading,inabookoronlinearticle,sothatyoucanbuildabroaderbaseofrelevantknowledge.I'd
suggestthatit'sworthdoingadditionalreadingaroundanyunfamiliarconceptthatcomesupasyouwork

throughthisbook,asmachinelearningknowledgetendstotietogethersynergistically;themoreyouhave,
themorereadilyyou'llunderstandnewconceptsasyouexpandyourtoolkit.
ThisconceptofexpandingatoolkitofskillsisfundamentaltowhatI'vetriedtoachievewiththisbook.
Eachchapterintroducesoneormultiplealgorithmsandlookstoachieveseveralgoals:
Explainingatahighlevelwhatthealgorithmdoes,whatproblemsit'llsolvewell,andhowyou
shouldexpecttoapplyit
Walkingthroughkeycomponentsofthealgorithm,includingtopology,learningmethod,and
performancemeasurement
Identifyinghowtoimproveperformancebyreviewingmodeloutput
Beyondthetransferofknowledgeandpracticalskills,thisbooklookstoachieveamoreimportantgoal;
specifically,todiscussandconveysomeofthequalitiesthatarecommontoskilledmachinelearning
practitioners.Theseincludecreativity,demonstratedbothinthedefinitionofsophisticatedarchitectures
andproblem-specificcleaningtechniques.Rigorisanotherkeyquality,emphasizedthroughoutthisbook
byafocusonmeasuringperformanceagainstmeaningfultargetsandcriticallyassessingearlyefforts.
Finally,thisbookmakesnoefforttoobscuretherealitiesofworkingonsolvingdatachallenges:the
mixedresultsofearlytrials,largeiterationcounts,andfrequentimpasses.Yetatthesametime,usinga
mixtureoftoyexamples,dissectionofexpertapproachesand,towardtheendofthebook,morerealworldchallenges,weshowhowacreative,tenacious,andrigorousapproachcanbreakdownthese
barriersanddelivermeaningfulresults.


Asweproceed,Iwishyouthebestofluckandencourageyoutoenjoyyourselfasyougo,tacklingthe
contentpreparedforyouandapplyingwhatyou'velearnedtonewdomainsordata.
Let'sgetstarted!


Whatthisbookcovers
Chapter1,UnsupervisedMachineLearning,showsyouhowtoapplyunsupervisedlearningtechniques
toidentifypatternsandstructurewithindatasets.
Chapter2,DeepBeliefNetworks,explainshowtheRBMandDBNalgorithmswork;you'llknowhowto
usethemandwillfeelconfidentinyourabilitytoimprovethequalityoftheresultsthatyougetoutof

them.
Chapter3,StackedDenoisingAutoencoders,continuestobuildourskillwithdeeparchitecturesby
applyingstackeddenoisingautoencoderstolearnfeaturerepresentationsforhigh-dimensionalinputdata.
Chapter4,ConvolutionalNeuralNetworks,showsyouhowtoapplytheconvolutionalneuralnetwork
(orConvnet).
Chapter5,Semi-SupervisedLearning,explainshowtoapplyseveralsemi-supervisedlearning
techniques,includingCPLE,self-learning,andS3VM.
Chapter6,TextFeatureEngineering,discussesdatapreparationskillsthatsignificantlyincreasethe
effectivenessofallthemodelsthatwe'vepreviouslydiscussed.
Chapter7,FeatureEngineeringPartII,showsyouhowtointerrogatethedatatoweedoutormitigate
qualityissues,transformitintoformsthatareconducivetomachinelearning,andcreativelyenhancethat
data.
Chapter8,EnsembleMethods,looksatbuildingmoresophisticatedmodelensemblesandmethodsof
buildingrobustnessintoyourmodelsolutions.
Chapter9,AdditionalPythonMachineLearningTools,reviewssomeofthebestinrecenttools
availabletodatascientists,identifiesthebenefitsthattheyoffer,anddiscusseshowtoapplythem
alongsidetoolsandtechniquesdiscussedearlierinthisbook,withinaconsistentworkingprocess.
AppendixA,ChapterCodeRequirements,discussestoolrequirementsforthebook,identifyingrequired
librariesforeachchapter.


Whatyouneedforthisbook
Theentiretyofthisbook'scontentleveragesopenlyavailabledataandcode,includingopensource
Pythonlibrariesandframeworks.Whileeachchapter'sexamplecodeisaccompaniedbyaREADMEfile
documentingallthelibrariesrequiredtorunthecodeprovidedinthatchapter'saccompanyingscripts,the
contentofthesefilesiscollatedhereforyourconvenience.
Itisrecommendedthatsomelibrariesrequiredforearlierchaptersbeavailablewhenworkingwithcode
fromanylaterchapter.Theserequirementsareidentifiedusingboldtext.Particularly,itisimportanttoset
upthefirstchapter'srequiredlibrariesforanycontentlaterinthebook.



Whothisbookisfor
ThistitleisforPythondevelopersandanalystsordatascientistswhoarelookingtoaddtotheirexisting
skillsbyaccessingsomeofthemostpowerfulrecenttrendsindatascience.Ifyou'veeverconsidered
buildingyourownimageortext-taggingsolutionorenteringaKagglecontest,forinstance,thisbookis
foryou!
PriorexperienceofPythonandgroundinginsomeofthecoreconceptsofmachinelearningwouldbe
helpful.


Conventions
Inthisbook,youwillfindanumberoftextstylesthatdistinguishbetweendifferentkindsofinformation.
Herearesomeexamplesofthesestylesandanexplanationoftheirmeaning.
Codewordsintext,databasetablenames,foldernames,filenames,fileextensions,pathnames,dummy
URLs,userinput,andTwitterhandlesareshownasfollows:"WewillbeginapplyingPCAtothe
handwrittendigitsdatasetwiththefollowingcode."
Ablockofcodeissetasfollows:
importnumpyasnp
fromsklearn.datasetsimportload_digits
importmatplotlib.pyplotasplt
fromsklearn.decompositionimportPCA
fromsklearn.preprocessingimportscale
fromsklearn.ldaimportLDA
importmatplotlib.cmascm
digits=load_digits()
data=digits.data
n_samples,n_features=data.shape
n_digits=len(np.unique(digits.target))
labels=digits.target


Anycommand-lineinputoroutputiswrittenasfollows:
[0.392766060.495712920.439332430.535735580.42459285
0.556868540.45734010.498763580.502815850.4689295]
0.4772857426

Note
Warningsorimportantnotesappearinaboxlikethis.

Tip
Tipsandtricksappearlikethis.


Readerfeedback
Feedbackfromourreadersisalwayswelcome.Letusknowwhatyouthinkaboutthisbook—whatyou
likedordisliked.Readerfeedbackisimportantforusasithelpsusdeveloptitlesthatyouwillreallyget
themostoutof.
Tosendusgeneralfeedback,simplye-mail<>,andmentionthebook'stitlein
thesubjectofyourmessage.
Ifthereisatopicthatyouhaveexpertiseinandyouareinterestedineitherwritingorcontributingtoa
book,seeourauthorguideatwww.packtpub.com/authors.


Customersupport
NowthatyouaretheproudownerofaPacktbook,wehaveanumberofthingstohelpyoutogetthemost
fromyourpurchase.


×