Tải bản đầy đủ (.pdf) (468 trang)

IT training building machine learning systems with python (2nd ed ) coelho richert 2015 03 31

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (7.01 MB, 468 trang )



BuildingMachineLearningSystemswith
PythonSecondEdition


TableofContents
BuildingMachineLearningSystemswithPythonSecondEdition
Credits
AbouttheAuthors
AbouttheReviewers
www.PacktPub.com
Supportfiles,eBooks,discountoffers,andmore
Whysubscribe?
FreeaccessforPacktaccountholders
Preface
Whatthisbookcovers
Whatyouneedforthisbook
Whothisbookisfor
Conventions
Readerfeedback
Customersupport
Downloadingtheexamplecode
Errata
Piracy
Questions
1.GettingStartedwithPythonMachineLearning
MachinelearningandPython–adreamteam
Whatthebookwillteachyou(andwhatitwillnot)
Whattodowhenyouarestuck
Gettingstarted


IntroductiontoNumPy,SciPy,andmatplotlib
InstallingPython
ChewingdataefficientlywithNumPyandintelligentlywithSciPy
LearningNumPy
Indexing


Handlingnonexistingvalues
Comparingtheruntime
LearningSciPy
Ourfirst(tiny)applicationofmachinelearning
Readinginthedata
Preprocessingandcleaningthedata
Choosingtherightmodelandlearningalgorithm
Beforebuildingourfirstmodel…
Startingwithasimplestraightline
Towardssomeadvancedstuff
Steppingbacktogoforward–anotherlookatourdata
Trainingandtesting
Answeringourinitialquestion
Summary
2.ClassifyingwithReal-worldExamples
TheIrisdataset
Visualizationisagoodfirststep
Buildingourfirstclassificationmodel
Evaluation–holdingoutdataandcross-validation
Buildingmorecomplexclassifiers
Amorecomplexdatasetandamorecomplexclassifier
LearningabouttheSeedsdataset
Featuresandfeatureengineering

Nearestneighborclassification
Classifyingwithscikit-learn
Lookingatthedecisionboundaries
Binaryandmulticlassclassification
Summary
3.Clustering–FindingRelatedPosts
Measuringtherelatednessofposts
Hownottodoit


Howtodoit
Preprocessing–similaritymeasuredasasimilarnumberofcommonwords
Convertingrawtextintoabagofwords
Countingwords
Normalizingwordcountvectors
Removinglessimportantwords
Stemming
InstallingandusingNLTK
ExtendingthevectorizerwithNLTK’sstemmer
Stopwordsonsteroids
Ourachievementsandgoals
Clustering
K-means
Gettingtestdatatoevaluateourideason
Clusteringposts
Solvingourinitialchallenge
Anotherlookatnoise
Tweakingtheparameters
Summary
4.TopicModeling

LatentDirichletallocation
Buildingatopicmodel
Comparingdocumentsbytopics
ModelingthewholeofWikipedia
Choosingthenumberoftopics
Summary
5.Classification–DetectingPoorAnswers
Sketchingourroadmap
Learningtoclassifyclassyanswers
Tuningtheinstance
Tuningtheclassifier


Fetchingthedata
Slimmingthedatadowntochewablechunks
Preselectionandprocessingofattributes
Definingwhatisagoodanswer
Creatingourfirstclassifier
StartingwithkNN
Engineeringthefeatures
Trainingtheclassifier
Measuringtheclassifier’sperformance
Designingmorefeatures
Decidinghowtoimprove
Bias-varianceandtheirtradeoff
Fixinghighbias
Fixinghighvariance
Highbiasorlowbias
Usinglogisticregression
Abitofmathwithasmallexample

Applyinglogisticregressiontoourpostclassificationproblem
Lookingbehindaccuracy–precisionandrecall
Slimmingtheclassifier
Shipit!
Summary
6.ClassificationII–SentimentAnalysis
Sketchingourroadmap
FetchingtheTwitterdata
IntroducingtheNaïveBayesclassifier
GettingtoknowtheBayes’theorem
Beingnaïve
UsingNaïveBayestoclassify
Accountingforunseenwordsandotheroddities
Accountingforarithmeticunderflows


Creatingourfirstclassifierandtuningit
Solvinganeasyproblemfirst
Usingallclasses
Tuningtheclassifier’sparameters
Cleaningtweets
Takingthewordtypesintoaccount
Determiningthewordtypes
SuccessfullycheatingusingSentiWordNet
Ourfirstestimator
Puttingeverythingtogether
Summary
7.Regression
Predictinghousepriceswithregression
Multidimensionalregression

Cross-validationforregression
Penalizedorregularizedregression
L1andL2penalties
UsingLassoorElasticNetinscikit-learn
VisualizingtheLassopath
P-greater-than-Nscenarios
Anexamplebasedontextdocuments
Settinghyperparametersinaprincipledway
Summary
8.Recommendations
Ratingpredictionsandrecommendations
Splittingintotrainingandtesting
Normalizingthetrainingdata
Aneighborhoodapproachtorecommendations
Aregressionapproachtorecommendations
Combiningmultiplemethods
Basketanalysis


Obtainingusefulpredictions
Analyzingsupermarketshoppingbaskets
Associationrulemining
Moreadvancedbasketanalysis
Summary
9.Classification–MusicGenreClassification
Sketchingourroadmap
Fetchingthemusicdata
ConvertingintoaWAVformat
Lookingatmusic
Decomposingmusicintosinewavecomponents

UsingFFTtobuildourfirstclassifier
Increasingexperimentationagility
Trainingtheclassifier
Usingaconfusionmatrixtomeasureaccuracyinmulticlassproblems
Analternativewaytomeasureclassifierperformanceusingreceiver-operator
characteristics
ImprovingclassificationperformancewithMelFrequencyCepstralCoefficients
Summary
10.ComputerVision
Introducingimageprocessing
Loadinganddisplayingimages
Thresholding
Gaussianblurring
Puttingthecenterinfocus
Basicimageclassification
Computingfeaturesfromimages
Writingyourownfeatures
Usingfeaturestofindsimilarimages
Classifyingaharderdataset
Localfeaturerepresentations
Summary


11.DimensionalityReduction
Sketchingourroadmap
Selectingfeatures
Detectingredundantfeaturesusingfilters
Correlation
Mutualinformation
Askingthemodelaboutthefeaturesusingwrappers

Otherfeatureselectionmethods
Featureextraction
Aboutprincipalcomponentanalysis
SketchingPCA
ApplyingPCA
LimitationsofPCAandhowLDAcanhelp
Multidimensionalscaling
Summary
12.BiggerData
Learningaboutbigdata
Usingjugtobreakupyourpipelineintotasks
Anintroductiontotasksinjug
Lookingunderthehood
Usingjugfordataanalysis
Reusingpartialresults
UsingAmazonWebServices
Creatingyourfirstvirtualmachines
InstallingPythonpackagesonAmazonLinux
Runningjugonourcloudmachine
AutomatingthegenerationofclusterswithStarCluster
Summary
A.WheretoLearnMoreMachineLearning
Onlinecourses
Books


Questionandanswersites
Blogs
Datasources
Gettingcompetitive

Allthatwasleftout
Summary
Index



BuildingMachineLearningSystemswith
PythonSecondEdition



BuildingMachineLearningSystemswith
PythonSecondEdition
Copyright©2015PacktPublishing
Allrightsreserved.Nopartofthisbookmaybereproduced,storedinaretrievalsystem,
ortransmittedinanyformorbyanymeans,withoutthepriorwrittenpermissionofthe
publisher,exceptinthecaseofbriefquotationsembeddedincriticalarticlesorreviews.
Everyefforthasbeenmadeinthepreparationofthisbooktoensuretheaccuracyofthe
informationpresented.However,theinformationcontainedinthisbookissoldwithout
warranty,eitherexpressorimplied.Neithertheauthors,norPacktPublishing,andits
dealersanddistributorswillbeheldliableforanydamagescausedorallegedtobecaused
directlyorindirectlybythisbook.
PacktPublishinghasendeavoredtoprovidetrademarkinformationaboutallofthe
companiesandproductsmentionedinthisbookbytheappropriateuseofcapitals.
However,PacktPublishingcannotguaranteetheaccuracyofthisinformation.
Firstpublished:July2013
Secondedition:March2015
Productionreference:1230315
PublishedbyPacktPublishingLtd.
LiveryPlace

35LiveryStreet
BirminghamB32PB,UK.
ISBN978-1-78439-277-2
www.packtpub.com



Credits
Authors
LuisPedroCoelho
WilliRichert
Reviewers
MatthieuBrucher
MauriceHTLing
RadimŘehůřek
CommissioningEditor
KartikeyPandey
AcquisitionEditors
GregWild
RichardHarvey
KartikeyPandey
ContentDevelopmentEditor
ArunNadar
TechnicalEditor
PankajKadam
CopyEditors
RelinHedly
SameenSiddiqui
LaxmiSubramanian
ProjectCoordinator

NikhilNair
Proofreaders
SimranBhogal
LawrenceA.Herman
LindaMorris
PaulHindle
Indexer
HemanginiBari


Graphics
SheetalAute
AbhinashSahu
ProductionCoordinator
ArvindkumarGupta
CoverWork
ArvindkumarGupta



AbouttheAuthors
LuisPedroCoelhoisacomputationalbiologist:someonewhousescomputersasatoolto
understandbiologicalsystems.Inparticular,LuisanalyzesDNAfrommicrobial
communitiestocharacterizetheirbehavior.Luishasalsoworkedextensivelyinbioimage
informatics—theapplicationofmachinelearningtechniquesfortheanalysisofimagesof
biologicalspecimens.Hismainfocusisontheprocessingandintegrationoflarge-scale
datasets.
LuishasaPhDfromCarnegieMellonUniversity,oneoftheleadinguniversitiesinthe
worldintheareaofmachinelearning.Heistheauthorofseveralscientificpublications.
Luisstarteddevelopingopensourcesoftwarein1998asawaytoapplyrealcodetowhat

hewaslearninginhiscomputersciencecoursesattheTechnicalUniversityofLisbon.In
2004,hestarteddevelopinginPythonandhascontributedtoseveralopensourcelibraries
inthislanguage.Heistheleaddeveloperonthepopularcomputervisionpackagefor
Pythonandmahotas,aswellasthecontributorofseveralmachinelearningcodes.
LuiscurrentlydivideshistimebetweenLuxembourgandHeidelberg.
Ithankmywife,Rita,forallherloveandsupportandmydaughter,Anna,forbeingthe
bestthingever.
WilliRicherthasaPhDinmachinelearning/robotics,whereheusedreinforcement
learning,hiddenMarkovmodels,andBayesiannetworkstoletheterogeneousrobotslearn
byimitation.Currently,heworksforMicrosoftintheCoreRelevanceTeamofBing,
whereheisinvolvedinavarietyofMLareassuchasactivelearning,statisticalmachine
translation,andgrowingdecisiontrees.
Thisbookwouldnothavebeenpossiblewithoutthesupportofmywife,Natalie,andmy
sons,LinusandMoritz.Iamespeciallygratefulforthemanyfruitfuldiscussionswithmy
currentorpreviousmanagers,AndreasBode,ClemensMarschner,HongyanZhou,and
EricCrestan,aswellasmycolleaguesandfriends,TomaszMarciniak,CristianEigel,
OliverNiehoerster,andPhilippAdelt.Theinterestingideasaremostlikelyfromthem;the
bugsbelongtome.



AbouttheReviewers
MatthieuBrucherholdsanengineeringdegreefromtheEcoleSupérieured’Electricité
(Information,Signals,Measures),FranceandhasaPhDinunsupervisedmanifold
learningfromtheUniversitédeStrasbourg,France.HecurrentlyholdsanHPCsoftware
developerpositioninanoilcompanyandisworkingonthenextgenerationreservoir
simulation.
MauriceHTLinghasbeenprogramminginPythonsince2003.Havingcompletedhis
PhDinBioinformaticsandBSc(Hons.)inMolecularandCellBiologyfromThe
UniversityofMelbourne,heiscurrentlyaResearchFellowatNanyangTechnological

University,Singapore,andanHonoraryFellowatTheUniversityofMelbourne,Australia.
MauriceistheChiefEditorforComputationalandMathematicalBiology,andco-editor
forThePythonPapers.Recently,Mauricecofoundedthefirstsyntheticbiologystart-upin
Singapore,AdvanceSynPte.Ltd.,astheDirectorandChiefTechnologyOfficer.His
researchinterestsliesinlife—biologicallife,artificiallife,andartificialintelligence—
usingcomputerscienceandstatisticsastoolstounderstandlifeanditsnumerousaspects.
Inhisfreetime,Mauricelikestoread,enjoyacupofcoffee,writehispersonaljournal,or
philosophizeonvariousaspectsoflife.HiswebsiteandLinkedInprofileare
andrespectively.
RadimŘehůřekisatechgeekanddeveloperatheart.Hefoundedandledtheresearch
departmentatSeznam.cz,amajorsearchenginecompanyincentralEurope.After
finishinghisPhD,hedecidedtomoveonandspreadthemachinelearninglove,starting
hisownprivatelyownedR&Dcompany,RaReConsultingLtd.RaRespecializesinmadeto-measuredataminingsolutions,deliveringcutting-edgesystemsforclientsrangingfrom
largemultinationalstonascentstart-ups.
Radimisalsotheauthorofanumberofpopularopensourceprojects,includinggensim
andsmart_open.
Abigfanofexperiencingdifferentcultures,Radimhaslivedaroundtheglobewithhis
wifeforthepastdecade,withhisnextstepsleadingtoSouthKorea.Nomatterwherehe
stays,Radimandhisteamalwaystrytoevangelizedata-drivensolutionsandhelp
companiesworldwidemakethemostoftheirmachinelearningopportunities.



www.PacktPub.com


Supportfiles,eBooks,discountoffers,and
more
Forsupportfilesanddownloadsrelatedtoyourbook,pleasevisitwww.PacktPub.com.
DidyouknowthatPacktofferseBookversionsofeverybookpublished,withPDFand

ePubfilesavailable?YoucanupgradetotheeBookversionatwww.PacktPub.comandas
aprintbookcustomer,youareentitledtoadiscountontheeBookcopy.Getintouchwith
usat<>formoredetails.
Atwww.PacktPub.com,youcanalsoreadacollectionoffreetechnicalarticles,signup
forarangeoffreenewslettersandreceiveexclusivediscountsandoffersonPacktbooks
andeBooks.

/>DoyouneedinstantsolutionstoyourITquestions?PacktLibisPackt’sonlinedigital
booklibrary.Here,youcansearch,access,andreadPackt’sentirelibraryofbooks.


×