BuildingMachineLearningSystemswith
PythonSecondEdition
TableofContents
BuildingMachineLearningSystemswithPythonSecondEdition
Credits
AbouttheAuthors
AbouttheReviewers
www.PacktPub.com
Supportfiles,eBooks,discountoffers,andmore
Whysubscribe?
FreeaccessforPacktaccountholders
Preface
Whatthisbookcovers
Whatyouneedforthisbook
Whothisbookisfor
Conventions
Readerfeedback
Customersupport
Downloadingtheexamplecode
Errata
Piracy
Questions
1.GettingStartedwithPythonMachineLearning
MachinelearningandPython–adreamteam
Whatthebookwillteachyou(andwhatitwillnot)
Whattodowhenyouarestuck
Gettingstarted
IntroductiontoNumPy,SciPy,andmatplotlib
InstallingPython
ChewingdataefficientlywithNumPyandintelligentlywithSciPy
LearningNumPy
Indexing
Handlingnonexistingvalues
Comparingtheruntime
LearningSciPy
Ourfirst(tiny)applicationofmachinelearning
Readinginthedata
Preprocessingandcleaningthedata
Choosingtherightmodelandlearningalgorithm
Beforebuildingourfirstmodel…
Startingwithasimplestraightline
Towardssomeadvancedstuff
Steppingbacktogoforward–anotherlookatourdata
Trainingandtesting
Answeringourinitialquestion
Summary
2.ClassifyingwithReal-worldExamples
TheIrisdataset
Visualizationisagoodfirststep
Buildingourfirstclassificationmodel
Evaluation–holdingoutdataandcross-validation
Buildingmorecomplexclassifiers
Amorecomplexdatasetandamorecomplexclassifier
LearningabouttheSeedsdataset
Featuresandfeatureengineering
Nearestneighborclassification
Classifyingwithscikit-learn
Lookingatthedecisionboundaries
Binaryandmulticlassclassification
Summary
3.Clustering–FindingRelatedPosts
Measuringtherelatednessofposts
Hownottodoit
Howtodoit
Preprocessing–similaritymeasuredasasimilarnumberofcommonwords
Convertingrawtextintoabagofwords
Countingwords
Normalizingwordcountvectors
Removinglessimportantwords
Stemming
InstallingandusingNLTK
ExtendingthevectorizerwithNLTK’sstemmer
Stopwordsonsteroids
Ourachievementsandgoals
Clustering
K-means
Gettingtestdatatoevaluateourideason
Clusteringposts
Solvingourinitialchallenge
Anotherlookatnoise
Tweakingtheparameters
Summary
4.TopicModeling
LatentDirichletallocation
Buildingatopicmodel
Comparingdocumentsbytopics
ModelingthewholeofWikipedia
Choosingthenumberoftopics
Summary
5.Classification–DetectingPoorAnswers
Sketchingourroadmap
Learningtoclassifyclassyanswers
Tuningtheinstance
Tuningtheclassifier
Fetchingthedata
Slimmingthedatadowntochewablechunks
Preselectionandprocessingofattributes
Definingwhatisagoodanswer
Creatingourfirstclassifier
StartingwithkNN
Engineeringthefeatures
Trainingtheclassifier
Measuringtheclassifier’sperformance
Designingmorefeatures
Decidinghowtoimprove
Bias-varianceandtheirtradeoff
Fixinghighbias
Fixinghighvariance
Highbiasorlowbias
Usinglogisticregression
Abitofmathwithasmallexample
Applyinglogisticregressiontoourpostclassificationproblem
Lookingbehindaccuracy–precisionandrecall
Slimmingtheclassifier
Shipit!
Summary
6.ClassificationII–SentimentAnalysis
Sketchingourroadmap
FetchingtheTwitterdata
IntroducingtheNaïveBayesclassifier
GettingtoknowtheBayes’theorem
Beingnaïve
UsingNaïveBayestoclassify
Accountingforunseenwordsandotheroddities
Accountingforarithmeticunderflows
Creatingourfirstclassifierandtuningit
Solvinganeasyproblemfirst
Usingallclasses
Tuningtheclassifier’sparameters
Cleaningtweets
Takingthewordtypesintoaccount
Determiningthewordtypes
SuccessfullycheatingusingSentiWordNet
Ourfirstestimator
Puttingeverythingtogether
Summary
7.Regression
Predictinghousepriceswithregression
Multidimensionalregression
Cross-validationforregression
Penalizedorregularizedregression
L1andL2penalties
UsingLassoorElasticNetinscikit-learn
VisualizingtheLassopath
P-greater-than-Nscenarios
Anexamplebasedontextdocuments
Settinghyperparametersinaprincipledway
Summary
8.Recommendations
Ratingpredictionsandrecommendations
Splittingintotrainingandtesting
Normalizingthetrainingdata
Aneighborhoodapproachtorecommendations
Aregressionapproachtorecommendations
Combiningmultiplemethods
Basketanalysis
Obtainingusefulpredictions
Analyzingsupermarketshoppingbaskets
Associationrulemining
Moreadvancedbasketanalysis
Summary
9.Classification–MusicGenreClassification
Sketchingourroadmap
Fetchingthemusicdata
ConvertingintoaWAVformat
Lookingatmusic
Decomposingmusicintosinewavecomponents
UsingFFTtobuildourfirstclassifier
Increasingexperimentationagility
Trainingtheclassifier
Usingaconfusionmatrixtomeasureaccuracyinmulticlassproblems
Analternativewaytomeasureclassifierperformanceusingreceiver-operator
characteristics
ImprovingclassificationperformancewithMelFrequencyCepstralCoefficients
Summary
10.ComputerVision
Introducingimageprocessing
Loadinganddisplayingimages
Thresholding
Gaussianblurring
Puttingthecenterinfocus
Basicimageclassification
Computingfeaturesfromimages
Writingyourownfeatures
Usingfeaturestofindsimilarimages
Classifyingaharderdataset
Localfeaturerepresentations
Summary
11.DimensionalityReduction
Sketchingourroadmap
Selectingfeatures
Detectingredundantfeaturesusingfilters
Correlation
Mutualinformation
Askingthemodelaboutthefeaturesusingwrappers
Otherfeatureselectionmethods
Featureextraction
Aboutprincipalcomponentanalysis
SketchingPCA
ApplyingPCA
LimitationsofPCAandhowLDAcanhelp
Multidimensionalscaling
Summary
12.BiggerData
Learningaboutbigdata
Usingjugtobreakupyourpipelineintotasks
Anintroductiontotasksinjug
Lookingunderthehood
Usingjugfordataanalysis
Reusingpartialresults
UsingAmazonWebServices
Creatingyourfirstvirtualmachines
InstallingPythonpackagesonAmazonLinux
Runningjugonourcloudmachine
AutomatingthegenerationofclusterswithStarCluster
Summary
A.WheretoLearnMoreMachineLearning
Onlinecourses
Books
Questionandanswersites
Blogs
Datasources
Gettingcompetitive
Allthatwasleftout
Summary
Index
BuildingMachineLearningSystemswith
PythonSecondEdition
BuildingMachineLearningSystemswith
PythonSecondEdition
Copyright©2015PacktPublishing
Allrightsreserved.Nopartofthisbookmaybereproduced,storedinaretrievalsystem,
ortransmittedinanyformorbyanymeans,withoutthepriorwrittenpermissionofthe
publisher,exceptinthecaseofbriefquotationsembeddedincriticalarticlesorreviews.
Everyefforthasbeenmadeinthepreparationofthisbooktoensuretheaccuracyofthe
informationpresented.However,theinformationcontainedinthisbookissoldwithout
warranty,eitherexpressorimplied.Neithertheauthors,norPacktPublishing,andits
dealersanddistributorswillbeheldliableforanydamagescausedorallegedtobecaused
directlyorindirectlybythisbook.
PacktPublishinghasendeavoredtoprovidetrademarkinformationaboutallofthe
companiesandproductsmentionedinthisbookbytheappropriateuseofcapitals.
However,PacktPublishingcannotguaranteetheaccuracyofthisinformation.
Firstpublished:July2013
Secondedition:March2015
Productionreference:1230315
PublishedbyPacktPublishingLtd.
LiveryPlace
35LiveryStreet
BirminghamB32PB,UK.
ISBN978-1-78439-277-2
www.packtpub.com
Credits
Authors
LuisPedroCoelho
WilliRichert
Reviewers
MatthieuBrucher
MauriceHTLing
RadimŘehůřek
CommissioningEditor
KartikeyPandey
AcquisitionEditors
GregWild
RichardHarvey
KartikeyPandey
ContentDevelopmentEditor
ArunNadar
TechnicalEditor
PankajKadam
CopyEditors
RelinHedly
SameenSiddiqui
LaxmiSubramanian
ProjectCoordinator
NikhilNair
Proofreaders
SimranBhogal
LawrenceA.Herman
LindaMorris
PaulHindle
Indexer
HemanginiBari
Graphics
SheetalAute
AbhinashSahu
ProductionCoordinator
ArvindkumarGupta
CoverWork
ArvindkumarGupta
AbouttheAuthors
LuisPedroCoelhoisacomputationalbiologist:someonewhousescomputersasatoolto
understandbiologicalsystems.Inparticular,LuisanalyzesDNAfrommicrobial
communitiestocharacterizetheirbehavior.Luishasalsoworkedextensivelyinbioimage
informatics—theapplicationofmachinelearningtechniquesfortheanalysisofimagesof
biologicalspecimens.Hismainfocusisontheprocessingandintegrationoflarge-scale
datasets.
LuishasaPhDfromCarnegieMellonUniversity,oneoftheleadinguniversitiesinthe
worldintheareaofmachinelearning.Heistheauthorofseveralscientificpublications.
Luisstarteddevelopingopensourcesoftwarein1998asawaytoapplyrealcodetowhat
hewaslearninginhiscomputersciencecoursesattheTechnicalUniversityofLisbon.In
2004,hestarteddevelopinginPythonandhascontributedtoseveralopensourcelibraries
inthislanguage.Heistheleaddeveloperonthepopularcomputervisionpackagefor
Pythonandmahotas,aswellasthecontributorofseveralmachinelearningcodes.
LuiscurrentlydivideshistimebetweenLuxembourgandHeidelberg.
Ithankmywife,Rita,forallherloveandsupportandmydaughter,Anna,forbeingthe
bestthingever.
WilliRicherthasaPhDinmachinelearning/robotics,whereheusedreinforcement
learning,hiddenMarkovmodels,andBayesiannetworkstoletheterogeneousrobotslearn
byimitation.Currently,heworksforMicrosoftintheCoreRelevanceTeamofBing,
whereheisinvolvedinavarietyofMLareassuchasactivelearning,statisticalmachine
translation,andgrowingdecisiontrees.
Thisbookwouldnothavebeenpossiblewithoutthesupportofmywife,Natalie,andmy
sons,LinusandMoritz.Iamespeciallygratefulforthemanyfruitfuldiscussionswithmy
currentorpreviousmanagers,AndreasBode,ClemensMarschner,HongyanZhou,and
EricCrestan,aswellasmycolleaguesandfriends,TomaszMarciniak,CristianEigel,
OliverNiehoerster,andPhilippAdelt.Theinterestingideasaremostlikelyfromthem;the
bugsbelongtome.
AbouttheReviewers
MatthieuBrucherholdsanengineeringdegreefromtheEcoleSupérieured’Electricité
(Information,Signals,Measures),FranceandhasaPhDinunsupervisedmanifold
learningfromtheUniversitédeStrasbourg,France.HecurrentlyholdsanHPCsoftware
developerpositioninanoilcompanyandisworkingonthenextgenerationreservoir
simulation.
MauriceHTLinghasbeenprogramminginPythonsince2003.Havingcompletedhis
PhDinBioinformaticsandBSc(Hons.)inMolecularandCellBiologyfromThe
UniversityofMelbourne,heiscurrentlyaResearchFellowatNanyangTechnological
University,Singapore,andanHonoraryFellowatTheUniversityofMelbourne,Australia.
MauriceistheChiefEditorforComputationalandMathematicalBiology,andco-editor
forThePythonPapers.Recently,Mauricecofoundedthefirstsyntheticbiologystart-upin
Singapore,AdvanceSynPte.Ltd.,astheDirectorandChiefTechnologyOfficer.His
researchinterestsliesinlife—biologicallife,artificiallife,andartificialintelligence—
usingcomputerscienceandstatisticsastoolstounderstandlifeanditsnumerousaspects.
Inhisfreetime,Mauricelikestoread,enjoyacupofcoffee,writehispersonaljournal,or
philosophizeonvariousaspectsoflife.HiswebsiteandLinkedInprofileare
andrespectively.
RadimŘehůřekisatechgeekanddeveloperatheart.Hefoundedandledtheresearch
departmentatSeznam.cz,amajorsearchenginecompanyincentralEurope.After
finishinghisPhD,hedecidedtomoveonandspreadthemachinelearninglove,starting
hisownprivatelyownedR&Dcompany,RaReConsultingLtd.RaRespecializesinmadeto-measuredataminingsolutions,deliveringcutting-edgesystemsforclientsrangingfrom
largemultinationalstonascentstart-ups.
Radimisalsotheauthorofanumberofpopularopensourceprojects,includinggensim
andsmart_open.
Abigfanofexperiencingdifferentcultures,Radimhaslivedaroundtheglobewithhis
wifeforthepastdecade,withhisnextstepsleadingtoSouthKorea.Nomatterwherehe
stays,Radimandhisteamalwaystrytoevangelizedata-drivensolutionsandhelp
companiesworldwidemakethemostoftheirmachinelearningopportunities.
www.PacktPub.com
Supportfiles,eBooks,discountoffers,and
more
Forsupportfilesanddownloadsrelatedtoyourbook,pleasevisitwww.PacktPub.com.
DidyouknowthatPacktofferseBookversionsofeverybookpublished,withPDFand
ePubfilesavailable?YoucanupgradetotheeBookversionatwww.PacktPub.comandas
aprintbookcustomer,youareentitledtoadiscountontheeBookcopy.Getintouchwith
usat<>formoredetails.
Atwww.PacktPub.com,youcanalsoreadacollectionoffreetechnicalarticles,signup
forarangeoffreenewslettersandreceiveexclusivediscountsandoffersonPacktbooks
andeBooks.
/>DoyouneedinstantsolutionstoyourITquestions?PacktLibisPackt’sonlinedigital
booklibrary.Here,youcansearch,access,andreadPackt’sentirelibraryofbooks.