Tải bản đầy đủ (.pdf) (1,183 trang)

IT training data mining and predictive analytics larose larose 2015 03 16

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (35.14 MB, 1,183 trang )



TableofContents
Cover
Series
TitlePage
Copyright
Dedication
Preface
WhatisDataMining?WhatisPredictiveAnalytics?
WhyisthisBookNeeded?
WhoWillBenefitfromthisBook?
Danger!DataMiningisEasytodoBadly
“White-Box”Approach
AlgorithmWalk-Throughs
ExcitingNewTopics
TheRZone
Appendix:DataSummarizationandVisualization
TheCaseStudy:BringingitallTogether
HowtheBookisStructured
TheSoftware
Weka:TheOpen-SourceAlternative
TheCompanionWebSite:www.dataminingconsultant.com
DataMiningandPredictiveAnalyticsasaTextbook
Acknowledgments
Daniel'sAcknowledgments
Chantal'sAcknowledgments
PartI:DataPreparation
Chapter1:AnIntroductiontoDataMiningandPredictiveAnalytics
1.1WhatisDataMining?WhatIsPredictiveAnalytics?
1.2Wanted:DataMiners


1.3TheNeedForHumanDirectionofDataMining
1.4TheCross-IndustryStandardProcessforDataMining:CRISP-DM
1.5FallaciesofDataMining
1.6WhatTaskscanDataMiningAccomplish
TheRZone
RReferences
Exercises
Chapter2:DataPreprocessing


2.1WhydoWeNeedtoPreprocesstheData?
2.2DataCleaning
2.3HandlingMissingData
2.4IdentifyingMisclassifications
2.5GraphicalMethodsforIdentifyingOutliers
2.6MeasuresofCenterandSpread
2.7DataTransformation
2.8Min–MaxNormalization
2.9Z-ScoreStandardization
2.10DecimalScaling
2.11TransformationstoAchieveNormality
2.12NumericalMethodsforIdentifyingOutliers
2.13FlagVariables
2.14TransformingCategoricalVariablesintoNumericalVariables
2.15BinningNumericalVariables
2.16ReclassifyingCategoricalVariables
2.17AddinganIndexField
2.18RemovingVariablesthatarenotUseful
2.19VariablesthatShouldProbablynotbeRemoved
2.20RemovalofDuplicateRecords

2.21AWordAboutIDFields
TheRZone
RReference
Exercises
Chapter3:ExploratoryDataAnalysis
3.1HypothesisTestingVersusExploratoryDataAnalysis
3.2GettingtoKnowTheDataSet
3.3ExploringCategoricalVariables
3.4ExploringNumericVariables
3.5ExploringMultivariateRelationships
3.6SelectingInterestingSubsetsoftheDataforFurtherInvestigation
3.7UsingEDAtoUncoverAnomalousFields
3.8BinningBasedonPredictiveValue
3.9DerivingNewVariables:FlagVariables
3.10DerivingNewVariables:NumericalVariables
3.11UsingEDAtoInvestigateCorrelatedPredictorVariables
3.12SummaryofOurEDA


TheRZone
RReferences
Exercises
Chapter4:Dimension-ReductionMethods
4.1NeedforDimension-ReductioninDataMining
4.2PrincipalComponentsAnalysis
4.3ApplyingPCAtotheHousesDataSet
4.4HowManyComponentsShouldWeExtract?
4.5ProfilingthePrincipalComponents
4.6Communalities
4.7ValidationofthePrincipalComponents

4.8FactorAnalysis
4.9ApplyingFactorAnalysistotheAdultDataSet
4.10FactorRotation
4.11User-DefinedComposites
4.12AnExampleofaUser-DefinedComposite
TheRZone
RReferences
Exercises
PartII:StatisticalAnalysis
Chapter5:UnivariateStatisticalAnalysis
5.1DataMiningTasksinDiscoveringKnowledgeinData
5.2StatisticalApproachestoEstimationandPrediction
5.3StatisticalInference
5.4HowConfidentareWeinOurEstimates?
5.5ConfidenceIntervalEstimationoftheMean
5.6HowtoReducetheMarginofError
5.7ConfidenceIntervalEstimationoftheProportion
5.8HypothesisTestingfortheMean
5.9AssessingTheStrengthofEvidenceAgainstTheNullHypothesis
5.10UsingConfidenceIntervalstoPerformHypothesisTests
5.11HypothesisTestingforTheProportion
Reference
TheRZone
RReference
Exercises
Chapter6:MultivariateStatistics


6.1Two-Samplet-TestforDifferenceinMeans
6.2Two-SampleZ-TestforDifferenceinProportions

6.3TestfortheHomogeneityofProportions
6.4Chi-SquareTestforGoodnessofFitofMultinomialData
6.5AnalysisofVariance
Reference
TheRZone
RReference
Exercises
Chapter7:PreparingtoModeltheData
7.1SupervisedVersusUnsupervisedMethods
7.2StatisticalMethodologyandDataMiningMethodology
7.3Cross-Validation
7.4Overfitting
7.5Bias–VarianceTrade-Off
7.6BalancingTheTrainingDataSet
7.7EstablishingBaselinePerformance
TheRZone
RReference
Exercises
Chapter8:SimpleLinearRegression
8.1AnExampleofSimpleLinearRegression
8.2DangersofExtrapolation
8.3HowUsefulistheRegression?TheCoefficientofDetermination,2
8.4StandardErroroftheEstimate,
8.5CorrelationCoefficient
8.6AnovaTableforSimpleLinearRegression
8.7Outliers,HighLeveragePoints,andInfluentialObservations
8.8PopulationRegressionEquation
8.9VerifyingTheRegressionAssumptions
8.10InferenceinRegression
8.11t-TestfortheRelationshipBetweenxandy

8.12ConfidenceIntervalfortheSlopeoftheRegressionLine
8.13ConfidenceIntervalfortheCorrelationCoefficientρ
8.14ConfidenceIntervalfortheMeanValueofGiven
8.15PredictionIntervalforaRandomlyChosenValueofGiven
8.16TransformationstoAchieveLinearity


8.17Box–CoxTransformations
TheRZone
RReferences
Exercises
Chapter9:MultipleRegressionandModelBuilding
9.1AnExampleofMultipleRegression
9.2ThePopulationMultipleRegressionEquation
9.3InferenceinMultipleRegression
9.4RegressionWithCategoricalPredictors,UsingIndicatorVariables
9.5AdjustingR2:PenalizingModelsForIncludingPredictorsThatAreNotUseful
9.6SequentialSumsofSquares
9.7Multicollinearity
9.8VariableSelectionMethods
9.9GasMileageDataSet
9.10AnApplicationofVariableSelectionMethods
9.11UsingthePrincipalComponentsasPredictorsinMultipleRegression
TheRZone
RReferences
Exercises
PartIII:Classification
Chapter10:k-NearestNeighborAlgorithm
10.1ClassificationTask
10.2k-NearestNeighborAlgorithm

10.3DistanceFunction
10.4CombinationFunction
10.5QuantifyingAttributeRelevance:StretchingtheAxes
10.6DatabaseConsiderations
10.7k-NearestNeighborAlgorithmforEstimationandPrediction
10.8Choosingk
10.9Applicationofk-NearestNeighborAlgorithmUsingIBM/SPSSModeler
TheRZone
RReferences
Exercises
Chapter11:DecisionTrees
11.1WhatisaDecisionTree?
11.2RequirementsforUsingDecisionTrees
11.3ClassificationandRegressionTrees


11.4C4.5Algorithm
11.5DecisionRules
11.6ComparisonoftheC5.0andCARTAlgorithmsAppliedtoRealData
TheRZone
RReferences
Exercises
Chapter12:NeuralNetworks
12.1InputandOutputEncoding
12.2NeuralNetworksforEstimationandPrediction
12.3SimpleExampleofaNeuralNetwork
12.4SigmoidActivationFunction
12.5Back-Propagation
12.6Gradient-DescentMethod
12.7Back-PropagationRules

12.8ExampleofBack-Propagation
12.9TerminationCriteria
12.10LearningRate
12.11MomentumTerm
12.12SensitivityAnalysis
12.13ApplicationofNeuralNetworkModeling
TheRZone
RReferences
Exercises
Chapter13:LogisticRegression
13.1SimpleExampleofLogisticRegression
13.2MaximumLikelihoodEstimation
13.3InterpretingLogisticRegressionOutput
13.4Inference:ArethePredictorsSignificant?
13.5OddsRatioandRelativeRisk
13.6InterpretingLogisticRegressionforaDichotomousPredictor
13.7InterpretingLogisticRegressionforaPolychotomousPredictor
13.8InterpretingLogisticRegressionforaContinuousPredictor
13.9AssumptionofLinearity
13.10Zero-CellProblem
13.11MultipleLogisticRegression
13.12IntroducingHigherOrderTermstoHandleNonlinearity
13.13ValidatingtheLogisticRegressionModel


13.14WEKA:Hands-OnAnalysisUsingLogisticRegression
TheRZone
RReferences
Exercises
Chapter14:NaÏVeBayesandBayesianNetworks

14.1BayesianApproach
14.2MaximumAPosteriori(MAP)Classification
14.3PosteriorOddsRatio
14.4BalancingTheData
14.5NaïveBayesClassification
14.6InterpretingTheLogPosteriorOddsRatio
14.7Zero-CellProblem
14.8NumericPredictorsforNaïveBayesClassification
14.9WEKA:Hands-onAnalysisUsingNaïveBayes
14.10BayesianBeliefNetworks
14.11ClothingPurchaseExample
14.12UsingTheBayesianNetworktoFindProbabilities
TheRZone
RReferences
Exercises
Chapter15:ModelEvaluationTechniques
15.1ModelEvaluationTechniquesfortheDescriptionTask
15.2ModelEvaluationTechniquesfortheEstimationandPredictionTasks
15.3ModelEvaluationMeasuresfortheClassificationTask
15.4AccuracyandOverallErrorRate
15.5SensitivityandSpecificity
15.6False-PositiveRateandFalse-NegativeRate
15.7ProportionsofTruePositives,TrueNegatives,FalsePositives,andFalseNegatives
15.8MisclassificationCostAdjustmenttoReflectReal-WorldConcerns
15.9DecisionCost/BenefitAnalysis
15.10LiftChartsandGainsCharts
15.11InterweavingModelEvaluationwithModelBuilding
15.12ConfluenceofResults:ApplyingaSuiteofModels
TheRZone
RReferences

Exercises
Hands-OnAnalysis


Chapter16:Cost-BenefitAnalysisUsingData-DrivenCosts
16.1DecisionInvarianceUnderRowAdjustment
16.2PositiveClassificationCriterion
16.3DemonstrationOfThePositiveClassificationCriterion
16.4ConstructingTheCostMatrix
16.5DecisionInvarianceUnderScaling
16.6DirectCostsandOpportunityCosts
16.7CaseStudy:Cost-BenefitAnalysisUsingData-DrivenMisclassificationCosts
16.8RebalancingasaSurrogateforMisclassificationCosts
TheRZone
RReferences
Exercises
Chapter17:Cost-BenefitAnalysisforTrinaryand-NaryClassificationModels
17.1ClassificationEvaluationMeasuresforaGenericTrinaryTarget
17.2ApplicationofEvaluationMeasuresforTrinaryClassificationtotheLoan
ApprovalProblem
17.3Data-DrivenCost-BenefitAnalysisforTrinaryLoanClassificationProblem
17.4ComparingCartModelsWithandWithoutData-DrivenMisclassificationCosts
17.5ClassificationEvaluationMeasuresforaGenerick-NaryTarget
17.6ExampleofEvaluationMeasuresandData-DrivenMisclassificationCostsforkNaryClassification
TheRZone
RReferences
Exercises
Chapter18:GraphicalEvaluationofClassificationModels
18.1ReviewofLiftChartsandGainsCharts
18.2LiftChartsandGainsChartsUsingMisclassificationCosts

18.3ResponseCharts
18.4ProfitsCharts
18.5ReturnonInvestment(ROI)Charts
TheRZone
RReferences
Exercises
Hands-OnExercises
PartIV:Clustering
Chapter19:Hierarchicaland-MeansClustering
19.1TheClusteringTask


19.2HierarchicalClusteringMethods
19.3Single-LinkageClustering
19.4Complete-LinkageClustering
19.5-MeansClustering
19.6Exampleof-MeansClusteringatWork
19.7BehaviorofMSB,MSE,andPseudo-Fasthe-MeansAlgorithmProceeds
19.8Applicationof-MeansClusteringUsingSASEnterpriseMiner
19.9UsingClusterMembershiptoPredictChurn
TheRZone
RReferences
Exercises
Hands-OnAnalysis
Chapter20:KohonenNetworks
20.1Self-OrganizingMaps
20.2KohonenNetworks
20.3ExampleofaKohonenNetworkStudy
20.4ClusterValidity
20.5ApplicationofClusteringUsingKohonenNetworks

20.6InterpretingTheClusters
20.7UsingClusterMembershipasInputtoDownstreamDataMiningModels
TheRZone
RReferences
Exercises
Chapter21:BIRCHClustering
21.1RationaleforBIRCHClustering
21.2ClusterFeatures
21.3ClusterFeatureTREE
21.4Phase1:BuildingTheCFTree
21.5Phase2:ClusteringTheSub-Clusters
21.6ExampleofBirchClustering,Phase1:BuildingTheCFTree
21.7ExampleofBIRCHClustering,Phase2:ClusteringTheSub-Clusters
21.8EvaluatingTheCandidateClusterSolutions
21.9CaseStudy:ApplyingBIRCHClusteringtoTheBankLoansDataSet
TheRZone
RReferences
Exercises
Chapter22:MeasuringClusterGoodness


22.1RationaleforMeasuringClusterGoodness
22.2TheSilhouetteMethod
22.3SilhouetteExample
22.4SilhouetteAnalysisoftheIRISDataSet
22.5ThePseudo-FStatistic
22.6ExampleofthePseudo-FStatistic
22.7Pseudo-FStatisticAppliedtotheIRISDataSet
22.8ClusterValidation
22.9ClusterValidationAppliedtotheLoansDataSet

TheRZone
RReferences
Exercises
PartV:AssociationRules
Chapter23:AssociationRules
23.1AffinityAnalysisandMarketBasketAnalysis
23.2Support,Confidence,FrequentItemsets,andtheAPrioriProperty
23.3HowDoesTheAPrioriAlgorithmWork(Part1)?GeneratingFrequentItemsets
23.4HowDoesTheAPrioriAlgorithmWork(Part2)?GeneratingAssociationRules
23.5ExtensionFromFlagDatatoGeneralCategoricalData
23.6Information-TheoreticApproach:GeneralizedRuleInductionMethod
23.7AssociationRulesareEasytodoBadly
23.8HowCanWeMeasuretheUsefulnessofAssociationRules?
23.9DoAssociationRulesRepresentSupervisedorUnsupervisedLearning?
23.10LocalPatternsVersusGlobalModels
TheRZone
RReferences
Exercises
PartVI:EnhancingModelPerformance
Chapter24:SegmentationModels
24.1TheSegmentationModelingProcess
24.2SegmentationModelingUsingEDAtoIdentifytheSegments
24.3SegmentationModelingusingClusteringtoIdentifytheSegments
TheRZone
RReferences
Exercises
Chapter25:EnsembleMethods:BaggingandBoosting
25.1RationaleforUsinganEnsembleofClassificationModels



25.2Bias,Variance,andNoise
25.3WhentoApply,andnottoapply,Bagging
25.4Bagging
25.5Boosting
25.6ApplicationofBaggingandBoostingUsingIBM/SPSSModeler
References
TheRZone
RReference
Exercises
Chapter26:ModelVotingandPropensityAveraging
26.1SimpleModelVoting
26.2AlternativeVotingMethods
26.3ModelVotingProcess
26.4AnApplicationofModelVoting
26.5WhatisPropensityAveraging?
26.6PropensityAveragingProcess
26.7AnApplicationofPropensityAveraging
TheRZone
RReferences
Exercises
Hands-OnAnalysis
PartVII:FurtherTopics
Chapter27:GeneticAlgorithms
27.1IntroductionToGeneticAlgorithms
27.2BasicFrameworkofaGeneticAlgorithm
27.3SimpleExampleofaGeneticAlgorithmatWork
27.4ModificationsandEnhancements:Selection
27.5ModificationsandEnhancements:Crossover
27.6GeneticAlgorithmsforReal-ValuedVariables
27.7UsingGeneticAlgorithmstoTrainaNeuralNetwork

27.8WEKA:Hands-OnAnalysisUsingGeneticAlgorithms
TheRZone
RReferences
Chapter28:ImputationofMissingData
28.1NeedforImputationofMissingData
28.2ImputationofMissingData:ContinuousVariables
28.3StandardErroroftheImputation


28.4ImputationofMissingData:CategoricalVariables
28.5HandlingPatternsinMissingness
Reference
TheRZone
RReferences
PartVIII:CaseStudy:PredictingResponsetoDirect-MailMarketing
Chapter29:CaseStudy,Part1:BusinessUnderstanding,DataPreparation,andEDA
29.1Cross-IndustryStandardPracticeforDataMining
29.2BusinessUnderstandingPhase
29.3DataUnderstandingPhase,Part1:GettingaFeelfortheDataSet
29.4DataPreparationPhase
29.5DataUnderstandingPhase,Part2:ExploratoryDataAnalysis
Chapter30:CaseStudy,Part2:ClusteringandPrincipalComponentsAnalysis
30.1PartitioningtheData
30.2DevelopingthePrincipalComponents
30.3ValidatingthePrincipalComponents
30.4ProfilingthePrincipalComponents
30.5ChoosingtheOptimalNumberofClustersUsingBirchClustering
30.6ChoosingtheOptimalNumberofClustersUsingk-MeansClustering
30.7Applicationofk-MeansClustering
30.8ValidatingtheClusters

30.9ProfilingtheClusters
Chapter31:CaseStudy,Part3:ModelingAndEvaluationForPerformanceAnd
Interpretability
31.1DoYouPreferTheBestModelPerformance,OrACombinationOfPerformance
AndInterpretability?
31.2ModelingAndEvaluationOverview
31.3Cost-BenefitAnalysisUsingData-DrivenCosts
31.4VariablestobeInputToTheModels
31.5EstablishingTheBaselineModelPerformance
31.6ModelsThatUseMisclassificationCosts
31.7ModelsThatNeedRebalancingasaSurrogateforMisclassificationCosts
31.8CombiningModelsUsingVotingandPropensityAveraging
31.9InterpretingTheMostProfitableModel
Chapter32:CaseStudy,Part4:ModelingandEvaluationforHighPerformanceOnly
32.1VariablestobeInputtotheModels
32.2ModelsthatuseMisclassificationCosts


32.3ModelsthatNeedRebalancingasaSurrogateforMisclassificationCosts
32.4CombiningModelsusingVotingandPropensityAveraging
32.5LessonsLearned
32.6Conclusions
AppendixA:DataSummarizationandVisualization
Part1:Summarization1:BuildingBlocksOfDataAnalysis
Part2:Visualization:GraphsandTablesForSummarizingAndOrganizingData
Part3:Summarization2:MeasuresOfCenter,Variability,andPosition
Part4:SummarizationAndVisualizationOfBivariateRelationships
Index
EndUserLicenseAgreement



ListofIllustrations
Figure1.1
Figure1.2
Figure1.3
Figure2.1
Figure2.2
Figure2.3
Figure2.4
Figure2.5
Figure2.6
Figure2.7
Figure2.8
Figure2.9
Figure2.10
Figure2.11
Figure2.12
Figure2.13
Figure2.14
Figure2.15
Figure2.16
Figure2.17
Figure2.18
Figure2.19
Figure2.20
Figure2.21
Figure2.22
Figure3.1
Figure3.2
Figure3.3

Figure3.4
Figure3.5
Figure3.6


Figure3.7
Figure3.8
Figure3.9
Figure3.10
Figure3.11
Figure3.12
Figure3.13
Figure3.14
Figure3.15
Figure3.16
Figure3.17b
Figure3.18b
Figure3.19a
Figure3.20
Figure3.21
Figure3.22
Figure3.23
Figure3.24
Figure3.25
Figure3.26
Figure3.27
Figure3.28
Figure3.29
Figure4.1
Figure4.2

Figure4.3
Figure4.4
Figure4.6
Figure4.7
Figure6.1
Figure6.2
Figure5.3
Figure5.4


Figure6.1
Figure6.2
Figure7.1
Figure7.2
Figure7.3
Figure7.4
Figure8.1
Figure8.2
Figure8.3
Figure8.4
Figure8.5
Figure8.6
Figure8.7
Figure8.8
Figure8.9
Figure8.10
Figure8.12
Figure8.11
Figure8.13
Figure8.14

Figure8.15
Figure8.16
Figure8.17
Figure8.18
Figure8.19
Figure8.20
Figure8.21
Figure8.22
Figure8.23
Figure9.1
Figure9.2
Figure9.3
Figure9.4


Figure9.5
Figure9.6
Figure9.7
Figure9.8
Figure9.9
Figure9.10
Figure9.11
Figure9.12
Figure9.13
Figure9.14
Figure9.15
Figure9.16
Figure9.17
Figure9.18
Figure10.1

Figure10.2
Figure10.3
Figure10.4
Figure10.5
Figure11.1
Figure11.2
Figure11.3
Figure11.4
Figure11.5
Figure11.6
Figure11.7
Figure11.8
Figure11.9
Figure12.1
Figure12.2
Figure12.3
Figure12.4
Figure12.5


Figure12.6
Figure12.7
Figure12.8
Figure12.9
Figure12.10
Figure13.1
Figure13.2
Figure13.3
Figure13.4
Figure13.5

Figure13.6
Figure13.7
Figure14.1
Figure14.2
Figure14.3
Figure14.4
Figure14.5
Figure14.6
Figure15.1
Figure15.2
Figure15.3
Figure15.4
Figure16.1a–c
Figure18.1
Figure18.2
Figure18.3
Figure18.4
Figure18.5
Figure18.6
Figure18.7
Figure18.8
Figure19.1
Figure19.2


Figure19.3
Figure19.4
Figure19.5
Figure19.6
Figure19.7

Figure19.8
Figure19.9
Figure19.10
Figure19.11
Figure20.1
Figure20.2
Figure20.3
Figure20.4
Figure20.5
Figure20.6
Figure20.7
Figure20.8
Figure20.9
Figure20.10
Figure21.1
Figure21.2
Figure21.3
Figure21.4
Figure21.5
Figure21.6
Figure21.7
Figure21.8
Figure21.9
Figure21.10
Figure21.11
Figure21.12
Figure21.13
Figure21.14



Figure22.1
Figure22.2
Figure22.3
Figure22.4
Figure22.5
Figure22.6
Figure22.7
Figure22.8
Figure22.9
Figure22.10
Figure22.11
Figure22.12
Figure23.1
Figure23.2
Figure23.3
Figure23.4
Figure23.5
Figure24.1
Figure24.2
Figure24.4
Figure24.3
Figure24.5
Figure24.6
Figure24.7
Figure24.8
Figure24.9
Figure25.1
Figure25.2
Figure25.3
Figure25.4

Figure25.5
Figure25.6
Figure25.7


Figure25.8
Figure25.9
Figure25.10
Figure26.1
Figure26.3
Figure26.2
Figure27.1
Figure27.2
Figure27.3
Figure27.4
Figure27.5
Figure27.6
Figure27.7
Figure27.8
Figure27.9
Figure27.10
Figure27.11
Figure28.1
Figure28.2
Figure29.1
Figure29.2
Figure29.3
Figure29.4
Figure29.5
Figure29.6

Figure29.7
Figure29.8
Figure29.9
Figure29.10
Figure29.11
Figure29.12
Figure29.13
Figure29.14


Figure29.15
Figure29.16
Figure29.17
Figure29.18
Figure29.19
Figure29.20
Figure29.21
Figure29.22
Figure29.23
Figure29.24
Figure29.25
Figure29.26
Figure29.27
Figure29.28
Figure29.29
Figure30.1
Figure30.2
Figure30.3
Figure30.4
Figure30.5

Figure30.6
Figure30.7
Figure30.8
Figure30.11
Figure30.12
Figure30.13
Figure30.14
Figure31.1
Figure31.2
Figure31.3
Figure31.4
Figure31.5
Figure32.1


Figure32.2
FigureA.1
FigureA.2
FigureA.3
FigureA.4
FigureA.5
FigureA.6
FigureA.7
FigureA.9
FigureA.8


×