Tải bản đầy đủ (.pdf) (170 trang)

Big data analytics made easy

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (10.79 MB, 170 trang )



NotionPress
OldNo.38,NewNo.6
McNicholsRoad,Chetpet
Chennai-600031
FirstPublishedbyNotionPress2016
Copyright©Y.LakshmiPrasad2016
AllRightsReserved.
ISBN978-1-946390-72-1
Thisbookhasbeenpublishedwithalleffortstakentomakethematerialerror-freeafterthe
consent of the authors. However, the authors and the publisher do not assume and hereby
disclaim any liability to any party for any loss, damage, or disruption caused by errors or
omissions,whethersucherrorsoromissionsresultfromnegligence,accident,oranyother
cause.
No part of this book may be used, reproduced in any manner whatsoever without written
permission from the authors, except in the case of brief quotations embodied in critical
articlesandreviews.


Thisbookisdedicatedto

A.P.J.AbdulKalam
(Thinkingshouldbecomeyourcapitalasset,nomatterwhateverupsanddownsyoucome
acrossinyourlife.)


Todownloadthedatafilesusedinthisbook,
pleaseusethebelowlink:
www.praanalytix.com/Bigdata-Analytics-MadeEasy-Datafiles.rar



Contents
Preface
Author’sNote
Acknoweldgements
STEP1IntroductiontoBigDataAnalytics
STEP2GettingstartedwithR
STEP3DataExploration
STEP4DataPreparation
STEP5StatisticalThinking
STEP6IntroductiontoMachineLearning
STEP7DimensionalityReduction
STEP8Clustering
STEP9MarketBasketAnalysis
STEP10KernelDensityEstimation
STEP11Regression
STEP12LogisticRegression
STEP13DecisionTrees
STEP14K-NearestNeighborClassification
STEP15BayesianClassifiers
STEP16NeuralNetworks
STEP17SupportVectorMachines
STEP18EnsembleLearning


Preface
ThisbookisanindispensableguidefocusesonMachineLearningandRProgramming,inan
instructive and conversational tone which helps them who want to make their career in Big
Data Analytics/ Data Science and entry level Data Scientist for their day to day tasks with
practicalexamples,detaileddescription,Issues,Resolutions,keytechniquesandmanymore.

Thisbookislikeyourpersonaltrainer,explainstheartofBigdataAnalytics/DataScience
with R Programming in 18 steps which covers from Statistics, Unsupervised Learning,
Supervised Learning as well as Ensemble Learning. Many Machine Learning Concepts are
explainedinaneasywaysothatyoufeelconfidentwhileusingtheminProgramming.Ifyou
arealreadyworkingasaDataAnalyst,stillyouneedthisbooktosharpenyourskills.This
bookwillbeanassettoyouandyourcareerbymakingyouabetterDataScientist.


Author’sNote
One interesting thing in Big Data Analytics, it is the career Option for people with various
study backgrounds. I have seen Data Analyst/Business Analyst/Data Scientists with different
qualificationslikeM.B.A,Statistics,M.C.A,M.Tech,M.scMathematicsandmanymore.Itis
wonderfultoseepeoplewithdifferentbackgroundsworkingonthesameproject,buthowcan
we expect Machine Learning and Domain knowledge from a person with technical
qualification.
EverypersonmightbestrongintheirownsubjectbutDataScientistneedstoknowmore
than one subject (Programming, Machine Learning, Mathematics, Business Acumen and
Statistics). This might be the reason I thought it would be beneficial to have a resource that
bringstogetheralltheseaspectsinonevolumesothatitwouldhelpeverybodywhowantsto
makeBigDataAnalytics/DataScienceastheircareerOption.
Thisbookwaswrittentoassistlearnersingettingstarted,whileatthesametimeproviding
techniquesthatIhavefoundtobeusefultoEntrylevelDataAnalystandRprogrammers.This
bookisaimedmoreattheRprogrammerwhoisresponsibleforprovidinginsightsonboth
structuredandunstructureddata.
This book assumes that the reader has no prior knowledge of Machine Learning and R
programming. Each one of us has our own style of approach to an issue; it is likely that
otherswillfindalternatesolutionsformanyoftheissuesdiscussedinthisbook.Thesample
data that appears in a number of examples throughout this book was just an imaginary, any
resemblancewassimplyaccidental.
This book was organized in 18 Steps from introduction to Ensemble Learning, which

offers the different thinking patterns in Data Scientist work environment. The solutions to
someofthequestionsarenotwrittenfullybutonlysomestepsofhintsarementioned.Itis
justforthesakeofrecallingthememoryinvolvingimportantfactsincommonpractice.
Y.LakshmiPrasad


Acknoweldgements
Agreatdealofinformationwasreceivedfromthenumerouspeoplewhoofferedtheirtime.I
wouldliketothankeachandeverypersonwhohelpedmeincreatingthisbook.
I heartily express my gratitude to all of my peers, ISB colleagues, friends and students
whose sincere response geared up to meet the exigent way of expressing the contents. I am
very much grateful to our Press, editors and designers whose scrupulous assistance
completedthisworktoreachyourhands.
Finally,IampersonallyindebtedtomywonderfulpartnerPrajwala,andmykidPrakhyath,
fortheirsupport,enthusiasm,andtolerancewithoutwhichthisbookwouldhaveneverbeen
completed.
Y.LakshmiPrasad


STEP1
IntroductiontoBigDataAnalytics
1.1WHATBIGDATA?
BigDataisanyvoluminousamountofStructured,Semi-structuredandUnstructureddata
thathasthepotentialtobeminedforinformationwheretheIndividualrecordsstopmattering
and only aggregates matter. Data becomes Big data when it is difficult to process using
traditionaltechniques.

1.2CHARACTERISTICSOFBIGDATA:
TherearemanycharacteristicsofBigdata.Letmediscussafewhere.
1. Volume: Big data implies enormous volumes of data generated by Sensors, Machines

combinedwithinternetexplosion,socialmedia,e-commerce,GPSdevicesetc.
2. Velocity: It implies to the rate at which the data is pouring in like Facebook users
generate3millionlikesperdayandaround450millionoftweetsarecreatedperdayby
users.
3.Variety:Itimpliestothetypeofformatsandtheycanbeclassifiedinto3types:
Structured–RDBMSlikeOracle,MySQL,LegacysystemslikeExcel,Access
Semi-Structured–Emails,Tweets,Logfiles,Userreviews
Un-Structured–Photos,Video,Audiofiles.
4. Veracity: It refers to the biases, noise, and abnormality in data. If we want meaningful
insightfromthisdataweneedtocleanseitinitially.
5. Validity: It refers to appropriateness and precision of the data since the validity of the
dataisveryimportanttomakedecisions.
6.Volatility:Itreferstohowlongthedataisvalidsincethedatawhichisvalidrightnow
mightnotbevalidjustafewminutesorfewerdayslater.

1.3WHYBIGDATAIMPORTANT?
Thesuccessoftheorganizationnotjustliesinhowgoodthereareindoingtheirbusinessbut
also on how well they can analyze their data and derive insights about their company, their
competitorsetc.Bigdatacanhelpyouintakingtherightdecisionatrighttime.
Why not RDBMS? Scalability is the major problem in RDBMS, it is very difficult to
manageRDBMSwhentherequirementsorthenumberofuserschange.Onemoreproblem
withRDBMSisthatweneedtodecidethestructureofthedatabaseatthestartandmakingany
changes later might be a huge task. While dealing with Big data we need flexibility and
unfortunately,RDBMScannotprovidethat.


1.4ANALYTICSTERMINOLOGY
Analyticsisoneofthefewfieldswherealotofdifferenttermsthrownaroundbyeveryone
and lot of these terms sound similar to each other yet they are used in different contexts.
Therearesometermswhichsoundverydifferenttoeachotheryettheyaresimilarandcanbe

used interchangeably. Someone who is new to Analytics expected to confuse with this
abundanceofterminologywhichisthereinthisfield.
Analytics is the process of breaking the problem into simpler parts and using inferences
based on data to take decisions. Analytics is not a tool or technology, rather it is a way of
thinking and acting. Business Analytics specifies application of Analytics in the sphere of
Business. It includes Marketing Analytics, Risk Analytics, Fraud Analytics, CRM Analytics,
LoyaltyAnalytics,OperationAnalyticsaswellasHRanalytics.Withinthebusiness,Analytics
isusedinallsortsofindustrieslikeFinanceAnalytics,HealthcareAnalytics,Retailanalytics,
TelecomAnalytics,WebAnalytics.PredictiveAnalyticsisgainedpopularityintherecentpast
Vs.RetrospectivenaturesuchasOLAPandBI,Descriptiveanalyticsistodescribeorexplore
any kind of data. Data exploration and Data Preparation is essential to rely heavily on
descriptive analytics. Big Data Analytics is the new term which is used to Analyze the
unstructureddataandbigdataliketerabytesorevenpetabytesofdata.BigDataisanydataset
whichcannotbeanalyzedwithconventionaltools.

1.5TYPESOFANALYTICS
Analytics can be applied to so many problems and in so many different industries that it
becomesimportanttotakesometimetounderstandthescopeofanalyticsinbusiness.
Classifying the different type of analytics. We are going to look closer at 3 broad
classifications of analytics: 1. Based on the Industry. 2. Based on the Business function 3.
Basedofkindofinsightsoffered.
Let’s start by looking at industries where analytics usage is very prevalent. There are
certain industries which have always created a huge amount of data like Credit cards and
consumergoods.Theseindustrieswereamongthefirstonestoadoptanalytics.Analyticsis
oftenclassifiedonthebasisoftheindustryitisbeingappliedto,henceyouwillhearterms
such as insurance analytics, retail analytics, web analytics and so on. We can even classify
analytics on the basis of the business function it’s used in. Classification of analytics on the
basisofbusinessfunctionandimpactgoesasfollows:
MarketingAnalytics
SalesandHRanalytics

Supplychainanalyticsandsoon
This can be a equitably long list as analytics has the prospective to impact virtually any
business activity within a large organization. But the most popular way of classifying
analyticsisonthebasisofwhatitallowsustodo.Alltheinformationiscollecteddifferent
industriesanddifferentdepartments.Allweneedtodoisslicinganddicingthedataindiverse


ways,maybelookingatitfromdifferentanglesoralongdifferentdimensionsetc.
As you can see descriptive analysis is possibly the simplest type of analytics to perform
simply because it uses existing information from the past, to understand decisions in the
present and hopefully helps decide an effective source of action in the future. However,
because of its relative ease of understanding and application descriptive analytics has been
often considered the subdued twin of analytics. But it is also extremely powerful in its
potential and in most business situations, Descriptive analytics can help address most
problems.
Retailersareveryinterestedinunderstandingtherelationshipbetweenproducts.Theywant
toknowifthepersonbuysaproductA,ishealsolikelybuyingproductBorproductC.This
iscalledproductaffinityanalysisorassociationanalysisanditiscommonlyusedintheretail
industry.Itisalsocalledmarketbasketanalysisandisusedtoreferasetoftechniquesthatcan
beappliedtoanalyzetheshoppingbasketoratransaction.Haveyoueverwonderedwhymilk
is placed right at the back of the store while magazines and chewing gum are right by the
check-out?Thatisbecausethroughanalyticsretailersrealizethatwhiletravelingalltheway
to the back of the store to pick up your essentials, you just may be tempted to pick up
something else and also because magazines and chewing gum are cheap impulse buys. You
decide to throw them in your cart since they are not too expensive and you have probably
beeneyingthemasyouwaitedinlineatthecounter.
Predictive Analytics works by identifying patterns and historical data and then using
statisticstomakeinferencesaboutthefuture.Ataverysimplisticlevel,wetrytofitthedata
into a certain pattern and if we believe the data is following a certain pattern then we can
predict what will happen in the future. Let’s try and look at another example involving

predictiveanalyticsinthetelecomindustry.Alargetelecomcompanyhasaccesstoallkinds
ofinformationaboutitscustomer ’scallinghabits:
Howmuchtimedotheyspendonthephone,Howmanyinternationalcallsdotheymake?
DotheypreferSMSorcallnumbersoutsidetheircity?
Thisisinformationonecanobtainpurelybyobservationordescriptiveanalytics.Butsuch
companies would, more importantly, like to know which is the customers plan to leave and
takeanewconnectionwiththeircompetitors.Thiswillusehistoricalinformationbutrelyon
predictive modeling and analysis to obtain results. This is predictive analysis. While
descriptiveanalyticsisaverypowerfultool.Itstillsgivesusinformationonlyaboutthepast
whereas, in reality, most user ’s primary concern will always be the future. A hotel owner
wouldwanttopredicthowmanyofhisroomswillbeoccupiednextweek.TheCEOofthe
PharmaCompanywillwanttoknowwhichofhisundertestdrugsismostlikelytosucceed.
This is where predictive analytics is a lot more useful. In addition to these tools, there is a
thirdtypeofanalytics,whichcameintoexistenceveryrecently,maybejustadecadeold.This
iscalledprescriptiveanalytics.Prescriptiveanalyticsgoesbeyondpredictiveanalyticsbynot
onlytellingyouwhatisgoingonbutalsowhatmighthappenandmostimportantlywhattodo
about it. It could also inform you about the impact of these decisions, which is what makes


prescriptive analytics so cutting edge. Business domains that are great examples were
prescriptive analytics can be used are the aviation industry or nationwide road networks.
Prescriptive analytics can predict an effectively correct road bottlenecks, or identify roads
where tolls can be implemented to streamline traffic. To see how prescriptive analytics
functionsintheaviationindustry,let’slookatthefollowingexample.
Airlinesarealwayslookingforwaystooptimizetheirroutesformaximumefficiency.This
can be billions of dollars in savings but this is not that easy to do. With over 50 million
commercialflightsintheworldeveryyear,that’saflighteverysecond.Justasimpleflight
routebetweentwocities,let’ssay,SanFranciscoandBoston,hasapossibilityof2000route
options.Sotheaviationindustryoftenreliesonprescriptiveanalyticstodecidewhat,which
andhowtheyshouldflytheirairplanestokeepcostdownandprofitsup.So,wehavetakena

fairly in-depth look at descriptive, predictive and prescriptive analytics. The focus of this
courseisgoingtobedescriptiveanalytics.Towardstheend,wewillalsospendsometimeon
understandingsomeofthemorepopularpredictivemodelingtechniques.

1.6ANALYTICSLIFECYCLE
TheAnalyticslifecyclehasdifferentstagesandmanypeopledescribeitinmanyways,butthe
overallidearemainssame.hereletusconsiderthefollowingstagesofanAnalyticsproject
lifecycle.
ProblemIdentification
HypothesisFormulation
DataCollection
DataExploration
DataPreparation/Manipulation
Modelplanning/Building
ValidateModel
Evaluate/Monitorresults
1.ProblemIdentification:Aproblemisasituationthatisjudgedassomethingthatneedsto
becorrected.Itisourjobtomakesurewearesolvingtherightproblem,itmaynotbethe
onepresentedtousbytheclient.Whatdowereallyneedtosolve?
Sometimestheproblemstatementsthatwegetfromthebusinessareverystraightforward.
Forexample:
HowdoIidentifythemostvaluablecustomers?
HowdoIensurethatIminimizelossesfromtheproductnotbeingavailableontheshelf?
HowdoIoptimizemyinventory?
HowdoIdetectcustomersthatarelikelytodefaultonabillpayment?
Thesearestraightforwardproblemstatementsandthereisreallynoconfusionaroundwhat
isitthatwearetryingtoachievewithananalyticalproject.However,everysingletimeour
business statement may not lead to clear problem identification. Sometimes, the business



statementsareveryhighlevelandthereforeyouwillneedtospendtimewiththebusinessto
understand the needs and obtain the context. You may need to break down that issue into
sub-issuestoidentifycriticalrequirements.Youmayneedtothinkabouttheconstraintsthat
needtobeincludedinthesolution.
Letustakeanexampleforthis.Supposingthatyouworkforacreditcardcompany,andthe
businesstellyouthatthisistheproblemstatementthattheywantyoutolookat,whichis
“We want to receive credit care applications only from good customers” Now from a
businessperspective,isthisavalidbusinessstatement?Certainly,ataveryhighlevel,thisis
a valid business requirement. However, for your purpose which is to build a solution to
addressthisquestion,isthisaveryvalidstatementorisitasufficientstartingpointforthe
dataanalysis?No.Because,therearemultipleproblemswithabusinessstatementlikethis,
whichis,wewanttoreceivecreditcareapplicationsonlyfromgoodcustomers.Letuslook
attheproblemwiththatproblemstatement.Iwanttoreceivecreditcareapplicationsonly
from good customers. One of the most obvious problem with that statement is who are
goodcustomers?
If you have any knowledge of the credit card industry, one of the answers for a good
customercouldbepeoplethatdon’tdefaultonpayments.Thatis,youspendonthecredit
cardandyoupaythecreditcardcompanybackontime.However,anotherdefinitionofa
goodcustomercouldbepeoplewhodon’tpayontime.Whyisthat?Because,ifyoudon’t
pay on time, the credit card company has the opportunity to charge you high rates of
interestonthatbalanceonyourcreditcard.Thesekindsofcustomersarecalledrevolvers.
Whoreallyisthegoodcustomerforacreditcardcompany?Arethesecustomerswhopay
on time? Or are these customers that default and don’t pay on time. An answer could be
botharegoodcustomers.Howisthatpossible?Itreallydependsonyourperspective.
Ifyouareinterestedinminimizingrisk,ifyouworkintheriskfunctionofthecreditcard
company,yourdefinitionofagoodcustomeristhecustomersthatpayontime,customers
that don’t default. Whereas, if you were looking at revenue, then your perspective on a
goodcustomercouldbepeoplewhospendalotonthecreditcardanddon’tpayitallback.
Theyhaveahighrevolvingbalance.Now,asananalyst,whodecideswhogoodcustomers
are? When the credit card company gives you a business statement that says we want to

accept credit card application from only good customers. Do you know that they are
looking at risk or revenue? It really depends on the business interest; it depends on the
businessgoalsforthatyear.Infact,agoodcustomerthisyearmaybeabadcustomernext
year. This is why it is important to obtain the context or the problem statement before
startingonananalysis.Butthisisnottheonlyproblemwiththisproblemstatement.
Anotherproblemisthinkingaboutthedecisionwhichis,canyoureallyinsistonreceiving
goodapplicationsorcanyouinsistonapprovinggoodapplications.Isthedecisionatthe
applicationstageortheapprovalstage?Canyoureallycontrolapplicationstobegoodor
canyoucontrolthedecisionstoenableonlygoodcustomerstocomeontoyou?.Another
problemwiththisproblemstatementisthatweonlywanttoreceivecreditcardapplications


fromgoodcustomers.Isitrealisticforyoutoassumethatyouwillhaveasolutionthatwill
neveracceptabadcustomer?Again,notarealisticoutcome.Comingbacktoourproblem
definitionstatewhichis,givenabusinessproblem,Iwanttogetgoodcustomersasacredit
cardcompany.Howdoyouframethatproblemintosomethingthatanalyticalapproachcan
tackle?
Onewayistoaddspecificstotheproblemstatement.So,thinkaboutspecific,measurable,
attainable,realistic,andtimelyoutcomesthatyoucanattachtothatproblemstatement.That
is,whyweemphasizethatyouneedtounderstandthebusinesscontextthoroughlyandtalk
tothebusinessthatyouaretacklingtherightproblem.HowwouldIbeabletoaddspecifics
tothisproblemstatement?LetusassumethatIamlookingatitfromtheriskperspective,
becauseinthisyearmycreditcardcompaniesfocusedonreducingtheportfoliorisk.So,I
could have a variety of business problem statements. For example, reduce losses from
creditcarddefaultbyatleast30percentinthefirst12monthspostimplementationofthe
newstrategy.
Develop an algorithm to screen applications that do not meet good customer defined
criteria that will reduce defaults by 20 percent in the next 3 months. Identify strategies to
reduce defaults by 20 percent in the next three months by allowing at-risk customers
additionalpaymentoptions.Wehavedecidedthatthegoodproblemdefinitionissomething

thatwearetacklingfromariskperspective.But,forthesamebusinessstatement,wenow
have three different problem statements that are tackling three different things. Again,
which of these should I choose as a starting point for my analysis? Should I identify
strategies for my existing customers or should I look at identifying potential new
customers? Again, this is something that may be driven by business needs. So, it is
important to constantly talk to the business to make sure that when you are starting an
analyticsprojectyouaretacklingtherightproblemstatement.
Getting to a clearly defined problem is often discovery driven – Start with a conceptual
definition and through analysis (root cause, impact analysis, etc.) you shape and redefine
the problem in terms of issues. A problem becomes known when a person observes a
discrepancy between the way things are and the way things ought to be. Problems can be
identifiedthrough:
Comparative/benchmarkingstudies
Performance reporting - assessment of current performance against goals and
objectives
SWOTAnalysis–assessmentofstrengths,weaknesses,opportunities,andthreats
Complaints/Surveys
Sometimes the thing we think is a problem is not the real problem, so to get at the real
problem,probingisnecessary.RootCauseAnalysisisaneffectivemethodofprobing–it
helpsidentifywhat,how,andwhysomethinghappened.
Let us consider an employee turnover rate in our organization is increasing. we need to
findoutFiveWhy’sreferstothepracticeofasking,fivetimes,whytheproblemexistsin


ordertogettotherootcauseoftheproblem:
WhyareEmployeesleavingforotherjobs?
WhyareEmployeesnotsatisfied?
WhydoEmployeesfeelthattheyareunderpaid?
WhyareOtheremployerspayinghighersalaries?
WhyDemandforsuchemployeeshasincreasedinthemarket?

BasicQuestionstoAskinDefiningtheProblem:
Whoiscausingtheproblem?
Whoareimpactedbythisproblem?
Whatwillhappenifthisproblemisnotsolved?Whataretheimpacts?
WhereandWhendoesthisproblemoccur?
Whyisthisproblemoccurring?
Howshouldtheprocesswork?
Howarepeoplecurrentlyhandlingtheproblem?
2.Formulatingthehypothesis:Breakdownproblemsandformulatehypotheses.Framethe
Questionswhichneedtobeansweredortopicswhichneedtobeexploredinordertosolve
aproblem.
Developacomprehensivelistofallpossibleissuesrelatedtotheproblem
Reduce the comprehensive list by eliminating duplicates and combining overlapping
issues
Usingconsensusbuilding,getdowntoamajorissueslist.
3. Data Collection: In order to answer the key questions and validate the hypotheses
collection of realistic information is necessary. Depending on the type of problem being
solved,differentdatacollectiontechniquesmaybeused.Datacollectionisacriticalstagein
problemsolving-ifitissuperficial,biasedorincomplete,dataanalysiswillbedifficult.
DataCollectionTechniques:
Usingdatathathasalreadybeencollectedbyothers
Systematicallyselectingandwatchingcharacteristicsofpeople,objectsorevents.
Oralquestioningofrespondents,eitherindividuallyorasagroup.
Collectingdatabasedonanswersprovidedbyrespondentsinwrittenform.
Facilitatingfreediscussionsonspecifictopicswithselectedgroupofparticipants.
4.DataExploration:Beforeaformaldataanalysiscanbeconducted,theanalystmustknow
how many cases are in the dataset, what variables are included, how many missing
observationsthereareandwhatgeneralhypothesesthedataislikelytosupport.Aninitial
explorationofthedatasethelpsanswerthesequestionsbyfamiliarizinganalystsaboutthe
datawithwhichtheyareworking.

Analystscommonlyusevisualizationfordataexplorationbecauseitallowsuserstoquickly
and simply view most of the relevant features of their dataset. By doing this, users can


identify variables that are likely to have interesting observations. By displaying data
graphically through scatter plots or bar charts users can see if two or more variables
correlateanddetermineiftheyaregoodcandidatesforfurtherin-depthanalysis.
5. Data Preparation: Data comes to you in a form that is not easy to analyze. We need to
clean data and check it for consistency, extensive manipulation of the data is needed in
ordertoanalyze.
DataPreparationstepsmayinclude:
Importingthedata
VariableIdentification/CreatingNewvariables
Checkingandsummarizingthedata
Selectingsubsetsofthedata
Selectingandmanagingvariables.
Combiningdata
Splittingdataintomanydatasets.
Missingvaluestreatment
Outliertreatment
Variable Identification: First, identify Predictor (Input) and Target (output) variables.
Then,identifythedatatypeandcategoryofthevariables.
Univariate Analysis: At this stage, we explore variables one by one. Method to perform
Univariateanalysiswilldependonwhetherthevariabletypeiscategoricalorcontinuous.
Let’s look at these methods and statistical measures for categorical and continuous
variablesindividually.
Continuous Variables: In the case of continuous variables, we need to understand the
central tendency and spread of the variable. These are measured using diverse statistical
metricsvisualizationmethods.
Categorical Variables: For categorical variables, we use a frequency table to understand

the distribution of each category. We can also read as a percentage of values under each
category.Itcanbemeasuredusingtwometrics,CountandPercentagainsteachcategory.
6.ModelBuilding:Thisisreallytheentireprocessofbuildingthesolutionandimplementing
the solution. The majority of the project time spent in the solution implementation stage.
One interesting thing to remember with an analytical approach is that an analytical
approach when you are building models, analytical models, is a very iterative process
becausethereisnosuchthingasafinalsolutionoraperfectsolution.Typically,youwill
spend time building multiple models on multiple solutions before arriving at the best
solutionthatthebusinesswillworkwith.
7.Therearemanywaysoftakingdecisionsfromabusinessperspective.Analyticsisoneway.
Thereareotherswaysoftakingadecision.Itcouldbeexperiencebaseddecisiontaking.It
couldbegut-baseddecisionmaking.Andnoteverysingletimeyouwillalwayschoosean
analyticalapproach.However,inthelongrun,itmakessensetobuildanalyticalcapability


becausethatleadstomoreobjectivedecisionmaking.Butfundamentallyifyouwanttodata
to drive decision making, you need to make sure that you have invested in collecting the
rightdatatoenableyourdecision-makingthroughdata.
8.ModelEvaluation/Monitoring:Thisisanongoingprocessessentiallyaimedatlookingat
the effectiveness of the solution over time. Remember that an analytical problem-solving
approach, which is different from the standard problem-solving approach. We need to
rememberthesepoints:
Thereisaclearconfidenceondatatodrivesolutionidentification.
Weareusinganalyticaltechniquesbasedonnumerictheories.
Youneedtohaveagoodunderstandingoftheoreticalconceptstobusinesssituationsin
ordertobuildafeasiblesolution.
What that means is you need to a good understanding of the business situation and the
business context and as well a strong knowledge of analytical approaches and be able to
merge the concepts, come up with a workable solution. In some industries, the rate of
changeisveryhigh.So,solutionsageveryfast.Inotherindustries,therateofchangemay

notbeashighandwhenyoubuildasolution,youmayhave2-3yearswhereyoursolution
workswellbutpostthatwillneedtobetweakedtomanagethenewbusinessconditions.But,
thewaytoassesswhetherornotyoursolutionisworking,istoperiodicallychecksolution
effectiveness.
You need to track dependability over time and you may need to make minor changes to
bring the solution back on track. Sometimes, may have to build an entire solution from
scratchbecausetheenvironmenthaschangedsodramaticallythatthesolutionthatyoubuilt
doesnotclutchanymoreinthecurrentbusinesscontext.

1.7COMMONMISTAKESINANALYTICALTHINKING
The client’s definition of the problem may not be correct. He may lack the knowledge and
experiencethatyouhave.Sincemostproblemsarenotunique,Wemaybeabletocorroborate
theproblemandpossiblesolutionsagainstothersources.Thebestsolutionstoaproblemare
often too difficult for the client to implement. So be cautious about recommending the
optimal solution to a problem. Most explanations require some degree of conciliation for
execution.


STEP2
GettingstartedwithR
2.1INTRODUCTION
R is a programming language for statistical analysis and reporting. R is a simple
programminglanguagewhichincludesmanyfunctionsforDataAnalytics,ithasaneffective
data handling and storage facility. R provides graphical facilities for data analysis and
reporting.IrequestyoutopleaseInstallRandRstudiowhichisfreelydownloadable.Herein
mybookIusethecodewritteninRstudio.ToworkinRstudio,youneedtohaveevenRat
the back end, so please go to site CRAN and Install latest version of R according to your
Operatingsystem.SoLet’sStart(RockandRoll)withR-Studio:
WhenyoufirstopentheR-studio,youwillseefourwindows.
1. Scripts:ServesasanareatowriteandsaveRcode

2. Workspace:ListsthedatasetsandvariablesintheRenvironment
3. Plots:DisplaystheplotsgeneratedbytheRcode
4. ConsoleProvidesahistoryoftheexecutedRcodeandtheoutput.

2.2ELEMENTARYOPERATIONSINR
1.Expressions:
ifyouareworkingwithonlynumbersRcanbeusedasanadvancedcalculator,justtype
4+5
andpressenter,youwillgetthevalue9.
Rcanperformmathematicalcalculationswithoutobligationthatyouneedtostoreitinan
object.
Theresultisprintedontheconsole.
Trycalculatingtheproductof2ormorenumbers(*isthemultiplicationoperator).
6*9#youwillget54.
Anythingwrittenafter#signwillbeconsideredascommentsinR.
RfollowsBODMASrulestoperformmathematicaloperations.
Typethefollowingcommandsandunderstandthedifference.
20–15*2#youwillget-10
(20–15)*2#hereyouwillget10
Becarefulofdividinganyvaluewith0willgiveyouinf(infinity).
typethiscommandintheconsoleandcheck


8/0
These mathematical operations can be combined into long formulae to achieve specific
tasks.
2.LogicalValues:
Fewexpressionsreturna“logicalvalue”:TRUEorFALSE.(knownas“Boolean”values.)
Lookattheexpressionthatgivesusalogicalvalue:
6<9#TRUE

3.Variables:
Wecanstorevaluesintoavariabletoaccessitlater.
X<-48#tostoreavalueinx.
Y<-“YL,Prasad”(Don’tforgetthequotes)
NowXandYaretheobjectscreatedinR,canbeusedinexpressionsinthepositionofthe
originalresult.
TrytocallXandYjustbytypingtheobjectname
Y#[1]“YL,Prasad”
weneedtorememberthatRiscasesensitive.ifyouassignavaluetocapXandcallsmallx
thenitwillshowyouanerror.
TrydividingXby2(/isthedivisionoperator)#youwillget24astheanswer
Wecanre-assignanyvaluetoavariableatanytime.Assign“Lakshmi”toY.
Y<-“Lakshmi”
Wecanprintthevalueofavariablejustbytypingitsnameintheconsole.Tryprintingthe
currentvalueofY.
Ifyouwrotethiscode,congratulations!YouwrotethefirstcodeinRandcreatedanobject.
4.Functions:
We can call a function by typing its name, followed by arguments to that function in
parenthesis.
Trythesumfunction,toaddupafewnumbers.Enter:
sum(1,3,5)#9
Weusesqrtfunctiontogetthesquarerootof16.
sqrt(16)#4
16^.5#alsogivesthesameansweras4
Square root transformation is the most widely used transformation along with log
transformationindatapreparation.
Typethefollowingcommandsandchecktheanswers
log(1)#0



log(10)#2.302585
log10(100)#thiswillreturn2sincethelogof100tothebase10is2.
anytimeifyouwanttoaccessthehelpwindowyoucantypethefollowingcommands
help(exp)
?exp
Ifyouwantsomeexampleforfunctionsyougivethiscommand:
example(log)
Rallowsonetosavetheworkspaceenvironment,includingvariablesandloadedlibraries,
intoa.Rdatafileusingthesave.image()function.Anexisting.Rdatafilecanbeloadedusing
theload.image()function.
5.Files
Rcommandscanbewrittenandstoredinplaintextfiles(with“.R”extension)forexecuting
later.
Assume that we stored some sample scripts, We can list the files in the current directory
fromwithinR,bycallingthelist.filesfunction.
list.files()

2.3SETTINGUPAWORKINGDIRECTORY
BeforegettingdeeperintoRitisalwaysbettertosetupaworkingdirectorytostoreallour
files, scalars, vectors, data frames etc. For this first, we want to know what is the current
directoryRisusingbydefault.tounderstandthat,typethecommand:
getwd()#[1]“C:/Users/admin/Documents”
NowIwanttosetfolderRdataasmyworkingdirectorywhichislocatedinDdrive.todo
thisIwillgivethecommand:
setwd(“D:/Rdata”)
Pressenter(ClickSubmitIcon)tomakesurethatyourcommandhasbeenexecutedandthe
working directory been set. We set the folder R data in D drive as working directory. It
doesn’t mean that we created anything new here, but just assigned a place as the working
directory,thisiswhereallthefileswillbeadded.
tocheckwhethertheworkingdirectoryhassetupcorrectlygivethecommand:

getwd()

2.4DATASTRUCTURESINR
A data structure is an interface to data organized in computer memory. R provides several
kinds of data structure each designed to optimize some aspect of storage, access, or
processing.
ExamplesofDatastructures:1.Vector2.Matrix3.Factor4.DataFrame


1.Vectors
VectorsareabasicbuildingblockfordatainR.Rvariablesareactuallyvectors.Avectorcan
only consist of values in the same class. The tests for vectors can be conducted using the
is.vector()function.
Thenamemaysoundfrightening,butavectorissimplyalistofvalues.Avector ’svalues
can be numbers, strings, logical values, or any other type, as long as they are all the same
type.
TypesofVectors:Integer,Numeric,Logical,Character,Complex.
R provides functionality that enables the easy creation and manipulation of vectors. The
followingRcodeillustrateshowavectorcanbecreatedusingthecombinefunction,c()
orthecolonoperator,:,
Letuscreateavectorofnumbers:
c(4,7,9)
Thecfunction(cisshortforCombine)createsanewvectorbycombiningalistofvalues.
Createavectorwithstrings:
c(‘a’,’b’,’c’)
SequenceVectors
Wecancreateavectorwithstart:endnotationtomakesequences.Letusbuildavectorfrom
thesequenceofintegersfrom5to9.
5:9#Createsavectorwithvaluesfrom5through9:
Wecanevencalltheseqfunction.Let’strytodothesamethingwithseq:

seq(5,9)
VectorAccess
Aftercreatingavectorwithsomestringsinitandstoreit,Wecanretrieveanindividualvalue
withinavectorbyjustprovidingitsnumericindexinsquarebrackets.
sentence<-c(‘Learn’,‘Data’,‘Analytics’)
sentence[3]#[1]“Analytics”
We can assign new values within an existing vector. Try changing the third word to
“Science”:
sentence[3]<-“Science”
Ifyouaddnewvaluestothevector,thevectorwillgrowtoaccommodatethem.Let’sadda
fourthword:
sentence[4]<-‘ByYL,Prasad’
Wecanuseavectorwithinthesquarebracketstoaccessmultiplevalues.
Trygettingthefirstandfourthwords:


sentence[c(1,4)]
Thismeansyoucanretrieverangesofvalues.Getthesecondthroughfourthwords:
sentence[2:4]
Wecansetrangesofvalues,byjustprovidingthevaluesinavector.
sentence[5:7]<-c(‘at’,‘PRA’,‘Analytix’)#toaddwords5through7
Tryaccessingtheseventhwordofthesentencevector:
sentence[7]
VectorNames
Let us create a 3-item vector, and store it in the ranks variable. We can assign names to a
vector ’s elements by passing a second vector filled with names to the names assignment
function,likethis:
ranks<-1:3
names(ranks)<-c(“first”,“second”,“third”)
ranks

Wecanusethenamestoaccessthevector ’svalues.
ranks[“first”]
2.Matrices
AmatrixinRisacollectionofhomogeneouselementsarrangedin2dimensions.
Amatrixisavectorwithadimattribute,i.e.anintegervectorgivingthenumberorrows
andcolumns
Thefunctionsdim(),nrow()andncolprovidetheattributesofthematrix.
Rowsandcolumnscanhavenames,dimnames(),rownames(),colnames()
Let us look at the basics of working with matrices, creating them, accessing them and
plottingthem.
Letuscreateamatrix3rowshighby4columnswide,withallitsfieldssetto0.
Sample<-matrix(0,3,4)
MatrixConstruction
We can construct a matrix directly with data elements, the matrix content is filled along the
columnorientationbydefault.
Lookatthefollowingcode,thecontentofSampleisfilledwiththecolumnsconsecutively.
Sample<-matrix(1:20,nrow=4,ncol=5)
MatrixAccess
Toobtainvaluesfrommatricesyoujusthavetoprovidetwoindicesinsteadofone.
Let’sprintourSamplematrix:


print(Sample)
TrygettingthevaluefromthesecondrowinthethirdcolumnofSample:
Sample[2,3]
Wecangetanentirerowofthematrixbyomittingthecolumnindex(butkeepthecomma).
TryretrievingtheThirdrow:
Sample[3,]#[1]371115
Togetanentirecolumn,omittherowindex.Retrievethefourthcolumn:
Sample[,4]#[1]13141516

3.Factors
Whenwewantthedatatobegroupedbycategory,Rhasaspecialtypecalledafactortotrack
thesecategorizedvalues.Afactorisavectorwhoseelementscantakeononeofaspecificset
ofvalues.Forexample,“Gender”willusuallytakeononlythevalues“Male”,“Female”and
“NA”.Thesetofvaluesthattheelementsofafactorcantakeiscalleditslevels.
CreatingFactors
Tocategorizethevalues,simplypassthevectortothefactorfunction:
gender<-c(‘male’,‘female’,‘male’,‘NA’,‘female’)
types<-factor(gender)
print(gender)
Youseetherawlistofstrings,repeatedvaluesandall.Nowprintthetypesfactor:
print(types)
Let’stakealookattheunderlyingintegers.Passthefactortotheas.integerfunction:
as.integer(types)#[1]21231
Youcangetonlythefactorlevelswiththelevelsfunction:
levels(types)#[1]“female”“male”“NA”
4.DataFrames
data frames provide a structure for storing and accessing several variables of possibly
differentdatatypes.Becauseoftheirflexibilitytohandlemanydatatypes,dataframesarethe
preferredinputformatformanyofthemodelingfunctionsavailableinR.
LetuscreatethreeindividualobjectsnamedId,Gender,andAgeandtiethemtogetherinto
adataset.
Id<-c(101,102,103,104,105)
Gender<-c(‘male’,‘female’,‘male’,‘NA’,‘female’)
Age<-c(38,29,NA,46,53)
TheId,Gender,andAgearethreeindividualobjects,Rhasastructureknownasthedata
framewhichcantieallthesevariablestogetherinasingletableoranExcelspreadsheet.Ithas


aspecificnumberofcolumns,eachofwhichisexpectedtocontainvaluesofaparticulartype.

Italsohasanindeterminatenumberofrows-setsofrelatedvaluesforeachcolumn.
It’seasytocreateadatasetjustcallthedata.framefunction,andpassId,Gender,andAgeas
thearguments.AssigntheresulttotheTestdataset:
Test<-data.frame(Id,Gender,Age)
PrintTesttoseeitscontents:
print(Test)
fix(Test)#Toviewthisobjectdataset
Data Frame Access: It is easy to access individual portions of a data frame. We can get
individual columns by providing their index number in double-brackets. Try getting the
secondcolumn(Gender)ofTest:
Test[[2]]
Youcouldprovideacolumnnameasastringindouble-bracketsformorereadability
Test[[“Age”]]
Wecanevenuseashorthandnotation:thedataframename,adollarsign,andthecolumn
namewithoutquotes.
Test$Gender

2.5IMPORTINGANDEXPORTINGDATA
QuiteoftenweneedtogetourdatafromexternalfileslikeTextfiles,excelsheetsandCSV
files,toperformthisRwasgiventhecapabilitytoeasilyloaddatainfromexternalfiles.
Your environment might have many objects and values, which you can delete using the
followingcode:
rm(list=ls())
Therm()functionallowsyoutoremoveobjectsfromaspecifiedenvironment.
ImportingTXTfiles:Ifyouhavea.txtoratab-delimitedtextfile,youcaneasilyimportit
withthebasicRfunctionread.table().
setwd(“D:/Rdata”)
Inc_ds<-read.table(“Income.txt”)
Forfilesthatuseseparatorstringsotherthancommas,youcanusetheread.tablefunction.
Thesep=argumentdefinestheseparatorcharacter,andyoucanspecifyatabcharacterwith

“\t”.Callread.tableon“Inc_tab.txt”,usingtabseparators:
read.table(“Inc_tab.txt“,sep=”\t”)
Noticethe“V1”,”V2”and“V3”columnheaders?Thefirstlineisnotautomaticallytreated
ascolumnheaderswithread.table.Thisbehavioriscontrolledbytheheaderargument.Call
read.tableagain,settingheadertoTRUE:
Inc_th<-read.table(“Inc_tab.txt”,sep=”\t”,header=TRUE)


Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Tải bản đầy đủ ngay
×