Data analytics for beginners paul kinley

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (2.28 MB, 51 trang )

Data Analytics for Beginners
Basic Guide to Master Data Analytics

Table of Contents:
Introduction
Chapter 1: Overview of Data Analytics
Foundations Data Analytics
Getting Started
Mathematics and Analytics
Analysis and Analytics
Communicating Data Insights
Automated Data Services
Chapter 2: The Basics of Data Analytics
Planning a Study
Surveys
Experiments
Gathering Data
Selecting a Useful Sample
Avoiding Bias in a Data Set
Explaining Data
Descriptive analytics
Charts and Graphs
Chapter 3: Measures of Central Tendency
Mean
Median
Mode
Variance
Standard Deviation
Coefficient of Variation
Drawing Conclusions

Chapter 4: Charts and Graphs
Pie Charts
Create a Pie Chart in MS Excel
Bar Graphs
Create a Bar Graph with MS Excel
Customizing the Bar Graph
Time Charts and Line Graphs
Create a Line Graph in MS Excel
Customizing Your Chart
Annual Employee Losses

Adding another Set of Data
Histograms
Create a Histogram with MS Excel
Creating a Histogram
Scatter Plots
Create a Scatter Chart with MS Excel
Spatial Plots and Maps
Chapter 5: Applying Data Analytics to Business and Industry
Business Intelligence (BI)
Data Analytics in Business and Industry
BI and Data Analytics
Chapter 6: Final Thoughts on Data
Conclusion

Introduction
We live in thrilling and innovative times. As business moves to the digital environment, virtually every
action we take produces data. Information is collected from every online interaction. All sorts of devices

gather and store data about who we are, where we are, and what we are doing. Increasingly-massive
warehouses of data are now freely available to the public. Skilled analyses of all this data can help
businesses, governments, and organizations to make better-informed decisions, respond quickly to
changing needs, and to gain deeper insights into our rapidly-changing environment. It is a challenge to
even attempt to make good use of all of the available data. In order to answer specific questions, a
person must decide what data to collect, which methods to use, and how to interpret the results.
Data analytics is a way to make valuable use all types of information. Analytics is used to help
categorize data, identify patterns, and predict results. Data use has become so ubiquitous that it has
become necessary for individuals in every profession to learn how to work with data. Those who
become the most proficient at working with data in useful and creative ways will be the most successful
in the new world of business.
Until recently, data analytics was limited to an exclusive culture of data analysts, who characteristically
presented this topic in complicated and often unintelligible terminology. Fortunately, data analytics is
not as complicated as many believe. It simply consists of using analytical methods and processes to
develop and explain specific and useful information from data. The point of data analytics is to enhance
practices and to support better-informed decisions. This can result in: safer practices within an industry,
greater revenues for a business, higher customer satisfaction, or any other object of focus. This eBook
introduces a wide range of ideas and concepts used for deriving useful information from a set of data,
including data analytics techniques and what can be achieved by using them.

Chapter 1: Overview of Data Analytics
With a little statistical understanding and procedural training, you will be able to use analytical methods
to make data-based insights. Data analytics offers new ways to understand the world. Businesses and
organizations were in the habit of making decisions based on assumptions and hoping for favorable
outcomes. Data analytics gives people the insights that they need to plan for improvements and specific
results. Analytics are generally used for the following purposes:
•

To enhance business organizations and increase returns on investment (ROIs).

•

To improve the success of sales and marketing campaigns.

•

To identify trends and emerging developments.

•

To make society more safe.

Foundations Data Analytics
Data analytics requires the use mathematical and statistical procedures. It also requires the skills to work
with certain software applications and a knowledge of the subject area you are working with. Without
knowledge of the subject-matter, analytics is reduced to simple analytics. Due to the increasing demand
for data insights, every field of business has begun to implement data analytics. This has resulted in a
variety of analytic specialties, such as: market analytics, financial analytics, clinical analytics,
geographical analytics, retail analytics, educational analytics, and many other areas of interest.

Getting Started
This chapter explains the major components comprising data analytics, gathering, exploring, and
interpreting data. As a data analyst, you will be collecting and sorting large volumes of raw,
unstructured, and partially-structured data. The amounts of data that you are likely to be working with
can be too large for a normal database system to effective process. A data set that is too large, changes
too quickly, or it does not conform to the structure of standard database designs requires a special
skillset to manage. Data analytics consists of analyzing, predicting, and visualizing data. When data
analysts gather, query, and interpret data, they conduct a process that is quite similar to data engineering.
Although useful insights can be produced from an individual source of data, the blending of several

sources gives context to the data that is necessary to make more informed decisions. As a data analyst,
you can combine multiple datasets that are maintained in a single database. You can also work with
several different databases maintained within a large data warehouse. Data can also be maintained and
managed within a cloud-based platform specially designed for that purpose. However the data is pooled
and wherever it is stored, the analyst must still issue queries on the data and make commands to retrieve
specific information. This is typically done using a specialized database language called Structured
Query Language (SQL).

When using a database software application or conducting an analysis using other programming
languages, like R or Python, you can utilize a variety of digital file formats, such as:
•
•
•
•

Comma-separated values (CSV) files: Virtually all data-based software applications (including
cloud-based programs) and scripting languages are compatible with the CSV file type.
Programming Scripts: Professional data analysts generally know how to write programming
scripts in order to work with data and visualizations in languages like Python and R.
Common File Extensions: MS Excel files have the .xls or .xlsx extension. Geospatial
applications are saved with their own file formats (e.g., .mxdextension for ArcGIS and the .qgs
extension for QGIS).
Web Programming Files: Web-based data visualizations often use the Data Driven Documents
JavaScript library (D3.js.). D3.js, files are saved as .html files.

Mathematics and Analytics
Data analytics requires the ability to perform mathematical and statistical operations. These skills are
necessary to understand both to make sense of the data and to evaluate its relative significance. This is
also important in data analytics, because they can be used to conduct data forecasting, decision analytics,

and testing of hypotheses. Before getting into more advanced explorations of mathematical and
statistical procedures, we will take some time to explain some distinctions between mathematics and
analytics.
Mathematics relies on specific numerical procedures and deductive reasoning to develop a mathematical
explanation of some phenomenon. Like mathematics, analytics provides a mathematical description of a
phenomenon. Analytics is actually a type of analytics that is based on mathematics. However analytics
uses inductive reasoning and probability to form a conclusions and explanations.
Data analysts use mathematical procedures to make decision models, to produce estimations, and to
make forecasts. In order to follow this book, you need little more than common math skills. This book
will teach you how to statistical techniques to develop insights from data. In the field of data analytics,
statistical procedures are used to determine the meaning and significance of data. This can then utilized
to test hypotheses, build data simulations, and make predictions about future outcomes.

Analysis and Analytics
The major difference between data analysis and data analytics is the need for subject knowledge. Typical
statisticians specialize in data procedures and have little-to-no knowledge of other fields of study. They
must consult with others who have subject-specific expertise to know which data to look for and to help
find meaning in that data. Data analysts, on the other hand, must understand their subject matter. They
seek to gain important insights that they can use with their subject-matter expertise to make meaning of
those insights. Below is a list of ways that subject matter experts use analytics to enhance performance
in their areas:
•
•
•
•
•
•

Engineering analysts use data analytics with building designs.
Clinical data analysts use predictive methods to foresee future health issues.

Marketing data analysts use regression data to predict and moderate customer turnover.
Data journalists search databases for patterns that may be worth investigating.
Crime data analysts develop spatial models to identify patterns and predict future crimes.
Disaster relief data analysts work to organize and explain important data about the effects of
disasters, which is then used to determine the types of assistance needed.

Communicating Data Insights
Data analysts often have to explain data in ways that non-technical people can comprehend. They must
be able to create understandable data visualizations and reports. Generally, people have to visually
process data in the form of charts, graphs, and pictures for to be able to understand data. Analysts have
to be both creative and practical in the ways that they communicate their findings.
Organizational leaders often have difficulties trying to figure out what to do with all of data that their
organization collects. What they do know, however, is that effectively using analytical tools can help
them to both strengthen and gain a valuable competitive edge for their business or organization.
Currently, very few of these leaders know the available options for engaging in the process. The
following section discusses the major data analytics solutions and the benefits that can be gained by
organizations.
When implementing data analytics within an organization, there are three key methodologies. One can
create an internal data analytics department. One could contract out the assignments to independent data
analysts, or one could pay for a cloud-based software-as-a-service (SAS) solution that enables novices to
utilize powerful of data analytics tools.
There are a few major ways to create an internal data analytics team:
•

Train current personnel. This can be an inexpensive way to provide an organization with
ongoing data analytics. This training can be used to transform certain employees into highlyskilled subject-matter experts who are proficient in data analysis.

•

Train current personnel and also hire professional analysts. This strategy follows the same
process as the first method, but also includes hiring a few data professionals to oversee the
process and personally handle the most challenging problems and tasks.

•

Hire data professionals. An organization get their needs met by hiring or contracting with
professional data analysts. This is the most expensive option, because professional data analysts
are in low supply and generally have high salary requirements.

Securing highly-skilled data analysts to meet the needs of an organization can be extremely difficult.
Many businesses and organizations outsource their data analytics jobs to external experts. This happens
in two different ways: They contract with someone to develop a wide-ranging data analytics plan to
serve the entire organization. Another way is to contract with experts to provide individual data analytics
solutions for specific situations and problems that that their organization may encounter.

Automated Data Services
Although you must understand some certain statistical and mathematical procedures, it is not essential to
learn how to code like professional analysts. Computer program applications have been developed that
can help to provide powerful capabilities without having to code or script. Cloud-based platform
solutions can provide organizations with most or all of their data analytics needs, although training is
still required for personnel to operate the cloud platform programs.
This book will teach you how to use the power of data analytics to achieve a individual and
organizational goals. Regardless of a field of work, learning data analytics can help you to become a

more in proficient and sought after professional. Below is a brief list of benefits that data analytics
provide for various areas:
•

Benefits for corporations: Cost minimization, higher return on investment (ROI), increased staffproductivity, reduction of customer loss, higher customer satisfaction, sales forecasting, pricingmodel enhancement, loss detection, and more efficient processes.

•

Benefits for governments: Increased staff-productivity, improved decision-making models, more
reliable budget forecasting, more efficient resource allocations, and discovery of organizational
patterns.

•

Benefits for academia: More efficient resource allocations, improved instructional focus and
student performance, increased student retention, refinement of processes, reliable budget
forecasting, and increased ROI for student recruitment practices.

This chapter provided an introduction to the concept of data analytics. Analytics is a growing field of
science that brings together traditional statistical procedures and computer science in order to ascertain
meaningful insights from huge sets of raw data for the benefit of businesses, organizations,
governments, and society. Data analytics is sometimes confused with Business Intelligence (BI) because
of the common tools they both share, particularly data visualizations, such as traditional charts and
graphs. BI, however, is a discipline designed for business leaders without the advanced training
necessary to engage in data analytics. The following chapter discusses the basic principle of data
analytics.

Chapter 2: The Basics of Data Analytics
This Chapter will help you to understand the big picture of the field of analytics. It will discuss the steps
of the scientific method, and it will help you to learn how to apply analytics at each step of the scientific
process. Analytics does not only consist of analyzing data. It also consists of using the scientific process
to find answers to questions and make important decisions. The process includes designing studies,
gathering useful information, explaining the data with figures and charts, exploring the data, and

drawing conclusions. We will now examine each step in this process and discuss the critical role of
analytics.

Planning a Study
Once the research question is established, it is time to design a study to answer that specific question.
This requires figuring out the methods that you will use to extract the necessary data. This section covers
the two main types of studies: descriptive studies and experimental studies.

Surveys
With a descriptive study, data are gathered from people in a way that does not have an impact on them.
The most widely used type of descriptive study is a survey. Surveys are questionnaires that are given to
people who are randomly selected from a target population. Surveys are useful data tools for gathering
information. As with all methods of gathering data, improperly conducted surveys are likely to result in
inaccurate information. Common issues with surveys include inadequately worded questions, which can
be confusing, lack of participant response, or lack of randomization in the selection process. Any of
these problems can invalidate the results of the survey, therefore surveys must be carefully planned
before they are implemented.
A limitation of the survey method is that they can only provide information on relationships that exist
between variables and not information on causes and effects. If the survey researchers observe that the
people who smoke cigarettes, for example, tend to work longer hours per day than those who do not
smoke, they are not in a position to suggest that smoking is the cause for the longer work hours.
Variables that were not part of the research design might cause the relationship, such as number of hours
they sleep every night.

Experiments
Experiments involve the application of one or more treatments to subjects in a controlled environment.
The treatments are things that may or may not affect the subject under study. Some studies involve
medical experiments, wherein the subjects are patients who undergo medical treatments. Other
experiments might include students who receive tutoring, or exposure to a particular instructional tool as
the treatment. Businesses engage in experiments that involve sample participants from the consumer

market. These participants may be exposed to a certain type of advertisement and asked how they were
emotionally affected.
Once the treatments are applied, the responses are systematically recorded. For instance, to study the
effect of a drug dosage amount on blood pressure, a group of subjects may be administered 15mg of a
medicine. A different sample group may be administered 30 mg of the same drug. Typically, a control
group is also involved, where subjects each receive a placebo treatment (i.e., a substance with no
medicinal properties).

Experiments are often designed to take place in a controlled setting, in order to reduce the number of
potential unrelated variables and possible biases that might affect the results. Some possible problems
might include: researchers knowing which participants received particular treatments; a particular
circumstance or condition, not factored into the study, that may impact the results (e.g., other
medications that a participant may be taking), or not including an experimental control group. However,
when experiments are designed correctly, difference in responses, found when the groups are compared,
allow the researchers to conclude that there is a cause and effect relationship. No matter what the study,
it must be designed so that the original questions can be answered in a credible way.

Gathering Data
Once a research plan (whether descriptive or experimental) has been designed, the subjects must be
selected, and data must be gathered. This stage of the research process is essential to generating
meaningful data. The ways in which data are collected vary with the type of study. In experimental
designs, the data should be collected in the most controlled manner possible, in order to reduce the
possibility of generating contaminated results. Some experiments require more strenuous procedures
than others. When gathering data on people’s perceptions of a new business marketing strategy or data
concerning the effectiveness of a new teaching strategy, the consequences of inaccurate results are not as
critical as they would be in a medical study. Therefore, in low-stakes experiments, it is sometimes
preferable to use less robust data gathering procedures in order to save time and money.

Selecting a Useful Sample

In analytics, as with computer programming, garbage in results in garbage out. If subjects are
improperly chosen, for example by giving some more of a chance to be selected than others, the results
will be unreliable and not useful for making decisions. For example, John is researching the attitudes of
individuals about a possible new tax. John stands in front of a local grocery store and asks passers-by to
share their thoughts and attitudes. The problem with that is that John will only get the attitudes of a)
individuals who shop at that grocery store; b) on that specific day; c) at that specific time; d) and who
actually chose to participate. Because of his limited selection process, the subjects in his survey are not
representative of the entire population of the town. Likewise, John design an online survey and ask
people to input their feedback on the new tax. However, only people who are aware of the website, have
access to the Internet, and choose to participate will provide data. Characteristically, only people with
the strongest attitudes are likely to participate. Again, these the participants would not be representative
of everyone in the town.
In order to avoid such selection bias, it is necessary to select the sample randomly, using some type of
process that gives everyone in the population the same statistical opportunity to be chosen. There are
various methods for randomly selecting subjects in order to get valid and useable results.

Avoiding Bias in a Data Set
If you were conducting a phone survey on political voting preferences, and you made your calls to
people’s land lines at home between the hours of 8:00 a.m. and 4:00 p.m., you would fail to get feedback
from individuals who work at that time. Perhaps those who work during those hours have different
preferences than those at home during those hours. For example, more business owners may be at home
and express voting preferences for something completely different than members of the working class.
Surveys that are poorly designed may be too lengthy, resulting in some participants quitting before they
finish. Participants may not be completely honest if the questions are too personal. If the list of choices
is too limited, the survey will not be able to capture valuable data that people would have provided.
Many things can render survey data invalid.

Experiments can be even more problematic in terms of gathering data. If you want to test how well
people retain information when exposed to loud music, a variety of factors could affect the outcomes.

The experiment designer should consider if everyone will listen to the same song, if they will be asked
about the amount of sleep they got the night before, if they have prior knowledge about the type of
subject matter, how they feel about being there participating in the experiment, whether they use drugs
or alcohol regularly, and a host of other considerations that must be considered in order to control for
outside variables.

Explaining Data
Once data has been collected, it is time to compile it in order to get a view of the entire data set.
Analysts describe data in two basic ways: with images, like graphs and charts, and with figures, called
descriptive analytics. Descriptive analytics are the most commonly-used methods for describing data to
the general population. When used effectively, a chart or graph can easily explain volumes of data in a
single snapshot.

Descriptive analytics
Data can be summarized by using descriptive analytics. Descriptive analytics are numerical
representations of data that highlight the most important features of a dataset. With categorical data,
wherein everything is sorted into groups (e.g., age, gender, ethnicity, currency, price, etc.) things are
usually summarized by the number of units in each category. This is referred to in terms of frequency or
percentage.
Numerical data consists of literal quantities or totals (e.g., height, weight, amount of money, etc.),
wherein the actual numbers are meaningful. When working with numerical data, more aspects can be
summarized than just the number or percentage within each category. Such elements include measures
of middle (i.e., the center point of the data); measures of variance (i.e., how widely spread or how
tightly-clustered the data are around the center). Another consideration is a measure of the relationship
between different variables.
Depending on the particular situation, certain descriptive analytics are more appropriate than others. For
example, if you were to assign the codes 1 for men and 2 women, when analyzing the data, it would not
make sense to attempt to average those numbers. Likewise, attempting to use a percentage to explain a
singular amount of time would not be useful.
Another type of data, ordinal data, is somewhat of a combination of the first two types. Ordinal data

appear are in categories, however the categories have a hierarchical order, such as rankings from 1 to 10,
or student ranks of freshman through seniors. This data can be analyzed the same way as categorical
data. Numerical data procedures can also be used when the categories represent meaningful numbers.

Charts and Graphs
Data can be presented visually with graphs and charts. Such graphs include pie charts and bar charts,
which can be used with categorical variables like gender or type of car. A bar graph might present data
about attitudes using, for example, a series of five ordered bars labeled from “Strongly Disagree”
through “Strongly Agree.”
Not all data, however, can be presented clearly with these types of charts. Numerical data, such as
height, time, or dollars that represent measures of something or totals require the types of graphs that
can either summarize the numbers or group them numerically. One such graph that is a histogram, which
will be discussed later in this book.

Once the data is collected and described with pictures and numbers, it is time to begin the process of
data analysis. Assuming that the study was planned well, the research question can be properly answered
by applying an appropriate data analysis. As with all previous steps in the process, selecting an
appropriate analytical procedure determines the usefulness of the results.
This chapter discussed the foundations of data analytics. Using mathematical techniques and scientific
procedures to collect, measure, analyze, and draw conclusions from data is what data analytics is all
about. The following chapter discusses the major kinds of data analyses necessary to conduct effective
data analytics. In the following chapter you will learn the basics of calculating and measuring common
descriptive analytics for measuring central tendency and variation within a set of data, as well as the
analytics necessary to evaluate the relative position of a specific value within that data set.

Chapter 3: Measures of Central Tendency
The essence of data analytics is their analysis of data. Analysts use analytical procedures to make sense
out of large amounts of data and their characteristics. Analytical methods can be applied to find

commonalities within groups of people; which can then be used to influence the decisions that they
make. This is done all of the time by advertisers and politicians. A governmental department, for
example, may want to find out the average number of people below the age of 18 that use smokeless
tobacco products. Based on the results of their study, the department could propose a new requirement
that smokeless tobacco products be restricted from advertising near schools. Likewise, a fashion
designer might want to learn the height and weight of U.S. women with full time jobs. A great deal of
data analytics is conducted to find averages and other measures of central characteristics among sets of
data.
When investigating a total of 100 units, it can be convenient to gather the entire population and apply
measurements. When dealing with larger numbers, reaching the millions or even billions, measuring the
entire population can be slightly more challenging. In situations, it is necessary to take random samples
from the total population and allow the sample to represent the total group. This section discusses we
some of the essential principals of data analytical that lay the foundation for all types of data analysis.
These important concepts are: mean, median, mode, variance, and standard deviation.

Mean
The mean or average of a set of data, is the sum of all the numbers within a group divided by the number
of units in the group. The mean of a group is a representative property of the collective group. Useful
assumptions can be made about an entire set of data by figuring out its mean. The formula for
calculating the mean is below.
Mean = Sum of all the set elements / Number of elements.
For example: (1+2+3+4+5) / 5 = 3
The mean of a data set summarizes all of the data with a single value. An analyst might want to compare
the average price of houses between to different neighborhoods. In order to compare theses housing
prices, it would be illogical to compare the price of each individual house to the price of every other
house in the study. The best way to approach this research question would be to find the mean prices of
houses in each of the two neighborhoods, and then compare the two means with each other. By doing
this, the analyst will be able make a valid assumption about which neighborhood has the more expensive
houses.

Median
Median is the middle number of a data set. For a set of data that is composed of an odd number of
values, the value in the middle the median. For a set of data composed of an even number of values, is
the average of the two middle numbers is the median. The median is commonly utilized to divide a
collection of data into two separate halves.
In order to find the median of a set of data, write the numbers of the set in order from smallest to largest,
and count the number of units and identify the one or two numbers in the center. This is different from

calculating the mean, because the range of number values is not taken into consideration. Consider this
set of numbers: (1, 2, 3, 4, 20):
Mean: (1+2+3+4+20) / 5 = 6

Median: (1, 2, 3, 4, 20) = 3

The median of a data set is important, because it is not affected by abnormal deviations in the data set.
As we can see in our example, the value "20" disproportionately affects our median, making it appear as
though half of the values would be below 6 and the other half above 6. The mean, in this case, does not
provide a realistic representation of the data set. If the values represented dollars per week in allowance,
it would appear that the individual receives amounts that are half over and half under $6, when in fact,
the person would have only once received more than $4. The median, in this case, provides us with a
more accurate description of the contents of the data set. Bear in mind that this small collection of data
only consists of 5 values, so it is easy to understand with a quick glance. When the data set contains
hundreds of thousands of values, accurate estimations cannot be made with a quick glance.
The most significant feature of this data set is the single outlier that raises the mean. An outlier is an
outstanding deviation from the majority of the data set. For instance, if a set of data contains the values:
10, 20, 30, 40, 1000, the value 1000 is considered an outlier. Outliers can move the value of the mean far
from its logical central location. The mean of the above set is 1100/5=220 and the median is 30. The
median of this more accurately represents the data set than does the mean.

Mode
In a data set, the mode is the value that occurs most frequently. Mode is a measure of central tendency
like mean and median. The mode also represents a set of data with a single value. For instance, the mode
of the dataset (1,2,3,3,3,4,4,4,4,4,5,5,6,7) is 4, because it appears more than any other value.
If a data set has a normal distribution of values, the mode is equal to the values of the median and the
mean. With data distributions that are skewed (not standard), the mean, median, and mode values may
all be different. Data is symmetrical to the central value in a normal distribution. The distribution curve
in a normally-distributed data set is also symmetrical to an axis. Also, in a perfectly normal distribution,
half of the data values are lower than the mean, and the other half are higher.

Variance
It is sometimes necessary, and always a helpful, to measure the variation from the mean value within a
set of data. As we saw earlier, one or two outliers can result in an inaccurate representation of the data
set. For example, a large variance within family income data for a city may suggest that a mostly poor
population, with a few wealthy members, is earning more than a solidly middle-class population.
Measuring variance adds context to a standard data analysis. Below is the procedure for finding
variance:
-----------------------------------------------------------------------------------------------Step 1
Calculate the mean of the data set.
Example: (1, 2, 3, 4, 5)
Mean: (1+2+3+4+5) / 5 = 3
Step 2

Subtract the difference between the mean of the entire data set and the all of the individual values in the
data set (using absolute values...no negative numbers).
|3| - |1| = 2 |3| - |2| = 1 |3| - |3| = 0 |3| - |4| = 1|3| - |5| = 2
Step 3
Square each of those differences.
2 x 2 = 4 1 x 1 = 10 x 0 = 01 x 1 = 12 x 2 = 4

Step 4
Add all of the differences together.
4 + 1 + 0 + 1 + 4 = 10
Step 5
Divide that total by the number of values in the data set minus one.
10 / 5-1 = 2.5
Var = 2.5
-----------------------------------------------------------------------------------------------Because the variance is calculated using absolute values (see step 2), the variance of a dataset is always
positive. Calculating the variance of an actual data set would take too much time to calculate by hand.
The variance of a dataset with thousands of values can be calculated within seconds (actually microseconds) using data software. Perhaps the most important function of the variance value is the fact that it
is used to calculate the Standard Deviation, which is a critical concept of data analytics.

Standard Deviation
Standard deviation is a single value that represents how widely spread the values in a data set are from
the central value (mean). The more spread out a data distribution is, the greater its standard deviation.
This value provides a precise measure of how widely dispersed the values are in a dataset, allowing for
more advance statistical analyses. The standard deviation is determined by squaring the variance of the
data set. Standard deviation is derived by calculating the square root of the variance. Therefore, standard
deviation is a highly reliable analytical value that can be used to conduct sophisticated analytical
procedures. Standard deviation is also necessary to perform probability calculations, making it that
much more important to data analytics.
Step 1
Calculate the variance of the data set. This is necessary to find the standard deviation.
In our earlier example the variance was 2.5.
Step 2
Calculate the square root of the variance.
√2.5 = 1.58.
Check to verify that 4 out of 5 (80%) of the data set (1, 2, 3, 4, 5) is within one standard deviation (1.58)
from the mean (2.5). We know what the standard deviation is...but what does it really mean? In order to
determine whether our standard deviation is low (which means that the distribution is uniform and

therefore representative of the average member of the population) or high (which means that the
distribution is not very uniform and, therefore, it is less representative of the average member), we must
normalize it by calculating the Coefficient of Variation.

Coefficient of Variation
The coefficient of variation (CV) is the standard deviation / mean. This formula is applied to normalize
the standard deviation so that it can evaluated. Generally, a CV >= 1 indicates high variation, and a CV
< 1 indicates low variation. The greater the distance from 1 in either direction is significant. Let us
consider our example:
CV = Std. Dev. / mean
CV: 1.58 / 2.5 = 0.63
Because the CV < 1, we can assume that our data set is strongly representative of the average member of
the total population.
Example:
Imagine that the population of a particular city has an average monthly income of $5,000. We might
assume based upon the mean that the average citizen in this city are doing financially well. As data
analysts, however, we know that before we can make a reasonable assumption, it is necessary to order to
determine how uniformly the income is distributed among the population by calculating the variation of
data set. If the standard deviation is high, we may assume that the salaries are unevenly distributed
throughout the population. In that event we should not assume that the average member makes a
monthly income in the neighborhood of $5,000. If the standard deviation is low, then we may tend to
consider the population generally affluent.
With standard deviation, 68% of the values in a data set will always be within one standard deviation of
the group mean. Ninety-five percent of the values will be within two standard deviations of the mean.
Also, 99.7% of all values in the data set will be within three standard deviations of the mean. Consider
the statement, “Ninety-five percent of a town’s residents are between the ages of 4 and 84 years old. To
find the mean age, you would use the formula, mean = the sum of all data values / the total number of
values (4+84/2=22). Therefore, the mean age of the population is 22. Because we already know that the
range include 95% of the total population, we can assume that at least 68% of the citizens are within on

standard deviation of 22; therefore the majority of citizens are young.

Drawing Conclusions
Analysts utilize computers and formulas. However neither computers nor formulas can detect if they are
being used to perform useful operations. Nor can these things determine the meaning or significance of
the results. A common error made in in analytics is to overemphasize the significance of the results, or
to apply the results to the general population, when there is no logical basis for doing so. For example, a
research team is researching which types of restaurants airline travelers prefer to frequent. They
interview 100 travelers from the local airport and ask them to rate each restaurant from a provided list.
They produce a top 5 list, and conclude that travelers like those 5 restaurants the most. However, they
actually only know which ones those particular traveler like the most; they cannot draw conclusions
about travelers everywhere.
Analytics is much more than just numbers. It is important for analysts to know how to draw sensible
conclusions from their results.
This chapter discussed measures of central tendency and the role they play in data analytics. Analytical
concepts were explained, including: standard deviation, variance, relative standing, and other measures

of variance. All data analysis is affected by variation and analyses of how the values within the set of
data are distributed. Normally distributed data values strengthen both the inferences that can be drawn
and the predictions that can be made from statistical procedures conducted on a set of data.

Chapter 4: Charts and Graphs
This chapter presents visual ways to present day, including Pie Charts and Bar Graphs for categorical
data, Time Charts for time series data, and Histograms and Boxplots for numerical data. The primary
purpose for data displays is to organize and present data clearly and effectively. The reader will learn the
most common types of data displays used to present both categorical and numerical data. Also discussed
are caveats concerning data interpretation, and guidelines for data evaluation.

Pie Charts
Pie chart take are used for categorical data. They illustrate the percentage of individuals that in each
category. The total of all of the pieces of the pie equal 100%. Categories can clearly be compared and
contrasted with each other, due to the visually straightforward pie chart. The Budgets are typically
presented with pie charts to show how money is distributed.

Total Yearly Sales by Quarter
In order to assess the accuracy of a pie chart:
•
•
•

Make sure that the percentages add up to 100% or very close.
Check for pieces of the pie labeled “other” that are disproportionately big in relation to other
slices.
Verify that the pie chart consists of percentages for each category and not the literal numbers in
each group.

Create a Pie Chart in MS Excel
Step 1. Open a new MS Excel spreadsheet. Enter your data into two columns.

In this the example, the pie chart will be created to identify the relative percentage of money spent at the
grocery store. The data table includes a column with the list of grocery items and another column with
the amount of money spent on each item. The process is the same whether it is for a small list of
groceries or a large list of corporate transactions.
Items

Amount Spent

Cereal

$5.50

Milk

$4.10

Bananas

$1.25

Yogurt

$0.75

Total

$11.60

Step 2. Highlight the information that you would like to include in your pie chart. You do not have to
include all of the data in your table, however you must have at least 1 data record. Do this by clicking
and dragging your mouse over the area. Be sure to include the column headings when you do this. In
this example, those would be “Items” and “Amount Spent.” This way, you can include the headings in
your chart.

Step 3. Click on the "Insert" menu on the tool bar along the top of the screen. Select "Chart" from the
list of options. Then select the Pie Chart.

Step 4. Choose the type of pie chart you would like to make from the range of options. The pie chart
options consist of a flat chart, a 3D chart, an exploded chart, a pie-of-pie chart, or a bar-of-pie chart,
each option includes a section of the chart with more detail.

If you would like to preview each pie chart, click the "Press and Hold to View Sample" button,

Step 5. Click “Enter”, and review your pie chart. To edit or modify your chart, right click on it and
select from the extensive range of options.

Percentages Spent on Groceries

Bar Graphs
Bar graphs are another way to summarize categorical data. Like pie charts, bar graphs display data by
category, indicating how many objects are in each group, or the percentages of each category. Analysts
typically us bar graphs to compare and contrast categorical groups by separating the categories for each
one and displaying the resultant bars next to each other.

Average Number of Cars Sold per Month
Below is a checklist for evaluating bar graphs:
•
•
•

Make sure that the units on the Y-axis are evenly spaced.
Consider units of measurement on the scale of the bar graph. Smaller scales can make minor
differences appear to be huge.
If the bars represent percentages, as opposed to total numbers, look for the total number of units
being summarized.

Create a Bar Graph with MS Excel
Step 1. Create a data table with 1 independent variable. Bar graphs are horizontal visualizations that
illustrate values or data from a single variable.

Include labels for the data and variable at the head of each column. If you want to graph the number of
military personnel recruited in a month, you would write "Branch" at the head of the first column and
"Recruited" at the head of the second column.
Branch

Recruited

Army

210

Navy

165

Airforce

130

Marines

75

As an option, you could insert a third column containing a sub-data category. The Bar Graph menu
allows you to choose from a standard, clustered, or stacked bar graph. The stacked bar graph that
displays an additional number that is related to the variable.
Branch

Recruited

Code

Army

210

77B

Navy

165

50A

Airforce

130

45C

Marines

75

22D

Highlight the data that you would like to include in your graph. You can include everything in the data
table or just a selection with the data set. Microsoft Excel will separate the X and Y axes by columns.

Step 2. Click on the "Insert" menu on the tool bar along the top of the screen. Select "Chart" from the
list of options. Then select the “Bar Graph.” Click on the kind of bar graph you want from the choices
available in the bar menu. Bar graph options include: Cylinder, 2-D, 3-D, or Pyramid Cone or shaped
bar graphs.

Data analytics for beginners paul kinley

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về