Tải bản đầy đủ (.pdf) (39 trang)

IT training a guide to improving data integrity and adoption khotailieu

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (29.24 MB, 39 trang )



A Guide to Improving Data
Integrity and Adoption

A Case Study in Verifying Usage Data

Jessica Roper

Beijing

Boston Farnham Sebastopol

Tokyo


A Guide to Improving Data Integrity and Adoption
by Jessica Roper
Copyright © 2017 O’Reilly Media Inc. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA
95472.
O’Reilly books may be purchased for educational, business, or sales promotional use.
Online editions are also available for most titles ( For more
information, contact our corporate/institutional sales department: 800-998-9938 or


Editor: Nicole Tache
Production Editor: Colleen Lobner
Copyeditor: Octal Publishing Services
December 2016:



Interior Designer: David Futato
Cover Designer: Randy Comer
Illustrator: Rebecca Demarest

First Edition

Revision History for the First Edition
2016-12-12:

First Release

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. A Guide to
Improving Data Integrity and Adoption, the cover image, and related trade dress are
trademarks of O’Reilly Media, Inc.
While the publisher and the author have used good faith efforts to ensure that the
information and instructions contained in this work are accurate, the publisher and
the author disclaim all responsibility for errors or omissions, including without limi‐
tation responsibility for damages resulting from the use of or reliance on this work.
Use of the information and instructions contained in this work is at your own risk. If
any code samples or other technology this work contains or describes is subject to
open source licenses or the intellectual property rights of others, it is your responsi‐
bility to ensure that your use thereof complies with such licenses and/or rights.

978-1-491-97052-2
[LSI]


Table of Contents


A Guide to Improving Data Integrity and Adoption. . . . . . . . . . . . . . . . . . . 1
Validating Data Integrity as an Integral Part of Business
Using the Case Study as a Guide
An Overview of the Usage Data Project
Getting Started with Data
Managing Layers of Data
Performing Additional Transformation and Formatting
Starting with Smaller Datasets
Determining Acceptable Error Rates
Creating Work Groups
Reassessing the Value of Data Over Time
Checking the System for Internal Consistency
Verifying Accuracy of Transformations and Aggregation
Reports
Allowing for Tests to Evolve
Implementing Automation
Conclusion
Further Reading

1
3
4
5
6
7
9
10
11
13
14


19
21
29
31
32

iii



A Guide to Improving Data
Integrity and Adoption

In most companies, quality data is crucial to measuring success and
planning for business goals. Unlike sample datasets in classes and
examples, real data is messy and requires processing and effort to be
utilized, maintained, and trusted. How do we know if the data is
accurate or whether we can trust final conclusions? What steps can
we take to not only ensure that all of the data is transformed cor‐
rectly, but also to verify that the source data itself can be trusted as
accurate? How can we motivate others to treat data and its accuracy
as priority? What can we do to expand adoption of data?

Validating Data Integrity as an Integral Part of
Business
Data can be messy for many reasons. Unstructured data such as log
files can be complicated to understand and parse information. A lot
of data, even when structured, is still not standardized. For example,
parsing text from online forums can be complicated and might need

to include logic to accommodate slang such as “bad ass,” which is a
positive phrase but made with negative words. The system creating
the data can also make it messy because different languages have dif‐
ferent expectations for design, such as Ruby on Rails, which requires
a separate table to represent many-to-many relationships.
Implementation or design can also lead to messy data. For example,
the process or code that creates data, and the database storing that
data might use incompatible formats. Or, the code might store a set
of values as one column instead of many columns. Some languages
1


parse and store values in a format that is not compatible with the
databases used to store and process it, such as YAML (YAML Ain’t
Markup Language), which is not a valid data type in some databases
and is stored instead as a string. Because this format is intended to
work much like a hash with key-and-value pairs, searching with the
database language can be difficult.
Also, code design can inadvertently produce a table that holds data
for many different, unrelated models (such as categories, address,
name, and other profile information) that is also self-referential. For
example, the dataset in Table 1-1 is self-referential, wherein each
row has a parent ID representing the type or category of the row.
The value of the parent ID refers to the ID column of the same table.
In Table 1-1, all information around a “User Profile” is stored in the
same table, including labels for profile values, resulting in some val‐
ues representing labels, whereas others represent final values for
those labels. The data in Table 1-1 shows that “Mexico” is a “Coun‐
try,” part of the “User Profile” because the parent ID of “Mexico” is
11, the ID for “Country,” and so on. I’ve seen this kind of example in

the real world, and this format can be difficult to query. I believe this
relationship was mostly the result of poor design. My guess is that, at
the time, the idea was to keep all “profile-like” things in one table
and, as a result, relationships between different parts of the profile
also needed to be stored in the same place.
Table 1-1. Self-referential data example (source: Jessica Roper and Brian
Johnson)
ID
16
11
9

Parent ID
11
9
NULL

Value
Mexico
Country
User Profile

Data quality is important for a lot of reasons, chiefly that it’s difficult
to draw valid conclusions from impartial or inaccurate data. With a
dataset that is too small, skewed, inaccurate, or incomplete, it’s easy
to draw invalid conclusions. Organizations that make data quality a
priority are said to be data driven; to be a data-driven company
means priorities, features, products used, staffing, and areas of focus
are all determined by data rather than intuition or personal experi‐
ence. The company’s success is also measured by data. Other things

that might be measured include ad impression inventory, user

2

|

A Guide to Improving Data Integrity and Adoption


engagement with different products and features, user-base size and
predictions, revenue predictions, and most successful marketing
campaigns. To affect data priority and quality will likely require
some work to make the data more usable and reportable and will
almost certainly require working with others within the organiza‐
tion.

Using the Case Study as a Guide
In this report, I will follow a case study from a large and critical data
project at Spiceworks, where I’ve worked for the past seven years as
part of the data team, validating, processing and creating reports.
Spiceworks is a software company that aims to be “everything IT for
everyone IT,” bringing together vendors and IT pros in one place.
Spiceworks offers many products including an online community
for IT pros to do research and collaborate with colleagues and ven‐
dors, a help desk with a user portal, network monitoring tools, net‐
work inventory tools, user management, and much more.
Throughout much of the case study project, I worked with other
teams at Spiceworks to understand and improve our datasets. We
have many teams and applications that either produce or consume
data, from the network-monitoring tool and online community that

create data, to the business analysts and managers who consume
data to create internal reports and prove return on investment to
customers. My team helps to analyze and process the data to provide
value and enable further utilization by other teams and products via
standardizing, filtering, and classifying the data. (Later in this
report, I will talk about how this collaboration with other teams is a
critical component to achieving confidence in the accuracy and
usage of data.)
This case study demonstrates Spiceworks’ process for checking each
part of the system for internal and external consistency. Throughout
the discussion of the usage data case study, I’ll provide some quick
tips to keep in mind when testing data, and then I’ll walk through
strategies and test cases to verify raw data sources (such as parsing
logs) and work with transformations (such as appending and sum‐
marizing data). I will also use the case study to talk about vetting
data for trustworthiness and explain how to use data monitors to
identify anomalies and system issues for the future. Finally, I will
discuss automation and how you can automate different tests at dif‐

Using the Case Study as a Guide

|

3


ferent levels and in different ways. This report should serve as a
guide for how to think about data verification and analysis and some
of the tools that you can use to determine whether data is reliable
and accurate, and to increase the usage of data.


An Overview of the Usage Data Project
The case study, which I’ll refer to as the usage data project, or UDP,
began with a high-level goal: to determine usage across all of Spice‐
works’ products and to identify page views and trends by our users.
The need for this new processing and data collection came after a
long road of hodge-podge reporting wherein individual teams and
products were all measured in different ways. Each team and depart‐
ment collected and assessed data in its own way—how data was
measured in each team could be unique. Metrics became increas‐
ingly important for us to measure success and determine which fea‐
tures and products brought the most value to the company and,
therefore, should have more resources devoted to them.
The impetus for this project was partially due to company growth—
Spiceworks had reached a size at which not everyone knew exactly
what was being worked on and how the data from each place corre‐
lated to their own. Another determining factor was inventory—to
improve and increase our inventory, we needed to accurately deter‐
mine feature priority and value. We also needed to utilize and
understand our users and audience more effectively to know what to
show, to whom, and when (such as display ads or send emails).
When access to this data occurred at an executive level, it was even
more necessary to be able to easily compare products and under‐
stand the data as a whole to answer questions like: “How many total
active users do we have across all of our products?” and “How many
users are in each product?” It wasn’t necessary to understand how
each product’s data worked. We also needed to be able to do analysis
on cross-product adoption and usage.
The product-focused reporting and methods of measuring perfor‐
mance that were already in place made comparison and analysis of

products impossible. The different data pieces did not share the
same mappings, and some were missing critical statistics such as
which specific user was active on a feature. We thus needed to find a
new source for data (discussed in a moment).

4

| A Guide to Improving Data Integrity and Adoption


When our new metrics proved to be stable, individual teams began
to focus more on the quality of their data. After all, the product bugs
and features that should be focused on are all determined by data
they collect to record usage and performance. After our experience
with the UDP and wider shared data access, teams have learned to
ensure that their data is being collected correctly during beta testing
of the product launch instead of long after. This guarantees them
easy access to data reports dynamically created on the data collected.
After we made the switch to this new way of collecting and manag‐
ing data from the start—which was automatic and easy—more peo‐
ple in the organization were motivated to focus on data quality,
consistency, and completeness. These efforts moved us to being a
more truly data-driven company and, ultimately, a stronger com‐
pany because of it.

Getting Started with Data
Where to begin? After we determined the goals of the project, we
were ready to get started. As I previously remarked, the first task
was to find new data. After some research, we identified much of the
data needed was available in logs from Spiceworks’ advertising ser‐

vice (see Figure 1-1), which is used to identify a target audience that
users qualify to be in and therefore what set of ads should be dis‐
played to them. On each page of our applications, the advertising
service is loaded, usually even when no ads are displayed. Each new
page and even context changes, such as switching to a new tab, cre‐
ate a log entry. We parsed these logs into tables to analyze usage
across all products; then, we identified places where tracking was
missing or broken to show what parts of the advertising-service data
source could be trusted.
As Figure 1-1 demonstrates, each log entry offered a wealth of data
from the web request that we scraped for further analysis, including
the uniform resource locator (URL) of the page, the user who
viewed it, the referrer of the page, the Internet Protocol (IP) address,
and, of course, a time stamp to indicate when the page was viewed.
We parsed these logs into structured data tables, appended more
information (such as geography, and other user profile informa‐
tion), and created aggregate data that could provide insights into
product usage and cohort analysis.

Getting Started with Data

|

5


Figure 1-1. Ad service log example (source: Jessica Roper and Brian
Johnson)

Managing Layers of Data

There are three layers of data useful to keep in mind, each used dif‐
ferently and with different expectations (Figure 1-2). The first layer
is raw, unprocessed data, often produced by an application or exter‐
nal process; for example, some raw data from the usage data study
comes from products such as Spiceworks’ cloud helpdesk, where
users can manage IT tickets and requests, and our community,
which is where users can interact online socially through discus‐
sions, product research, and so on. This data is in a format that
makes sense for how the application itself works. Most often, it is
not easily consumed nor does it lend itself well for creating reports.
For example, in the community, due to the frameworks used, we
break apart different components and ideas of users and relation‐
ships so that email, subscriptions, demographics and interests, and
so forth are all separated into many different components, but for
analysis and reporting it’s better to have these different pieces of
information all connected. Because this data is in a raw format, it is
more likely to be unstructured and/or somewhat random, and
sometimes even incomplete.

Figure 1-2. Data layers (source: Jessica Roper and Brian Johnson)

6

| A Guide to Improving Data Integrity and Adoption


The next layer of data is processed and structured following some
format, usually created from the raw dataset. At this layer, compres‐
sion can be used if needed; either way, the final format will be a
result of general processing, transformation, and classification. To

use and analyze even this structured and processed layer of data still
usually requires deep understanding and knowledge and can be a bit
more difficult to report on accurately. Deeper understanding is
required to work with this dataset because it still includes all of the
raw data, complete with outliers and invalid data but in a formatted
and consistent representation with classifications and so on.
The final layer is reportable data that excludes outliers, incomplete
data, and unqualified data; it includes only the final classifications
without the raw source for the classification included, allowing for
segmentation and further analysis at the business and product levels
without confusion. This layer is also usually built from the previous
layer, processed and structured data. If needed, other products and
processes using this data can further format and standardize it for
the individual needs as well as apply further filtering.

Performing Additional Transformation and
Formatting
The most frequent reasons additional transformation and format‐
ting are needed are when it is necessary to improve performance for
the analysis or report being created, to work with analysis tools
(which can be quite specific as to how data must be formatted to
work well), and to blend data sources together.
An example of a use case in which we added more filtering was to
analyze changes in how different products were used and determine
what changes had positive long-term effects. This analysis required
further filtering to create cohort groups and ensure that the users
being observed were in the ideal audiences for observation. Remov‐
ing users unlikely to engage in a product from analysis helped us to
determine what features changed an engaged user’s behavior.
In addition, further transformations were required. For example, we

used a third-party business intelligence tool to feed in the data to
analyze and filter final data results for project managers. One trans‐
formation we had to make was to create a summary table that broke

Performing Additional Transformation and Formatting

|

7


out the categorization and summary data needed into columns
instead of rows.
For a long time, a lot of the processed and compressed data at Spice‐
works was developed and formatted in a way that was highly related
to the reporting processes that would be consuming the data. This
usually would be the final reporting data, but many of the reports
created were fairly standard, so we could create a generic way for
consumption. Then, each report applied filters, and further aggrega‐
tions on the fly. Over time, as data became more widely used and
dynamically analyzed as well as combined with different data sour‐
ces, these generic tables proved to be difficult to use for digging
deeper into the data and using it more broadly.
Frequently, the format could not be used at all, forcing analysts to go
back to the raw unprocessed data that required a higher level of
knowledge about the data if it were to be used at all. If the wrong
assumptions were made about the data or if the wrong pieces of data
were used (perhaps some that was no longer actively updated),
incorrect conclusions might have been drawn. For example, when
digging into the structured data parsed from the logs, some of our

financial analysts incorrectly assumed that the presence of a user ID
(generic, anonymous user identifier—ID) indicated the user was
logged in. However, in some cases we identified the user through
other means and included flags to indicate the source of the ID.
Because the team did not have a full understanding of these flags or
the true meaning of the field they were using, they got wildly differ‐
ent results than other reports tracking only logged-in users, which
caused a lot of confusion.
To be able to create new reports from the raw, unprocessed data, we
blended additional sources and analyzed the data as a whole. One
problem arose from different data sources with different representa‐
tions of the same entities. Of course, this is not surprising, because
each product team needed to have its own idea of users, and usually
some sort of profile for those users. Blending the data required cre‐
ating mappings and relationships among the different datasets,
which of course required a deep understanding of those relation‐
ships and datasets. Over time, as data consumption and usage grew,
we updated, refactored, and reassessed how data is processed and
aggregated. Our protocol has evolved over time to fit the needs for
our data consumption.

8

| A Guide to Improving Data Integrity and Adoption


Starting with Smaller Datasets
A few things to keep in mind when you’re validating data include
becoming deeply familiar with the data, using small datasets, and
testing components in isolation. Beginning with smaller datasets

when necessary allows for faster iterations of testing before working
on the full dataset. The sample data is a great place to begin digging
into what the raw data really looks like to better understand how
it needs to be processed and to identify patterns that are considered
valid.
When you’re creating smaller datasets to work with, it is important
to try to be as random as possible but still ensure that the sample is
large enough to be representative of the whole. I usually aim for
about 10 percent, but this will vary between datasets. Keep in mind
that it’s important to include data over time, from varying geograph‐
ical locations, and include data that will be used for filtering such as
demographics. This understanding will define the parameters
around the data needed to create tests.
For example, one of the Spiceworks’ products identifies computer
manufacturer data that is collected in aggregate anonymously and
then categorized and grouped for further analysis. This information
is originally sourced from devices such as my laptop, which is a
MacBook Pro (Retina, 15-inch, mid-2014) (Figure 1-3). Categoriz‐
ing and grouping the data into a set for all MacBook Pros requires
spending time understanding what kind of titles are possible for
Apple and MacBook in the dataset by searching through the data
for related products. To really understand the data, however, it is
important to also understand titles in their raw format to gain some
insight into how they are aggregated and changed before being
pushed into the dataset that is being categorized and grouped.
Therefore, testing a data scrubber requires familiarity with the data‐
set and the source, if possible, so that you know which patterns and
edge case conditions to check for and how you should format
the data.


Starting with Smaller Datasets

|

9


Figure 1-3. Example of laptop manufacturing information

Determining Acceptable Error Rates
It’s important to understand acceptable errors in data. This will vary
between datasets, but, overall, you want to understand what an
acceptable industry standard is and understand the kinds of deci‐
sions that are being made with the data to determine the acceptable
error rate. The rule of thumb I use is that edge case issues represent‐
ing less than one percent of the dataset are not worth a lot of time
because they will not affect trends or final analysis. However, you
should still investigate all issues at some level to ensure that the set
affected is indeed small or at least outside the system (i.e., caused by
someone removing code that tracks usage because that person
believed it “does not do anything”).
In some cases, this error rate is not obtainable or not exact; for
example, some information we appended assigned sentiment (posi‐
tive, negative, neutral) to posts viewed by users in the online forums
that are part of Spiceworks’ community.
To determine our acceptable error rate, we researched sentiment
analysis as a whole in the industry and found that the average accu‐
racy rate is between 65 and 85 percent. We decided on a goal of 25
percent error rate for posts with incorrect sentiment assigned
because it kept us in that top half of accuracy levels achieved in

the industry.

10

| A Guide to Improving Data Integrity and Adoption


When errors are found, understanding the sample size affected will
also help you to determine severity and priority of the errors. I gen‐
erally try to ensure that the amount of data “ignored” in each step
makes up less of the dataset than the allowable error so that the
combined error will still be within the acceptable rate. For example,
if we allow an error rate of one-tenth of a percent in each of 10
products, we can assume that the total error rate is still around or
less than 1 percent, which is the overall acceptable error rate.
After a problem is identified, the next goal is to find examples of
failures and identify patterns. Some patterns to look for are failures
for the same day of the week, time of day, or from only a small set of
applications or users. For example, we once found a pattern
in which a web page’s load time increased significantly every Mon‐
day at the same time during the evening. After further digging, this
information led us to find that a database backup was locking large
tables and causing slow page loads. To account for this we added
extra servers and utilized one of them to be used for backups so that
performance would be retained even during backups. Any data
available that can be grouped and counted to check for patterns
in the problematic data can be helpful. This can assist in identifying
the source of issues and better estimate the impact the errors could
have.
Examples are key to providing developers, or whoever is doing data

processing, with insight into the cause of the problem and providing
a solution or way to account for the issue. A bug that cannot be
reproduced is very difficult to fix or understand. Determining pat‐
terns will also help you to identify how much data might be affected
and how much time should be spent investigating the problem.

Creating Work Groups
Achieving data integrity often requires working with other groups
or development teams to understand and improve accuracy. In
many cases, these other teams must do work to improve the systems.
Data integrity and comparability becomes more important the
higher up in an organization that data is used. Generally, lots of
people or groups can benefit from good data, but they are not usu‐
ally the ones who must create and maintain data so it’s good [4].
Therefore, the higher the level of support and usage of data (such as
from managers and executives), the more accurate the system will

Creating Work Groups

|

11


be and the more likely the system will evolve to improve accuracy. It
will require time and effort to coordinate between those who con‐
sume data and those who produce it, and some executive direction
will be necessary to ensure that coordination. When managers or
executives utilize data, collection and accuracy will be easier to make
a priority for other teams helping to create and maintain that data.

This does not mean that collection and accuracy aren’t important
before higher-level adoption of data, but it can be a much longer
and difficult process to coordinate between teams and maintain
data.
One effective way to influence this coordination among team mem‐
bers is consistently showing metrics in a view that’s relevant to the
audience. For example, to show the value of data to a developer, you
can compare usage of products before and after a new feature is
added to show how much of an effect that feature has had on usage.
Another way to use data is to connect unrelated products or data
sources together to be used in a new way.
As an example, a few years ago at Spiceworks, each product—and
even different features within some products—had individual defi‐
nitions for categories. After weeks of work to consolidate the cate‐
gories and create a new way to manage and maintain them that was
more versatile, it took additional effort and coordination to educate
others on the team about the new system, and it took work with
individual teams to help enable and encourage them to apply the
new system. The key to getting others to adopt the new system was
showing value to them. In this case, my goal was to show value by
making it easier to connect different products such as how-to’s and
groups for our online forums.
There were only a few adopters in the beginning, but each new
adopter helped to push others to use the same definitions and pro‐
cesses for categorization—very slowly, but always in the same con‐
sistent direction. As we blended more data together, a unified
categorization grew in priority, making it more heavily adopted and
used. Now, the new system is widely used and used for the potential
seen when building it initially several years ago. It took time for
team members to see the value in the new system and to ultimately

adopt it, but as soon as the tipping point was crossed, the work
already put in made final adoption swift and easy in comparison.

12

| A Guide to Improving Data Integrity and Adoption


Collaboration helped achieve confidence in the accuracy because
each different application was fully adopting the new categorization,
which then vetted individual product edge cases against the design
and category set defined. In a few cases, the categorization system
needed to be further refined to address those edge cases, such as to
account for some software and hardware that needed to belong to
more than one category.

Reassessing the Value of Data Over Time
The focus and value placed on data in a company evolves over time.
In the beginning, data collection might be a lower priority than
acquiring new clients and getting products out the door. In my
experience, data consumption begins with the product managers
and developers working on products who want to understand if and
how their features and products are being used or to help in the
debugging process. It can also include monitoring system perfor‐
mance and grow quickly when specific metrics are set as goals for
performance of products.
After a product manager adopts data-tracking success and failure,
the goal is to make that data reportable and sharable so that others
can also view the data as a critical method of measuring success. As
more product managers and teams adopt data metrics, those metrics

can be shared and standardized. Executive-level adoption of data
metrics at a high level is much easier with the data in uniform and
reportable format that can measure company goals.
If no parties are already interested in data and the value it can bring,
this is a good opportunity to begin using data to track your success
of products that you are invested in and share the results with man‐
agers, teammates, and so on. If you can show value and success in
the products or prove opportunity that is wasted, the data is more
likely to be seen as valuable and as a metric for success.
Acting only as an individual, you can show others the value you can
get out of data, and thereby push them to invest and use it for their
own needs. The key is to show value and, when possible, make it
easy for others to maintain and utilize the data. Sometimes, it might
require building a small tool or defining relationships between data
structures that make the data easy to use and maintain.

Reassessing the Value of Data Over Time

|

13


An example of this is when we were creating the unified mechanism
and process to categorize everything including products, groups,
topics, vendors and how-to’s, including category hierarchy (recall
the example introduced in the preceding section). The process
required working with each team that created and consumed the
data around categories to determine a core set that could be used
and shared. I wanted to identify which pieces had unique require‐

ments for which we could account. In some products, a category
hierarchy was required to be able to identify relationships such as
antivirus software, which was grouped under both the security and
software categories. Other products such as online forums for users
to have discussions did not have this hierarchy and therefore did not
use that part of the feature. Creating default hierarchies and rela‐
tionship inheritance with easy user interfaces for management made
creating and maintaining this data easier for others.
When it became vital to connect different products, improve search
engine optimization, and provide more complete visibility between
products, the work done years earlier to maintain and define cate‐
gories easily was recognized as incredibly vital.

Checking the System for Internal Consistency
Utilizing the usage data case study, I will talk through the process we
used at Spiceworks for checking each part of the system for internal
and external consistency, including examples of why, when, and how
this was done (see Figure 1-4). We began with validation of the raw
data to ensure that the logging mechanism and applications created
the correct data. Next, we validated parsing the data and appending
information from other sources into the initial format. After that, we
verified that transformations and aggregation reports were accurate.
Finally, we vetted the data against other external and internal data
sources and used trends to better understand the data and add mon‐
itors to the systems to maintain accuracy. Let’s look at each of these
steps in more detail.
First, we tested the raw data itself. This step is nearly identical to
most other software testing, but in this instance the focus is on
ensuring that different possible scenarios and products all produce
the expected log entries. This step is where we became familiar with

the source data, how it was formatted, and what each field meant in
a log entry.

14

|

A Guide to Improving Data Integrity and Adoption


Figure 1-4. UDP workflow (source: Jessica Roper and Brian Johnson)
We included tests such as creating page views from several IP
addresses, different locations, and from every product and critical
URL path. We also used grep (a Unix command used to search files
for the occurrence of a string of characters that matches a specified
pattern) to search the logs for more URLs, locations, and IP
addresses to “smoke-test” for data completeness and understand
how different fields were recorded. During this investigation, we saw
that several fields had values that looked like IP addresses. Some of
them actually had multiple listings, so we needed to investigate fur‐
ther to learn what each of those fields meant. In this case, it required
further research into how event headers are set and what the non‐
simple results meant.

Checking the System for Internal Consistency

|

15



When you cannot test or validate the raw data formally to under‐
stand the flow of the system, try to understand as much about the
raw data as possible—including the meaning behind different com‐
ponents and how everything connects together. One way to do this
is to observe the data relationships and determine outlier cases by
understanding boundaries. For example, if we had been unable to
observe how the logs were populated, my search would have
included looking at counts for different fields.
Even with source data testing, we manually dug through the logs to
investigate what each field might mean and defined a normal
request. We also looked at the counts of IP addresses to see which
might have had overly high counts or any other anomalies. We
looked for cases in which fields were null to understand why and
when this occurred and understand how different fields seemed to
correlate. This raw data testing was a big part of becoming familiar
with the data and defining parameters and expectations.
When digging in, you should try to identify and understand the
relationship between components; in other words, determine which
relationships are one-to-one versus one-to-many or many-to-many
between different tables and models. One relationship that was
important to keep in mind during the UDP was the many-to-many
relationship between application installations and users. Users can
have multiple installations and multiple users use each installation,
so overlap must be accounted for in testing. Edge case parameters of
the datasets are also important to understand, such as determining
the valid minimum and maximum values for different fields.
Finally, in this step you want to understand what is common for the
dataset, essentially identifying the “normal” cases. When investigat‐
ing the page views data for products, the normal case had to be

defined for each product separately. For example, in most cases
users were expected to be logged in, but in a couple of our products,
being a visitor was actually more common; so, the normal use case
for those products was quite different.

Validating the Initial Parsing Process
After we felt confident that the raw data was reliable, it was time to
validate the initial parsing process that converted unstructured logs
into data tables. We began by validating that all of the data was
present. This included ensuring all of the log files we expected did
16

| A Guide to Improving Data Integrity and Adoption


in fact exist (there should be several for each day) as well as check‐
ing that the numbers of log entries match the total rows in the initial
table.
For this test, two senior developers created a complex grep regular
expression command to search the logs. They then had both
searches compared and reviewed by other senior staff members,
working together to define the best clause. One key part of the exer‐
cise included determining which rows were included in one expres‐
sion result but not in others, and then use those to determine the
best patterns to employ.
We also dug into any rows the grep found that were not in the table,
and vice versa. The key goal in this first set of tests was to ensure
that the source data and final processing match before filtering out
invalid and outlier data.
For this project, tests included comparing totals such as unique user

identification counts, validating the null data found was acceptable,
checking for duplicates, and verifying proper formats of fields such
as IP addresses and the unique application IDs that are assigned to
each product installation. This required an understanding of which
data pieces are valid when null, how much data could be tolerated
to have null fields, and what is considered valid formatting for each
field.
In the advertising service logs, some fields should never be null,
such as IP address, timestamp, and URL. This set of tests identified
cases that excluded IP addresses completely by incorrectly parsing
the IP set. This prompted us to rework the IP parsing logic to get
better-quality data. Some fields, such as user ID, were valid in some
cases when null; only some applications require users to be logged
in, so we were able to use that information as a guide. We validated
that user ID was present in all products and URLs that are accessible
only by logged in users. We also ensured that visitor traffic came
only from applications for which login was not required.
You should carry out each test in isolation to ensure that your test
results are not corrupted by other values being checked. This helps
you to find smaller edge cases, such as small sections of null data.
For our purposes at Spiceworks, by individually testing for each
classification of data and each column of data, we found small edge
cases for which most of the data seemed normal but had only one

Checking the System for Internal Consistency

|

17



field that was invalid in a small, but significant way. All relationships
existed for the field, but when examined at the individual level, we
saw the field was invalid in some cases, which led us to find some
of the issues described later, such as invalid IP addresses and a sig‐
nificant but small number of development installations that needed
to be excluded because they skewed the data with duplicate and
invalid results.

Checking the Validity of Each Field
After the null checks were complete, it was time to check the validity
of each field to determine if the data is structured correctly and all
the data is valid. After working with the team that built the logger as
well as some product teams, we defined proper formats for fields
and verified that the unique user IDs were correct. Another field
that was included was IP addresses, where we looked for private or
invalid (unassigned) IPs. After we identified some IP addresses as
invalid, we dug into the raw log results and determined some of the
invalid addresses were a result of how the request was forwarded
and recorded in the headers.
We improved parsing to grab only the most valid IP and added a
flag to the data to indicate data from those IPs that could be filtered
out in reporting. We also checked for URLs and user IDs that might
not be valid. For example, our user IDs are integers, so a quick valid‐
ity test included getting the minimum and maximum value of user
IDs found in the logs after parsing to ensure that they fell within the
user-base size and that all user IDs were numerical. One test identi‐
fied invalid unique application IDs.
Upon further digging, we determined that almost all of the invalid
applications IDs were prerelease versions, usually from the Spice‐

works IP address. This led us to decide to exclude any data from our
IP and development environments because that data was not valua‐
ble for our end goals. By identifying the invalid data and finding the
pattern, we were able to determine the cause and ultimately remove
the unwanted data so that we could gain better insights into the
behavior of actual users.
Next, we checked for duplicates; usually there will be specific sets of
fields that should be unique even if other values in the row are not.
For example, every timestamp should be unique because parallel
logging mechanisms were not used. After the raw logs were parsed
18

|

A Guide to Improving Data Integrity and Adoption


into table format, more information was appended and the tables
were summarized for easier analysis. One of the goals for these more
processed tables, which included any categorizations and final filter‐
ing of invalid data such as that from development environments,
was to indicate total page views of a product per day by user. We
verified that rows were unique across user ID, page categorization,
and date.
Usually at this point, we had the data in some sort of database,
which allowed us to do this check by simply writing a query that
selected and counted the columns that should be unique and ensure
that the resulting counts equal 1. As Figure 1-5 illustrates, you also
can run this check in Excel by using the Remove Duplicates action,
which will return the number of duplicate rows that are removed.

The goal is for zero rows to be removed to show that the data is
unique.

Figure 1-5. Using Excel to test for duplicates (source: Jessica Roper)

Verifying Accuracy of Transformations and
Aggregation Reports
Because all of the data needed was not already in the logs, we added
more to the parsed logs and then translated and aggregated the
results so that everything we wanted to filter and report on was
included. Data transformations like appended information, aggrega‐
tion, and transformation require more validation. Anything con‐
verted or categorized needs to be individually checked.

Verifying Accuracy of Transformations and Aggregation Reports

|

19


×