Tải bản đầy đủ (.pdf) (31 trang)

a guide to improving data integrity and adoption

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (3.79 MB, 31 trang )


Strata+Hadoop World



A Guide to Improving Data Integrity and
Adoption
A Case Study in Verifying Usage Data
Jessica Roper


A Guide to Improving Data Integrity and Adoption
by Jessica Roper
Copyright © 2017 O’Reilly Media Inc. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online
editions are also available for most titles ( For more information, contact
our corporate/institutional sales department: 800-998-9938 or
Editor: Nicole Tache
Production Editor: Colleen Lobner
Copyeditor: Octal Publishing Services
Interior Designer: David Futato
Cover Designer: Randy Comer
Illustrator: Rebecca Demarest
December 2016: First Edition
Revision History for the First Edition
2016-12-12: First Release
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. A Guide to Improving Data
Integrity and Adoption, the cover image, and related trade dress are trademarks of O’Reilly Media,
Inc.


While the publisher and the author have used good faith efforts to ensure that the information and
instructions contained in this work are accurate, the publisher and the author disclaim all
responsibility for errors or omissions, including without limitation responsibility for damages
resulting from the use of or reliance on this work. Use of the information and instructions contained in
this work is at your own risk. If any code samples or other technology this work contains or describes
is subject to open source licenses or the intellectual property rights of others, it is your responsibility
to ensure that your use thereof complies with such licenses and/or rights.
978-1-491-97052-2
[LSI]


A Guide to Improving Data Integrity and
Adoption
In most companies, quality data is crucial to measuring success and planning for business goals.
Unlike sample datasets in classes and examples, real data is messy and requires processing and effort
to be utilized, maintained, and trusted. How do we know if the data is accurate or whether we can
trust final conclusions? What steps can we take to not only ensure that all of the data is transformed
correctly, but also to verify that the source data itself can be trusted as accurate? How can we
motivate others to treat data and its accuracy as priority? What can we do to expand adoption of data?


Validating Data Integrity as an Integral Part of Business
Data can be messy for many reasons. Unstructured data such as log files can be complicated to
understand and parse information. A lot of data, even when structured, is still not standardized. For
example, parsing text from online forums can be complicated and might need to include logic to
accommodate slang such as “bad ass,” which is a positive phrase but made with negative words. The
system creating the data can also make it messy because different languages have different
expectations for design, such as Ruby on Rails, which requires a separate table to represent many-tomany relationships.
Implementation or design can also lead to messy data. For example, the process or code that creates
data, and the database storing that data might use incompatible formats. Or, the code might store a set

of values as one column instead of many columns. Some languages parse and store values in a format
that is not compatible with the databases used to store and process it, such as YAML (YAML Ain’t
Markup Language), which is not a valid data type in some databases and is stored instead as a string.
Because this format is intended to work much like a hash with key-and-value pairs, searching with the
database language can be difficult.
Also, code design can inadvertently produce a table that holds data for many different, unrelated
models (such as categories, address, name, and other profile information) that is also self-referential.
For example, the dataset in Table 1-1 is self-referential, wherein each row has a parent ID
representing the type or category of the row. The value of the parent ID refers to the ID column of the
same table. In Table 1-1, all information around a “User Profile” is stored in the same table,
including labels for profile values, resulting in some values representing labels, whereas others
represent final values for those labels. The data in Table 1-1 shows that “Mexico” is a “Country,”
part of the “User Profile” because the parent ID of “Mexico” is 11, the ID for “Country,” and so on.
I’ve seen this kind of example in the real world, and this format can be difficult to query. I believe
this relationship was mostly the result of poor design. My guess is that, at the time, the idea was to
keep all “profile-like” things in one table and, as a result, relationships between different parts of the
profile also needed to be stored in the same place.
Table 1-1. Selfreferential data
example (source:
Jessica Roper and
Brian Johnson)
ID Parent ID Value
16 11

Mexico

11 9

Country


9

User Profile

NULL

Data quality is important for a lot of reasons, chiefly that it’s difficult to draw valid conclusions from


impartial or inaccurate data. With a dataset that is too small, skewed, inaccurate, or incomplete, it’s
easy to draw invalid conclusions. Organizations that make data quality a priority are said to be data
driven; to be a data-driven company means priorities, features, products used, staffing, and areas of
focus are all determined by data rather than intuition or personal experience. The company’s success
is also measured by data. Other things that might be measured include ad impression inventory, user
engagement with different products and features, user-base size and predictions, revenue predictions,
and most successful marketing campaigns. To affect data priority and quality will likely require some
work to make the data more usable and reportable and will almost certainly require working with
others within the organization.

Using the Case Study as a Guide
In this report, I will follow a case study from a large and critical data project at Spiceworks, where
I’ve worked for the past seven years as part of the data team, validating, processing and creating
reports. Spiceworks is a software company that aims to be “everything IT for everyone IT,” bringing
together vendors and IT pros in one place. Spiceworks offers many products including an online
community for IT pros to do research and collaborate with colleagues and vendors, a help desk with
a user portal, network monitoring tools, network inventory tools, user management, and much more.
Throughout much of the case study project, I worked with other teams at Spiceworks to understand
and improve our datasets. We have many teams and applications that either produce or consume data,
from the network-monitoring tool and online community that create data, to the business analysts and
managers who consume data to create internal reports and prove return on investment to customers.

My team helps to analyze and process the data to provide value and enable further utilization by other
teams and products via standardizing, filtering, and classifying the data. (Later in this report, I will
talk about how this collaboration with other teams is a critical component to achieving confidence in
the accuracy and usage of data.)
This case study demonstrates Spiceworks’ process for checking each part of the system for internal
and external consistency. Throughout the discussion of the usage data case study, I’ll provide some
quick tips to keep in mind when testing data, and then I’ll walk through strategies and test cases to
verify raw data sources (such as parsing logs) and work with transformations (such as appending and
summarizing data). I will also use the case study to talk about vetting data for trustworthiness and
explain how to use data monitors to identify anomalies and system issues for the future. Finally, I will
discuss automation and how you can automate different tests at different levels and in different ways.
This report should serve as a guide for how to think about data verification and analysis and some of
the tools that you can use to determine whether data is reliable and accurate, and to increase the usage
of data.

An Overview of the Usage Data Project
The case study, which I’ll refer to as the usage data project, or UDP, began with a high-level goal: to


determine usage across all of Spiceworks’ products and to identify page views and trends by our
users. The need for this new processing and data collection came after a long road of hodge-podge
reporting wherein individual teams and products were all measured in different ways. Each team and
department collected and assessed data in its own way—how data was measured in each team could
be unique. Metrics became increasingly important for us to measure success and determine which
features and products brought the most value to the company and, therefore, should have more
resources devoted to them.
The impetus for this project was partially due to company growth—Spiceworks had reached a size at
which not everyone knew exactly what was being worked on and how the data from each place
correlated to their own. Another determining factor was inventory—to improve and increase our
inventory, we needed to accurately determine feature priority and value. We also needed to utilize

and understand our users and audience more effectively to know what to show, to whom, and when
(such as display ads or send emails).
When access to this data occurred at an executive level, it was even more necessary to be able to
easily compare products and understand the data as a whole to answer questions like: “How many
total active users do we have across all of our products?” and “How many users are in each
product?” It wasn’t necessary to understand how each product’s data worked. We also needed to be
able to do analysis on cross-product adoption and usage.
The product-focused reporting and methods of measuring performance that were already in place
made comparison and analysis of products impossible. The different data pieces did not share the
same mappings, and some were missing critical statistics such as which specific user was active on a
feature. We thus needed to find a new source for data (discussed in a moment).
When our new metrics proved to be stable, individual teams began to focus more on the quality of
their data. After all, the product bugs and features that should be focused on are all determined by
data they collect to record usage and performance. After our experience with the UDP and wider
shared data access, teams have learned to ensure that their data is being collected correctly during
beta testing of the product launch instead of long after. This guarantees them easy access to data
reports dynamically created on the data collected. After we made the switch to this new way of
collecting and managing data from the start—which was automatic and easy—more people in the
organization were motivated to focus on data quality, consistency, and completeness. These efforts
moved us to being a more truly data-driven company and, ultimately, a stronger company because of
it.

Getting Started with Data
Where to begin? After we determined the goals of the project, we were ready to get started. As I
previously remarked, the first task was to find new data. After some research, we identified much of
the data needed was available in logs from Spiceworks’ advertising service (see Figure 1-1), which
is used to identify a target audience that users qualify to be in and therefore what set of ads should be


displayed to them. On each page of our applications, the advertising service is loaded, usually even

when no ads are displayed. Each new page and even context changes, such as switching to a new tab,
create a log entry. We parsed these logs into tables to analyze usage across all products; then, we
identified places where tracking was missing or broken to show what parts of the advertising-service
data source could be trusted.
As Figure 1-1 demonstrates, each log ee results so that everything we wanted to filter and report on was
included. Data transformations like appended information, aggregation, and transformation require
more validation. Anything converted or categorized needs to be individually checked.


In the usage data project, we converted each timestamp to a date and had to ensure that the timestamps
were converted correctly. One way we did this was to manually find the first and last log entries for a
day in the logs and compare them to the first and last entries in the parsed data tables. This test
revealed an issue with a time zone difference between the logs and the database system, which shifted
and excluded results for several hours. To account for this we processed all logs for a given day as
well as the following day and then filtered the results based on the date after adjusting for the time
zone differences.
We also validated all data appended to the original source. One piece of data that we appended was
location information for each page view based on the IP address. To do this, we used a third-party
company that provides an application program interface (API) to correlate IP addresses with location
data, such as country, for further processing and geographical analysis. For each value, we verified
that the source of the appended data matched what was found in the final table. For example, we
ensured that the country and user information appended was correct by comparing the source location
data from the third-party and user data to the final appended results. We did this by joining the source
data to the parsed dataset and comparing values.
For the aggregations, we checked that raw row counts from the parsed advertising service log tables
matched the sum of the aggregate values. In this case, we wanted to roll up our data by pages viewed
per user in each product, requiring validation that the total count of rows parsed matched the summary
totals stored in the aggregate table.
Part of the UDP required building aggregated data for the reporting layer, generically customized for
the needs of the final reports. In most cases, consumers of data (individuals, other applications, or

custom tools) will need to transform, filter, and aggregate data for the unique needs of the report or
application. We created this transformation for them in a way that allowed the final product to easily
filter and further aggregate the data (in this case, we used a business intelligence software). Those
final transformations also required validation for completeness and accuracy, such as ensuring that
any total summaries equal the sum of their parts, nothing was double counted, and so on.
The goal for this level of testing is to validate that aggregate and appended data has as much integrity
as the initial data set. Here are some questions that you should ask during this process:
If values are split into columns, do the columns add up to the total?
Are any values negative that should not be, such as a calculated “other” count?
Is all the data from the source found in the final results?
Is there any data that should be filtered out but is still present?
Do appended counts match totals of the original source?
As an example of the last point, when dealing with Spiceworks’ advertising service, there are a
handful of sources and services that can make a request for an ad and cause a log entry to be added.
Some different kinds of requests included new organic pageviews, ads refreshing automatically, and


for pages with ad-block software. One test we included checked that the total requests equaled the
sum of the different request types. As we built this report and continued to evolve our understanding
of the data and requirements for the final results, the reportable tables and tests also evolved. The test
process itself helped us to define some of this when outliers or unexpected patterns were found.

Allowing for Tests to Evolve
It is common for tests to evolve as more data is introduced and consumed and therefore better
understood. As specific edge cases and errors are discovered, you might need to add more
automation or processes. One such case I encountered was caused by the fact that every few years
there are 53 weeks in the calendar year. This extra “week” (it is approximately half a week, actually)
results in 5 weeks in December and 14 weeks in the last quarter. When this situation occurred for the
first time after building our process, the reporting for the last quarter of the year as well as for the
following quarter were incorrect. When the issue and cause were discovered, special logic for the

process and new test cases were added to account for this unexpected edge case.
For scrubbing transformations or clustering of data, your tests should search through all unique
possible options, filtering one piece at a time. Look for under folding, whereby data has not
clustered/grouped everything it should have, and over folding, which is when things are over-grouped
or categorized where they should not be [2]. Part of the aggregations for this project required us to
classify URLs based on the different products of which they were a part.
To test and scrub these, we first broke apart the required URL components to ensure that all
variations are captured. For example, one of the products that required its own category was an “app
center” where users can share and download small applications or plug-ins for our other products; to
test this, we began by searching for all URLs that had “app” and “center” in the URL. We did not
require “app center,” “app%center,” or other combined variations, because we wanted to make no
assumptions about the format of the URL. By searching in this more generic way, we were able to
identify many URLs with formats of “appcenter,” “app-center,” and “app center.”
Next, we looked for URLs that match only part of the string. In this case, we found the URL “/apps”
by looking for URLs that had the word “app” but not “center.” This test identified several URLs that
looked similar to other app center URLs, but after further investigation were found to be part of
another product. This allowed us to add automated tests that ensured those URLs were always
categorized correctly and separately. To categorize this data required using the acceptable error to
identify what should be used to create the logic. In this case, we did not need to focus on getting down
and dirty with our “long tail”—what was usually thousands of pages that only have a handful of
views. A few page views account for well below one thousandth of a percent and would provide
virtually no value even if scrubbed. Most of the time the URLs are still incorporated by the other
logic created.

Checking for External Consistency: Analyzing and Monitoring Trends


The last two components of data validation, vetting data with trend analysis and monitoring, are the
most useful for determining data reliability and helping to ensure continued validity. This layer is
heavily dependent on the kind of data that is going to be reported on and what data is considered

critical for any analysis. It is part of maintaining and verifying the reportable data layer, especially
when data comes from external or multiple sources.
First is vetting data by comparing what was collected to other related data sources to
comprehensively cover known boundaries. This helps to ensure the data is complete and that other
data sources correlate and agree with the data being tested.
Of course, other data sources will not represent the exact same information, but they can be used to
check things such as whether trends over time match, if total unique values are within the bounds of
expectations, and so on. For example, in Spiceworks’ advertising service log data, there should not
be more active users than the total users registered. Even further, active users should not be higher
than total users that have logged in during the time period. The goal is to verify the data against any
reliable external data possible.
External data could be from a source such as Google Analytics, which is a reliable source for page
views, user counts, and general usage with some data available for free. We used external data
available to us to compare total active users and page views over time for many products. Even
public data such as general market share comparisons are a good option; compare sales records to
product usage, active application counts, and associated users to total users, and so on.
Checking against external sources is just a different way to think about the data and other data related
to it. It provides boundaries and expectations for what the data should look like and edge-case
conditions that might be encountered. Some things to include are comparing total counts, averages,
and categories or classifications. In some cases, the counts from the external source might be
summaries or estimates, so it’s important to understand those values, as well, to determine if
inconsistencies among datasets indicate an error.
For the UDP, we were fortunate to have many internal sources of data that we used to verify that our
data was within expected bounds and matched trends. One key component is to compare trends. We
compared data over time for unique user activity, total page views, and active installations (and users
related to those installations) and checked our results against available data sources (a hypothetical
example is depicted in Figure 1-6).


Figure 1-6. Hypothetical example of vetting data trends (source: Jessica Roper and Brian Johnson)


We aimed to answer questions such as the following:
Do the number of active users over time correlate to unique users seen in our other stats?
Do the total page views correlate to the usage we see from users in our installation stats?
Can all the URLs we see in our monitoring tools and request logs be found in the new data set?
During this comparison, we found that several pages were not being tracked with the system. This led
us to work with the appropriate development teams to start tracking those pages, and determined what
the impact would be on final analysis.
The total number of active users for each product was critical for reporting teams and project
managers. During testing, we found some products only had data available to indicate the number of
active installations and the total number of users related to the installation. As users change jobs and
add installations, they can be a user in all of those applications, making the user-to-installation
relationship many-to-many. Some application users also misunderstood the purpose of adding new
users, which is meant to be for all IT pros providing support to the people in their companies.
However, in some cases an IT pro might add not only all other support staff, but include all end users,
who won’t actually use the application and therefore are never active on the installation itself. In this
circumstance, they were adding the set of end users they support, but those end users are neither
expected to interact directly with the application nor are they considered official users of it. We
wanted to define what assumptions were testable in the new data set from the ad service. For
example, at least one user should be active on every installation; otherwise no page views could be
generated. Also, there should not be more active users than the total number of users associated to an
installation.
After we defined all the expectations for the data, we built several tests. One tested that each


installation had less active users than the total associated with it. More important, however, we also
tested that the total active users trended over time was consistent with trends for total active
installations and total users registered to the application (Figure 1-6). We expected the trend to be
consistent between the two values and follow the same patterns of usage for time of day, day of week,
and so on. The key to this phase is having as much understanding as possible of the data, its

boundaries, how it is represented, and how it is produced so that you know what to expect from the
test results. Trends usually will match generally, but in my experience, it’s rare for them to match
exactly.

Performing Time–Series Analyses
The next step is time–series analysis—understanding how the data behaves over time and what
“makes sense” for the dataset. Time–series analysis provides insights needed to monitor the system
over the long term and to validate data consistency. This sort of analysis also verifies the data’s
accuracy and reliability. Are there large changes from month to month or week to week? Is change
consistent over time?
One way to verify whether a trend makes sense is by looking for expected anomalies such as new
product launch dates causing spikes, holidays causing a dip, known outage times, and expected lowusage periods (i.e., 2 AM). A hypothetical example is provided in Figure 1-7. This can also help
identify other issues such as missing data or even problems in the system itself. After you understand
trends and how they change over time, you might find it helpful to implement alerts that ensure the
data fits within expected bounds. You can do this by checking for thresholds being crossed, or by
verifying that new updates to the dataset grow or decline at a rate that is similar to the average seen
across the previous few datasets.


Figure 1-7. Hypothetical example of page view counts over time vetting (source: Jessica Roper and Brian Johnson)

For example, in the UDP, we looked at how page views by product change over time by month
compared to the growth of the most recent month. We verified that the change we saw from month to
month as new data was stable over time, and dips or spikes were seen only when expected (e.g.,
when a new product was launched). We used the average over several months to account for
anomalies caused by months with several holidays during the week. We wanted to identify thresholds
and data existence expectations.
During this testing, we found several issues, including a failing log copy process and products that
stopped sending up data to the system. This test verified that each product was present in the final
dataset. Using this data, we were able to identify a problem with ad server tracking in one of our

products before it caused major problems. This kind of issue was previously difficult to detect
without time–series analysis.
We knew the number of active installations for different products and total users associated with each
of those installations, but we could not determine which users were actually active before the new
data source was created. To validate the new data and these active user counts, we ensured that the
total number of users we saw making page views in each product was higher than the total number of
installations, but lower than the total associated users, because not all users in an installation would
be active.

Putting the Right Monitors in Place
The time–series analysis was key to identifying the kinds of monitors needed, such as ones for user
and client growth. It also identified general usage trends to expect, such as average page views per


product. Monitors are used to test new data being appended to and created by the system for the
future; one time or single historical reports will not require monitoring. One thing we had to account
for when creating monitors was traffic changes throughout the week, such as significant drops on the
weekends. A couple of trend complications we had to deal with were weeks that have holidays and
general annual trends such as drops in traffic in December and during the summer. It is not enough to
verify that the month looks similar to the month before it or that a week has similar data to the week
before; we also had to determine a list of known holidays to add indicators to those dates when the
monitors are triggered and compare averages over a reasonable amount of time.
It is important to note that we did not allow holidays to mute errors; instead, we added indicators and
high-level data trend summaries in the monitor errors that allowed us to easily determine if the alert
could be ignored.
Some specific monitors we added included looking at total page views over time and ensuring that
the total was close to the average total over the previous three months. We also added the same
monitors for the total page views of each product and category, which tracked that all categories
collect data consistently. This also ensured that issues in the system creating the data were monitored
and changes such as accidental removal of tracking code would not go unnoticed.

Other tests included looking at these same trends for totals and by category for registered users and
visitors to ensure that tracking around users remained consistent. We added many tests around users
because knowing active users and their demographics was critical to our reporting. The main
functionality for monitors is to ensure that critical data continues to have the integrity required.
A large change in a trend is an indicator that something might not be working as expected in all parts
of the system. A good rule of thumb for what defines a “large” change is when the data in question is
outside one to two standard deviations from the average. For example, we found one application that
collected the expected data for three months while in beta, but when the final product was deployed,
the tracking was removed. Our monitors discovered this issue by detecting a drop in total page views
for that product category, allowing us to dig in and correct the issue before it had a large impact.
There are other monitors we also added that do not focus heavily on trends over time. Rather, they
ensured that we would see the expected number of total categories and that the directory containing
all the files being processed had the minimum number of expected files, each with the minimum
expected size. This was determined to be critical because we found one issue in which some log files
were not properly copied for parsing and therefore significant portions of data were missing for a
day. Missing even only a few hours of data can have large effects on different product results,
depending on what part of the day is missing from our data. These monitors helped us to ensure data
copy processes and sources were updated correctly and provided high-level trackers to make sure the
system is maintained.
As with other testing, the monitors can change over time. In fact, we did not start out with a monitor to
ensure that all the files being processed were present and the correct sizes. The monitor was added
when we discovered data missing after running a very long process. When new data or data
processes are created it is important to use it skeptically until no new issues or questions are found


for a reasonable amount of time. This is usually related to how the processed data is consumed and
used.
Much of the data I work with at Spiceworks is produced and analyzed monthly, so we closely and
heavily manually monitor the system until the process has run fully successfully for several months.
This included working closely with our analysts as they worked with the data to find any potential

issues or remaining edge cases in the data. Anytime we found a new issue or unexpected change, a
new monitor was added. Monitors were also updated over time to be more tolerant of acceptable
changes. Many of these monitors were less around the system (there are different kinds of tests for
that), and more about the data integrity and ensuring reliability.
Finally, another way to monitor the system is to “provide end users with a dead-easy way to raise an
issue the moment an inaccuracy is discovered,” and, even better, let them fix it. If you can provide a
tool that both allows a user to report on data as well as make corrections, the data will be able to
mature and be maintained more effectively. One tool we created at Spiceworks helped maintain how
different products are categorized. We provided a user interface with a database backend that
allowed interested parties to update classifications of URLs. This created a way to dynamically
update and maintain the data without requiring code changes and manual updates.
Yet another way we did this was to incorporate regular communications and meetings with all of the
users of our data. This included our financial planning teams, business analysts, and product
managers. We spent time understanding the way the data would be used and what the end goals were
for those using it. In every application, we included a way to give feedback on each page, usually
through a form that includes all the page’s details. Anytime the reporting tool did not have enough
data results for the user, we gave an easy way to connect with us directly to help obtain the necessary
data.

Implementing Automation
At each layer of testing, automation can help ensure long-term reliability of the data and quickly
identify problems during development and process updates. This can include unit tests, trend alerts,
or anything in between. These are valuable for products that are being changed frequently or require
heavy monitoring.
In the UDP, we automated almost all of the tests around transformations and aggregations, which
allowed for shorter test cycles while iterating through process and provided long-term stability
monitoring of the parsing process in case anything changes in the future or a new system needs to be
tested.
Not all tests need to be automated or created as monitors. To determine which tests should be
automated, I try to focus on three areas:

Overall totals that indicate system health and accuracy
Edge cases that have a large effect on the data


How much effect code changes can have on the data
There are four general levels of testing, and each of these levels generally describes how the tests are
implemented:
Unit
This tests focus on single complete components in isolation.
Integration
Integration tests focus on two components working together to build a new or combined data set.
System
This level tests verify the infrastructure and overall process itself as a whole.
Acceptance
Acceptance tests validate data as reasonable before publishing or appending data sets.
In the UDP, because having complete sets of logs was critical, a separate system-level test was
created to run before the rest of the process to ensure that data for each day and hour could be
identified in the log files. This approach further ensures that critical and difficult-to-find errors would
not go unnoticed. Other tests we focused on were between transformations of the data such as
comparing initial parsed logs as well as aggregate counts of users and total page views.
Some tests, such as categorization verification, were only done manually because most changes to the
process should not affect this data and any change in categorization would require more manual
testing either way. Different tests require different kinds of automation; for example, we created an
automated test to validate the final reporting tables, which included a column for total impressions as
well as the breakdown for type of impression based on that impression being caused by a new page
view versus ad refresh, and so on. This test was implemented as a unit test to ensure that at a low
level the total was equal to the sum of the page view types.
Another unit test included creating samples for the log parsing logic including edge cases as well as
both common and invalid examples. These were fed through the parsing logic after each change to it
as we discovered new elements of the data. One integration test included in the automation suite was

the test to ensure country data from the third-party geographical dataset was valid and present. The
automating tests for data integrity and reliability using monitors and trends were done at the
acceptance level after processing to ensure valid data that followed the patterns expected before
publishing it. Usually when automated tests are needed, there will be some at every level.
It is helpful to document test suites and coverage, even if they are not automated immediately or at all.
This makes it easy to review tests and coverage as well as allow for new or inexperienced testers,
developers, and so on, to assist in automation and manual testing. Usually, I just record tests as they
are manually created and executed. This helps to document edge cases and other expectations and
attributes of the data.
As needed, when critical tests were identified, we worked to automate those tests to allow for faster
iterations working with the data. Because almost all code changes required some regression testing,


covering critical and high-level tests automatically provided easy smoke testing for the system and
gave some confidence in the continued integrity of the data when changes were made.

Conclusion
Having confidence in data accuracy and integrity can be a daunting task, but it can be accomplished
without having a Ph.D. or background in data analysis. Although you cannot use some of these
strategies in every scenario or project, they should provide a guide for how you think about data
verification, analysis, and automation, as well as give you the tools and ways to think about data to be
able to provide confidence that the data you’re using is trustworthy. It is important that you become
familiar with the data at each layer and create tests between each transformation to ensure consistency
in the data. Becoming familiar with the data will allow you to understand what edge cases to look for
as well as trends and outliers to expect. It will usually be necessary to work with other teams and
groups to improve and validate data accuracy (a quick drink never hurts to build rapport). Some ways
to make this collaboration easier are to understand what the focus is for those being collaborated with
and to show how the data can be valuable to those teams to use themselves. Finally, you can ensure
and monitor reliability through automation of process tests and acceptance tests that verify trends and
boundaries and also allow the data collection processes to be converted and iterated on easily.


Further Reading
1. Peters, M. (2013). “How Do You Know If Your Data is Accurate?” Retrieved December 12,
2016, from />2. Polovets, L. (2011). “Data Testing Challenge.” Retrieved December 12, 2016 from
/>3. Chen, W. (2010). “How to Measure Data Accuracy?” Retrieved December 12, 2016 from
/>4. Chen, W. (2010). “What’s the Root Cause of Bad Data?” Retrieved December 12, 2016 from
/>5. Jain, K. (2013). “Being paranoid about data accuracy!” Retrieved December 12, 2016 from
/>

About the Author
Since graduating from University of Texas at Austin with a BS in computer science, Jessica Roper
has worked as a software developer working with data to maintain, process, scrub, warehouse, test,
report on, and create products for it. She is an avid mentor and teacher, taking any opportunity
available to share knowledge.
Jessica is currently senior developer in the data analytics division of Spiceworks, Inc., a network
used by IT professionals to stay connected and monitor their systems.
Outside of her technical work, she enjoys biking, swimming, cooking, and traveling.



×