Tải bản đầy đủ (.pdf) (24 trang)

IT training managing the data lake khotailieu

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (2.91 MB, 24 trang )

Managing
the Data Lake
Moving to Big Data Analysis

Andy Oram


Make Data Work
strataconf.com
Presented by O’Reilly and Cloudera,
Strata + Hadoop World is where
cutting-edge data science and new
business fundamentals intersect—
and merge.
n

n

n

Learn business applications of
data technologies
Develop new skills through
trainings and in-depth tutorials
Connect with an international
community of thousands who
work with data

Job # 15420



Managing the Data Lake

Moving to Big Data Analysis

Andy Oram


Managing the Data Lake
by Andy Oram
Copyright © 2015 O’Reilly Media, Inc. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA
95472.
O’Reilly books may be purchased for educational, business, or sales promotional use.
Online editions are also available for most titles (). For
more information, contact our corporate/institutional sales department:
800-998-9938 or

Editor: Shannon Cutt

September 2015:

Interior Designer: David Futato
Cover Designer: Karen Montgomery
First Edition

Revision History for the First Edition
2015-09-02: First Release
2015-10-20: Second Release
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Managing the

Data Lake and related trade dress are trademarks of O’Reilly Media, Inc.
While the publisher and the author have used good faith efforts to ensure that the
information and instructions contained in this work are accurate, the publisher and
the author disclaim all responsibility for errors or omissions, including without limi‐
tation responsibility for damages resulting from the use of or reliance on this work.
Use of the information and instructions contained in this work is at your own risk. If
any code samples or other technology this work contains or describes is subject to
open source licenses or the intellectual property rights of others, it is your responsi‐
bility to ensure that your use thereof complies with such licenses and/or rights.
Cover photo credit: “55 Flying Fish” by Michal (flikr).

978-1-491-94168-3
[LSI]


Table of Contents

Moving to Big Data Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Why Companies Move to Hadoop
Acquisition and Ingestion
Metadata (Cataloguing)
Data Preparation and Cleaning
Managing Workflows
Access Control
Conclusion

7
10
11
14

17
19
20

iii



Moving to Big Data Analysis

Can you tell by sailing the surface of a lake whether it has been well
maintained? Can local fish and plants survive? Dare you swim? And
how about the data maintained in your organization’s data lake? Can
you tell whether it’s healthy enough to support your business needs?
An increasing number of organizations maintain fast-growing repo‐
sitories of data, usually from multiple sources and formatted in mul‐
tiple ways, that are commonly called “data lakes.” They use a variety
of storage and processing tools—especially in the Hadoop family—
to extract value quickly and inform key organizational decisions.
This report looks at the common needs that modern organizations
have for data management and governance. The MapReduce model
—introduced in 2004 in a paper1 by Jeffrey Dean and Sanjay Ghe‐
mawat—completely overturned the way the computing community
approached big data analysis. Many other models, such as Spark,
have come since then, creating excitement and seeing eager adop‐
tion by organizations of all sizes to solve the problems that relational
databases were not suited for. But these technologies bring with
them new demands for organizing data and keeping track of what
you’ve got.
I take it for granted that you understand the value of undertaking a

big data initiative, as well as the value of a framework such as
Hadoop, and are in the process of transforming the way you manage
your organization’s data. I have interviewed a number of experts in
data management to find out the common challenges you are about

1 (PDF)

5


to face, so you can anticipate them and put solutions in place before
you find yourself overwhelmed.
Essentially, you’ll need to take care of challenges that never came up
with traditional relational databases and data warehouses, or that
were handled by the constraints that the relational model placed on
data. There is wonderful value in those constraints, and most of us
will be entrusting data to relational systems for the foreseeable
future. But some data tasks just don’t fit. And once you escape the
familiarity and safety of the relational model, you need other tools
to manage the inconsistencies, unpredictability, and breakneck pace
of the data you’re handling.
The risk of the new tools is having many disparate sources of data—
and perhaps multiple instances of Hadoop or other systems offering
analytics operating inefficiently—which in turn causes you to lose
track of basic information you need to know about your data. This
makes it hard to set up new jobs that could provide input to the
questions you urgently need to answer.
The fix is to restore some of the controls you had over old data sour‐
ces through careful planning and coding, while still being flexible
and responsive to fast-moving corporate data needs.

The main topics covered in this report are:
Acquisition and ingestion
Data comes nowadays from many different sources: internal
business systems, product data from customers, external data
providers, public data sets, and more. You can’t force everyone
to provide the data in a format that’s convenient for you. Nor
can you take the time (as in the old days) to define strict sche‐
mas and enter all data into schemas. The problems of data
acquisition and ingestion have to be solved with a degree of
automation.
Metadata (cataloguing)
Questions such as who provided the data, when it came in, and
how it was formatted—a slew of concerns known as lineage or
provenance—are critical to managing your data well. A catalog
can keep this metadata and make it available to later stages of
processing.

6

|

Moving to Big Data Analysis


Data preparation and cleaning
Just as you can’t control incoming formats, you can’t control
data quality. You will inevitably deal with data that does not
conform. Data may be missing, entered in diverse formats, con‐
tain errors, and so on. In addition, data might be lost or corrup‐
ted because sensors run out of battery power, networks fail, soft‐

ware along the way harbored a bug, or the incoming data had
an unrecognized format. Some data users estimate that detect‐
ing these anomalies and cleaning takes up 90% of their time.
Managing workflows
The actual jobs you run on data need to be linked with the three
other stages just described. Users should be able to submit jobs
of their own, based on the work done by experts before them, to
handle ingestion, cataloguing, and cleaning. You want staff to
quickly get a new visualization or report without waiting weeks
for a programmer to code it up.
Access control
Data is the organization’s crown jewels. You can’t give everybody
access to all data. In fact, regulations require you to restrict
access to sensitive customer data. Security and access controls
are therefore critical at all stages of data handling.

Why Companies Move to Hadoop
To set the stage for exploration of data management, it is helpful to
remind ourselves of why organizations are moving in the direction
of big data tools.
Size

“Volume” is one of the main aspects of big data. Relational data‐
bases cannot scale beyond a certain volume due to architecture
restrictions. Organizations find that data processing in rela‐
tional databases takes too long, and as they do more and more
analytics, such data processing using conventional ETL tools
becomes such a big time sink that they hold users back from
making full use of the data.


Variety
Typical sources include flat files, RDBMSes, logs from web
servers, devices and sensors, and even legacy mainframe data.

Why Companies Move to Hadoop

|

7


Sometimes you want also to export data from Hadoop to an
RDBMS or other repository.
Free-form data
Some data may be almost completely unstructured, as in the
case of product reviews and social media postings. Other data
will come to you inconsistently structured. For instance, differ‐
ent data providers may provide the same information in very
different formats.
Streaming data
If you don’t keep up with changes in the world around you, it
will pass you by—and probably reward a competitor who does
adapt to it. Streaming has evolved from a few rare cases, such as
stock markets and sensor data, to everyday data such as product
usage data and social media.
Fitting the task to the tool
Data maintained in relational databases—let alone cruder stor‐
age formats, such as spreadsheets—is structured well for certain
analytics. But for new paradigms such as Spark or the MapRe‐
duce model, preparing data can take more time than doing the

analytics. Data in normalized relational format resides in many
different tables and must be combined to make the format that
the analytics engine can efficiently process.
Frequent failures
Modern processing systems such as Hadoop contain redun‐
dancy and automatic restart to handle hardware failures or soft‐
ware glitches. Even so, you can expect jobs to be aborted regu‐
larly by bad data. You’ll want to get notifications when a job fin‐
ishes successfully or unsuccessfully. Log files should show you
what goes wrong, and you should be able to see how many cor‐
rupted rows were discarded and what other errors occurred.
Unless you take management into consideration in advance, you
end up unable to make good use of this data. One example comes
from a telecom company whose network generated records about
the details of phone calls for monthly billing purposes. Their ETL
system didn’t ingest data from calls that were dropped or never con‐
nected, because no billing was involved. So years later, when they
realized they should be looking at which cell towers had low quality,
they had no data with which to do so.

8

|

Moving to Big Data Analysis


A failure to collect or store data may be an extreme example of man‐
agement problems, but other hindrances—such as storing it in a for‐
mat that is hard to read, or failing to remember when it arrived—

will also slow down processing to the point where you give up
opportunities for learning insights from your data.
When the telecom company just mentioned realized that they could
use information on dropped and incomplete calls, their ETL system
required a huge new programming effort and did not have the
capacity to store or process the additional data. Modern organiza‐
tions may frequently get new sources of data from brokers or pub‐
licly available repositories, and can’t afford to spend time and
resources doing such coding in order to integrate them.
In systems with large, messy data, you have to decide what the sys‐
tem should do when input is bad. When do you skip a record, when
do you run a program to try to fix corrupted data, and when do you
abort the whole job?
A minor error such as a missing ZIP code probably shouldn’t stop a
job, or even prevent that record from being processed. A missing
customer ID, though, might prevent you from doing anything useful
with the data. (There may be ways to recover from these errors too,
as we’ll see.)
Your choice depends of course on your goal. If you’re counting sales
of a particular item, you don’t need the customer ID. If you want to
update customer records, you probably do.
A more global problem with data ingestion comes when someone
changes the order of fields in all the records of an incoming data set.
Your program might be able to detect what happened and adjust, or
might have to abort.
At some point, old data will pile up and you will have to decide
whether to buy more disk space, archive the data (magnetic tape is
still in everyday use), or discard it. Archiving or discarding has to be
automated to reduce errors. You’ll find old data surprisingly useful if
you can manage to hold on to it. And of course, having it readily at

hand (instead of on magnetic tape) will permit you to quickly run
analytics on that data.

Why Companies Move to Hadoop

|

9


Acquisition and Ingestion
At this point we turn to the steps in data processing. Acquisition
comes first. Nowadays it involves much more than moving data
from an external source to your own repository. In fact, you may
not be storing every source you get data from at all: you might
accept streams of fast-changing data from sensors or social media,
process them right away, and save only the results.
On the other hand, if you want to keep the incoming data, you may
need to convert it to a format understood by Hadoop or other pro‐
cessing tools, such as Avro or Parquet.
The health care field provides a particularly complex data collection
case. You may be collecting:
• Electronic health records from hospitals using different formats
• Claims data from health care providers or payers
• Profiles from health plans
• Data from individuals’ fitness devices
Electronic health records illustrate the variety and inconsistency of
all these data types. Although there are standards developed by the
HL7 standards group, they are implemented differently by each
EHR vendor. Furthermore, HL7 exchanges data through several

messaging systems that differ from any other kind of data exchange
used in the computer field.
In a situation like this, you will probably design several general
methods of ingesting data: one to handle the HL7 messages from
EHRs, another to handle claims data, and so on. You’ll want to make
it easy for a user to choose one of these methods and adjust parame‐
ters such as source, destination file, and frequency in order to han‐
dle a new data feed.
Successful ingestion requires you to know in detail how the data is
coming in. Read the documentation carefully: you may find that the
data doesn’t contain what you wanted at all, or needs complex pro‐
cessing to extract just what you need. And the documentation may
not be trustworthy, so you have to test your ingestion process on
actual input.

10

|

Moving to Big Data Analysis


As mentioned earlier, you may be able to anticipate how incoming
data changes—such as reordered fields—and adapt to it. However,
there are risks to doing this. First, your tools become more compli‐
cated and harder to maintain. Second, they may make the wrong
choice because they think they understand the change and get it
wrong.
Another common ingestion task is to create a consolidated record
from multiple files of related information that are used frequently

together— for example, an Order Header and Details merged into
one file. Hadoop has a particular constraint on incoming data: it was
not designed for small files. Input may consist of many small files,
but submitting them individually will force a wasteful input process
onto Hadoop and perhaps even cause a failure. For this reason, it is
recommended that, prior to processing these small files, they be
combined into a single large file to leverage the Hadoop cluster
more efficiently.
This example highlights an important principle governing all the
processing discussed in this report: use open formats if possible, and
leverage everything the open source and free software communities
have made available. This will give you more options, because you
won’t be locked into one vendor. Open source also makes it easier to
hire staff and get them productive quickly.
However, current open source tools don’t do everything you need.
You’ll have to fill in the gaps with commercial solutions or handcrafted scripts.
For instance, Sqoop is an excellent tool for importing data from a
relational database to Hadoop and supports incremental loads.
However, building a complete insert-update-delete solution to keep
the Hive table in sync with the RDBMS table would be a pretty com‐
plex task. Here you might benefit from Zaloni’s Bedrock product,
which offers a Change Data Capture (CDC) action that handles
inserts, updates, and deletes and is easy to configure.

Metadata (Cataloguing)
Why do you need to preserve metadata about your data? Reasons
for doing so abound:

Metadata (Cataloguing)


|

11


• For your analytics, you will want to choose data from the right
place and time. For instance, you may want to go back to old
data from all your stores in a particular region.
• Data preparation and cleaning require a firm knowledge of
which data set you’re working on. Different sets require differ‐
ent types of preparation, based on what you have learned about
them historically.
• Analytical methods are often experimental and have some
degree of error. To determine whether you can trust results, you
may want to check the data that was used to achieve the results,
and review how it was processed.
• When something goes wrong in any stage from ingestion
through to the processing, you need to quickly pinpoint the data
causing the problem. You also must identify the source so you
can contact them and make sure the problem doesn’t reoccur in
future data sets.
• In addition to cleaning data and preventing errors, you may
have other reasons related to quality control to preserve the lin‐
eage or provenance of data.
• Access has to be restricted to sensitive data. If users deliberately
or inadvertently try to start a job on data they’re not supposed
to see, your system should reject the job.
• Regulatory requirements may require the access restrictions
mentioned in the previous bullet, as well as imposing other
requirements that depend on the data source.

• Licenses may require access restrictions and other special treat‐
ment of some data sources.
Ben Sharma, CEO and co-founder of Zaloni, talks about creating “a
single source of truth” from the diverse data sets you take in. By cre‐
ating a data catalog, you can store this metadata for use by down‐
stream programs.
Zaloni divides metadata roughly into three types:
Business metadata
This can include the business names and descriptions that you
assign to data fields to make them easier to find and under‐
stand. For instance, the technical staff may have a good reason
to assign the name loc_outlet to a field that represents a retail
12

|

Moving to Big Data Analysis


store, but you will want users to be able to find it through com‐
mon English words. This kind of metadata also covers business
rules, such as putting an upper limit (perhaps even a lower
limit) on salaries, or determining which data must be removed
from some jobs for security and privacy.
Operational metadata
This is generated automatically by the processes described in
this report, and include such things as the source and target
locations of data, file size, number of records, how many
records were rejected during data preparation or a job run, and
the success or failure of that run itself.

Technical metadata
This includes the data’s type and format (text, images, JSON,
Avro, etc.) and the structure or schema. This structure includes
the names of fields, their data types, their lengths, whether they
can be empty, and so on. Structure is commonly provided by a
relational database or the headings in a spreadsheet, but may
also be added during ingestion and data preparation. Zaloni’s
Bedrock integrates with Apache Hcatalog for technical metadata
so that other tools in the Hadoop ecosystem can take advantage
of the structure definition.
As suggested in the previous list, one can also categorize metadata
by the way it is gathered:
• Some metadata is embedded in the data, such as the schema in a
relational database.
• Some metadata pertains to the data acquisition process: the
source of the data, filename, time of creation, time of acquisi‐
tion, file size, redundancy checks generated to make sure the
transmission was not corrupted, and MD5 hashes generated to
uniquely identify a file.
• Some metadata is created during ingestion. For instance, a
watermark can be added to a file or to a column within the file.
If you take JSON or other relatively unstructured data and cre‐
ate a schema around it, that schema becomes part of the meta‐
data.

Metadata (Cataloguing)

|

13



• Some metadata is created during a job run, such as the number
of records successfully processed, the number of bad fields or
bad records, and how long a job took.
The next question is how to create metadata. Many tools can extract
the easy stuff, such as file sizes and timestamps, as the stages of pro‐
cessing proceed. Other metadata requires custom-written programs
that do such things as tag particular data fields you’ll want to extract
later.
At any stage of processing, you may choose to update the metadata.
Each stage can also consult the metadata when applying rules for
user access, cleaning, and submitting data to jobs. We’ll see later
how, at least in theory, storing feedback in metadata can create an
environment of continuous quality improvement.
Currently, one of the huge challenges in data management is com‐
municating metadata to downstream parts of a workflow. A good
deal of Zaloni Bedrock’s benefits rest on its ability to do this conven‐
iently. Work is just starting on open source project named Apache
Atlas, which addresses some of these issues as well.

Data Preparation and Cleaning
Assume that your data will come with a certain amount of errors,
corrupted formats, and duplicates. I’m not using “assume” in a
hypothetical sense here—you had better assume the presence of
errors or you will be blindsided when they happen.
What will be the impacts of such errors? Suppose data transfers
don’t complete, for instance? Your workflows should be able to han‐
dle the most common problems, and you’ll need to research your
data feeds to discover those problems.

A sense of what you can run into comes, like several other examples
in this report, from health care. The US government’s Center for
Medicare & Medicaid Services (CMS), which covers a large percent‐
age of health care payments in the country, requires participating
health care providers to submit quality data in a format called the
Healthcare Effectiveness Data and Information Set (HEDIS). This
format is strict, demanding, and absolutely gigantic. Fields that get
mixed up or have incorrect coding cost huge amounts of money as
providers rush to fix them.

14

|

Moving to Big Data Analysis


Why is HEDIS hard to fill out? Because the data is drawn from
reports that undergo many processing steps, in paper or electronic
forms. You would not want your organs during a surgery to pass
through as many hands as HEDIS data does. The doctor’s original
note is processed by a business office within the provider, after
which it is sent to an outside billing service because payer require‐
ments are so strict and complicated. The forms then go to the
insurer, who may question the claim and send it back through the
route on which it came.
The trek may undergo several iterations, taking months. As the
health care provider strives to get payment, lost data and errors in
coding are likely to enter the data.
Rest assured, therefore, that your data will need processing and

cleaning. There are two types of fixes that require different respon‐
ses from your organization: fixes that can be done on a single piece
of data and fixes that require analytics to be run on large data sets.
Note that even a fix on a single piece of data may be developed by
analytics carried out within your organization, or a vendor. For
instance, research can show that the state of California is commonly
represented as Ca, CA, Cal, or Cali in data sets. A simple program‐
ming check, using fixed strings or regular expressions, can identify
the various possible values and harmonize them on a single stan‐
dard, such as CA.
Similar research can help with the HL7 example I cited earlier,
where different vendors implement a standard differently and put
data in different places. Once you identify how a particular vendor
codes an address, you can write a program to read it into the format
of your choice. This program must be updated, of course, if the ven‐
dor changes their coding, which probably will happen without
notice. Good reason for running more analytics.
A missing customer ID probably can’t be fixed by examining a single
record, although it is possible you’ll discover the ID entered into a
different field of the record. More likely, you’ll run a job to match
customers by name, gender, address, and other characteristics. You
can probably find a record in a different data set and be able to trust,
with a good deal of confidence, that it’s the customer with the miss‐
ing ID.

Data Preparation and Cleaning

|

15



A job can identify two records that refer to the same customer. This
mistake often happens when combining data sets from different
sources. It could also happen out in the real world for many reasons:
the customer changed his name, moved to a new address, decided to
use a different email address, got a misspelled name because some‐
one entered it into the system sloppily, etc.
Another example where a job can help enforce quality is checking
city names against ZIP codes in US addresses. If you find two cities
with the same ZIP code in your data, at least one is incorrect. Every
ZIP code in the U.S. is assigned to only one city (although a city can
have many ZIP codes).
To decide what needs to be cleaned up and how, work with the busi‐
ness team and come up with rules for data quality. When you check
individual records, typical rules might include:
• Data older than a certain age should be discarded, or marked as
less trustworthy because it might have changed.
• Certain fields must not be empty. An empty field may be hard to
identify because some people enter meaningless strings such as
X or 9999 when they don’t know something. Sometimes you can
find the data elsewhere and fill it in, but sometimes you’ll
choose to reject the whole record.
• Dates and times must be correct, and must be in a standard for‐
mat.
Many commercial tools provide built-in functions to do common
checks and even make fixes, but many sites write filters of their own
at least part of the time.
In addition to checking each field, you usually need some higherlevel checks that involve files and metadata. For instance, did
incoming data conform to the schema you expected? Are you get‐

ting two identical files? Comparing the MD5 hashes generated on
the files is a simple way to determine the answer to the previous
question.
The data preparation stage is often where sensitive data, such as
financial and health information, is protected. Although terms for
this differ, most systems distinguish two types of protection: remov‐
ing a field completely (often called masking) and changing the field
to something innocuous (often called tokenization). As an example
16

|

Moving to Big Data Analysis


of tokenization, test data sets substitute realistic but fake names for
real names so that developers can test their code against these sets.
Another kind of tokenization is to run the value from a field
through a one-way hash (such as MD5), which ensures that the
same value is always represented by the same hash, but prevents
anyone from deriving the original value. This is a type of pseudo‐
nymity.
Often, you need to bring a human back into the loop to clean data.
A system may suggest that 1-800-REDCROSS is an incorrect phone
number because it contains letters (and is one character too long, as
well), but a human observer can tell the system to accept it. Over
time, the system picks up more and more such information and
becomes smarter, even when processing new data sets.
One of the most interesting experiments currently taking place in
big data research is a form of continuous quality improvement,

according to Ihab Ilyas, professor at the University of Waterloo and
co-founder of Tamr. A program analyzes the data to find error pat‐
terns and develop some rules to restore consistency. It can then use
these rules to catch current and future errors at earlier stages of data
processing, and perhaps even fix or suggest fixes to the errors.

Managing Workflows
You have designed your filters and jobs for ingestion, cataloguing
metadata, data preparation, and Hadoop itself. Can you make regu‐
lar, productive use of all these things? That depends on how easily
you can combine the tasks in end-to-end workflows.
First, you should make workflows for each task. How is data from a
particular source ingested? Do you have a general workflow to
which you can just assign parameters such as the source and type of
data?
And how is the workflow triggered? Forcing someone to launch the
job manually is a waste of staff time, and prone to errors.
You could do something as simple as schedule a job at regular inter‐
vals. (Unix and Linux provide cron for that purpose.) YARN is an
open source tool that helps with resource allocation and scheduling.
Resource allocation gets particularly complex in the cloud. You want
to ensure you can get the number and capacity of systems you need

Managing Workflows

|

17



for the turn-around time you need, while avoiding the risk of jobs
growing to an enormous, costly scale.
Your workflow processor should also be able to handle triggers, so
that when something important happens like the arrival of new data,
the job launches on its own. For instance, AWS Data Pipeline lets
you specify that a job starts whenever a particular file is uploaded to
S3 storage. The open source Oozie project can also start a job based
on the availability of data.
Scheduling should also be flexible. One site I talked to sometimes
delays a workflow for a few hours when the servers are at capacity.
Having small workflows in place, you should be able to compose
larger workflows from sub-workflows. In that way you can robustly
construct a single workflow covering data acquisition, ingestion
(putting it in the right repository), cleansing, format conversion,
enrichment, and provisioning of the results.
Most sites have multiple environments: for instance, development,
test, and production. It should be possible to run the same workflow
in these environments, with different parameters appropriate for
each environment. With such a system in place, you can have strong
confidence that the programs your developers and testers work on
will hold up in production.
Currently, most sites create workflows through a programming lan‐
guage. Some developers use Java because that’s the basic way of cre‐
ating jobs for Hadoop and related tools. Most use popular scripting
languages such as Python or simply the Unix shell. However, not all
formats handled by Hadoop are supported by all languages. Libra‐
ries are continually being added to fill the gap, but you are likely to
find a need to incorporate a Java program to format data into your
workflow. One advantage of using a programming or scripting lan‐
guage is that you can use source control and testing as you would on

any program.
Ideally, users without a technical background could construct and
launch their own workflows. To enable this, Zaloni provides a
graphical user interface where users can drag and drop predefined
workflows, connect them by dragging arrows between them, and
then schedule the job.
Job failures, as mentioned before, may sometimes be handled by
rerunning the job at various levels of your system, but you’ll have to
18

| Moving to Big Data Analysis


plan what to do if the job can’t recover from an error. Thus, work‐
flows should send notifications on important events, particularly
success or failure. They should also embody rules to decide when to
skip a record, or when to stop entirely.
For instance, suppose you have two rules during data preparation,
one making sure that the input is a number and the other making
sure it’s within an allowed range. If the input isn’t a number, it would
be meaningless to check it against a range, and there is no point to
running the second rule.
After a run, reports can include lots of useful statistics in addition to
success or failure. How many records were dropped because they
were corrupt? Were input files missing? What were the percentages
of such errors, in relation to the whole job?
Such information can be stored in logs and then processed by the
operations team to produce web displays and dashboards. You’ll
want to track errors on several levels: for a single job, for a collection
of jobs, and over time. That way you can tell whether your input

data is slipping in quality, and whether your tools are doing as good
a job as they did on the data where you first ran them.
Your metadata catalog can come in valuable at the error stage. The
operations team should be able to see from a log or other report
where the problem occurred (which file, which record) and go back
to the original data to diagnose the cause.
Tags and watermarks enable this forensic research. The tag you
assign to a particular column in a particular source should last
throughout the pipeline and appear in the log entry that reports a
problem.

Access Control
We have seen that access control is crucial for organizational safety,
privacy, and regulatory compliance. Large organizations achieve
security by dividing users into groups—research teams, operations
teams, etc.—and grouping data into resources with access rights.
Then you can grant users or groups access to particular data resour‐
ces. For instance, one research team may be researching the effec‐
tiveness of a website, so you can grant it access to all logs and data

Access Control

|

19


about the website without being able to see other things such as sales
data.
One site I talked to isolated personally identifiable information (PII)

through a hybrid solution. It’s often easy to tell by the column name
whether data is personally identifiable, and route such columns to a
different repository with different access rights. Sometimes a pro‐
cessor needs to tag data with special identifiers so that it is routed
later to the secure repository. Each stage, including the analytics, can
be restricted to the repositories that don’t contain PII.
In another sub-optimal type of operating environment, teams keep
their data and Hadoop jobs separate within silos or “data puddles.”
With appropriate access controls, your organization should be able
to save money, increase security, and leverage your data better by
managing it all in a systematic manner.

Conclusion
A recent report2 found that governments and other organizations
are opening up large quantities of data, but many of the companies
who could benefit from it don’t know it exists. The same problem
can happen within your own organization.
Hadoop, at its core, is a file system and a set of libraries to process
large quantities of data. Management of that data—ingestion, data
preparation, job scheduling, and access rights—must be addressed
by other tools. Tools such as Sqoop and YARN are emerging in the
open source community to pick off various pieces of the data man‐
agement problem. You should use robust open source tools where
they are available and keep data in transparent formats so that it can
be submitted to these tools, while taking advantage of commercial
products aimed at the data lake.
You’re spending a lot of money to accumulate and store data. There‐
fore, the people who need the data must be able to find it and com‐
bine it quickly into analytic jobs that produce useful insights
Recognizing the specific tasks you need for acquisition and inges‐

tion, cataloguing, data cleaning, and analytical jobs can help you
prepare for the problems you’ll encounter in these phases and have

2 (PDF)

20

|

Moving to Big Data Analysis


production-ready solutions at hand. Workflows and access control
contribute important management solutions across the entire sys‐
tem. All that shiny data is there for your users to enjoy—make it a
pleasure for them.

Conclusion

|

21


About the Author
Andy Oram is an editor at O’Reilly Media. An employee of the
company since 1992, Andy currently specializes in programming
and health IT. His work for O’Reilly includes the first books ever
published commercially in the United States on Linux, and the 2001
title Peer-to-Peer.




×