Managing the data lake

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.87 MB, 27 trang )

Managing the Data Lake
Moving to Big Data Analysis
Andy Oram

Managing the Data Lake
by Andy Oram
Copyright © 2015 O’Reilly Media, Inc. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North,
Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales
promotional use. Online editions are also available for most titles
(). For more information, contact our
corporate/institutional sales department: 800-998-9938 or

Editor: Shannon Cutt
Interior Designer: David Futato
Cover Designer: Karen Montgomery
September 2015: First Edition

Revision History for the First Edition
2015-09-02: First Release
2015-10-20: Second Release
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc.
Managing the Data Lake and related trade dress are trademarks of O’Reilly
Media, Inc.
While the publisher and the author have used good faith efforts to ensure that

the information and instructions contained in this work are accurate, the
publisher and the author disclaim all responsibility for errors or omissions,
including without limitation responsibility for damages resulting from the use
of or reliance on this work. Use of the information and instructions contained
in this work is at your own risk. If any code samples or other technology this
work contains or describes is subject to open source licenses or the
intellectual property rights of others, it is your responsibility to ensure that
your use thereof complies with such licenses and/or rights.
Cover photo credit: “55 Flying Fish” by Michal (flikr).
978-1-491-94168-3
[LSI]

Chapter 1. Moving to Big Data
Analysis
Can you tell by sailing the surface of a lake whether it has been well
maintained? Can local fish and plants survive? Dare you swim? And how
about the data maintained in your organization’s data lake? Can you tell
whether it’s healthy enough to support your business needs?
An increasing number of organizations maintain fast-growing repositories of
data, usually from multiple sources and formatted in multiple ways, that are
commonly called “data lakes.” They use a variety of storage and processing
tools — especially in the Hadoop family — to extract value quickly and
inform key organizational decisions.
This report looks at the common needs that modern organizations have for
data management and governance. The MapReduce model — introduced in
2004 in a paper1 by Jeffrey Dean and Sanjay Ghemawat — completely
overturned the way the computing community approached big data analysis.
Many other models, such as Spark, have come since then, creating
excitement and seeing eager adoption by organizations of all sizes to solve

the problems that relational databases were not suited for. But these
technologies bring with them new demands for organizing data and keeping
track of what you’ve got.
I take it for granted that you understand the value of undertaking a big data
initiative, as well as the value of a framework such as Hadoop, and are in the
process of transforming the way you manage your organization’s data. I have
interviewed a number of experts in data management to find out the common
challenges you are about to face, so you can anticipate them and put solutions
in place before you find yourself overwhelmed.
Essentially, you’ll need to take care of challenges that never came up with
traditional relational databases and data warehouses, or that were handled by
the constraints that the relational model placed on data. There is wonderful

value in those constraints, and most of us will be entrusting data to relational
systems for the foreseeable future. But some data tasks just don’t fit. And
once you escape the familiarity and safety of the relational model, you need
other tools to manage the inconsistencies, unpredictability, and breakneck
pace of the data you’re handling.
The risk of the new tools is having many disparate sources of data — and
perhaps multiple instances of Hadoop or other systems offering analytics
operating inefficiently — which in turn causes you to lose track of basic
information you need to know about your data. This makes it hard to set up
new jobs that could provide input to the questions you urgently need to
answer.
The fix is to restore some of the controls you had over old data sources
through careful planning and coding, while still being flexible and responsive
to fast-moving corporate data needs.
The main topics covered in this report are:
Acquisition and ingestion

Data comes nowadays from many different sources: internal business
systems, product data from customers, external data providers, public
data sets, and more. You can’t force everyone to provide the data in a
format that’s convenient for you. Nor can you take the time (as in the
old days) to define strict schemas and enter all data into schemas. The
problems of data acquisition and ingestion have to be solved with a
degree of automation.
Metadata (cataloguing)
Questions such as who provided the data, when it came in, and how it
was formatted — a slew of concerns known as lineage or provenance —
are critical to managing your data well. A catalog can keep this metadata
and make it available to later stages of processing.
Data preparation and cleaning
Just as you can’t control incoming formats, you can’t control data
quality. You will inevitably deal with data that does not conform. Data
may be missing, entered in diverse formats, contain errors, and so on. In

addition, data might be lost or corrupted because sensors run out of
battery power, networks fail, software along the way harbored a bug, or
the incoming data had an unrecognized format. Some data users estimate
that detecting these anomalies and cleaning takes up 90% of their time.
Managing workflows
The actual jobs you run on data need to be linked with the three other
stages just described. Users should be able to submit jobs of their own,
based on the work done by experts before them, to handle ingestion,
cataloguing, and cleaning. You want staff to quickly get a new
visualization or report without waiting weeks for a programmer to code
it up.
Access control

Data is the organization’s crown jewels. You can’t give everybody
access to all data. In fact, regulations require you to restrict access to
sensitive customer data. Security and access controls are therefore
critical at all stages of data handling.

Why Companies Move to Hadoop
To set the stage for exploration of data management, it is helpful to remind
ourselves of why organizations are moving in the direction of big data tools.
Size
“Volume” is one of the main aspects of big data. Relational databases
cannot scale beyond a certain volume due to architecture restrictions.
Organizations find that data processing in relational databases takes too
long, and as they do more and more analytics, such data processing
using conventional ETL tools becomes such a big time sink that they
hold users back from making full use of the data.
Variety
Typical sources include flat files, RDBMSes, logs from web servers,
devices and sensors, and even legacy mainframe data. Sometimes you
want also to export data from Hadoop to an RDBMS or other repository.
Free-form data
Some data may be almost completely unstructured, as in the case of
product reviews and social media postings. Other data will come to you
inconsistently structured. For instance, different data providers may
provide the same information in very different formats.
Streaming data
If you don’t keep up with changes in the world around you, it will pass
you by — and probably reward a competitor who does adapt to it.
Streaming has evolved from a few rare cases, such as stock markets and
sensor data, to everyday data such as product usage data and social

media.
Fitting the task to the tool
Data maintained in relational databases — let alone cruder storage
formats, such as spreadsheets — is structured well for certain analytics.
But for new paradigms such as Spark or the MapReduce model,

preparing data can take more time than doing the analytics. Data in
normalized relational format resides in many different tables and must
be combined to make the format that the analytics engine can efficiently
process.
Frequent failures
Modern processing systems such as Hadoop contain redundancy and
automatic restart to handle hardware failures or software glitches. Even
so, you can expect jobs to be aborted regularly by bad data. You’ll want
to get notifications when a job finishes successfully or unsuccessfully.
Log files should show you what goes wrong, and you should be able to
see how many corrupted rows were discarded and what other errors
occurred.
Unless you take management into consideration in advance, you end up
unable to make good use of this data. One example comes from a telecom
company whose network generated records about the details of phone calls
for monthly billing purposes. Their ETL system didn’t ingest data from calls
that were dropped or never connected, because no billing was involved. So
years later, when they realized they should be looking at which cell towers
had low quality, they had no data with which to do so.
A failure to collect or store data may be an extreme example of management
problems, but other hindrances — such as storing it in a format that is hard to
read, or failing to remember when it arrived — will also slow down
processing to the point where you give up opportunities for learning insights

from your data.
When the telecom company just mentioned realized that they could use
information on dropped and incomplete calls, their ETL system required a
huge new programming effort and did not have the capacity to store or
process the additional data. Modern organizations may frequently get new
sources of data from brokers or publicly available repositories, and can’t
afford to spend time and resources doing such coding in order to integrate
them.
In systems with large, messy data, you have to decide what the system should
do when input is bad. When do you skip a record, when do you run a

program to try to fix corrupted data, and when do you abort the whole job?
A minor error such as a missing ZIP code probably shouldn’t stop a job, or
even prevent that record from being processed. A missing customer ID,
though, might prevent you from doing anything useful with the data. (There
may be ways to recover from these errors too, as we’ll see.)
Your choice depends of course on your goal. If you’re counting sales of a
particular item, you don’t need the customer ID. If you want to update
customer records, you probably do.
A more global problem with data ingestion comes when someone changes the
order of fields in all the records of an incoming data set. Your program might
be able to detect what happened and adjust, or might have to abort.
At some point, old data will pile up and you will have to decide whether to
buy more disk space, archive the data (magnetic tape is still in everyday use),
or discard it. Archiving or discarding has to be automated to reduce errors.
You’ll find old data surprisingly useful if you can manage to hold on to it.
And of course, having it readily at hand (instead of on magnetic tape) will
permit you to quickly run analytics on that data.

Acquisition and Ingestion
At this point we turn to the steps in data processing. Acquisition comes first.
Nowadays it involves much more than moving data from an external source
to your own repository. In fact, you may not be storing every source you get
data from at all: you might accept streams of fast-changing data from sensors
or social media, process them right away, and save only the results.
On the other hand, if you want to keep the incoming data, you may need to
convert it to a format understood by Hadoop or other processing tools, such
as Avro or Parquet.
The health care field provides a particularly complex data collection case.
You may be collecting:
Electronic health records from hospitals using different formats
Claims data from health care providers or payers
Profiles from health plans
Data from individuals’ fitness devices
Electronic health records illustrate the variety and inconsistency of all these
data types. Although there are standards developed by the HL7 standards
group, they are implemented differently by each EHR vendor. Furthermore,
HL7 exchanges data through several messaging systems that differ from any
other kind of data exchange used in the computer field.
In a situation like this, you will probably design several general methods of
ingesting data: one to handle the HL7 messages from EHRs, another to
handle claims data, and so on. You’ll want to make it easy for a user to
choose one of these methods and adjust parameters such as source,
destination file, and frequency in order to handle a new data feed.
Successful ingestion requires you to know in detail how the data is coming
in. Read the documentation carefully: you may find that the data doesn’t

contain what you wanted at all, or needs complex processing to extract just
what you need. And the documentation may not be trustworthy, so you have
to test your ingestion process on actual input.
As mentioned earlier, you may be able to anticipate how incoming data
changes — such as reordered fields — and adapt to it. However, there are
risks to doing this. First, your tools become more complicated and harder to
maintain. Second, they may make the wrong choice because they think they
understand the change and get it wrong.
Another common ingestion task is to create a consolidated record from
multiple files of related information that are used frequently together — for
example, an Order Header and Details merged into one file. Hadoop has a
particular constraint on incoming data: it was not designed for small files.
Input may consist of many small files, but submitting them individually will
force a wasteful input process onto Hadoop and perhaps even cause a failure.
For this reason, it is recommended that, prior to processing these small files,
they be combined into a single large file to leverage the Hadoop cluster more
efficiently.
This example highlights an important principle governing all the processing
discussed in this report: use open formats if possible, and leverage everything
the open source and free software communities have made available. This
will give you more options, because you won’t be locked into one vendor.
Open source also makes it easier to hire staff and get them productive
quickly.
However, current open source tools don’t do everything you need. You’ll
have to fill in the gaps with commercial solutions or hand-crafted scripts.
For instance, Sqoop is an excellent tool for importing data from a relational
database to Hadoop and supports incremental loads. However, building a
complete insert-update-delete solution to keep the Hive table in sync with the
RDBMS table would be a pretty complex task. Here you might benefit from
Zaloni’s Bedrock product, which offers a Change Data Capture (CDC) action

that handles inserts, updates, and deletes and is easy to configure.

Metadata (Cataloguing)
Why do you need to preserve metadata about your data? Reasons for doing so
abound:
For your analytics, you will want to choose data from the right place and
time. For instance, you may want to go back to old data from all your
stores in a particular region.
Data preparation and cleaning require a firm knowledge of which data set
you’re working on. Different sets require different types of preparation,
based on what you have learned about them historically.
Analytical methods are often experimental and have some degree of error.
To determine whether you can trust results, you may want to check the
data that was used to achieve the results, and review how it was processed.
When something goes wrong in any stage from ingestion through to the
processing, you need to quickly pinpoint the data causing the problem.
You also must identify the source so you can contact them and make sure
the problem doesn’t reoccur in future data sets.
In addition to cleaning data and preventing errors, you may have other
reasons related to quality control to preserve the lineage or provenance of
data.
Access has to be restricted to sensitive data. If users deliberately or
inadvertently try to start a job on data they’re not supposed to see, your
system should reject the job.
Regulatory requirements may require the access restrictions mentioned in
the previous bullet, as well as imposing other requirements that depend on
the data source.
Licenses may require access restrictions and other special treatment of
some data sources.

Ben Sharma, CEO and co-founder of Zaloni, talks about creating “a single
source of truth” from the diverse data sets you take in. By creating a data
catalog, you can store this metadata for use by downstream programs.
Zaloni divides metadata roughly into three types:
Business metadata
This can include the business names and descriptions that you assign to
data fields to make them easier to find and understand. For instance, the
technical staff may have a good reason to assign the name loc_outlet to a
field that represents a retail store, but you will want users to be able to
find it through common English words. This kind of metadata also
covers business rules, such as putting an upper limit (perhaps even a
lower limit) on salaries, or determining which data must be removed
from some jobs for security and privacy.
Operational metadata
This is generated automatically by the processes described in this report,
and include such things as the source and target locations of data, file
size, number of records, how many records were rejected during data
preparation or a job run, and the success or failure of that run itself.
Technical metadata
This includes the data’s type and format (text, images, JSON, Avro, etc.)
and the structure or schema. This structure includes the names of fields,
their data types, their lengths, whether they can be empty, and so on.
Structure is commonly provided by a relational database or the headings
in a spreadsheet, but may also be added during ingestion and data
preparation. Zaloni’s Bedrock integrates with Apache Hcatalog for
technical metadata so that other tools in the Hadoop ecosystem can take
advantage of the structure definition.
As suggested in the previous list, one can also categorize metadata by the

way it is gathered:
Some metadata is embedded in the data, such as the schema in a relational
database.

Some metadata pertains to the data acquisition process: the source of the
data, filename, time of creation, time of acquisition, file size, redundancy
checks generated to make sure the transmission was not corrupted, and
MD5 hashes generated to uniquely identify a file.
Some metadata is created during ingestion. For instance, a watermark can
be added to a file or to a column within the file. If you take JSON or other
relatively unstructured data and create a schema around it, that schema
becomes part of the metadata.
Some metadata is created during a job run, such as the number of records
successfully processed, the number of bad fields or bad records, and how
long a job took.
The next question is how to create metadata. Many tools can extract the easy
stuff, such as file sizes and timestamps, as the stages of processing proceed.
Other metadata requires custom-written programs that do such things as tag
particular data fields you’ll want to extract later.
At any stage of processing, you may choose to update the metadata. Each
stage can also consult the metadata when applying rules for user access,
cleaning, and submitting data to jobs. We’ll see later how, at least in theory,
storing feedback in metadata can create an environment of continuous quality
improvement.
Currently, one of the huge challenges in data management is communicating
metadata to downstream parts of a workflow. A good deal of Zaloni
Bedrock’s benefits rest on its ability to do this conveniently. Work is just
starting on open source project named Apache Atlas, which addresses some
of these issues as well.

Data Preparation and Cleaning
Assume that your data will come with a certain amount of errors, corrupted
formats, and duplicates. I’m not using “assume” in a hypothetical sense here
— you had better assume the presence of errors or you will be blindsided
when they happen.
What will be the impacts of such errors? Suppose data transfers don’t
complete, for instance? Your workflows should be able to handle the most
common problems, and you’ll need to research your data feeds to discover
those problems.
A sense of what you can run into comes, like several other examples in this
report, from health care. The US government’s Center for Medicare &
Medicaid Services (CMS), which covers a large percentage of health care
payments in the country, requires participating health care providers to
submit quality data in a format called the Healthcare Effectiveness Data and
Information Set (HEDIS). This format is strict, demanding, and absolutely
gigantic. Fields that get mixed up or have incorrect coding cost huge amounts
of money as providers rush to fix them.
Why is HEDIS hard to fill out? Because the data is drawn from reports that
undergo many processing steps, in paper or electronic forms. You would not
want your organs during a surgery to pass through as many hands as HEDIS
data does. The doctor’s original note is processed by a business office within
the provider, after which it is sent to an outside billing service because payer
requirements are so strict and complicated. The forms then go to the insurer,
who may question the claim and send it back through the route on which it
came.
The trek may undergo several iterations, taking months. As the health care
provider strives to get payment, lost data and errors in coding are likely to
enter the data.

Rest assured, therefore, that your data will need processing and cleaning.
There are two types of fixes that require different responses from your

organization: fixes that can be done on a single piece of data and fixes that
require analytics to be run on large data sets.
Note that even a fix on a single piece of data may be developed by analytics
carried out within your organization, or a vendor. For instance, research can
show that the state of California is commonly represented as Ca, CA, Cal, or
Cali in data sets. A simple programming check, using fixed strings or regular
expressions, can identify the various possible values and harmonize them on
a single standard, such as CA.
Similar research can help with the HL7 example I cited earlier, where
different vendors implement a standard differently and put data in different
places. Once you identify how a particular vendor codes an address, you can
write a program to read it into the format of your choice. This program must
be updated, of course, if the vendor changes their coding, which probably
will happen without notice. Good reason for running more analytics.
A missing customer ID probably can’t be fixed by examining a single record,
although it is possible you’ll discover the ID entered into a different field of
the record. More likely, you’ll run a job to match customers by name, gender,
address, and other characteristics. You can probably find a record in a
different data set and be able to trust, with a good deal of confidence, that it’s
the customer with the missing ID.
A job can identify two records that refer to the same customer. This mistake
often happens when combining data sets from different sources. It could also
happen out in the real world for many reasons: the customer changed his
name, moved to a new address, decided to use a different email address, got a
misspelled name because someone entered it into the system sloppily, etc.
Another example where a job can help enforce quality is checking city names

against ZIP codes in US addresses. If you find two cities with the same ZIP
code in your data, at least one is incorrect. Every ZIP code in the U.S. is
assigned to only one city (although a city can have many ZIP codes).
To decide what needs to be cleaned up and how, work with the business team
and come up with rules for data quality. When you check individual records,
typical rules might include:

Data older than a certain age should be discarded, or marked as less
trustworthy because it might have changed.
Certain fields must not be empty. An empty field may be hard to identify
because some people enter meaningless strings such as X or 9999 when
they don’t know something. Sometimes you can find the data elsewhere
and fill it in, but sometimes you’ll choose to reject the whole record.
Dates and times must be correct, and must be in a standard format.
Many commercial tools provide built-in functions to do common checks and
even make fixes, but many sites write filters of their own at least part of the
time.
In addition to checking each field, you usually need some higher-level checks
that involve files and metadata. For instance, did incoming data conform to
the schema you expected? Are you getting two identical files? Comparing the
MD5 hashes generated on the files is a simple way to determine the answer to
the previous question.
The data preparation stage is often where sensitive data, such as financial and
health information, is protected. Although terms for this differ, most systems
distinguish two types of protection: removing a field completely (often called
masking) and changing the field to something innocuous (often called
tokenization). As an example of tokenization, test data sets substitute realistic
but fake names for real names so that developers can test their code against
these sets.

Another kind of tokenization is to run the value from a field through a oneway hash (such as MD5), which ensures that the same value is always
represented by the same hash, but prevents anyone from deriving the original
value. This is a type of pseudonymity.
Often, you need to bring a human back into the loop to clean data. A system
may suggest that 1-800-REDCROSS is an incorrect phone number because it
contains letters (and is one character too long, as well), but a human observer
can tell the system to accept it. Over time, the system picks up more and
more such information and becomes smarter, even when processing new data

sets.
One of the most interesting experiments currently taking place in big data
research is a form of continuous quality improvement, according to Ihab
Ilyas, professor at the University of Waterloo and co-founder of Tamr. A
program analyzes the data to find error patterns and develop some rules to
restore consistency. It can then use these rules to catch current and future
errors at earlier stages of data processing, and perhaps even fix or suggest
fixes to the errors.

Managing Workflows
You have designed your filters and jobs for ingestion, cataloguing metadata,
data preparation, and Hadoop itself. Can you make regular, productive use of
all these things? That depends on how easily you can combine the tasks in
end-to-end workflows.
First, you should make workflows for each task. How is data from a
particular source ingested? Do you have a general workflow to which you
can just assign parameters such as the source and type of data?
And how is the workflow triggered? Forcing someone to launch the job
manually is a waste of staff time, and prone to errors.

You could do something as simple as schedule a job at regular intervals.
(Unix and Linux provide cron for that purpose.) YARN is an open source
tool that helps with resource allocation and scheduling. Resource allocation
gets particularly complex in the cloud. You want to ensure you can get the
number and capacity of systems you need for the turn-around time you need,
while avoiding the risk of jobs growing to an enormous, costly scale.
Your workflow processor should also be able to handle triggers, so that when
something important happens like the arrival of new data, the job launches on
its own. For instance, AWS Data Pipeline lets you specify that a job starts
whenever a particular file is uploaded to S3 storage. The open source Oozie
project can also start a job based on the availability of data.
Scheduling should also be flexible. One site I talked to sometimes delays a
workflow for a few hours when the servers are at capacity.
Having small workflows in place, you should be able to compose larger
workflows from sub-workflows. In that way you can robustly construct a
single workflow covering data acquisition, ingestion (putting it in the right
repository), cleansing, format conversion, enrichment, and provisioning of
the results.
Most sites have multiple environments: for instance, development, test, and
production. It should be possible to run the same workflow in these

environments, with different parameters appropriate for each environment.
With such a system in place, you can have strong confidence that the
programs your developers and testers work on will hold up in production.
Currently, most sites create workflows through a programming language.
Some developers use Java because that’s the basic way of creating jobs for
Hadoop and related tools. Most use popular scripting languages such as
Python or simply the Unix shell. However, not all formats handled by
Hadoop are supported by all languages. Libraries are continually being added

to fill the gap, but you are likely to find a need to incorporate a Java program
to format data into your workflow. One advantage of using a programming or
scripting language is that you can use source control and testing as you would
on any program.
Ideally, users without a technical background could construct and launch
their own workflows. To enable this, Zaloni provides a graphical user
interface where users can drag and drop predefined workflows, connect them
by dragging arrows between them, and then schedule the job.
Job failures, as mentioned before, may sometimes be handled by rerunning
the job at various levels of your system, but you’ll have to plan what to do if
the job can’t recover from an error. Thus, workflows should send
notifications on important events, particularly success or failure. They should
also embody rules to decide when to skip a record, or when to stop entirely.
For instance, suppose you have two rules during data preparation, one
making sure that the input is a number and the other making sure it’s within
an allowed range. If the input isn’t a number, it would be meaningless to
check it against a range, and there is no point to running the second rule.
After a run, reports can include lots of useful statistics in addition to success
or failure. How many records were dropped because they were corrupt? Were
input files missing? What were the percentages of such errors, in relation to
the whole job?
Such information can be stored in logs and then processed by the operations
team to produce web displays and dashboards. You’ll want to track errors on
several levels: for a single job, for a collection of jobs, and over time. That

way you can tell whether your input data is slipping in quality, and whether
your tools are doing as good a job as they did on the data where you first ran
them.
Your metadata catalog can come in valuable at the error stage. The operations

team should be able to see from a log or other report where the problem
occurred (which file, which record) and go back to the original data to
diagnose the cause.
Tags and watermarks enable this forensic research. The tag you assign to a
particular column in a particular source should last throughout the pipeline
and appear in the log entry that reports a problem.

Access Control
We have seen that access control is crucial for organizational safety, privacy,
and regulatory compliance. Large organizations achieve security by dividing
users into groups — research teams, operations teams, etc. — and grouping
data into resources with access rights.
Then you can grant users or groups access to particular data resources. For
instance, one research team may be researching the effectiveness of a
website, so you can grant it access to all logs and data about the website
without being able to see other things such as sales data.
One site I talked to isolated personally identifiable information (PII) through
a hybrid solution. It’s often easy to tell by the column name whether data is
personally identifiable, and route such columns to a different repository with
different access rights. Sometimes a processor needs to tag data with special
identifiers so that it is routed later to the secure repository. Each stage,
including the analytics, can be restricted to the repositories that don’t contain
PII.
In another sub-optimal type of operating environment, teams keep their data
and Hadoop jobs separate within silos or “data puddles.” With appropriate
access controls, your organization should be able to save money, increase
security, and leverage your data better by managing it all in a systematic
manner.

Conclusion
A recent report2 found that governments and other organizations are opening
up large quantities of data, but many of the companies who could benefit
from it don’t know it exists. The same problem can happen within your own
organization.
Hadoop, at its core, is a file system and a set of libraries to process large
quantities of data. Management of that data — ingestion, data preparation,
job scheduling, and access rights — must be addressed by other tools. Tools
such as Sqoop and YARN are emerging in the open source community to
pick off various pieces of the data management problem. You should use
robust open source tools where they are available and keep data in
transparent formats so that it can be submitted to these tools, while taking
advantage of commercial products aimed at the data lake.
You’re spending a lot of money to accumulate and store data. Therefore, the
people who need the data must be able to find it and combine it quickly into
analytic jobs that produce useful insights
Recognizing the specific tasks you need for acquisition and ingestion,
cataloguing, data cleaning, and analytical jobs can help you prepare for the
problems you’ll encounter in these phases and have production-ready
solutions at hand. Workflows and access control contribute important
management solutions across the entire system. All that shiny data is there for
your users to enjoy — make it a pleasure for them.
1

(PDF)

2

(PDF)

Managing the data lake

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về