Tải bản đầy đủ (.pdf) (33 trang)

IT training big data all stars khotailieu

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (947.11 KB, 33 trang )

Big Data All-Stars
Real-World Stories and Wisdom
from the Best in Big Data

Presented by D a t a n a m i

Sponsored by

®

1



Introduction
Those of us looking to take a significant step towards creating
a data-driven business sometimes need a little inspiration from
those that have traveled the path we are looking to tread. This
book presents a series of real-world stories from those on the big
data frontier who have moved beyond experimentation to creating
sustainable, successful big data solutions within their organizations. Read these stories to get an inside look at nine “big data
all-stars” who have been recognized by MapR and Datanami as
having achieved great success in the expanding field of big data.
Use the examples in this guide to help you develop your own
methods, approaches, and best practices for creating big data
solutions within your organization. Whether you are a business
analyst, data scientist, enterprise architect, IT administrator, or
developer, you’ll gain key insights from these big data luminaries—insights that will help you tackle the big data challenges
you face in your own company.

3




Table of Contents
6

How comScore Uses Hadoop and MapR to Build its Business

Michael Brown, CTO at comScore
comScore uses MapR to manage and scale their Hadoop cluster of 450 servers, create more files, process more data faster, and produce better streaming and random
I/O results. MapR allows comScore to easily access data in the cluster and just as
easily store it in a variety of warehouse environments.

9

Making Good Things Happen at Wells Fargo

Paul Cao, Director of Data Services for Wells Fargo’s Capital
Markets business
Wells Fargo uses MapR to serve the company’s data needs across the entire banking
business, which involve a variety of data types including reference data, market data,
and structured and unstructured data, all under the same umbrella. Using NoSQL
and Hadoop, their solution requires the utmost in security, ease of ingest, ability to
scale, high performance, and—particularly important for Wells Fargo—multi-tenancy.

11

Coping with Big Data at Experian–“Don’t Wait, Don’t Stop”

Tom Thomas, Director of IT at Experian
Experian uses MapR to store in-bound source data. The files are then available for

analysts to query with SQL via Hive, without the need to build and load a structured
database. Experian is now able to achieve significantly more processing power and
storage space, and clients have access to deeper data.

14

Trevor Mason and Big Data: Doing What Comes Naturally

Trevor Mason, Vice President Technology Research at IRI
IRI used MapR to maximize file system performance, facilitate the use of a large
number of smaller files, and send files via FTP from the mainframe directly to the
cluster. With Hadoop, they have been able to speed up data processing while reducing mainframe load, saving more than $1.5 million.

17

Leveraging Big Data to Economically Fuel Growth

Kevin McClowry, Director of Analytics Application Development
at TransUnion
TransUnion uses a hybrid architecture made of commercial databases and Hadoop
so that their analysts can work with data in a way that was previously out of reach.
The company is introducing the analytics architecture worldwide and sizing it to fit
the needs and resources of each country’s operation.

4


20

Making Big Data Work for a Major Oil & Gas Equipment Manufacturer


Warren Sharp, Big Data Engineer at National Oilwell VARCO (NOV)
NOV created a data platform for time-series data from sensors and control systems
to support deep analytics and machine learning. The organization is now able to build,
test, and deliver complicated condition-based maintenance models and applications.

23

The NIH Pushes the Boundaries of Health Research with Data Analytics

Chuck Lynch, Chief Knowledge Officer at National Institutes of Health
The National Institutes for Health created a five-server cluster that enables the office
to effectively apply analytical tools to newly-shared data. NIH can now do things
with health science data it couldn’t do before, and in the process, advance medicine.

27

Keeping an Eye on the Analytic End Game at UnitedHealthcare

Alex Barclay, Vice President of Advanced Analytics at UnitedHealthcare
UnitedHealthcare uses Hadoop as a basic data framework and built a single platform equipped with the tools needed to analyze information generated by claims,
prescriptions, plan participants, care providers, and claim review outcomes. They can
now identify mispaid claims in a systematic, consistent way.

29

Creating Flexible Big Data Solutions for Drug Discovery

David Tester, Application Architect at Novartis Institutes for
Biomedical Research

Novartis Institutes for Biomedical Research built a workflow system that uses
Hadoop for performance and robustness. Bioinformaticians use their familiar tools
and metadata to write complex workflows, and researchers can take advantage of
the tens of thousands of experiments that public organizations have conducted.

5


How comScore Uses Hadoop and
MapR to Build its Business
Michael Brown
CTO at comScore

When comScore was founded in 1999, Mike Brown, the company’s first
engineer, was immediately immersed in the world of Big Data.
The company was created to provide digital marketing intelligence and digital
media analytics in the form of custom solutions in online audience measurement, e-commerce, advertising, search, video and mobile. Brown’s job was to
create the architecture and design to support the founders’ ambitious plans.
It worked. Over the past 15 years comScore has built a highly successful
business and a customer base composed of some of the world’s top companies—Microsoft, Google, Yahoo!, Facebook, Twitter, craigslist, and the BBC
to name just a few. Overall the company has more than 2,100 clients worldwide. Measurements are derived from 172 countries with 43 markets reported.

“With MapR we see a
3X performance increase
running the same data
and the same code—the
jobs just run faster.”
Mike Brown, CTO, comScore

To service this extensive client base, well over 1.8 trillion interactions are captured monthly, equal to about 40% of the monthly page views of the entire

Internet. This is Big Data on steroids.
Brown, who was named CTO in 2012, continues to grow and evolve the company’s IT infrastructure to keep pace with this constantly increasing data deluge.
“We were a Dell shop from the beginning. In 2002 we put together our own grid
processing stack to tie all our systems together in order to deal with the fast
growing data volumes,” Brown recalls.

Introducing Unified Digital Measurement
In addition to its ongoing business, in 2009 the company embarked on
a new initiative called Unified Digital Measurement (UDM), which directly
addresses the frequent disparity between census-based site analytics data and
panel-based audience measurement data. UDM blends these two
approaches into a “best of breed” approach that combines person-level
measurement from the two million person comScore global panel with census
informed consumption to account for 100 percent of a client’s audience.
UDM helped prompt a new round of IT infrastructure upgrades. “The volume of
data was growing rapidly and processing requirements were growing dramatically as well,” Brown says. “In addition, our clients were asking us to turn the

6


“MapR has built in to
the design an automated
DR strategy.”
Mike Brown,
CTO, comScore

data around much faster. So we looked into building our own stack again, but
decided we’d be better off adopting a well accepted, open source, heavy duty
processing model—Hadoop.”
With the implementation of Hadoop, comScore continued to expand its server

cluster. Multiple servers also meant they had to solve the Hadoop shuffle problem. During the high volume, parallel processing of data sets coming in from
around the world, data is scattered across the server farm. To count the number
of events, all this data has to be gathered, or “shuffled” into one location.
comScore needed a Hadoop platform that could not only scale, but also provide
data protection, high availability, as well as being easy to use.
It was requirements like these that led Brown to adopt the MapR distribution
for Hadoop.
He was not disappointed—by using the MapR distro, the company is able
to more easily manage and scale their Hadoop cluster, create more files and
process more data faster, and produce better streaming and random I/O results
than other Hadoop distributions. “With MapR we see a 3X performance increase
running the same data and the same code—the jobs just run faster.”
In addition, the MapR solution provides the requisite data protection and
disaster recovery functions: “MapR has built in to the design an automated DR
strategy,” Brown notes.

Solving the Shuffle
He said they leveraged a feature in MapR known as volumes to directly address the shuffle problem. “It allows us to make this process run superfast. We
reduced the processing time from 36 hours to three hours—no new hardware,
no new software, no new anything, just a design change. This is just what we
needed to colocate the data for efficient processing.”
Using volumes to optimize processing was one of several unique solutions that
Brown and his team applied to processing comScore’s massive amounts of data.
Another innovation is pre-sorting the data before it is loaded into the Hadoop
cluster. Sorting optimizes the data’s storage compression ratio, from the usual
ratio of 3:1 to a highly compressed 8:1 with no data loss. And this leads to a cas-

7



“With MapR, you can
just mount HDFS as NFS
and then use native
tools whether they’re in
Windows, Unix, Linux or
whatever. NFS allowed
our enterprise to easily
access data in the cluster
and just as easily store
it in a variety of warehouse environments.”
Mike Brown,
CTO, comScore

cade of benefits: more efficient processing with far fewer IOPS, less data to read
from disk, and less equipment which, in turn, means savings on power, cooling
and floor space.
“HDFS is great internally,” says Brown. “But to get data in and out of Hadoop,
you have to do some kind of HDFS export. With MapR, you can just mount
HDFS as NFS and then use native tools whether they’re in Windows, Unix, Linux
or whatever. NFS allowed our enterprise to easily access data in the cluster and
just as easily store it in a variety of warehouse environments.”
For the near future, Brown says the comScore IT infrastructure will continue to
scale to meet new customer demand. The Hadoop cluster has grown to 450
servers with 17,000 cores and more than 10 petabytes of disk.
MapR’s distro of Hadoop is also helping to support a major new product
announced in 2012 and enjoying rapid growth. Know as validated Campaign
Essential (vCE), the new measurement solution provides a holistic view of campaign delivery and a verified assessment of ad-exposed audiences via a single,
third-party source. vCE also allows the identification of non-human traffic and
fraudulent delivery.


Start Small
When asked if he had any advice for his peers in IT who are also wrestling with
Big Data projects, Brown commented, “We all know we have to process mountains of data, but when you begin developing your environment, start small.
Cut out a subset of the data and work on that first while testing your code and
making sure everything functions properly. Get some small wins. Then you can
move on to the big stuff.”

8


Making Good Things Happen
at Wells Fargo
Paul Cao
Director of Data Services
for Wells Fargo’s Capital
Markets business

When Paul Cao joined Wells Fargo several years ago, his timing was perfect.
Big Data analytic technology had just made a major leap forward, providing
him with the tools he needed to implement an ambitious program designed
to meet the company’s analytic needs.
Wells Fargo is big—a nationwide, community-based financial services company
with $1.8 trillion in assets. It provides its various services through 8,700 locations
as well as on the Internet and through mobile apps. The company has some
265,000 employees and offices in 36 countries. They generate a lot of data.

The MapR solution,
for example, provides
powerful features
to logically partition

a physical cluster to
provide separate
administrative control,
data placement,
job execution and
network access.

Cao has been working with data for twenty years. Now, as the Director of Data
Services for Wells Fargo’s Capital Markets business, he is creating systems that
support the Business Intelligence and analytic needs of its far-flung operations.

Meeting Customer and Regulatory Needs
“We receive massive amounts of data from a variety of different systems,
covering all types of securities (equity, fixed income, FX, etc.) from around the
world,” Cao says. “Many of our models reflect the interactions between these
systems—it’s multi-layered. The analytic solutions we offer are not only driven
by customers’ needs, but by regulatory considerations as well.
“We serve the company’s data needs across the entire banking business and
so we work with a variety of data types including reference data, market data,
structured and unstructured data, all under the same umbrella,” he continues.
“Because of the broad scope of the data we are dealing with, we needed tools
that could handle the volume, speed and variety of data as well as all the requirements that had to be met in order to process that data. Just one example is
market tick data. For North American cash equities, we are dealing with up to
three million ticks per second, a huge amount of data that includes all the different price points for the various equity stocks and the movement of those stocks.”

Enterprise NoSQL on Hadoop
Cao says that given his experience with various Big Data solutions in the past
and the recent revolution in the technology, he and his team were well aware
of the limitations of more traditional relational databases. So they concentrated their attention on solutions that support NoSQL and Hadoop. They wanted


9


“The new technology
we are introducing
is not an incremental
change—this is a
dramatic change in
the way we are
handling data.”
Paul Cao, Director of Data
Services for Wells Fargo’s
Capital Markets business

to deal with vendors like MapR that could provide commercial support for the
Hadoop distribution rather than relying on open source channels. The vendors
had to meet criteria such as their ability to provide utmost in security, ease of
ingest, ability to scale, high performance, and—particularly important for Wells
Fargo—multi-tenancy.
Cao explains that he is partnering with the Wells Fargo Enterprise Data & Analytics and Enterprise Technology Infrastructure teams to develop a platform
servicing many different kinds of capital markets related data– including files
of all sizes and real time and batch data from a variety of sources within Well
Fargo. Multi-tenancy is a must to cost-efficiently and securely share IT resources
and allow different business lines, data providers and data consumer applications to coexist on the same cluster with true job isolation and customized
security. The MapR solution, for example, provides powerful features to logically
partition a physical cluster to provide separate administrative control, data
placement, job execution and network access.

Dramatic Change to Handling Data
“The new technology we are introducing is not an incremental change—this

is a dramatic change in the way we are handling data,” Cao says. “Among our
challenges is to get users to accept working with the new Hadoop and NoSQL
infrastructure, which is so different from what they were used to. Within Data
Services, we have been fortunate to have people who not only know the new
technology, but really know the business. This domain expertise is essential to
an understanding of how to deploy and apply the new technologies to solve
essential business problems and work successfully with our users.”
When asked what advice he would pass on to others working with Big Data,
Cao reiterates his emphasis on gaining a solid understanding of the new technologies along with a comprehensive knowledge of their business domain.
“This allows you to marry business and technology to solve business problems,”
he concludes. “You’ll be able to understand your users concerns and work with
them to make good things happen.”

10


Coping with Big Data at Experian–
“Don’t Wait, Don’t Stop”
Tom Thomas
Director of IT at Experian

Experian is no stranger to Big Data. The company can trace its origins back
to 1803 when a group of London merchants began swapping information on
customers who had failed to meet their debts.
Fast forward 211 years. The rapid growth of the credit reference industry and
the market for credit risk management services set the stage for the reliance on
increasing amounts of consumer and business data that has culminated in an
explosion of Big Data. Data that is Experian’s life’s blood.
With global revenues of $4.8 billion ($2.4 billion in North America and 16,000
employees worldwide (6,000 in North America), Experian is an international information services organization working with a majority of the world’s

largest companies. It has four primary business lines: credit services, decision
analytics, direct-to-consumer products, and a marketing services group.
Tom Thomas is the director of the Data Development Technology Group within
the Consumer Services Division. “Our group provides production operations
support as well as technology solutions for our various business units including Automotive, Business, Collections, Consumer, Fraud, and various Data Lab
joint-development initiatives,” he explains. “I work closely with Norbert Frohlich
and Dave Garnier, our lead developers. They are responsible for the design
and development of our various solutions, including those that leverage MapR
Hadoop environments.”

Processing More Data in Less Time
Until recently, the Group had been getting by, as Thomas puts it “…with solutions running on a couple of Windows servers and a SAN.” But as the company
added new products and new sets of data quality rules, more data had to be
processed in the same or less time. It was time to upgrade. But simply adding
to the existing Windows/SAN system wasn’t an option—too cumbersome
and expensive.
So the group upgraded to a Linux-based HPC cluster with—for the time being—
six nodes. Says Thomas, “We have a single customer solution right now. But as
we get new customers who can use this kind of capability, we can add additional
nodes and storage and processing capacity at the same time.”

11


“All our solutions
leverage MapR NFS
functionality.”
Tom Thomas, Director IT
at Experian


NFS Provides Direct Access to Data
“All our solutions leverage MapR NFS functionality,” he continues. “This allows us
to transition from our previous internal or SAN storage to Hadoop by mounting
the cluster directly. In turn, this provides us with access to the data via HDFS and
Hadoop environment tools, such as Hive.”
ETL tools like DMX-h from Syncsort also figured prominently in the new infrastructure, as does MapR NFS. MapR is the only distribution for Apache Hadoop
that leverages the full power of the NFS protocol for remote access to shared
disks across the network.
“Our first solution includes well-known and defined metrics and aggregations,”
Thomas says. “We leverage DMX-h to determine metrics for each record and
pre-aggregate other metrics, which are then stored in Hadoop to be used in
downstream analytics as well as real-time rules based actions. Our second
solution follows a traditional data operations flow, except in this case we use
DMX-h to prepare in-bound source data that is then stored in MapR Hadoop.
Then we run Experian-proprietary models that read the data via Hive and create
client-specific and industry-unique results.

Data Analysts Use SQL to Query on Hadoop
“Our latest endeavor copies data files from a legacy dual application server and
SAN product solution to a MapR Hadoop cluster quite easily as facilitated by
the MapR NFS functionality,” Thomas continues. “The files are then available
for analysts to query with SQL via Hive – without the need to build and load a
structured database. Since we are just starting to work with this data, we are
not ‘stuck’ with that initial database schema that we would have developed, and
thus eliminated that rework time. Our analysts have Tableau and DMX-h available to them, and will generate our initial reports and any analytics data files.
Once the useful data, reports, and results formats are firmed up, we will work
on optimizing production.”
Developers Garnier and Frohlich point out that by taking advantage of the
Hadoop cluster, the team was able to realize substantial more processing power
and storage space, without the costs associated with traditional blade servers


12


By taking advantage
of the Hadoop cluster,
the team was able to
realize substantial more
processing power and
storage space, without
the costs associated
with traditional blade
servers equipped with
SAN storage.

equipped with SAN storage. Two of the servers from the cluster are also application
servers running SmartLoad code and components. The result is a more efficient
use of hardware with no need for separate servers to run the application.

Improved Speed to Market
Here’s how Thomas summarizes the benefits of the upgraded system to both
the company and its customers: “We are realizing increased processing speed
which leads to shorter delivery times. In addition, reduced storage expenses
means that we can store more, not acquire less. Both the company’s internal
operations and our clients have access to deeper data supporting and aiding
insights into their business areas.
“Overall, we are seeing reduced storage expenses while gaining processing
and store capabilities and capacities,” he adds. “This translates into an improved
speed to market for our business units. It also positions our Group to grow our
Hadoop ecosystem to meet future Big Data requirements.”

And when it comes to being a Big Data All Star in today’s informationintensive world, Thomas’ advice is short and to the point: “Don’t wait and
don’t stop.”

13


Trevor Mason and Big Data:
Doing What Comes Naturally
Trevor Mason
Vice President Technology
Research at IRI

Mason is the vice president for Technology Research at IRI, a 30 year old Chicagobased company that provides information, analytics, business intelligence and
domain expertise for the world’s leading CPG, retail and healthcare companies.
“I’ve always had a love of mathematics and proved to be a natural when it
came to computer science,” Mason says. “So I combined both disciplines and
it has been my interest ever since. I joined IRI 20 years ago to work with Big
Data (although it wasn’t called that back then). Today I head up a group that is
responsible for forward looking research into tools and systems for processing,
analyzing and managing massive amounts of data. Our mission is two-fold:
keep technology costs as low as possible while providing our clients with the
state-of-the-art analytic and intelligence tools they need to drive their insights.”

Big Data Challenges
“We looked at traditional
warehouse technologies,
but Hadoop was by far
the most cost effective
solution,” Mason says.
“Within Hadoop we

investigated all the
main distributions and
various hardware
options before settling
on MapR on a Cisco
UCS (Unified Computing
System) cluster.”
Trevor Mason, Vice President for
Technology Research at IRI

Recent challenges facing Mason and his team included a mix of business and
technological issues. They were attempting to realize significant cost reductions
by reducing mainframe load, and continue to reduce mainframe support risk
that is increasing due to the imminent retirement of key mainframe support personnel. At the same time, they wanted to build the foundations for a more cost
effective, flexible and expandable data processing and storage environment.
The technical problem was equally challenging. The team wanted to achieve
random extraction rates averaging 600,000 records per second, peaking to
over one million records persecond from a 15 TB fact table. This table feeds a
large multi-TB downstream client-facing reporting farm. Given IRI’s emphasis on
economy, the solution had to be very efficient, using only 16 to 24 nodes.
“We looked at traditional warehouse technologies, but Hadoop was by far the
most cost effective solution,” Mason says. “Within Hadoop we investigated all
the main distributions and various hardware options before settling on MapR
on a Cisco UCS (Unified Computing System) cluster.”
The fact table resides on the mainframe where it is updated and maintained
daily. These functions are very complex and proved costly to migrate to the
cluster. However, the extraction process, which represents the majority of the
current mainframe load, is relatively simple, Mason says.

14



With Hadoop, they
have been able to
speed up the process
while reducing mainframe load. The result:
annual savings of
more than $1.5 million.

“The solution was to keep the update and maintenance processes on the
mainframe and maintain a synchronized copy on the Hadoop cluster by using
our mainframe change logging process,” he notes. “All extraction processes go
against the Hadoop cluster, significantly reducing the mainframe load. This met
our objective of maximum performance with minimal new development.”
The team chose MapR to maximize file system performance, facilitate the use of
a large number of smaller files, and take full advantage of its NFS capability so
files could be sent via FTP from the mainframe directly to the cluster.

Shaking up the System
They also gave their system a real workout. Recalls Mason, “To maximize efficiency we had to see how far we could push the hardware and software before
it broke. After several months of pushing the system to its limits, we weeded
out several issues, including a bad disk, a bad node, and incorrect OS, network
and driver settings. We worked closely with our vendors to root out and correct
these issues.”
Overall, he says, the development took about six months followed by two
months of final testing and running in parallel with the regular production processes. He also stressed that “Much kudos go to the IRI engineering team and
Zaloni consulting team who worked together to implemented all the minute
details needed to create the current fully functional production system in only
six months.”
To accomplish their ambitious goals, the team took some unique approaches.

For instance, the methods they used to organize the data and structure the extraction process allowed them to achieve between two million and three million
records per second extraction rates on a 16 node cluster.
They also developed a way to always have a consistent view of the data used in
the extraction process while continuously updating it.
By far one of the most effective additions to the IRI IT infrastructure was the
implementation of Hadoop. Before Hadoop the technology team relied on the
mainframe running 24×7 to process the data in accordance with their customers’

15


“Hadoop is not only
saving us money, it
also provides a flexible
platform that can easily
scale to meet future
corporate growth.”
Trevor Mason, Vice President
for Technology Research at IRI

tight timelines. With Hadoop, they have been able to speed up the process
while reducing mainframe load. The result: annual savings of more than
$1.5 million.
Says Mason, “Hadoop is not only saving us money, it also provides a flexible
platform that can easily scale to meet future corporate growth. We can do a lot
more in terms of offering our customers unique analytic insights—the Hadoop
platform and all its supporting tools allow us to work with large datasets in a
highly parallel manner.
“IRI specialized in Big Data before the term became popular—this is not new to
us,” he concludes. “Big Data has been our business now for more than 30 years.

Our objective is to continue to find ways to collect, process and manage Big
Data efficiently so we can provide our clients with leading insights to drive their
business growth.”
And finally, when asked what advice he might have for others who would like
to become Big Data All Stars, Mason is very clear: “Find and implement efficient
and innovative ways to solve critical Big Data processing and management
problems that result in tangible value to the company.”

16


Leveraging Big Data to
Economically Fuel Growth
Kevin McClowry
Director of Analytics
Application Development
at TransUnion

Kevin McClowry has been working with Big Data even before the term hit the
mainstream. And these days, McClowry, currently the lead architect over analytics with TransUnion, LLC, is looking to Big Data technologies to add even more
impetus to his organization’s growth.
“My role is to build systems that enable new insights and innovative product
development,” he says. “And as we grow, these systems need to go beyond
traditional consumer financial information. When people hear TransUnion, they
immediately think of their credit score—and that is a huge part of our business. What they don’t realize is that we also have been providing services across
a number of industries for years—insurance, telecommunications, banking,
automotive to name a few—and we have a wealth of information. Enabling our
analysts to more effectively experiment within and across data domains is what
keeps that needle of innovation moving.”
But growth is rarely achieved without combatting some degree of inertia.

“Within most organizations, it is the customer facing, mission critical systems
that warrant the financial investment in ‘enterprise-class’ commercial solutions.
In contrast, R&D environments and innovation centers tend to receive discretionary funding. We started down this road because we knew we wanted to do
something great, but we were feeling ‘the pinch’ from our current technology
stack,” McClowry says.

Data to the People
“The first problem I wanted to address was the amount of time our analysts
spent requesting, waiting on, and piecing together disparate data from across
the organization. But there was a lot of data, so we wanted to incorporate lower
cost storage platforms to reduce our investment.”
So McClowry went to a hybrid architecture that made use of commercial databases for the most desirable, well-understood data assets—where the cost
can be justified by the demand—surrounded by a more cost-effective Hadoop
platform for everything else. “The more recent, enterprise data assets are what
most people want, and we’ve invested in that. But the trends in historical data

17


“We’re seeing analysts
work with data in a
way that was previously
out of reach, and it’s

or the newly acquired sources that don’t have the same demand are where a
lot of the unseen potential lies. We put that data in Hadoop, and get value from
it without having those uncomfortable conversations about why our analysts
need it around—most of us have been there before.”

folks starts to include


Since adopting this new “tiered” architecture, McClowry’s organization has seen
the benefits. “We’re seeing analysts work with data in a way that was previously
out of reach, and it’s fantastic. When the new normal for our folks starts to
include trillion-row datasets at their fingertips…that’s fun.”

trillion-row datasets

New Tools, Enabling Better Insights

at their fingertips…

The next step for McClowry’s team was to enable tools that would make that
data accessible and usable by the company’s statisticians and analysts.

fantastic. When the
new normal for our

that’s fun.”
Kevin McClowry,
Lead Analytics Architect,
TransUnion

“We’re finding that a lot of the Data Scientists we’re hiring, particularly those
coming out of academia, are proficient with tools like R and Python. So we’re
making sure they have those tools at their disposal. And that is having a great
influence on our other team members and leading them to adopt these tools.”
But that was not the only type of analyst that McClowry’s team is looking to
empower. “I’m consistently impressed by the number skilled analysts we have
hidden within our organization that have simply been lacking the opportunity

to ‘fall down the rabbit hole.’ These are the people that have invaluable tribal
knowledge about our data, and we’re seeing them use that knowledge and data
visualization tools like Tableau to tell some really powerful stories.”

Hitting the Road
McClowry’s team is globalizing their Big Data capabilities, introducing the
analytics architecture worldwide and sizing it to fit the needs and resources
of each country’s operation. In the process, they are building an international
community that is taking full advantage of Big Data—a community that did
not exist before.

18


“We are trying to foster innovation and growth. Embracing these new Big Data
platforms and architectures has helped lay that foundation. Nurturing the expertise and creativity within our analysts is how we’ll build on that foundation.”
And when asked what he would say to other technologists introducing Big Data
into their organizations, McClowry advises, “You’ll be tempted and pressured
to over-promise the benefits. Try to keep the expectations grounded in reality.
You can probably reliably avoid costs, or rationalize a few silos to start out, and
that can be enough as a first exercise. But along the way, acknowledge every
failure and use it to drive the next foray into the unknown; those unintended
consequences are often where you get the next big idea. And as far as technologies go, these tools are not just for the techno-giants anymore and there is no
indication that there will be fewer of them tomorrow. Organizations need to
understand how they will and will not leverage them.”

19


Making Big Data Work for a Major

Oil & Gas Equipment Manufacturer
Warren Sharp
Big Data Engineer at
National Oilwell VARCO
(NOV)

Big data requires a big vision. This was one of the primary reasons that Warren
Sharp was asked to join National Oilwell Varco (NOV) a little over six months
ago. NOV is a worldwide leader in the design, manufacture and sale of equipment and components used in oil and gas drilling and production operations
and the provision of oilfield services to the upstream oil and gas industry.
Sharp, whose title is Big Data Engineer in NOV’s Corporate Engineering and
Technology Group, honed his Big Data analytic skills with a previous employer—
a leading waste management company that was collecting information about
driver behavior by analyzing GPS data for 15,000 trucks around the country.
The goals are more complicated and challenging at NOV. Says Sharp, “We are
creating a data platform for time-series data from sensors and control systems
to support the deep analytics and machine learning. This platform will efficiently
ingest and store all time-series data from any source within the organization
and make it widely available to tools that talk Hadoop or SQL. The first business
use case is to support Condition-Based Maintenance efforts by making years of
equipment sensor information available to all machine learning applications
from a single source.“

MapR at NOV
For Sharp using the MapR data platform was a given—he was already familiar
with its features and capabilities. Coincidentally, his boss-to-be at NOV had
already come to the same conclusion six month’s earlier and made MapR a part
of their infrastructure. “Learning that MapR was part of the infrastructure was
one of the reasons I took the job,” comments Sharp. “I realized we had compatible ideas about how to solve Big Data problems.”
“MapR is relatively easy to install and setup, and the POSIX-compliant NFSenabled clustered file system makes loading data onto MapR very easy,” Sharp

adds. “It is the quickest way to get started with Hadoop and the most flexible in
terms of using ecosystem tools. The next step was to figure out which tools in
the Hadoop ecosystem to include to create a viable solution.”

20


“MapR is relatively easy
to install and setup, and
the POSIX-compliant
NFS-enabled clustered
file system makes
loading data onto MapR
very easy.”
Warren Sharp, Big Data
Engineer at National
Oilwell VARCO

Querying OpenTSDB
The initial goal was to load large volumes of data into OpenTSDB, a time series
database. However, Sharp realized that other Hadoop SQL-based tools could
not query the native OpenTSDB data table easily. So he designed a partitioned
Hive-table to store all ingested data as well. This hybrid storage approach supported options to negotiate the tradeoffs between storage size and query time,
and has yielded some interesting results. For example, Hive allowed data to be
accessed by common tools such as Spark and Drill for analytics with query times
in minutes, whereas OpenTSDB offered for near-instantaneous visualization of
months and years of data. The ultimate solution, says Sharp, was to ingest data
into a canonical partitioned Hive table for use by Spark and Drill and use Hive to
generate files for the OpenTSDB import process.


Coping with Data Preparation
Storage presented another problem. “Hundreds of billions of data points uses
a lot of storage space,” he notes. “Storage space is less expensive now than it’s
ever been, but the physical size of the data also affects read times of the data
while querying. Understanding the typical read patterns of the data allows
us to lay down the data in MapR in a way to maximize the read performance.
Moreover, partitioning data by its source and date leads to compact daily files.”
Sharp found both ORC (Optimized Row Columnar) format and Spark were
essential tools for handling time-series data and analytic queries over larger
time ranges.

Bottom Line
As a result of his efforts, he has created a very compact, lossless storage mechanism for sensor data. Each terabyte of storage has the capacity to store 750 billion to 5 trillion data points. This is equivalent to 20,000—150,000 sensor-years
of 1 Hz data and will allow NOV to store all sensor data on a single MapR cluster.
“Our organization now has platform data capabilities to enable Condition-Based
Maintenance,” Sharp says. “All sensor data are accessible by any authorized user

21


“MapR is the quickest
way to get started with
Hadoop and the most
flexible in terms of
using ecosystem tools.
The next step was to
figure out which tools
in the Hadoop ecosystem to include to create
a viable solution.”
Warren Sharp, Big Data

Engineer at National
Oilwell VARCO

or application at any time for analytics, machine learning, and visualization
with Hive, Spark, OpenTSDB and other vendor software. The Data Science and
Product teams have all the tools and data necessary to build, test, and deliver
complicated CBM models and applications.”

Becoming a Big Data All Star
When asked what advice he might have for other potential Big Data All Stars,
Sharp comments, “Have a big vision. Use cases are great to get started, vision is
critical to creating a sustainable platform.”
“Learn as much of the ecosystem as you can, what each tool does and how it
can be applied. End-to-end solutions won’t come from a single tool or implementation, but rather by assembling the use of a broad range of available Big
Data tools to create solutions.”

22


The NIH Pushes the Boundaries of
Health Research with Data Analytics
Chuck Lynch
Chief Knowledge Officer at
National Institutes of Health

Few things probably excite a data analyst more than data on a mission, especially when that mission has the potential to literally save lives.
That fact might make the National Institutes for Health the mother-load of
gratifying work projects for data analysts that work there. In fact, the NIH is
27 separate Institutes and Centers under one umbrella title, all dedicated to
the most advanced biomedical research in the world.

At approximately 20,000 employees strong, including some of the most prestigious experts in their respective fields, the NIH is generating a tremendous
amount of data on healthcare research. From studies on cancer, to infectious
diseases, to Aids, or women’s health issues, the NIH probably has more data on
each topic than nearly everyone else. Even the agency’s library—the National
Library of Medicine—is the largest of its kind in the world.

Data Lake Gives Access to Research Data
‘Big data’ has been a very big thing for the NIH for some time. But this fall the
NIH will benefit from a new ability to combine and compare separate institute
grant data sets in a single ‘data lake’.
With the help of MapR, the NIH created a five-server cluster—with approximately
150 terabytes of raw storage—that will be able to “accumulate that data, manipulate the data and clean it, and then apply analytics tools against it,” explains
Chuck Lynch, a senior IT specialist with the NIH Office of Portfolio Analysis, in
the Division of Program Coordination, Planning, and Strategic Initiatives.
If Lynch’s credentials seem long, they actually get longer. Add to the above the
Office of the Director, which coordinates the activities of all of the institutes.
Each individual institute in turn has its own director, and a separate budget, set
by Congress.
“What the NIH does is basically drive the biomedical research in the United
States in two ways,” Lynch explains. “There’s an intermural program where we
have the scientists here on campus do biomedical research in laboratories. They
are highly credentialed and many of them are world famous.”

23


This is all really great
stuff, but it just got a
lot better. The new
cluster enables the


“Additionally, we have an extramural program where we issue billions of dollars
in grants to universities and to scientists around the world to perform biomedical research—both basic and applied—to advance different areas of research
that are of concern to the nation and to the world,” Lynch says.

to the newly-shared

This is all really great stuff, but it just got a lot better. The new cluster enables
the office to effectively apply analytical tools to the newly-shared data. The hope
is that the NIH can now do things with health science data it couldn’t do before,
and in the process advance medicine.

data. The hope is that

Expanding Access to ‘Knowledge Stores’

the NIH can now do

As Lynch notes, ‘big data’ is not about having volumes of information. It is
about the ability to apply analytics to data to find new meaning and value in it.
That includes the ability to see new relationships between seemingly unrelated
data, and to discover gaps in those relationships. As Lynch describes it, analytics
helps you better know what you don’t know. If done well, big data raises as
many questions as it provides answers, he says.

office to effectively
apply analytical tools

things with health
science data it couldn’t

do before, and in
the process advance
medicine.

The challenge for the NIH was the large number of institutes collecting and
managing their own data. Lynch refers to them as “knowledge stores” of
the scientific research being done.
“We would tap into these and do research on them, but the problem was that we
really needed to have all the information at one location where we could manipulate it without interfering with the [original] system of record,” Lynch says.
“For instance, we have an organization that manages all of the grants, research,
and documentation, and we have the Library of Medicine that handles all of the
publications in medicine. We need that information, but it’s very difficult to tap
into those resources and have it all accumulated to do the analysis that we need
to do. So the concept that we came up with was building a data lake,” Lynch recalls.
That was exactly one year ago, and the NIH initially hoped to undertake the
project itself.

24


“We have a system here at NIH that we’re not responsible for called Biowulf,
which is a play on Beowulf. It’s a high speed computing environment but it’s
not data intensive. It’s really computationally intensive,” Lynch explains. “We first
talked to them but we realized that what they had wasn’t going to serve
our purposes.”
So the IT staff at NIH worked on a preliminary design, and then engaged vendors to help formulate a more formal design. From that process the NIH chose
MapR to help it develop the cluster.
“We used end-of-year funding in September of last year to start the procurement of the equipment and the software,” Lynch says. “That arrived here in the
November /December timeframe and we started to coordinate with our office
of information technology to build the cluster out. Implementation took place

in the April to June timeframe, and the cluster went operational in August.”

Training Mitigates Learning Curve
“What we’re doing is that we’re in the process of testing the system and basically wringing out the bugs,” Lynch notes. “Probably the biggest challenge that
we’ve faced is our own learning curve; trying to understand the system. The
challenge that we have right now as we begin to put data into the system is
how do we want to deploy that data? Some of the data lends itself to the different elements of the MapR ecosystem. What should we be putting it into—
not just raw data, but should we be using Pig or Hive or any of the
other ecosystem elements?”
Key to the project success so far, and going forward, is training.
“Many of the people here are biomedical scientists. The vast majority of them
have PhDs in biomedical science or chemistry or something. We want them to
be able to use the system directly,” Lynch says. “We had MapR come in and give
us training and also give our IT people training on administering the MapR
system and using the tools.”

25


×