Tải bản đầy đủ (.pdf) (36 trang)

IT training delivering embedded analytics in modern apps khotailieu

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (6.6 MB, 36 trang )

Co
m
pl
im
en
ts
of

Delivering
Embedded Analytics
in Modern Applications
A Product Manager’s Guide to
Integrating Contextual Analytics

Federico Castanedo
& Andy Oram



Delivering Embedded
Analytics in Modern
Applications

A Product Manager’s Guide to
Integrating Contextual Analytics

Federico Castanedo and Andy Oram

Beijing

Boston Farnham Sebastopol



Tokyo


Delivering Embedded Analytics in Modern Applications
by Federico Castanedo and Andy Oram
Copyright © 2017 O’Reilly Media, Inc. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA
95472.
O’Reilly books may be purchased for educational, business, or sales promotional use.
Online editions are also available for most titles ( For more
information, contact our corporate/institutional sales department: 800-998-9938 or


Editor: Nicole Tache
Production Editor: Nicholas Adams
Copyeditor: Rachel Monaghan
May 2017:

Interior Designer: David Futato
Cover Designer: Randy Comer
Illustrator: Rebecca Demarest

First Edition

Revision History for the First Edition
2017-04-25:

First Release


The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Delivering Embed‐
ded Analytics in Modern Applications, the cover image, and related trade dress are
trademarks of O’Reilly Media, Inc.
While the publisher and the authors have used good faith efforts to ensure that the
information and instructions contained in this work are accurate, the publisher and
the authors disclaim all responsibility for errors or omissions, including without
limitation responsibility for damages resulting from the use of or reliance on this
work. Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes is sub‐
ject to open source licenses or the intellectual property rights of others, it is your
responsibility to ensure that your use thereof complies with such licenses and/or
rights.

978-1-491-98442-0
[LSI]


Table of Contents

1. Delivering Embedded Analytics in Modern Applications. . . . . . . . . . . 1
Overview of Trends Driving Embedded Analytics
The Impact of Trends on Embedded Analytics
Modern Applications of Big Data
Considerations for Embedding Visual Analytics into
Modern Data Environments
Deep Dive: Visualizations
Conclusion

3

6
7

15
23
28

A. Self-Assessment Rating Chart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

iii



CHAPTER 1

Delivering Embedded Analytics in
Modern Applications

Organizations in all industries are rapidly consuming more data
than ever, faster than ever—and need it in forms that are easy to vis‐
ualize and interact with. Ideally, these can be seamlessly embedded
into the applications and business processes that employees are
using in their everyday activities so they can make more effective
data-driven decisions, instead of decisions based on intuition and
guesses. Research indicates that data-driven employees and organi‐
zations outcompete and are more successful than those that are not
data-driven. In keeping with these trends, business leaders increas‐
ingly require their vendors and IT organizations to embed analytics
in business applications. With embedded analytics, organizations
leverage vendors’ domain expertise to provide analytics guidance for

the application users.
The trade press has recently focused on helping businesses become
“data-driven organizations.” As the result of a study, the Harvard
Business Review bluntly announced, “The more companies charac‐
terized themselves as data-driven, the better they performed on
objective measures of financial and operational results” and “The
evidence is clear: data-driven decisions tend to be better decisions.”
The article follows up with specific examples. Infoworld stresses
speed, automation of data collection, and independent thinking
among staff. The barriers to becoming data-driven are the focus of a
recent Forbes article; ways forward include a compelling justification
for the data and a central organization capable of handling it, adding
1


up to a “data-driven culture.” McKinsey (which helped conduct the
HBR study) also stresses organizational transformation, which
involves learning analytical skills and integrating data in decision
making.
The trends and research all point to a single conclusion: organiza‐
tions, customers, and employees seek a data-driven environment
and they value those applications that make it easier to make sense
of all the data that is available for decision making. This report
examines the architecture and characteristics that allow software
vendors and developers to meet this need and increase the value of
their application by using embedding analytics.
What are the basic requirements of embedded analytics on a
modern data platform? Essentially, to provide speed-of-thought
interaction with powerful visualizations that help knowledge work‐
ers take action on data. While data comes from multiple sources

(some of it historical, some of it streaming in real time) and is stored
with various technologies for cost efficiencies, all these complexities
must be managed by the visual platform, delivering visual analytics
in a wide variety of formats and devices. The following list summa‐
rizes the requirements of an embedded analytics tool on a modern
data platform:
• Integrate with the modern data architecture by accepting data
from a wide variety of input sources, where each may have dif‐
ferent data types.
• Process large amounts of data quickly, and respond to interac‐
tive requests within seconds.
• Embed easily into web pages or other media browsers, includ‐
ing applications created by third-party developers.
• Scale up automatically and have the ability to process streaming
data.
• Adhere to security restrictions, so users see only the data to
which they are supposed to have access.
To begin, let’s examine in more detail the trends driving embedded
analytics.

2

|

Chapter 1: Delivering Embedded Analytics in Modern Applications


Overview of Trends Driving Embedded
Analytics
For application vendors, embedding modern analytics is an oppor‐

tunity to provide added value, increasing customer stickiness and
revenue. For organizations, it is an opportunity to leap forward in
their data-driven initiatives. These initiatives are critical for compet‐
itive advantage and rely heavily on the following trends:
Speed of change (velocity)
In business, finance, and technology, trends that used to occur
over weeks or months now take place within minutes and must
be responded to as fast as possible. Until recently, companies
could get by with checking data every few months and changing
their strategies a couple times a year. But now, consumers and
clients can get news within minutes of it happening, and change
their preferences based on what’s posted to the internet.
Rollouts of new products are traditionally spaced across seasons
(spring, summer, etc.). But nowadays, a trend can catch fire in a
matter of days, thanks to the speed at which information
spreads through social media. Bad news travels even faster—if
there is a problem such as when the Galaxy Note 7 phone
started to catch fire, markets shift instantly.
As the pace of change increases, having the right data and ana‐
lyzing it quickly to make decisions is key to the organization’s
ability to switch directions much more rapidly.
Knowledge workers
Although the term knowledge work first appeared in 1959, at
that time the people whose decisions were crucial to organiza‐
tional success were stuck at their desks, perhaps in isolated offi‐
ces with nothing but a telephone and a pile of journals or stacks
of mainframe printouts connecting them to information. It took
the arrival of the internet era to provide knowledge workers
with continuous streams of data about the outside world, a
resource particularly exploited by young people in the work‐

force. These people know just as much about what is happening
among clients, suppliers, and competitors as what is happening
within their own organization. And they are speaking up to
demand access at speed-of-thought response times—seconds,
not minutes or hours.
Overview of Trends Driving Embedded Analytics

|

3


Availability of data
Sources of data that were unimaginable to earlier generations
are now commonly used. Today companies can enhance their
internal data with social, weather, credit history, census data,
and a wide range of data sets that are readily available. For
example, these data sets may include the dutifully logged behav‐
ior of web visitors, real-time updates on inventory and sales in
retail stores, and the terabytes of data streaming in from sensors
in the field.
While data may be more available, availability is only the first
step in making data useful and impactful to the organization.
Data becomes relevant only when it is utilized—that is, when
people act or make decisions based on it. The challenge of mak‐
ing data useful to drive actions is even harder with so much data
available. One reason for this is that it is necessary to combine
multiple sources of information (such as customer interactions,
real-time transactions, social communication, and location
data) to obtain insights that were not available before. Another

reason is that understanding the data enough to explore it
requires domain expertise, and users may easily get lost when
dealing with large amounts of data. Free exploration can be a
gift or waste of time. Finally, since some data may be sensitive, it
is also important to restrict access to only authorized users.
Lowering costs
Decreasing costs of technology and its components—especially
storage and hardware—allow organizations to do more with
less. Organizations typically archive data to tape storage, which
is fine for emergency recovery from system failures, but does
not allow instant access or fine-tuned queries. The need for
massive storage due to the large amounts of data collected, and
the falling cost of disk storage, means organizations are now
keeping billions of records within reach, contributing to the
data explosion revolution.
As memory also shrinks in size, gets cheaper, and is distributed
across clusters of commodity hardware, calculations that used
to suffer from slow disk access can be carried out in primary
memory. Massive data sets are now stored in memory, allowing
lightning-fast random access. Analytical tools can also run inmemory on clusters of low-cost servers to process real-time
streams in a timely manner.
4

|

Chapter 1: Delivering Embedded Analytics in Modern Applications


Cloud/SaaS platforms allow organizations to further lower costs
while also benefiting from flexibility, and the agility to scale. For

instance, if a product line takes off or a sudden event brings
widespread attention to an organization, it needs to quickly
respond to the change by ramping up the resources for a service
and scaling accordingly. Cloud platforms are selected for their
cost effectiveness as well as the option to offload the administra‐
tive hassles of reliability, security, and constant growth to a third
party.
More and more startups turn to cloud services exclusively as a
strategy to access large amounts of compute and storage resour‐
ces without incurring high upfront hardware costs. Even some
established companies like Netflix have taken operations to the
cloud to increase their computing power and flexibility.
Customer 360 insights
Customer 360 has become a popular term to describe an ideal
situation where an organization gains competitive advantages
by combining all the information available about the customer,
delivering a holistic view that is accessible to all parts of the
organization: sales, finance, services, marketing, and so on.
Nowadays, customers are tracked through multiple aspects of
their daily lives. Much of their data is surrendered voluntarily
through channels such as customer loyalty cards and social
media “likes.” The need to analyze customer interactions holisti‐
cally is known as customer journey analytics, and is based on
collecting information about every step of the customer’s expe‐
rience from first contact through purchase, and determining
from analytics how to make the process more likely to lead to a
sale. Customer journey analytics is very important to online
channels where advertisers need to decide their investment
strategies based on where the sales are coming from. Organiza‐
tions that use this data at key points of interaction with custom‐

ers—for example, next best recommendations such as “What
movie do we recommend next?”—have a clear competitive
advantage fueling their growth.
Technological advances
In the past 10 to 15 years we have seen a wide range of new tools
(many of them open source) attempting to meet the needs of
data-driven organizations. The rapid change in the technology
Overview of Trends Driving Embedded Analytics

|

5


to collect and process data at massive scale has become a trend
on its own.
New types of databases have revolutionized the field, which
continues to see significant change. These tools include data‐
bases (such as Kudu, HBase, Cassandra, MongoDB, and other
products loosely known as NoSQL), data processing frame‐
works (such as Hadoop and Spark), query engines (such as
Apache Impala, Apache Hive, and Presto), stream processing
tools (such as Storm, Kafka, Flume, Apache NiFi, Amazon
Kinesis, and Apex), and text indexing (such as ElasticSearch,
Solr, and Cloudera Search).
In addition, access to data may be facilitated by cloud providers,
which centralize and feed data to analytical tools. Cloud provid‐
ers have optimized many of the underlying infrastructure
required for storing and querying Big Data, by offering scalable
and cost-effective datastores such as Amazon’s Redshift, Goo‐

gle’s BigQuery and Cloud Spanner, and Microsoft’s HDInsight.
Both the quick pace at which data requirements change and the
pressure to lower costs drive organizations to experiment with
these new technologies and adopt them at a higher rate than
previously seen.

The Impact of Trends on Embedded Analytics
The preceding trends are driving embedded analytics. For instance,
to meet knowledge workers’ demand for data at the speed of
thought, organizations have adopted visual analytics platforms.
Business users have grown accustomed to dashboards, reports, and
free exploration of data sets. But as data grows in complexity and
size, business users require more guidance in order to turn Big Data
into insights with ease.
Embedded analytics is the use of reporting and analytic capabilities
within business applications. It is a technology designed to make
data analysis and business intelligence (BI) more accessible to users.
Embedded analytics is easily accessible from inside the application’s
workflow without forcing users to switch to a separate standalone BI
tool.
Because not everyone in the organization can become astute with
data, the use of embedded analytics provides guidelines to help busi‐
6

|

Chapter 1: Delivering Embedded Analytics in Modern Applications


ness users understand data quicker. With the complexity of data (as

discussed earlier), software application vendors are in a unique posi‐
tion to help organizations, today more than ever, by adding the
guidelines that help business users explore data more efficiently.
They do so by infusing their domain expertise and packaging the
analytics most relevant for the particular business processes man‐
aged by the application.
For example, advanced visualization frameworks such as Zoomdata
can embed visual analytics into any application connected to
modern technologies. With such advanced frameworks, knowledge
workers can turn data into actions in any environment, from web
browsers to touch-oriented mobile devices, by interacting with intu‐
itive visualizations and dashboards.
In the next section we will show use cases of modern applications
using Big Data.

Modern Applications of Big Data
Modern decision making rests on a phenomenon popularly known
as Big Data, defined by Wikipedia as data sets that are so large or
complex that traditional data processing application software is
inadequate to deal with them. Challenges include capture, storage,
analysis, data curation, search, sharing, transfer, visualization,
querying, updating, and information privacy. Scientists, business
executives, practitioners of medicine, advertisers, and governments
alike regularly encounter difficulties with large data sets in areas
including internet search, finance, and urban and business infor‐
matics.
Data sets grow rapidly—in part because they are increasingly gath‐
ered by cheap and numerous information-sensing mobile devices,
aerial (remote sensing), software logs, cameras, microphones, radiofrequency identification (RFID) readers, and wireless sensor net‐
works.

Big Data “size” is a constantly moving target, as of 2012 ranging
from a few dozen terabytes to many petabytes of data. According to
research firm Gartner, “Big Data represents the information assets
characterized by such a high volume, velocity, and variety to require
specific technology and analytical methods for its transformation
into value.” It requires a set of techniques and technologies with new

Modern Applications of Big Data

|

7


forms of integration to reveal insights from data sets that are diverse,
complex, and of a massive scale.
Following on from Gartner’s definition, Big Data applications are
usually defined by the three Vs: the amount of data they use (vol‐
ume), the velocity of data, and the variety of input data being used.
Having a huge amount of data is inevitable when you have a lot of
different possible combinations for a specific task—for instance,
when you have a lot of independent users/customers/interactions
and you want to make decisions about them. In this scenario, if you
take any random pair of users/customers they will most likely be
very different, but having enough users and enough data about them
allows the system to extract useful similarities.
Having a lot of data is only valuable when you can easily ask good
questions and retrieve insight from the data. Most of the time, visu‐
alizing data—presenting it in a pictorial or graphical format—is the
quickest form of interacting with it in order to answer these ques‐

tions. In fact, visualizing data is the missing link between collecting
data and understanding it. With visual access, users can digest huge
amount of data easily. As an example of how visualization is impor‐
tant to draw the correct conclusions about data, consider Anscom‐
be’s famous data set, depicted in Figure 1-1.

8

| Chapter 1: Delivering Embedded Analytics in Modern Applications


Figure 1-1. The four data sets defined by Francis Anscombe. Source:
Francis J. Anscombe, “Graphs in Statistical Analysis,” American Statis‐
tician 27 (1973): 17–21.
Anscombe’s data set comprises four different sets that have nearly
identical mean, variance, correlation, and linear regression, yet
appear very different when graphed. This example demonstrates the
added value of visualizing data rather than just analyzing it. The
problem is even more challenging when you need to analyze mas‐
sive amounts of data that arrive in a continuous stream. Proper visu‐
alization reduces the risk of making bad decisions.
Acquiring interesting data itself is very difficult; that’s why compa‐
nies commonly buy other companies just for their customer data‐
bases and information. Microsoft took the practice to a new level of
data intensity when it paid a 50% premium on LinkedIn’s stock to
purchase the company. The asset driving this bounty, observers
noted, was LinkedIn’s vast database of clients with detailed profes‐
sional information. Not only were these clients a useful resource in
themselves, they were also raw input to Microsoft’s machine learn‐
ing algorithms.


Modern Applications of Big Data

|

9


Industries and Use Cases Ripe for Modern Data-Driven
Solutions
Let’s take a look at the pressures you are likely to face in your orga‐
nization, by examining a couple of companies with especially huge
requirements for Big Data. Then, we will explore several use cases
for modern data-driven solutions.

Retail: The TJX Corporation
There is no doubt that data is important to the TJX corporation,
which runs TJ Maxx and many other stores. Here we’ll speculate a
bit about data’s relevance. TJX is perhaps the fastest-growing retailer
in the US. You probably don’t have the size and market pressures
that TJX faces now, but you might aspire to that status.
TJ Maxx and related stores are positioned to be low-priced in a
crowded market, and they depend on quick turnaround and effi‐
ciency to thrive. They provide a sterling example of responsiveness
to consumer needs in the modern era.
Let’s try to estimate the volume and velocity of data handled by TJX.
The corporation reached 3,395 stores in 2015. Because it added 176
stores in 2014, we’ll use the figure of 3,219 stores to estimate its data
requirements in 2014.
An insightful story in Fortune suggests that TJX shipped 2 billion

units in 2014. That’s a lot of volume. To indicate velocity, consider
the statement “Former employees say that the stuff moves so rapidly
that merchandise is often sold before TJX has paid its vendors for
it.” TJX’s ability to make quick decisions is evident in another quote
from the article: “Insiders say the ability to contract on the fly with
manufacturers lets TJX offer customers at least some merchandise
in a hot fashion trend (say, crop tops or slide sandals) when it can’t
get enough brand-name supply.”
Starting with that crude estimate of 2 billion units per year, we arrive
at an average of about 5.5 million units shipped each day, or 1,702
units per store per day. Let’s assume, to keep things simple, that a
store is open from 8 AM to 8 PM, for a 12-hour day. With that
assumption, TJX ships 7,610 units per minute, or 2 units per minute
per store.
What about the third V, variety? TJX succeeds “by selling blouses…
pots and pans…and bedding, sunglasses, sriracha seasoning, yoga
10

|

Chapter 1: Delivering Embedded Analytics in Modern Applications


mats, and the occasional $1,250 Stella McCartney dress.” The range
of sales between a $20 belt (a usual TJ Maxx purchase) and a $1,250
dress could challenge any decision maker.
By doing analytics on products, store reviews, or call center logs,
what could a TJX executive do with this data? Suppose she wants to
judge the impact of selling that Stella McCartney dress. How does
she compare it to other dresses sold by her stores at the same time?

She has to filter sales information to find all dresses sold, limiting
data to dresses with a certain price comparable to the Stella McCart‐
ney dress. She may want to get a report for each individual store, or
group the stores by zip code to get a sense of the dress buyers’ dem‐
ographics. Thus, queries by managers will home in on types of mer‐
chandise, price ranges, and time intervals. A manager may want a
chart showing sales of particular items or classes of items at holiday
season each year. She may also want to compare sales at stores that
offered the Stella McCartney dress at a reduced price versus those
that did not. During a sale, the manager may want to view real-time
data about particular types of merchandise.
Thus, both historical data and real-time data should be available for
processing, with support for filtering multiple dimensions (time,
price, etc.) and for comparing the data along multiple dimensions.

Direct service: Kaiser Permanente
Let’s turn now to a very different organization: a major health pro‐
vider interested in reducing costs while maintaining a high quality
of care. Kaiser Permanente is one of the largest not-for-profit health
plans in the US, serving more than 11.3 million members. It has 38
hospitals and 626 outpatient facilities.
Although data on patient interactions is hard to find, some relevant
data from 2007 were reported. In that year, Kaiser members in
Hawaii made an average of 3.7 office visits per year, 1.68 telephone
contacts, and 0.23 secure text messages.
If the average Kaiser patient has these 5.61 interactions per year,
Kaiser as a whole experiences more than 63 million interactions
with patients per year, or 173,679 per day. Kaiser is trying to shift
interactions from office visits to less expensive phone and text mes‐
saging contacts, but overall, the volume of all these things is likely to

increase as the organization pursues the health care field’s overarch‐
ing goal of more patient engagement.
Modern Applications of Big Data

|

11


What would a Kaiser manager look for in the data? Patients suffer
from multiple medical conditions that sometimes interact (for
instance, obesity can contribute to knee problems), so Kaiser’s vol‐
ume and velocity of data is matched by variety. In addition to rela‐
tively structured data on billing and tests, crucial information is
stored in free-text notes. Surveying all this data, a manager might
want to know the medical conditions associated with the most visits,
whether an increase in interactions (such as text messages) is corre‐
lated with medical improvements, and whether an increase in one
type of interaction leads to an increase or a decrease in other types.
Real-time data may be crucial to management as well, such as when
a suspected epidemic strikes and managers want to know what geo‐
graphic regions are affected.

Pharmaceutical research and development
Pharmaceutical research and development (R&D) organizations
make heavy use of analytics to find new drugs and modify treat‐
ments. Input to the process may consist of billions of rows of data.
With huge data sets and multiple sources, data crunching can take
hours with a traditional BI environment, but in order to work effec‐
tively visual analytics needs to respond in seconds. Achieving this

reduction in time lag will allow significant improvements in the
services of any pharmaceutical R&D organization. Fast response
time could extend the use of analytics from the research lab to the
front line. For instance, armed with responsive and interactive visual
analytics, their salespeople could visualize for each doctor the his‐
tory and current state of treatment outcomes, drilling down to
nation, state, and their own patients. The benefit goes beyond sales
efficiency, as it impacts health care overall, leading to more effective
treatments.

Insurance: Markerstudy
Markerstudy, a leading UK insurance company, is using a Big Data
platform across key areas of operation to get 360-degree customer
insights, achieving a 120% increase in policy counts in 18 months.
Matching the right price to an insurance quote is critical; one must
weigh the risks of offering premiums too high to compete against
those of offering a price too low to be profitable. Efficient quotes
rely on a huge number of factors, such as driving records and credit
history. Online insurance providers may be offering millions of

12

|

Chapter 1: Delivering Embedded Analytics in Modern Applications


price quotes each day, creating rich knowledge of market sensitivity.
This knowledge can be turned into competitive advantage in future
quoting when staff can perform ad hoc queries that corral all this

data—not just a subset—to make quotes, and drill down in real time
into details such as why a particular quote was offered. Markerstudy
uses the open source Apache Impala SQL query interface on top of
Hadoop and the Zoomdata framework. In the first year since the
company expanded its decision-making process to incorporate the
full range of available data, it reduced policy cancellations by 50%
and fraud by 7.5 million dollars. Markerstudy won a Strata Data
Impact award in 2015.

Cybersecurity
Cybersecurity is a field that relies on large-scale network security
and cybersecurity data sets. The objective for cybersecurity is to
improve situational awareness and shorten the time it takes to
respond to threats. Speed of analysis is of the essence. The cyberse‐
curity industry is an illustrative use case of the advantages of embed‐
ded analytics, particularly visual analytics. Network operators need
to look for patterns in traffic suggesting a potential attack, and to
quickly see the events leading up to an attack. This requires sophisti‐
cated journeys through real-time data to identify suspicious behav‐
ior as well as drilling down into the underlying data with a wide
variety of powerful visualizations that will help the analyst make
decisions. And all those visualizations should be integrated into the
existing screens at the network operations center. A salient feature in
this scenario is the ability to facilitate focus, since a human operator
cannot pay attention to a large number of screens.

Internet of Things
The Internet of Things (IoT) is a broad concept involving the inter‐
connection of different kinds of assets, from the smallest wearables
and consumer devices to the largest vehicles and industrial installa‐

tions. IoT connects objects such as cars, buildings, and machines to
the internet, turning them into “intelligent” assets that can commu‐
nicate with people, applications, and each other.
IoT is also known as Machine to Machine (M2M) applications. Data
is captured from sensors attached to things and transmitted over a
network to be used by applications. The M2M market is highly
attractive, with strong growth prospects in industries such as auto‐
Modern Applications of Big Data

|

13


motive, consumer electronic, and energy and utilities. These leading
sectors forecast 30% adoption rates for IoT.
IoT isn’t a future concept, it’s already here. In fact, 28% of organiza‐
tions already use IoT and a further 35% are less than a year away
from launching their own projects. It is also important to note that
more than three-quarters of businesses say that IoT will be “critical”
for the future success of any organization in their sector. By 2020,
the number of connected devices is estimated to be over 14 billion,
as compared to approximately 3 billion people using the internet
today.
IoT supports the development of new business solutions. For
instance, Vodafone M2M is providing the connectivity that supports
in-car entertainment services, remote maintenance and diagnostics,
and emergency and breakdown alerts. Vodafone is helping BMW to
stay in touch with its customers by installing a robust M2M SIM in
all of its cars during manufacturing. The BMW Connected Drive

initiative offers customers a wide range of intelligent services and
apps. These apps, launched in more than 30 countries, provide
information and entertainment during journeys. Another related
example is the BMW ReachNow car-sharing app, which allows cus‐
tomers to locate and reserve cars via web or mobile app. The initia‐
tive has more than 100,000 registered customers and is live in
Munich, Berlin, Dusseldorf, Koln, and San Francisco.
Netherlands-based Philips Lighting is yet another example of a busi‐
ness that has put IoT at the heart of its strategy. It offers IoT-based
solutions that enable public sector and corporate customers to
remotely monitor and control street and building lighting, driving
down energy costs and improving quality of service.
A key question with these consumer-facing IoT initiatives is what
the direct impact will be on the users. Embedding analytics helps
developers of consumer IoT applications to plan products as well as
to understand the impact and use of their initiatives by integrating
the consumer usage and patterns. M2M platforms require a man‐
agement platform where users are able to manage SIM status
through a Web GUI and are able to visualize all communication
logs, which may include start time, end time, data volume, location,
status, and more.
With the growth of IoT, there is a great opportunity for application
vendors who can help organizations manage and deploy IoT14

| Chapter 1: Delivering Embedded Analytics in Modern Applications


generated data. Within those applications, embedded analytics pro‐
vide insight and visuals about what’s going on in the IoT network
and ease the management of large-scale deployments.


Logistics
Logistics is an area where the use of IoT will provide huge benefits.
In fact, in the last 12 months, 95% of transport and logistic adopters
have increased their IoT spend.
As an example, big and fast data could be represented by shippers
and service companies that need to route their vehicles as efficiently
as possible. Deliveries or repairs are often scheduled or rescheduled
at the last minute, so the ability to quickly calculate new routes can
greatly improve service and save money by making more efficient
use of staff and fuel. Information from vehicles can be used to opti‐
mize routes via fleet management solutions, an important part of
which will be visualization. In the case of vehicle route optimization
with real-time requirements, such as Uber, the challenges of embed‐
ding visual analytics are higher. Such applications will eventually
evolve into support for better traffic patterns in smart cities, improv‐
ing the environment and the quality of life for everyone.
In the next section, we’ll look at the software and architectural
requirements for these kinds of applications.

Considerations for Embedding Visual Analytics
into Modern Data Environments
Business intelligence based on spreadsheets and custom-generated
charts has been in use for decades. It helps organizational staff set
goals, plan new products and marketing campaigns, and generally
find out what’s working and what’s not. The backbone to traditional
BI is the data warehouse into which data is batch-loaded using
extract-transform-load (ETL) tools, and then subsets of it, or cubes,
are exposed to business users. As data becomes bigger, comes in
more varieties (including structured and unstructured data), and

increases in velocity (with streaming data), the traditional methods
of managing the backbone of BI do not work. And yet organizations
have comparable BI goals to manage in modern environments, but
with a huge turnover in types of input and processing tools, accom‐
panied by shorter decision-making cycles. Key differences can be

Considerations for Embedding Visual Analytics into Modern Data Environments

|

15


seen in data sources, channels of input, data storage, and visualiza‐
tions.
Product managers developing new modern applications are proba‐
bly faced with the complexities of Big Data and a modern data envi‐
ronment. Embedding visual analytics into such an environment has
considerations that differ from the traditional embedded analytics
or BI in organizations. The differences are related to the granularity
of collected information, the use of immutable data, the way data is
retrieved, the data models employed, the interactivity of the output,
and the speed of data. We’ll look at these differences in the following
sections.

Collection
Modern data sources go far beyond the transactional sales data that
organizations historically collected and stored. In the past, customer
interaction was limited to data captured in transactional business
applications, but today customer interaction and engagement is cap‐

tured from multiple sources—clickstream, social, transactions—and
the information often streams in real time. With the Internet of
Things, organizations also track sensors, each of which transmits a
stream of data that updates its status every minute, or even every
second. Today’s organizations need to manage data that can be
transactional, real-time streams, or historical data. To manage both
history and real time, an important concept has emerged in the last
years—the master data set (and, relatedly, immutable data). The
master data set is the source of truth in any system and cannot with‐
stand corruption. In hybrid real-time and batch systems the master
data set stores both the historical data as well as the real-time data
coming from streams. The approach is to store incoming data in the
master data set, as raw as possible since rawness allows you to
deduce more information. To better understand this concept, see the
intraday stock graph of Google, Amazon, and Apple on March 22 in
Figure 1-2. If we consider close prices, AAPL gains 1.12%, AMZN
gains 0.57%, and GOOG loses 0.10%, but at some point around
10:45 AM the three stocks were very close at 0.25%. This kind of
detailed analysis cannot be done only with open and close prices
and requires enough data granularity, which translates into collect‐
ing more data and having more storage.

16

| Chapter 1: Delivering Embedded Analytics in Modern Applications


Figure 1-2. Stock comparison of Google, Amazon, and Apple. Source:
Google Finance, screenshot taken by Federico Castanedo.


Storage
Data storage has spawned many new types of databases in the past
15 years. Relational databases remain central, but new layouts such
as Hadoop (which created its own Hadoop File System, or HDFS,
format), document databases, and other formats falling under the
broad umbrella of NoSQL are popular. These flexible, analyticsfriendly formats are pushing flat files and spreadsheets out of the
office to business processes; these have to handle ever-growing
quantities of data. As organizations face a growing amount of
unstructured data, search frameworks such as Elasticsearch and Solr
also become central to many applications, particularly because they
can pull actionable information out of unstructured text. They can
help answer questions such as “Where do people complain about
food poisoning?” in restaurant reviews.
As we have seen, modern organizations use multiple sources with
diverse formats, and want to keep the data in one place without
copying it. In contrast, traditional BI typically copies data into a data
warehouse or data mart, stored in relational databases. To manage
multiple sources (variety of data) cost effectively, organizations and
software vendors turn to modern and flexible storage architectures
instead of relational databases. So, the use of NoSQL databases is
becoming more common.
Data from different sources is usually joined through metadata that
links up data that goes together—for instance, matching a customer
in a data set obtained from a broker with the customer who just vis‐
ited your business’s website. Each data set will have a unique ID for
each customer (barring duplicates, which are common and can be
found through various analytic techniques). Foreign keys are com‐
Considerations for Embedding Visual Analytics into Modern Data Environments

|


17


monly used in relational databases to link different tables. But your
application will have to match customers from different data sets in
many formats. To do this, it’s useful to have a separate table, such as
a dimension table.
Unlike with traditional databases, when you store immutable data in
a NoSQL database, the information isn’t just updated but rather is
added along with timestamp information. This characteristic has the
effect that data will be “eternally true,” meaning that data is always
true because it was true at the time it was obtained. This is also
known as a fact-based system. One salient feature of storing immuta‐
ble data is that you can recover your system at any point in time by
referring to the last stored timestamp. Of course, this comes at the
cost of increasing the storage required.

Retrieval
Modern data environments are challenged by the size and speed of
data and often do not have the luxury of moving the data or repli‐
cating it before analyzing it. In traditional BI environments data is
moved from the source systems into the data warehouse in a com‐
plex task called extract, transform, and load (ETL). Traditionally,
ETL is an in-between step manually coded by a data expert for each
BI application before users can explore the data. This meant that
data was most likely “old” by the time business users used it for their
decisions. Modern environments use data federation, which involves
a much lighter-weight process of joining data from the various
native formats in which it is stored, without physically moving it.

For example, a common pattern is to maintain reference data (cus‐
tomer name, address, etc.) in a relational database while placing
incoming records about customer behavior in a more flexible and
scalable data format such as Hadoop. The relational database is val‐
uable for doing queries on relatively structure forms of data, but
Hadoop is more appropriate for analytics that find new relationships
among different types of customer attributes, such as age and shop‐
ping habits. Data federation will define how visual analytics should
merge the data for combined insights, without moving it. The bene‐
fits are twofold: retain the integrity of the data and enable access to
fresh data.
Data quality remains an issue in old and modern environments.
Some data cleanup is required, for example, removing outliers or
harmonizing different strings like “CA” and “California.” Modern
18

|

Chapter 1: Delivering Embedded Analytics in Modern Applications


environments automate the cleanup process in some way or another,
and there is a huge business around data normalization and auto‐
matic data cataloging with some startups like Tamr and Gamalon.
The speed of retrieving data for visualization and analytics is also
impacted by organizations’ increased use of the cloud for storage.
More companies are finding it more convenient to rent storage in
the cloud than to keep adding servers on-premise. The varied dis‐
tances traveled by data in different locations may put extra stress on
speed. In addition, different cloud services offer different APIs, and

the tools pulling the data have to understand and adapt to each API.
For these reasons, organizations often deploy a hybrid architecture
where historical data may be stored on-premise because it is queried
frequently and would tally up high costs if stored in a third-party
cloud provider. So modern visualization and analytics tools should
be able to run queries across hybrid cloud and on-premise data,
combining results where appropriate.

Data Models
BI data warehouses develop semantic data models to establish rela‐
tionships between the data in different tables. These models are just
as useful in modern, real-time data processing, but the newer forms
of intelligence can also discover new relationships in more loosely
structured data. Instead of having a relational database, modern
environments use NoSQL databases with schema-free data models.
These more recent data models allow you to store unstructured or
semistructured data.
Architecturally, an application should connect directly to data in its
original storage format. From the 1980s on, many businesses created
data warehouses that collected data from diverse sources and organ‐
ized them in a single relational format. The rationale behind this
copying made sense at the time, because the data warehouse could
comprehensively handle all the queries businesspeople made. But in
a real-time decision-making environment, duplicating and copying
data is no longer viable, nor do you always want data in a single
warehouse format. Hadoop, NoSQL databases, and streaming sour‐
ces were invented because they are superior for many common ana‐
lytics on certain kinds of data, especially real-time data. You want to
take advantage of all appropriate formats.


Considerations for Embedding Visual Analytics into Modern Data Environments

|

19


×