Tải bản đầy đủ (.pdf) (29 trang)

IT training search driven business analytics khotailieu

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (6.56 MB, 29 trang )

Search-Driven
Business Analytics
Designing a New Search Engine for Data

Andy Oram


Make Data Work
strataconf.com
Presented by O’Reilly and Cloudera,
Strata + Hadoop World is where
cutting-edge data science and new
business fundamentals intersect—
and merge.
n

n

n

Learn business applications of
data technologies
Develop new skills through
trainings and in-depth tutorials
Connect with an international
community of thousands who
work with data

Job # 15420



Search-Driven
Business Analytics

Designing a New Search
Engine for Data

Andy Oram


Search-Driven Business Analytics
by Andy Oram
Copyright © 2015 O’Reilly Media, Inc. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA
95472.
O’Reilly books may be purchased for educational, business, or sales promotional use.
Online editions are also available for most titles (). For
more information, contact our corporate/institutional sales department:
800-998-9938 or

Editor: Shannon Cutt
Interior Designer: David Futato

Cover Designer: Randy Comer
Illustrator: Rebecca Demarest

First Edition

August 2015:


Revision History for the First Edition
2015-09-02:
2015-10-20:

First Release
Second Release

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Search-Driven
Business Analytics, the cover image, and related trade dress are trademarks of
O’Reilly Media, Inc.
While the publisher and the authors have used good faith efforts to ensure that the
information and instructions contained in this work are accurate, the publisher and
the authors disclaim all responsibility for errors or omissions, including without
limitation responsibility for damages resulting from the use of or reliance on this
work. Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes is sub‐
ject to open source licenses or the intellectual property rights of others, it is your
responsibility to ensure that your use thereof complies with such licenses and/or
rights.

978-1-491-93813-3
[LSI]


Table of Contents

Search-Driven Business Analytics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
A New Generation of Vendors Offering Interactive
Visualizations
Data Access Methods Are Being Transformed by Search

Getting Insights from Diverse Data
Interpreting User Input
Translating Queries into Answers
Validating Answers
Creating the Simplicity of a Search-Like Query
Creating Instant Visualizations
Sharing Answers and Visualizations
Bringing Search-Driven Analytics to the Masses

2
4
7
9
13
15
17
20
21
22

iii



Search-Driven Business Analytics

We are all accustomed to instant results with the use of major web
search engines. However, when we pull up a business intelligence
(BI) product at work, the situation is quite different. In comparison
to Internet services that we use every day, these products seem stiff

and unresponsive. Business leaders are served with pre-built reports
and dashboards put together by their BI teams, and they wait days
or weeks to get reports on new inquiries about customers, products,
or markets. Thus, when a business manager moves from Facebook,
Amazon.com, or Google to her BI tool, it feels like time travel back
to a different century.
This report examines what it takes to make business intelligence as
simple and responsive as today’s consumer search engines, where
the user gets answers and visualizations as quickly as questions
come to mind.
We’ll look at:







The convergence of BI and search
What a search-driven user experience looks like
The intelligence required for analytical search
Data sources and their associated data modeling requirements
Turning on-the-fly calculations into visualizations
Applying enterprise scale and security to search

The techniques described here are general and draw on wellestablished practices in the field. The main reference platform for
this report is the ThoughtSpot Analytical Search Appliance. The
author will also incorporate information gleaned from discussions

1



with technical staff from Microsoft’s Power BI service and from
Adatao, a firm that offers collaborative and predictive analytics.

A New Generation of Vendors Offering
Interactive Visualizations
ThoughtSpot’s Analytical Search engine allows the user to ask adhoc questions of their data through a search interface. The engine
computes results on-the-fly based on the search query, and offers
visualizations of interest to the user. It features an interactive inter‐
face that allows you to search through billions of rows and compute
results on-the-fly from any data source.

Figure 1. Data display in ThoughtSpot
Microsoft’s PowerBI service lets you quickly create dashboards,
share reports, and directly connect to (and incorporate) all the data
available within the organization, through partners, or publicly pos‐
ted to the Internet. Power BI Desktop enables you to transform data
and create reports and visualizations. Figure 2 shows a typical dash‐
board created in the Desktop.

2

|

Search-Driven Business Analytics


Figure 2. Dashboard produced by Microsoft Power BI
Adatao takes a problem-solving approach to all data, big and small,

where the user starts with a hypothesis and pulls answers out of data
sources to validate or invalidate the hypothesis. Figure 3 shows typi‐
cal output from Adatao, known as a narrative, which enables data
discovery and presentation in the form of attractive visualizations.

Figure 3. Narrative produced by Adatao

A New Generation of Vendors Offering Interactive Visualizations

|

3


Data Access Methods Are Being
Transformed by Search
So how have these new-generation technologies transformed data
interaction for the business user? An enlightening analogy can be
drawn between the way managers use BI today and how information
access on the Internet has evolved.
Typically, a manager at a data-rich company has access to certain
canned business reports. The managers have generated a list of busi‐
ness questions such as “a chart showing the product revenue from
each store, to compare same-store sales year-by-year” and a pro‐
grammer has dutifully coded up an analytics application to provide
those answers. If the business managers want a different report con‐
taining metrics and relationships not provided ahead of time, a
recoding effort is involved. This severely limits the data analysis sys‐
tems, leaving them unresponsive to intuitive questioning by the
business managers. The systems and humans are operating at very

different paces in this world of old-generation BI software.
Drawing an analogy to the evolution of the Internet, this is similar
to the sites that curated content for users more than a decade ago.
Users would subscribe to forums to find out what was new. Hot
products like Encarta (introduced by Microsoft in the early 1990s
when the Web was quite young) provided predetermined sets of
information in an encyclopedia format. Getting access to these
resources was much easier than pacing through the card catalog of
one’s local library, but they opened access only to a limited set of
information chosen by the site. Existing BI reports are similar to
these offerings in their inelasticity and lack of real-time interactivity
to serve the needs of the business user.
The advent of the AltaVista search engine, and subsequently Google,
transformed information access. The search engines didn’t add a jot
to the information already available. But they radically broadened
the sites to which we had access, and put us only a few seconds and
a few clicks away from the wealth of information and opinions on
the Web. Immediate options are now taken for granted as we search
an online bookseller for books, a travel site for hotels and airline
tickets, etc. Within minutes we sample a mind-boggling range of
opinions from around the world, whether the subject is the best data
store for fast-moving input or the latest sports news.

4

|

Search-Driven Business Analytics



What does it take to bring the same kind of instant feedback and
broad searchability to business intelligence? Some requirements
include:
Real-time interactivity
When you start typing “flowers” into a modern search engine
such as Google or Bing, it anticipates what you want and sug‐
gests popular completions, such as “flowers online” and “flowers
for algernon” (a popular book and movie title). Typing “restau‐
rants” will probably offer you local results. Similarly, a BI solu‐
tion should instantly fashion charts or other answers while you
are typing, predicting what you want based on its knowledge of
previous queries and the data sets themselves. It should get bet‐
ter over time as it learns more about what each user wants and
offer more relevant suggestions.
A single, accurate answer
Unlike web search engines that can return multiple results in
relevance-ranked order, the BI interface should return just what
the user asked for, leaving out extraneous results. Ideally, when
the user wants a simple answer such as “revenue for California
last year” the interface should return a single figure instead of a
table of values the user has to interpret, or a list of links to past
reports or dashboards for the user to sift through to find the
answer.
Diverse data sets
The BI solution should be able to use structured data through‐
out the organization, from many different databases and even
more informal sources such as spreadsheets. All these sources
should be combined smoothly, and the solution should recog‐
nize relationships among the columns of databases so that it can
combine this data in visualizations and other results.

A simple interface
User experience and system usability have to be similar to con‐
sumer applications. Anyone should be able to use the solution
as easily as a search engine, without the need for a training class.
Scalability
Modern firms deal with terabytes of data or more. The solution
should be able to quickly search large amounts of data from

Data Access Methods Are Being Transformed by Search

|

5


many columns of many tables and still return results in real
time.
Security
IT staff should be able to restrict access to specific columns or
rows of data, or to particular objects such as dashboards created
by users, assigning rights to individuals or groups. The product
needs to work with existing identity management solutions,
providing support for LDAP and Active Directory integration
and single sign-on capabilities. This will allow users to easily log
in using their corporate credentials.
Administrators should be able to set up security for individual
users or for groups, controlling access at the level of a saved
dashboard or chart, a column (such as a column in an HR table
that has compensation data), or a row (customer information
for the West Coast might be hidden from a sales rep in the East

Coast, for example).
How does a BI solution like this change the way we do business?
How does the reduction in response time for a query, from days to
seconds, lead to a higher top line and lower costs?
Instead of waiting to see past performance of sales, the general man‐
ager of a business unit can see real-time sales performance and
make inventory allocation decisions based on real-time demand.
Business processes are undergoing complete disruptions as precalculated transformations are now possible on demand.
The impact becomes even greater as interfaces are able to anticipate
what a user wants and bring into sharp focus ideas that are just
emerging. This anticipation can be based on previous queries—for
instance, if someone searches for information on California, the
interface would check its cached queries and notice similar searches
for information on New York, then suggest a related result. Every‐
one has a unique approach to asking questions, so personalizing the
suggestions makes the experience a lot more relevant and userfriendly. The interface can also look at the data itself: for instance, in
each column the interface anticipates that the user is likely to
request values that are more commonly found there.

6

|

Search-Driven Business Analytics


Getting Insights from Diverse Data
Enterprises’ data sources come in several flavors:
• Data warehouses often store tens, hundreds, or terabytes of his‐
torical data in relational tables accessed through SQL.

• Applications, both on-premise and in the cloud, produce results
that can be input into BI. Recent years have seen a notable
increase in cloud enterprise applications offered by vendors
such as Salesforce and NetSuite.
• The ubiquitous spreadsheets spread across desktops and laptops
across the enterprise that individuals use to analyze subsets of
data.
• With the increasing spread of Hadoop, Spark, and other “big
data” technologies within the enterprise, data sources with rela‐
tively loose document formats are becoming an important cate‐
gory as well.
The more sources of data a search engine can handle, the more use‐
ful it becomes—not only because more of the organization’s data is
searchable, but because the different sources can work together and
add extra meaning. However, one of the most time-consuming
problems faced by BI analysts is the integration of multiple data
sources, especially non-relational data. A search-driven interface can
help with this, by offering a visual and easy way for analysts to dis‐
cover bad or stale data, and exclude it from the scope of data that’s
visible to business users.
Therefore, integrating sources and indexing their content for quick
retrieval is the key initial task for interactive BI and analytics. The
ThoughtSpot Analytical Search Appliance uses a variety of interfaces
to integrate data from various sources:
• Data is loaded from data marts or data warehouses through the
enterprises’ chosen ETL tools, and through a JDBC/ODBC
interface that can be used to connect data sources directly to
ThoughtSpot. Data can also be directly loaded into Thought‐
Spot through bulk data load scripts. These are highly efficient,
loading the data at multi-terabyte-per-hour speeds in a scale-out

fashion across all the nodes.

Getting Insights from Diverse Data

|

7


• For cloud data sources, in addition to the above options,
ThoughtSpot has partnered with vendors to use their individual
products, such as Informatica’s Cloud Connector, to load data.
• Spreadsheets can be uploaded by individual users through an
interface in the product that guides the user through the pro‐
cess. As part of that workflow, the user can also specify whether
she wishes to link a column from this spreadsheet to any other
column present in the system so that she can analyze local data
present on her computer against company-wide data from their
data warehouse.
ThoughtSpot understands the underlying schema and relationships
between your data when you load it, so as soon as it is loaded, it is
ready to be searched without any additional modeling work. The
system also works across any time granularity—weekly, quarterly,
yearly—without requiring the BI team to build new aggregate tables,
OLAP cubes, and materialized views. This helps business users to
start using the system as soon as the IT/BI team has loaded data into
it. And as the user types queries that connect multiple tables
together, the multiple join path choices are all handled under the
hood so the user does not have to know any SQL terminology to
connect diverse data sets together and complete her query. Thought‐

Spot is able to provide sub-second response times for searches over
billions of rows of data because of its purpose-built, in-memory
relational cache. This cache understands search semantics and secu‐
rity rules, as well as query plans, and is able to scale out across hun‐
dreds of nodes.
Once the data is loaded, ThoughtSpot creates an index to maximize
the speed of queries. For data volumes in terabytes, the index needs
to be efficiently sharded and distributed across multiple nodes
without compromising on search latency. The creation of the index
itself must be distributed so that there is minimal delay between
when new data shows up in the system and when it is ready to be
searched.
Microsoft’s Power BI features integration with external tools, both
from Microsoft and from partners such as Salesforce and Zendesk.
The Power BI interface helps the user find these resources—databa‐
ses, spreadsheets, Hadoop data stores, even social media sites—and
connect to them. A relational database provides its own schema,
whereas Power BI creates the schema for a spreadsheet, normally
using the first row as column names. Figure 4 shows an entity8

|

Search-Driven Business Analytics


relationship diagram created by Power BI to represent an incoming
schema.

Figure 4. Schema in Power BI
In Power BI, The user can also attach to a stream of incoming data

and see a dashboard updated in real time as new data comes in. The
user can then provide this dashboard to colleagues—by sending an
email with a URL, or through SharePoint—and they too can see
real-time changes.
Power BI takes integration further by supporting single sign-on. For
instance, a user would log into Power BI and enter her Salesforce
credentials. After this, the user just needs to log into Power BI for
future sessions and would be able to search Salesforce without reau‐
thorizing the connection.

Interpreting User Input
Let’s see how the solutions in this report handle use questions.
Power BI and Adatao estimate what a user’s intent is using natural
language processing (NLP) techniques. They accept a range of rela‐
tively free text and resolve ambiguities by examining the context of
the words used.
ThoughtSpot, on the other hand, chose not to use NLP in order to
remove any chance of ambiguity. ThoughtSpot’s search engine
guides the user as they type with intelligent search suggestions, mak‐
ing sure that the user’s intent and the search engine are always in

Interpreting User Input

|

9


sync. As such, ThoughtSpot is always able to provide a single, accu‐
rate result, rather than a list of probabilistic answers.

All of these tools are fault-tolerant at the user-input level, allowing
users to get to answers even with misspellings, changed word orders,
or incorrect grammar. The tools can execute the kind of type-ahead
autocomplete that Google has made familiar (see Figure 5).

Figure 5. Autocomplete in ThoughtSpot
Figure 6 shows the output of an NLP query in Power BI.

Figure 6. Natural-language query in Power BI

10

|

Search-Driven Business Analytics


Figure 7 shows a typical set of relevant business questions suggested
by Adatao.

Figure 7. Adatao search suggestions
To recognize natural-language phrases such as “What is the average
cost per trip by region of travel?”, Power BI incorporated advanced
technology from other Microsoft tools, notably Bing. Corrections to
spelling and alternative columns can be presented to the user.
As we have seen, the ThoughtSpot Analytical Search Appliance can
handle a wide range of user requests and help the user structure her
queries. Let’s focus on a simple request such as “Revenue California
2015 county.” If the user types “Cal,” the engine fills in “California”
as a suggested completion. The algorithms that calculate and rank

the completions take into account many factors, including how
often a word shows up in the data (its cardinality) and how often
people have searched for it. As the product gets used, the sugges‐
tions get more relevant and personalized to each user, as with search
engines like Google.
To facilitate this type of personalization, index matching has to sup‐
port exact matches as well as prefix, suffix, and substring matches; it
also looks for synonyms. If there are no matches—for example, if
the user makes a typographical error—the engine offers suggestions
based on spellcheck-based algorithms and phonetic matching algo‐
rithms, such as metaphone. While performing these over potentially
billions of rows of data, the engine also needs to apply sophisticated
row-level, column-level, and object-level security rules so that only
the entities the user is allowed to see are visible even in the search
suggestions.
Within the the ThoughtSpot Analytical Search Appliance, when a
user types “2015,” the engine knows that the text refers to a year—
not a product part number or some other arbitrary number. The

Interpreting User Input

|

11


engine can predict this with high accuracy because 2015 appears fre‐
quently in a Year column in a database it indexed.
A crucial prerequisite for joining data to respond to user queries is
to recognize relationships. The “California” and “2015” in the user’s

query lead the engine to filter the data so it uses only the rows that
are related to California and are from the year 2015. In our “Califor‐
nia 2015” example, ThoughtSpot can determine that a relationship
exists if a foreign key connects two tables.
The interface offers suggestions in a dropdown box as the user
types, and the user can immediately choose the one she intends. For
instance, as the user types “Revenue California,” the interface sug‐
gests several completions such as “Revenue California by county”
and “Revenue California by customer,” drawing on its knowledge of
the columns in the database. The suggestions include those gener‐
ated through the analytical search algorithm already described, as
well as those generated by typical document search algorithms, like
Apache Lucene. This instant responsiveness keeps the user and
engine in lock step. It allows the user to focus on her thought pro‐
cess instead of her interactions with the engine. It also allows her to
create a new answer or consume a saved answer based on what she’s
looking for, without having to limit herself to saved charts and dash‐
boards.
The ThoughtSpot engine only produces suggestions that adhere to
any security restrictions. If a user has not been granted access to a
column, it is not used to generate search suggestions, let alone pro‐
duce results.
In our search for “Revenue California 2015 county,” the engine com‐
putes that the data in the joined State and Year columns should be
grouped by county in the display. The ThoughtSpot user interface
also recognizes common aggregate functions, such as “sum” or
“standard deviation”, and computations such as “growth of ” that are
complex to express in SQL. Figure 8 shows the results of a complex
calculation in ThoughtSpot.


12

|

Search-Driven Business Analytics


Figure 8. Monthly Sales Growth chart example

Translating Queries into Answers
Programmers reading this account will quickly see that the services
in this report manipulate SQL behind the scenes, generating rela‐
tional database queries of considerable complexity and sophistica‐
tion. But the user doesn’t have to think in terms of relational data or
SQL at all. The inputs are mildly structured but close to everyday
language—the original premise of SQL in the 1970s.
In the 1980s and 1990s, a number of products promised a “naturallanguage-on-SQL” approach, but failed to meet the market’s need.
The query suggestion/completion interfaces implemented by the
services in this report are along the lines of popular search engines,
and add a crucial missing piece to those older approaches. It turns
out that the query suggestion/completion interface is a significant
factor in helping users effectively go from thought to question to
answer. Figure 9 shows how ThoughtSpot extends the user’s query
with suggestions.

Translating Queries into Answers

|

13



Figure 9. Search suggestions are refined as you type your query
We have assumed so far that a column named “county” is in some
input table. However, if the column has some other name (say,
“region”), the services in this report allow a user or administrator to
define synonyms, so that they can map an oddly named column
(such as “cust_reg”) to words in everyday language (such as
“region”). Power BI, for instance, lets users do this through PowerPi‐
vot in Excel. ThoughtSpot allows administrators to relate column
names to synonyms. In this hypothetical case, the administrator
could indicate that when a user requests a “county,” the engine
should map that to the “region” column from the input. Thought‐
Spot also uses synonym sets and other matching algorithms to offer
meaningful suggestions based on what the user meant, and lets the
user pick the correct choice to move forward with the query compu‐
tation.
All three tools in this report, working with the original data sources,
lets the user “slice and dice” data through filtering and drill-down
operations (by region, product segment, etc.). With ThoughtSpot,
users can also slice-and-dice directly in the search bar by adding or
removing search terms.
In short, a responsive user interface should—in real time—compare
user inputs to both column names and values in the input databases.
It should be able to make a savvy guess as to what column the user
wants and offer that as a higher-ranked suggestion, based both on
exact matches and on considerations such as which columns contain
the most rows containing “California” or “2015” as a value. The user

14


|

Search-Driven Business Analytics


can pick the suggestion that matches what she is looking for and dis‐
ambiguate the request.

Validating Answers
The final piece of input interpretation is helping users verify the
intermediate steps that the product used to arrive at a result. This
helps adoption because users can now trust results that they see by
verifying the data sources used to compute the answer.
When a search result is shown, alongside the result the user is also
given an option to hover over each search term and understand the
lineage of the data (which source table and column it came from).
For example the user could see if she has chosen revenue data from
an official data source such as the data warehouse, or a spreadsheet
shared with her by a coworker. Each source and object in the system
can also be “tagged” to show its associations (e.g., marketing, sales),
and these could serve as useful inputs to help the user understand
what data sources she picked to arrive at the answer in front of her.
In Figure 10, a ThoughtSpot user has selected the “store region” part
of the query for deeper investigation.

Figure 10. User selects parts of a ThoughtSpot query to delve into
A button next to the search box lets the user translate the search
string to an almost plain-English form that explains how the differ‐
ent tables were joined, what filters were applied, and what final

result was computed. Figure 11 shows the internal information that
ThoughtSpot displays about the “store region” part of the query in
Figure 10.

Validating Answers

|

15


Figure 11. Delving into a ThoughtSpot query
This helps business users gain confidence that the product is indeed
performing the computations they way they expect it to. Users can
share their query answers with BI analysts to reconcile any differ‐
ences. For example, the business user might have wanted to see
order date, and the BI analyst could have made her report using the
ship date—the key is that the two are different. By looking at the
output provided by ThoughtSpot, she is able to see that the date
used was the order date and could change it by picking the date
from the ship date column to get her desired answer.

16

|

Search-Driven Business Analytics


Creating the Simplicity of a Search-Like Query

To show how a search interface can form and execute a query while
totally hiding the complexity of the schema and SQL, we’ll track the
ThoughtSpot Analytical Search engine through its underlying pro‐
cesses when handling a user query.
Say we have two fact tables called Contacts and Sales Details, along
with a dimension table called the Phone table that connects the
other two. Assume that for each category and product, we want to
find the following:






Number of unique phone numbers contacted
Number of contacts made
How many clicks were counted
How many sales were made
Total revenue

With ThoughtSpot, the user just needs to type these terms into the
interface and all the complex joins happen in the backend. The
query would be: “count phone count contact count sale clicks reve‐
nue category product”.
If you were to write the full SQL for something like this, it would
look like:

Creating the Simplicity of a Search-Like Query

|


17


18

|

Search-Driven Business Analytics


This search brings together data from the following tables:

With that data, ThoughtSpot does all the complex joins and pro‐
duces the result in Figure 12.

Figure 12. ThoughtSpot produced this result after hiding all the com‐
plex logic under the hood
The user can ask ThoughtSpot to explain how it put together and
interpreted the data, and receive the display in Figure 13.

Creating the Simplicity of a Search-Like Query

|

19


×