Tải bản đầy đủ (.pdf) (38 trang)

Search driven business analytics 2

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (3.74 MB, 38 trang )



Search-Driven Business
Analytics
Designing a New Search Engine for Data
Andy Oram


Search-Driven Business Analytics
by Andy Oram
Copyright © 2015 O’Reilly Media, Inc. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North,
Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales
promotional use. Online editions are also available for most titles
(). For more information, contact our
corporate/institutional sales department: 800-998-9938 or

Editor: Shannon Cutt
Interior Designer: David Futato
Cover Designer: Randy Comer
Illustrator: Rebecca Demarest
August 2015: First Edition


Revision History for the First Edition
2015-09-02: First Release
2015-10-20: Second Release
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. SearchDriven Business Analytics, the cover image, and related trade dress are
trademarks of O’Reilly Media, Inc.


While the publisher and the authors have used good faith efforts to ensure
that the information and instructions contained in this work are accurate, the
publisher and the authors disclaim all responsibility for errors or omissions,
including without limitation responsibility for damages resulting from the use
of or reliance on this work. Use of the information and instructions contained
in this work is at your own risk. If any code samples or other technology this
work contains or describes is subject to open source licenses or the
intellectual property rights of others, it is your responsibility to ensure that
your use thereof complies with such licenses and/or rights.
978-1-491-93813-3
[LSI]


Chapter 1. Search-Driven
Business Analytics
We are all accustomed to instant results with the use of major web search
engines. However, when we pull up a business intelligence (BI) product at
work, the situation is quite different. In comparison to Internet services that
we use every day, these products seem stiff and unresponsive. Business
leaders are served with pre-built reports and dashboards put together by their
BI teams, and they wait days or weeks to get reports on new inquiries about
customers, products, or markets. Thus, when a business manager moves from
Facebook, Amazon.com, or Google to her BI tool, it feels like time travel
back to a different century.
This report examines what it takes to make business intelligence as simple
and responsive as today’s consumer search engines, where the user gets
answers and visualizations as quickly as questions come to mind.
We’ll look at:
The convergence of BI and search
What a search-driven user experience looks like

The intelligence required for analytical search
Data sources and their associated data modeling requirements
Turning on-the-fly calculations into visualizations
Applying enterprise scale and security to search
The techniques described here are general and draw on well-established
practices in the field. The main reference platform for this report is the
ThoughtSpot Analytical Search Appliance. The author will also incorporate
information gleaned from discussions with technical staff from Microsoft’s


Power BI service and from Adatao, a firm that offers collaborative and
predictive analytics.


A New Generation of Vendors Offering
Interactive Visualizations
ThoughtSpot’s Analytical Search engine allows the user to ask ad-hoc
questions of their data through a search interface. The engine computes
results on-the-fly based on the search query, and offers visualizations of
interest to the user. It features an interactive interface that allows you to
search through billions of rows and compute results on-the-fly from any data
source.

Figure 1. Data display in ThoughtSpot

Microsoft’s PowerBI service lets you quickly create dashboards, share
reports, and directly connect to (and incorporate) all the data available within
the organization, through partners, or publicly posted to the Internet. Power
BI Desktop enables you to transform data and create reports and
visualizations. Figure 2 shows a typical dashboard created in the Desktop.



Figure 2. Dashboard produced by Microsoft Power BI

Adatao takes a problem-solving approach to all data, big and small, where the
user starts with a hypothesis and pulls answers out of data sources to validate
or invalidate the hypothesis. Figure 3 shows typical output from Adatao,
known as a narrative, which enables data discovery and presentation in the
form of attractive visualizations.


Figure 3. Narrative produced by Adatao


Data Access Methods Are Being Transformed
by Search
So how have these new-generation technologies transformed data interaction
for the business user? An enlightening analogy can be drawn between the
way managers use BI today and how information access on the Internet has
evolved.
Typically, a manager at a data-rich company has access to certain canned
business reports. The managers have generated a list of business questions
such as “a chart showing the product revenue from each store, to compare
same-store sales year-by-year” and a programmer has dutifully coded up an
analytics application to provide those answers. If the business managers want
a different report containing metrics and relationships not provided ahead of
time, a recoding effort is involved. This severely limits the data analysis
systems, leaving them unresponsive to intuitive questioning by the business
managers. The systems and humans are operating at very different paces in
this world of old-generation BI software.

Drawing an analogy to the evolution of the Internet, this is similar to the sites
that curated content for users more than a decade ago. Users would subscribe
to forums to find out what was new. Hot products like Encarta (introduced by
Microsoft in the early 1990s when the Web was quite young) provided
predetermined sets of information in an encyclopedia format. Getting access
to these resources was much easier than pacing through the card catalog of
one’s local library, but they opened access only to a limited set of
information chosen by the site. Existing BI reports are similar to these
offerings in their inelasticity and lack of real-time interactivity to serve the
needs of the business user.
The advent of the AltaVista search engine, and subsequently Google,
transformed information access. The search engines didn’t add a jot to the
information already available. But they radically broadened the sites to which
we had access, and put us only a few seconds and a few clicks away from the
wealth of information and opinions on the Web. Immediate options are now


taken for granted as we search an online bookseller for books, a travel site for
hotels and airline tickets, etc. Within minutes we sample a mind-boggling
range of opinions from around the world, whether the subject is the best data
store for fast-moving input or the latest sports news.
What does it take to bring the same kind of instant feedback and broad
searchability to business intelligence? Some requirements include:
Real-time interactivity
When you start typing “flowers” into a modern search engine such as
Google or Bing, it anticipates what you want and suggests popular
completions, such as “flowers online” and “flowers for algernon” (a
popular book and movie title). Typing “restaurants” will probably offer
you local results. Similarly, a BI solution should instantly fashion charts
or other answers while you are typing, predicting what you want based

on its knowledge of previous queries and the data sets themselves. It
should get better over time as it learns more about what each user wants
and offer more relevant suggestions.
A single, accurate answer
Unlike web search engines that can return multiple results in relevanceranked order, the BI interface should return just what the user asked for,
leaving out extraneous results. Ideally, when the user wants a simple
answer such as “revenue for California last year” the interface should
return a single figure instead of a table of values the user has to interpret,
or a list of links to past reports or dashboards for the user to sift through
to find the answer.
Diverse data sets
The BI solution should be able to use structured data throughout the
organization, from many different databases and even more informal
sources such as spreadsheets. All these sources should be combined
smoothly, and the solution should recognize relationships among the
columns of databases so that it can combine this data in visualizations
and other results.
A simple interface


User experience and system usability have to be similar to consumer
applications. Anyone should be able to use the solution as easily as a
search engine, without the need for a training class.
Scalability
Modern firms deal with terabytes of data or more. The solution should
be able to quickly search large amounts of data from many columns of
many tables and still return results in real time.
Security
IT staff should be able to restrict access to specific columns or rows of
data, or to particular objects such as dashboards created by users,

assigning rights to individuals or groups. The product needs to work
with existing identity management solutions, providing support for
LDAP and Active Directory integration and single sign-on capabilities.
This will allow users to easily log in using their corporate credentials.
Administrators should be able to set up security for individual users or
for groups, controlling access at the level of a saved dashboard or chart,
a column (such as a column in an HR table that has compensation data),
or a row (customer information for the West Coast might be hidden
from a sales rep in the East Coast, for example).
How does a BI solution like this change the way we do business? How does
the reduction in response time for a query, from days to seconds, lead to a
higher top line and lower costs?
Instead of waiting to see past performance of sales, the general manager of a
business unit can see real-time sales performance and make inventory
allocation decisions based on real-time demand. Business processes are
undergoing complete disruptions as pre-calculated transformations are now
possible on demand.
The impact becomes even greater as interfaces are able to anticipate what a
user wants and bring into sharp focus ideas that are just emerging. This
anticipation can be based on previous queries — for instance, if someone
searches for information on California, the interface would check its cached
queries and notice similar searches for information on New York, then


suggest a related result. Everyone has a unique approach to asking questions,
so personalizing the suggestions makes the experience a lot more relevant
and user-friendly. The interface can also look at the data itself: for instance,
in each column the interface anticipates that the user is likely to request
values that are more commonly found there.



Getting Insights from Diverse Data
Enterprises’ data sources come in several flavors:
Data warehouses often store tens, hundreds, or terabytes of historical data
in relational tables accessed through SQL.
Applications, both on-premise and in the cloud, produce results that can
be input into BI. Recent years have seen a notable increase in cloud
enterprise applications offered by vendors such as Salesforce and
NetSuite.
The ubiquitous spreadsheets spread across desktops and laptops across the
enterprise that individuals use to analyze subsets of data.
With the increasing spread of Hadoop, Spark, and other “big data”
technologies within the enterprise, data sources with relatively loose
document formats are becoming an important category as well.
The more sources of data a search engine can handle, the more useful it
becomes — not only because more of the organization’s data is searchable,
but because the different sources can work together and add extra meaning.
However, one of the most time-consuming problems faced by BI analysts is
the integration of multiple data sources, especially non-relational data. A
search-driven interface can help with this, by offering a visual and easy way
for analysts to discover bad or stale data, and exclude it from the scope of
data that’s visible to business users.
Therefore, integrating sources and indexing their content for quick retrieval is
the key initial task for interactive BI and analytics. The ThoughtSpot
Analytical Search Appliance uses a variety of interfaces to integrate data
from various sources:
Data is loaded from data marts or data warehouses through the
enterprises’ chosen ETL tools, and through a JDBC/ODBC interface that
can be used to connect data sources directly to ThoughtSpot. Data can also
be directly loaded into ThoughtSpot through bulk data load scripts. These



are highly efficient, loading the data at multi-terabyte-per-hour speeds in a
scale-out fashion across all the nodes.
For cloud data sources, in addition to the above options, ThoughtSpot has
partnered with vendors to use their individual products, such as
Informatica’s Cloud Connector, to load data.
Spreadsheets can be uploaded by individual users through an interface in
the product that guides the user through the process. As part of that
workflow, the user can also specify whether she wishes to link a column
from this spreadsheet to any other column present in the system so that
she can analyze local data present on her computer against company-wide
data from their data warehouse.
ThoughtSpot understands the underlying schema and relationships between
your data when you load it, so as soon as it is loaded, it is ready to be
searched without any additional modeling work. The system also works
across any time granularity — weekly, quarterly, yearly — without requiring
the BI team to build new aggregate tables, OLAP cubes, and materialized
views. This helps business users to start using the system as soon as the IT/BI
team has loaded data into it. And as the user types queries that connect
multiple tables together, the multiple join path choices are all handled under
the hood so the user does not have to know any SQL terminology to connect
diverse data sets together and complete her query. ThoughtSpot is able to
provide sub-second response times for searches over billions of rows of data
because of its purpose-built, in-memory relational cache. This cache
understands search semantics and security rules, as well as query plans, and
is able to scale out across hundreds of nodes.
Once the data is loaded, ThoughtSpot creates an index to maximize the speed
of queries. For data volumes in terabytes, the index needs to be efficiently
sharded and distributed across multiple nodes without compromising on

search latency. The creation of the index itself must be distributed so that
there is minimal delay between when new data shows up in the system and
when it is ready to be searched.
Microsoft’s Power BI features integration with external tools, both from


Microsoft and from partners such as Salesforce and Zendesk. The Power BI
interface helps the user find these resources — databases, spreadsheets,
Hadoop data stores, even social media sites — and connect to them. A
relational database provides its own schema, whereas Power BI creates the
schema for a spreadsheet, normally using the first row as column names.
Figure 4 shows an entity-relationship diagram created by Power BI to
represent an incoming schema.

Figure 4. Schema in Power BI

In Power BI, The user can also attach to a stream of incoming data and see a
dashboard updated in real time as new data comes in. The user can then
provide this dashboard to colleagues — by sending an email with a URL, or
through SharePoint — and they too can see real-time changes.
Power BI takes integration further by supporting single sign-on. For instance,
a user would log into Power BI and enter her Salesforce credentials. After
this, the user just needs to log into Power BI for future sessions and would be
able to search Salesforce without reauthorizing the connection.


Interpreting User Input
Let’s see how the solutions in this report handle use questions. Power BI and
Adatao estimate what a user’s intent is using natural language processing
(NLP) techniques. They accept a range of relatively free text and resolve

ambiguities by examining the context of the words used.
ThoughtSpot, on the other hand, chose not to use NLP in order to remove any
chance of ambiguity. ThoughtSpot’s search engine guides the user as they
type with intelligent search suggestions, making sure that the user’s intent
and the search engine are always in sync. As such, ThoughtSpot is always
able to provide a single, accurate result, rather than a list of probabilistic
answers.
All of these tools are fault-tolerant at the user-input level, allowing users to
get to answers even with misspellings, changed word orders, or incorrect
grammar. The tools can execute the kind of type-ahead autocomplete that
Google has made familiar (see Figure 5).


Figure 5. Autocomplete in ThoughtSpot

Figure 6 shows the output of an NLP query in Power BI.

Figure 6. Natural-language query in Power BI


Figure 7 shows a typical set of relevant business questions suggested by
Adatao.

Figure 7. Adatao search suggestions

To recognize natural-language phrases such as “What is the average cost per
trip by region of travel?”, Power BI incorporated advanced technology from
other Microsoft tools, notably Bing. Corrections to spelling and alternative
columns can be presented to the user.
As we have seen, the ThoughtSpot Analytical Search Appliance can handle a

wide range of user requests and help the user structure her queries. Let’s
focus on a simple request such as “Revenue California 2015 county.” If the
user types “Cal,” the engine fills in “California” as a suggested completion.
The algorithms that calculate and rank the completions take into account
many factors, including how often a word shows up in the data (its
cardinality) and how often people have searched for it. As the product gets
used, the suggestions get more relevant and personalized to each user, as with
search engines like Google.
To facilitate this type of personalization, index matching has to support exact
matches as well as prefix, suffix, and substring matches; it also looks for
synonyms. If there are no matches — for example, if the user makes a
typographical error — the engine offers suggestions based on spellcheckbased algorithms and phonetic matching algorithms, such as metaphone.
While performing these over potentially billions of rows of data, the engine


also needs to apply sophisticated row-level, column-level, and object-level
security rules so that only the entities the user is allowed to see are visible
even in the search suggestions.
Within the the ThoughtSpot Analytical Search Appliance, when a user types
“2015,” the engine knows that the text refers to a year — not a product part
number or some other arbitrary number. The engine can predict this with high
accuracy because 2015 appears frequently in a Year column in a database it
indexed.
A crucial prerequisite for joining data to respond to user queries is to
recognize relationships. The “California” and “2015” in the user’s query lead
the engine to filter the data so it uses only the rows that are related to
California and are from the year 2015. In our “California 2015” example,
ThoughtSpot can determine that a relationship exists if a foreign key connects
two tables.
The interface offers suggestions in a dropdown box as the user types, and the

user can immediately choose the one she intends. For instance, as the user
types “Revenue California,” the interface suggests several completions such
as “Revenue California by county” and “Revenue California by customer,”
drawing on its knowledge of the columns in the database. The suggestions
include those generated through the analytical search algorithm already
described, as well as those generated by typical document search algorithms,
like Apache Lucene. This instant responsiveness keeps the user and engine in
lock step. It allows the user to focus on her thought process instead of her
interactions with the engine. It also allows her to create a new answer or
consume a saved answer based on what she’s looking for, without having to
limit herself to saved charts and dashboards.
The ThoughtSpot engine only produces suggestions that adhere to any
security restrictions. If a user has not been granted access to a column, it is
not used to generate search suggestions, let alone produce results.
In our search for “Revenue California 2015 county,” the engine computes
that the data in the joined State and Year columns should be grouped by
county in the display. The ThoughtSpot user interface also recognizes


common aggregate functions, such as “sum” or “standard deviation”, and
computations such as “growth of” that are complex to express in SQL.
Figure 8 shows the results of a complex calculation in ThoughtSpot.


Figure 8. Monthly Sales Growth chart example


Translating Queries into Answers
Programmers reading this account will quickly see that the services in this
report manipulate SQL behind the scenes, generating relational database

queries of considerable complexity and sophistication. But the user doesn’t
have to think in terms of relational data or SQL at all. The inputs are mildly
structured but close to everyday language — the original premise of SQL in
the 1970s.
In the 1980s and 1990s, a number of products promised a “natural-languageon-SQL” approach, but failed to meet the market’s need. The query
suggestion/completion interfaces implemented by the services in this report
are along the lines of popular search engines, and add a crucial missing piece
to those older approaches. It turns out that the query suggestion/completion
interface is a significant factor in helping users effectively go from thought to
question to answer. Figure 9 shows how ThoughtSpot extends the user’s
query with suggestions.


Figure 9. Search suggestions are refined as you type your query

We have assumed so far that a column named “county” is in some input
table. However, if the column has some other name (say, “region”), the
services in this report allow a user or administrator to define synonyms, so
that they can map an oddly named column (such as “cust_reg”) to words in
everyday language (such as “region”). Power BI, for instance, lets users do
this through PowerPivot in Excel. ThoughtSpot allows administrators to
relate column names to synonyms. In this hypothetical case, the administrator
could indicate that when a user requests a “county,” the engine should map
that to the “region” column from the input. ThoughtSpot also uses synonym
sets and other matching algorithms to offer meaningful suggestions based on
what the user meant, and lets the user pick the correct choice to move
forward with the query computation.
All three tools in this report, working with the original data sources, lets the
user “slice and dice” data through filtering and drill-down operations (by
region, product segment, etc.). With ThoughtSpot, users can also slice-anddice directly in the search bar by adding or removing search terms.

In short, a responsive user interface should — in real time — compare user
inputs to both column names and values in the input databases. It should be


×