Tải bản đầy đủ (.pdf) (24 trang)

search driven business analytics

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (3.78 MB, 24 trang )



Search-Driven Business Analytics
Designing a New Search Engine for Data
Andy Oram


Search-Driven Business Analytics
by Andy Oram
Copyright © 2015 O’Reilly Media, Inc. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online
editions are also available for most titles (). For more information,
contact our corporate/institutional sales department: 800-998-9938 or
Editor: Shannon Cutt
Interior Designer: David Futato
Cover Designer: Randy Comer
Illustrator: Rebecca Demarest
August 2015: First Edition
Revision History for the First Edition
2015-09-02: First Release
2015-10-20: Second Release
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Search-Driven Business
Analytics, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc.
While the publisher and the authors have used good faith efforts to ensure that the information and
instructions contained in this work are accurate, the publisher and the authors disclaim all
responsibility for errors or omissions, including without limitation responsibility for damages
resulting from the use of or reliance on this work. Use of the information and instructions contained in
this work is at your own risk. If any code samples or other technology this work contains or describes
is subject to open source licenses or the intellectual property rights of others, it is your responsibility


to ensure that your use thereof complies with such licenses and/or rights.
978-1-491-93813-3
[LSI]


Chapter 1. Search-Driven Business
Analytics
We are all accustomed to instant results with the use of major web search engines. However, when
we pull up a business intelligence (BI) product at work, the situation is quite different. In comparison
to Internet services that we use every day, these products seem stiff and unresponsive. Business
leaders are served with pre-built reports and dashboards put together by their BI teams, and they wait
days or weeks to get reports on new inquiries about customers, products, or markets. Thus, when a
business manager moves from Facebook, Amazon.com, or Google to her BI tool, it feels like time
travel back to a different century.
This report examines what it takes to make business intelligence as simple and responsive as today’s
consumer search engines, where the user gets answers and visualizations as quickly as questions
come to mind.
We’ll look at:
The convergence of BI and search
What a search-driven user experience looks like
The intelligence required for analytical search
Data sources and their associated data modeling requirements
Turning on-the-fly calculations into visualizations
Applying enterprise scale and security to search
The techniques described here are general and draw on well-established practices in the field. The
main reference platform for this report is the ThoughtSpot Analytical Search Appliance. The author
will also incorporate information gleaned from discussions with technical staff from Microsoft’s
Power BI service and from Adatao, a firm that offers collaborative and predictive analytics.

A New Generation of Vendors Offering Interactive

Visualizations
ThoughtSpot’s Analytical Search engine allows the user to ask ad-hoc questions of their data through
a search interface. The engine computes results on-the-fly based on the search query, and offers
visualizations of interest to the user. It features an interactive interface that allows you to search
through billions of rows and compute results on-the-fly from any data source.


Figure 1. Data display in ThoughtSpot

Microsoft’s PowerBI service lets you quickly create dashboards, share reports, and directly connect
to (and incorporate) all the data available within the organization, through partners, or publicly
posted to the Internet. Power BI Desktop enables you to transform data and create reports and
visualizations. Figure 2 shows a typical dashboard created in the Desktop.

Figure 2. Dashboard produced by Microsoft Power BI


Adatao takes a problem-solving approach to all data, big and small, where the user starts with a
hypothesis and pulls answers out of data sources to validate or invalidate the hypothesis. Figure 3
shows typical output from Adatao, known as a narrative, which enables data discovery and
presentation in the form of attractive visualizations.

Figure 3. Narrative produced by Adatao

Data Access Methods Are Being Transformed by Search
So how have these new-generation technologies transformed data interaction for the business user?
An enlightening analogy can be drawn between the way managers use BI today and how information
access on the Internet has evolved.
Typically, a manager at a data-rich company has access to certain canned business reports. The
managers have generated a list of business questions such as “a chart showing the product revenue

from each store, to compare same-store sales year-by-year” and a programmer has dutifully coded up
an analytics application to provide those answers. If the business managers want a different report
containing metrics and relationships not provided ahead of time, a recoding effort is involved. This
severely limits the data analysis systems, leaving them unresponsive to intuitive questioning by the
business managers. The systems and humans are operating at very different paces in this world of oldgeneration BI software.
Drawing an analogy to the evolution of the Internet, this is similar to the sites that curated content for
users more than a decade ago. Users would subscribe to forums to find out what was new. Hot


products like Encarta (introduced by Microsoft in the early 1990s when the Web was quite young)
provided predetermined sets of information in an encyclopedia format. Getting access to these
resources was much easier than pacing through the card catalog of one’s local library, but they
opened access only to a limited set of information chosen by the site. Existing BI reports are similar
to these offerings in their inelasticity and lack of real-time interactivity to serve the needs of the
business user.
The advent of the AltaVista search engine, and subsequently Google, transformed information access.
The search engines didn’t add a jot to the information already available. But they radically broadened
the sites to which we had access, and put us only a few seconds and a few clicks away from the
wealth of information and opinions on the Web. Immediate options are now taken for granted as we
search an online bookseller for books, a travel site for hotels and airline tickets, etc. Within minutes
we sample a mind-boggling range of opinions from around the world, whether the subject is the best
data store for fast-moving input or the latest sports news.
What does it take to bring the same kind of instant feedback and broad searchability to business
intelligence? Some requirements include:
Real-time interactivity
When you start typing “flowers” into a modern search engine such as Google or Bing, it
anticipates what you want and suggests popular completions, such as “flowers online” and
“flowers for algernon” (a popular book and movie title). Typing “restaurants” will probably offer
you local results. Similarly, a BI solution should instantly fashion charts or other answers while
you are typing, predicting what you want based on its knowledge of previous queries and the data

sets themselves. It should get better over time as it learns more about what each user wants and
offer more relevant suggestions.
A single, accurate answer
Unlike web search engines that can return multiple results in relevance-ranked order, the BI
interface should return just what the user asked for, leaving out extraneous results. Ideally, when
the user wants a simple answer such as “revenue for California last year” the interface should
return a single figure instead of a table of values the user has to interpret, or a list of links to past
reports or dashboards for the user to sift through to find the answer.
Diverse data sets
The BI solution should be able to use structured data throughout the organization, from many
different databases and even more informal sources such as spreadsheets. All these sources
should be combined smoothly, and the solution should recognize relationships among the columns
of databases so that it can combine this data in visualizations and other results.
A simple interface
User experience and system usability have to be similar to consumer applications. Anyone should
be able to use the solution as easily as a search engine, without the need for a training class.


Scalability
Modern firms deal with terabytes of data or more. The solution should be able to quickly search
large amounts of data from many columns of many tables and still return results in real time.
Security
IT staff should be able to restrict access to specific columns or rows of data, or to particular
objects such as dashboards created by users, assigning rights to individuals or groups. The
product needs to work with existing identity management solutions, providing support for LDAP
and Active Directory integration and single sign-on capabilities. This will allow users to easily
log in using their corporate credentials.
Administrators should be able to set up security for individual users or for groups, controlling
access at the level of a saved dashboard or chart, a column (such as a column in an HR table that
has compensation data), or a row (customer information for the West Coast might be hidden from

a sales rep in the East Coast, for example).
How does a BI solution like this change the way we do business? How does the reduction in response
time for a query, from days to seconds, lead to a higher top line and lower costs?
Instead of waiting to see past performance of sales, the general manager of a business unit can see
real-time sales performance and make inventory allocation decisions based on real-time demand.
Business processes are undergoing complete disruptions as pre-calculated transformations are now
possible on demand.
The impact becomes even greater as interfaces are able to anticipate what a user wants and bring into
sharp focus ideas that are just emerging. This anticipation can be based on previous queries—for
instance, if someone searches for information on California, the interface would check its cached
queries and notice similar searches for information on New York, then suggest a related result.
Everyone has a unique approach to asking questions, so personalizing the suggestions makes the
experience a lot more relevant and user-friendly. The interface can also look at the data itself: for
instance, in each column the interface anticipates that the user is likely to request values that are more
commonly found there.

Getting Insights from Diverse Data
Enterprises’ data sources come in several flavors:
Data warehouses often store tens, hundreds, or terabytes of historical data in relational tables
accessed through SQL.
Applications, both on-premise and in the cloud, produce results that can be input into BI. Recent
years have seen a notable increase in cloud enterprise applications offered by vendors such as
Salesforce and NetSuite.


The ubiquitous spreadsheets spread across desktops and laptops across the enterprise that
individuals use to analyze subsets of data.
With the increasing spread of Hadoop, Spark, and other “big data” technologies within the
enterprise, data sources with relatively loose document formats are becoming an important
category as well.

The more sources of data a search engine can handle, the more useful it becomes—not only because
more of the organization’s data is searchable, but because the different sources can work together and
add extra meaning. However, one of the most time-consuming problems faced by BI analysts is the
integration of multiple data sources, especially non-relational data. A search-driven interface can
help with this, by offering a visual and easy way for analysts to discover bad or stale data, and
exclude it from the scope of data that’s visible to business users.
Therefore, integrating sources and indexing their content for quick retrieval is the key initial task for
interactive BI and analytics. The ThoughtSpot Analytical Search Appliance uses a variety of
interfaces to integrate data from various sources:
Data is loaded from data marts or data warehouses through the enterprises’ chosen ETL tools, and
through a JDBC/ODBC interface that can be used to connect data sources directly to ThoughtSpot.
Data can also be directly loaded into ThoughtSpot through bulk data load scripts. These are highly
efficient, loading the data at multi-terabyte-per-hour speeds in a scale-out fashion across all the
nodes.
For cloud data sources, in addition to the above options, ThoughtSpot has partnered with vendors
to use their individual products, such as Informatica’s Cloud Connector, to load data.
Spreadsheets can be uploaded by individual users through an interface in the product that guides
the user through the process. As part of that workflow, the user can also specify whether she
wishes to link a column from this spreadsheet to any other column present in the system so that she
can analyze local data present on her computer against company-wide data from their data
warehouse.
ThoughtSpot understands the underlying schema and relationships between your data when you load
it, so as soon as it is loaded, it is ready to be searched without any additional modeling work. The
system also works across any time granularity—weekly, quarterly, yearly—without requiring the BI
team to build new aggregate tables, OLAP cubes, and materialized views. This helps business users
to start using the system as soon as the IT/BI team has loaded data into it. And as the user types
queries that connect multiple tables together, the multiple join path choices are all handled under the
hood so the user does not have to know any SQL terminology to connect diverse data sets together
and complete her query. ThoughtSpot is able to provide sub-second response times for searches over
billions of rows of data because of its purpose-built, in-memory relational cache. This cache

understands search semantics and security rules, as well as query plans, and is able to scale out
across hundreds of nodes.
Once the data is loaded, ThoughtSpot creates an index to maximize the speed of queries. For data


volumes in terabytes, the index needs to be efficiently sharded and distributed across multiple nodes
without compromising on search latency. The creation of the index itself must be distributed so that
there is minimal delay between when new data shows up in the system and when it is ready to be
searched.
Microsoft’s Power BI features integration with external tools, both from Microsoft and from partners
such as Salesforce and Zendesk. The Power BI interface helps the user find these resources—
databases, spreadsheets, Hadoop data stores, even social media sites—and connect to them. A
relational database provides its own schema, whereas Power BI creates the schema for a
spreadsheet, normally using the first row as column names. Figure 4 shows an entity-relationship
diagram created by Power BI to represent an incoming schema.

Figure 4. Schema in Power BI

In Power BI, The user can also attach to a stream of incoming data and see a dashboard updated in
real time as new data comes in. The user can then provide this dashboard to colleagues—by sending
an email with a URL, or through SharePoint—and they too can see real-time changes.
Power BI takes integration further by supporting single sign-on. For instance, a user would log into
Power BI and enter her Salesforce credentials. After this, the user just needs to log into Power BI for
future sessions and would be able to search Salesforce without reauthorizing the connection.

Interpreting User Input
Let’s see how the solutions in this report handle use questions. Power BI and Adatao estimate what a
user’s intent is using natural language processing (NLP) techniques. They accept a range of relatively



free text and resolve ambiguities by examining the context of the words used.
ThoughtSpot, on the other hand, chose not to use NLP in order to remove any chance of ambiguity.
ThoughtSpot’s search engine guides the user as they type with intelligent search suggestions, making
sure that the user’s intent and the search engine are always in sync. As such, ThoughtSpot is always
able to provide a single, accurate result, rather than a list of probabilistic answers.
All of these tools are fault-tolerant at the user-input level, allowing users to get to answers even with
misspellings, changed word orders, or incorrect grammar. The tools can execute the kind of typeahead autocomplete that Google has made familiar (see Figure 5).

Figure 5. Autocomplete in ThoughtSpot

Figure 6 shows the output of an NLP query in Power BI.


Figure 6. Natural-language query in Power BI

Figure 7 shows a typical set of relevant business questions suggested by Adatao.

Figure 7. Adatao search suggestions

To recognize natural-language phrases such as “What is the average cost per trip by region of
travel?”, Power BI incorporated advanced technology from other Microsoft tools, notably Bing.
Corrections to spelling and alternative columns can be presented to the user.
As we have seen, the ThoughtSpot Analytical Search Appliance can handle a wide range of user
requests and help the user structure her queries. Let’s focus on a simple request such as “Revenue
California 2015 county.” If the user types “Cal,” the engine fills in “California” as a suggested


completion. The algorithms that calculate and rank the completions take into account many factors,
including how often a word shows up in the data (its cardinality) and how often people have searched
for it. As the product gets used, the suggestions get more relevant and personalized to each user, as

with search engines like Google.
To facilitate this type of personalization, index matching has to support exact matches as well as
prefix, suffix, and substring matches; it also looks for synonyms. If there are no matches—for
example, if the user makes a typographical error—the engine offers suggestions based on spellcheckbased algorithms and phonetic matching algorithms, such as metaphone. While performing these over
potentially billions of rows of data, the engine also needs to apply sophisticated row-level, columnlevel, and object-level security rules so that only the entities the user is allowed to see are visible
even in the search suggestions.
Within the the ThoughtSpot Analytical Search Appliance, when a user types “2015,” the engine
knows that the text refers to a year—not a product part number or some other arbitrary number. The
engine can predict this with high accuracy because 2015 appears frequently in a Year column in a
database it indexed.
A crucial prerequisite for joining data to respond to user queries is to recognize relationships. The
“California” and “2015” in the user’s query lead the engine to filter the data so it uses only the rows
that are related to California and are from the year 2015. In our “California 2015” example,
ThoughtSpot can determine that a relationship exists if a foreign key connects two tables.
The interface offers suggestions in a dropdown box as the user types, and the user can immediately
choose the one she intends. For instance, as the user types “Revenue California,” the interface
suggests several completions such as “Revenue California by county” and “Revenue California by
customer,” drawing on its knowledge of the columns in the database. The suggestions include those
generated through the analytical search algorithm already described, as well as those generated by
typical document search algorithms, like Apache Lucene. This instant responsiveness keeps the user
and engine in lock step. It allows the user to focus on her thought process instead of her interactions
with the engine. It also allows her to create a new answer or consume a saved answer based on what
she’s looking for, without having to limit herself to saved charts and dashboards.
The ThoughtSpot engine only produces suggestions that adhere to any security restrictions. If a user
has not been granted access to a column, it is not used to generate search suggestions, let alone
produce results.
In our search for “Revenue California 2015 county,” the engine computes that the data in the joined
State and Year columns should be grouped by county in the display. The ThoughtSpot user interface
also recognizes common aggregate functions, such as “sum” or “standard deviation”, and
computations such as “growth of” that are complex to express in SQL. Figure 8 shows the results of a

complex calculation in ThoughtSpot.


Figure 8. Monthly Sales Growth chart example

Translating Queries into Answers
Programmers reading this account will quickly see that the services in this report manipulate SQL
behind the scenes, generating relational database queries of considerable complexity and
sophistication. But the user doesn’t have to think in terms of relational data or SQL at all. The inputs
are mildly structured but close to everyday language—the original premise of SQL in the 1970s.
In the 1980s and 1990s, a number of products promised a “natural-language-on-SQL” approach, but
failed to meet the market’s need. The query suggestion/completion interfaces implemented by the
services in this report are along the lines of popular search engines, and add a crucial missing piece
to those older approaches. It turns out that the query suggestion/completion interface is a significant
factor in helping users effectively go from thought to question to answer. Figure 9 shows how
ThoughtSpot extends the user’s query with suggestions.


Figure 9. Search suggestions are refined as you type your query

We have assumed so far that a column named “county” is in some input table. However, if the column
has some other name (say, “region”), the services in this report allow a user or administrator to
define synonyms, so that they can map an oddly named column (such as “cust_reg”) to words in
everyday language (such as “region”). Power BI, for instance, lets users do this through PowerPivot
in Excel. ThoughtSpot allows administrators to relate column names to synonyms. In this hypothetical
case, the administrator could indicate that when a user requests a “county,” the engine should map that
to the “region” column from the input. ThoughtSpot also uses synonym sets and other matching
algorithms to offer meaningful suggestions based on what the user meant, and lets the user pick the
correct choice to move forward with the query computation.
All three tools in this report, working with the original data sources, lets the user “slice and dice”

data through filtering and drill-down operations (by region, product segment, etc.). With ThoughtSpot,
users can also slice-and-dice directly in the search bar by adding or removing search terms.
In short, a responsive user interface should—in real time—compare user inputs to both column names
and values in the input databases. It should be able to make a savvy guess as to what column the user
wants and offer that as a higher-ranked suggestion, based both on exact matches and on considerations
such as which columns contain the most rows containing “California” or “2015” as a value. The user
can pick the suggestion that matches what she is looking for and disambiguate the request.

Validating Answers
The final piece of input interpretation is helping users verify the intermediate steps that the product


used to arrive at a result. This helps adoption because users can now trust results that they see by
verifying the data sources used to compute the answer.
When a search result is shown, alongside the result the user is also given an option to hover over each
search term and understand the lineage of the data (which source table and column it came from). For
example the user could see if she has chosen revenue data from an official data source such as the
data warehouse, or a spreadsheet shared with her by a coworker. Each source and object in the
system can also be “tagged” to show its associations (e.g., marketing, sales), and these could serve as
useful inputs to help the user understand what data sources she picked to arrive at the answer in front
of her. In Figure 10, a ThoughtSpot user has selected the “store region” part of the query for deeper
investigation.

Figure 10. User selects parts of a ThoughtSpot query to delve into

A button next to the search box lets the user translate the search string to an almost plain-English form
that explains how the different tables were joined, what filters were applied, and what final result
was computed. Figure 11 shows the internal information that ThoughtSpot displays about the “store
region” part of the query in Figure 10.


Figure 11. Delving into a ThoughtSpot query


This helps business users gain confidence that the product is indeed performing the computations they
way they expect it to. Users can share their query answers with BI analysts to reconcile any
differences. For example, the business user might have wanted to see order date, and the BI analyst
could have made her report using the ship date—the key is that the two are different. By looking at the
output provided by ThoughtSpot, she is able to see that the date used was the order date and could
change it by picking the date from the ship date column to get her desired answer.

Creating the Simplicity of a Search-Like Query
To show how a search interface can form and execute a query while totally hiding the complexity of
the schema and SQL, we’ll track the ThoughtSpot Analytical Search engine through its underlying
processes when handling a user query.
Say we have two fact tables called Contacts and Sales Details, along with a dimension table called
the Phone table that connects the other two. Assume that for each category and product, we want to
find the following:
Number of unique phone numbers contacted
Number of contacts made
How many clicks were counted
How many sales were made
Total revenue
With ThoughtSpot, the user just needs to type these terms into the interface and all the complex joins
happen in the backend. The query would be: “count phone count contact count sale clicks revenue
category product”.
If you were to write the full SQL for something like this, it would look like:



This search brings together data from the following tables:


With that data, ThoughtSpot does all the complex joins and produces the result in Figure 12.

Figure 12. ThoughtSpot produced this result after hiding all the complex logic under the hood


The user can ask ThoughtSpot to explain how it put together and interpreted the data, and receive the
display in Figure 13.

Figure 13. Explanation of how ThoughtSpot performed query

Creating Instant Visualizations
We all like aptly-chosen charts and figures that show us trends, and business intelligence solutions
thrive at creating these. A search-driven business intelligence engine should therefore choose
appropriate visualizations instantly, making a best guess at what relationships the user wants to see
and how they should be compared.
With all the tools discussed in this report, the types of data that a user has entered into the search bar
automatically determine the chart type that gets plotted by default. For example, if a user looks for
revenue over a particular time period, a line chart might be picked automatically, with time shown
along the X axis and revenue along the Y axis. If the user were to look at revenue by store location, a
bar chart might be chosen instead—with stacked bars if the user wanted to further subdivide results
by product category. When there are two measures, such as when charting GDP of countries and life
expectancies, a scatterplot is the first choice. Three continuously changing variables can be
visualized through a bubble chart. Any geographical information, such as ZIP codes or latitudes and
longitudes, will automatically get displayed on a map.


Figure 14. Sample visualization in ThoughtSpot

As with the query, the visualization can be changed by the user in real time. The user can simply click

on a chart and pull up a menu of possible charts and display options. In addition to the types of data
included in a search, the algorithm also looks at factors such as cardinality of the different attributes
to determine the best visualization to represent the data.
Having found an appropriate query and visualization, the user can save it for future use by “pinning”
it to a dashboard—a feature similar to “pinning” photos on sites such as Pinterest. In ThoughtSpot,
charts, tables, and summary statistics can all be pinned to the “pinboard,” and organized in the best
sequence to support whatever story the user has in mind. Each time a user views a chart, ThoughtSpot
refreshes it with current data.

Sharing Answers and Visualizations
For a search experience to feel complete, it needs to provide you with the ability to express your
thoughts, create answers, and share those with others.
Today’s business intelligence world has a fragmented approach to the full workflow that starts when
a business user thinks of a question and continues through to when she shares her findings with others.
The user expresses her question in some documented form for a BI analyst, and then waits for a
dashboard to be built. Then she looks at the dashboard, slices-and-dices based on the limited options
she has, and if she finds what she was looking for, she saves the charts to PowerPoint for
presentation.
Once a user has created a visualization in ThoughtSpot she can share the answer or dashboard with


anyone in the enterprise that she’s allowed to share it with, using share options in the product. She
can also provide the URL for the shared answer to anyone that has group-level privileges to see that
answer.
ThoughtSpot also offers a full-screen, presentation mode for charts to be presented directly from the
product in meetings so the user can cut down time between search and discussions in a meeting. This
provides the added advantage of being able to edit search queries live in a meeting in case questions
come up regarding any particular charts or tables.

Bringing Search-Driven Analytics to the Masses

Intelligent search, instant visualization, and the ability to validate and verify answers provide the
foundations for a search-driven data analytics product. The goal is an easy entry point to doing
research with the organization’s data without requiring training.
Today, search has also become synonymous with speed at scale. A business user should be able to
access all of her enterprise data, even if it is billions of rows, with the same speed as looking up the
weather on Google. Therefore, to get mass adoption, search-driven analytics products need to be
architected from the ground up for scale (e.g.,terabytes of data being accessed by all the users in a
company).
Given the recent industry trends, a significant shift in direction for business intelligence is in the
works. This shift is being led by search. Within the next decade, search will transform the world of
business data as it did the world of public data over the last two decades.


About the Author
Andy Oram is an editor at O’Reilly Media. An employee of the company since 1992, Andy currently
specializes in open source technologies and software engineering. His work for O’Reilly includes the
first books ever released by a US publisher on Linux, the 2001 title Peer-to-Peer, and the 2007
bestseller Beautiful Code.



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Tải bản đầy đủ ngay
×