Tải bản đầy đủ (.pdf) (36 trang)

IT training marklogic cookbook powering search khotailieu

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (9.07 MB, 36 trang )



MarkLogic Cookbook

Documents, Triples, and Values:
Powering Search

Dave Cassel

Beijing

Boston Farnham Sebastopol

Tokyo


MarkLogic Cookbook
by Dave Cassel
Copyright © 2017 O’Reilly Media. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA
95472.
O’Reilly books may be purchased for educational, business, or sales promotional use.
Online editions are also available for most titles ( For more
information, contact our corporate/institutional sales department: 800-998-9938 or


Editor: Shannon Cutt
Production Editor: Kristen Brown
Copyeditor: Sonia Saruba
August 2017:



Interior Designer: David Futato
Cover Designer: Karen Montgomery
Illustrator: Rebecca Demarest

First Edition

Revision History for the First Edition
2017-06-09: Part 1
2017-08-16: Part 2
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. MarkLogic Cook‐
book, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc.
While the publisher and the author have used good faith efforts to ensure that the
information and instructions contained in this work are accurate, the publisher and
the author disclaim all responsibility for errors or omissions, including without limi‐
tation responsibility for damages resulting from the use of or reliance on this work.
Use of the information and instructions contained in this work is at your own risk. If
any code samples or other technology this work contains or describes is subject to
open source licenses or the intellectual property rights of others, it is your responsi‐
bility to ensure that your use thereof complies with such licenses and/or rights.

978-1-491-99458-0
[LSI]


Table of Contents

Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
1. Document Searches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Search by Root Element

Find Documents That Are Missing an Element

1
5

2. Scoring Search Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Sort Results to Promote Recent Documents
Weigh Matches Based on Document Parts

7
9

3. Understanding Your Data and How It Gets Used. . . . . . . . . . . . . . . . . 13
Logging Search Requests
Count Documents in Directories

13
16

4. Searching with the Optic API. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Paging Over Results
Group By
Extract Content from Retrieved Documents
Select Documents Based on Criteria in Joined Documents

19
22
24
26


iii



Introduction

MarkLogic is a database capable of storing many types of data, but it
also includes a search engine built into the core, complete with an
integrated suite of indexes working across multiple data models.
This combination allows for a simpler architecture (one software
system to deploy, configure, and maintain rather than two), simpler
application-level code (application code goes to one resource for
query and search, rather than two), and better security (because the
search engine has the same security configuration as the database
and is updated transactionally whenever data changes).
The recipes in this book, the second of a three-part series, provide
guidance on how to solve common search-related problems. Some
of the recipes work with older versions of MarkLogic, while others
take advantage of newer feaures in MarkLogic 9.
MarkLogic supports both XQuery and JavaScript as internal lan‐
guages. Most of the recipes in this book are written in JavaScript, but
have corresponding XQuery versions at k
logic.com/recipes. JavaScript is very well suited for JSON content,
while XQuery is great for XML; both are natively managed inside of
MarkLogic.
Recipes are a useful way to distill simple solutions to common prob‐
lems—copy and paste these into MarkLogic’s Query Console or
your source code, and you’ve solved the problem. In choosing rec‐
ipes for this book, I looked for a couple of factors. First, I wanted
problems that occur with some frequency. Some problems in this

book are more common than others, but all occur often enough in
real-world situations that one of my colleagues wrote down a solu‐
tion. Second, I selected four recipes that illustrate how to use the
v


new Optic API, to help developers get used to that feature. Finally,
some recipes require explanations that provide insight into how to
approach programming with MarkLogic.
Developers will get the most value from these recipes and the
accompanying discussions after they’ve worked with MarkLogic for
at least a few months and built an application or two. If you’re just
getting started, I suggest spending some time on MarkLogic Univer‐
sity classes first, then come back to this material.
If you would like to suggest or request a recipe, please write to


Acknowledgments
Many people contributed to this book. Erik Hennum provided code
for Optic recipes and helped me understand what I needed to know
in order to write up the discussions. Tom Ternquist provided the
original version of the “Search By Root Element” recipe. Jason
Hunter suggested the “Weight Matches” recipe and provided ideas
for “Logging Search Requests.” Puneet Rawal proposed “Count
Documents in Directories.” Bob Starbird, Gabo Manuel, and Mae
Isabelle Turiana reviewed the content. Diane Burley gave feedback
on the content and made sure I actually got this done. Thank you to
all!

vi


| Introduction


CHAPTER 1

Document Searches

Finding documents is a core feature for searching in MarkLogic.
Searches often begin with looking for simple words or phrases. Fac‐
ets in the user interface, in the form of lists, graphs, or maps, allow
users to drill into results. But MarkLogic’s Universal Index also cap‐
tures the structure of documents.
The recipes in this chapter take advantage of the Universal Index to
find documents with a specific root element and to look for docu‐
ments that are missing some type of structure.

Search by Root Element
Problem
You want to look for documents that have a particular root XML
element or JSON property and combine that with other search
criteria.

Solution
Applies to MarkLogic versions 7 and higher
(: Return a query that finds documents with
: the specified root element :)
declare function local:query-root($qname as xs:QName)
{
let $ns := fn:namespace-uri-from-QName($qname)

let $prefix := if ($ns eq "") then "" else "pre:"
return

1


xdmp:with-namespaces(
map:new(
map:entry("qry", " />cts:term-query(
xdmp:value(
"xdmp:plan(/" || $prefix ||
fn:local-name-from-QName($qname) || ")",
map:entry("pre", $ns)
)/qry:final-plan//qry:term-query/qry:key
)
)
};

You can then call it like this:
declare namespace ml = "";
cts:search(
fn:doc(),
cts:and-query((
local:query-root(xs:QName("ml:base")),
cts:collection-query("published")
))
)

Discussion
It’s easy to find all the documents that have a particular root element

or property: use XPath (/ml:base). However, that limits the other
search criteria you can use. For instance, you can’t combine a
cts:collection-query with XPath. What we need is a way to
express /ml:base as a cts:query.
The local:query-root function in the solution returns a cts:termquery that finds the target element as a root. We’re using a bit of
trickery to get there (including the fact that cts:term-query is an
undocumented function). Let’s dig in a bit deeper to see what’s
happening.
We can use xdmp:plan to ask MarkLogic how it will evaluate an
XPath expression like this:
declare namespace ml = "";
xdmp:plan(/ml:base)

The result looks like this (note that if you run this, the identifiers
will be different):

2

|

Chapter 1: Document Searches


<qry:query-plan xmlns:qry=" /><qry:expr-trace>xdmp:eval(
"declare namespace ml =
""; xdm...", (),
<options xmlns="xdmp:eval">
<database>17588436587394393575</database>...
</options>)
</qry:expr-trace>

<qry:info-trace>
Analyzing path: fn:collection()/ml:base
</qry:info-trace>
<qry:info-trace>
Step 1 is searchable: fn:collection()
</qry:info-trace>
<qry:info-trace>Step 2 is searchable: ml:base</qry:info-trace>
<qry:info-trace>Path is fully searchable.</qry:info-trace>
<qry:info-trace>Gathering constraints.</qry:info-trace>
<qry:info-trace>Executing search.</qry:info-trace>
<qry:final-plan>
<qry:and-query>
<qry:term-query weight="0">
<qry:key>682925892541848129</qry:key>
<qry:annotation>
doc-root(element(ml:base),doc-kind(document))
</qry:annotation>
</qry:term-query>
</qry:and-query>
</qry:final-plan>
<qry:info-trace>Selected 0 fragments</qry:info-trace>
<qry:result estimate="0"/>
</qry:query-plan>

Looking at the term-query in the <final-plan> element, we get
some visibility into the Universal Index—the index that stores terms
and structure for every XML, JSON, and text document that we
store in MarkLogic. This index records things like the words, XML
element or JSON properties, parent/child relationships among ele‐
ments and properties, and words that occur within specific elements

or properties. Exactly what is recorded depends on the settings you
have configured in your database. In each case, the word or struc‐
ture is mapped to a key.
Take another look at the <final-plan> element—this is the query
that MarkLogic will run. We can see that it’s using a term query, and
the annotation tells us what it means. A bit of XPath pulls out that
index key, which we then use to build a cts:query that we can com‐
bine with other queries.

Search by Root Element

|

3


declare namespace qry = " />declare namespace ml = "";
xdmp:plan(/ml:base)/qry:final-plan//qry:term-query/qry:key

So why are we using xdmp:value? We can run xdmp:plan with an
explicit XPath expression, but if we want to work with a dynamic
path (provided at runtime), then we can’t build a string and pass it
to xdmp:plan. However, we can build a string that includes the refer‐
ence to xdmp:plan and then pass the whole thing to xdmp:value,
which will evaluate it. xdmp:value also accepts bindings, which
allow us to use namespaces in the string we pass into xdmp:plan.
I used xdmp:with-namespaces so that the function can be selfcontained. Without that, the code would require the qry namespace
declaration at the top of the module where the local:query-root
function lives.
One more interesting bit: notice $prefix as part of the string passed

to xdmp:value. With a QName, there might be a prefix (if construc‐
ted with xs:QName) or there might not be (if constructed with
fn:QName or if the QName doesn’t use a namespace). To handle all
these cases, the recipe assigns whatever namespace is present to the
prefix “pre.” However, if the namespace URI is the empty string,
then we skip the prefix in the XPath that we send to xdmp:plan.
That last complexity is there because the parameter to the function
takes an xs:QName. The function could be written to take a string
(like /ml:base), or a namespace and a localname. Requiring an
xs:QName lets the caller build the QName using any of the available
methods (xs:QName, fn:QName; note that this approach doesn’t cre‐
ate any prefix), but also limits what goes into xdmp:value. Keeping
tight control over this data typing is important to prevent code
injection.

See Also
• Documentation: “Understanding Namespaces in XQuery”
(XQuery and XSLT Reference Guide)

4

|

Chapter 1: Document Searches


Find Documents That Are Missing an Element
Problem
You want to find all XML documents that are missing a particular
element. This can be used to find documents that have not yet gone

through some transformation.

Solution
Applies to MarkLogic versions 7 and higher
cts.search(
cts.notQuery(
cts.elementQuery(
xs.QName("target"), cts.trueQuery()
)
)
)

For MarkLogic 7 and earlier, replace the cts:true-query() with
cts:and-query(()).

Discussion
MarkLogic’s built-in search engine uses query criteria to identify
matching fragments. The indexes map terms (words, phrases, struc‐
tures, etc.) to fragment identifiers. To run a search, the specified
terms are looked up in the appropriate index to find fragment iden‐
tifiers. In the case of cts:not-query, the search will return any frag‐
ment identifiers except those matched by the nested query.
cts:element-query is a useful way to constrain a search to part of a

document. The function restricts the nested query to matching
within the specified XML element. Without the cts:not-query, this
same approach can be used to find documents that do have a partic‐
ular element, or to find terms that occur within a specific element.

The query passed to cts:element-query is cts:true-query for

MarkLogic 8 and later, and cts:and-query(()) for MarkLogic 7
and earlier. cts:true-query does what it sounds like—it matches
everything. Passed to cts:element-query, this provides a simple
way to test for the existence of an element. If you’re using a version
of MarkLogic that predates cts:true-query, the way to simulate it
is to use cts:and-query and pass in the empty sequence to it. An
Find Documents That Are Missing an Element

|

5


and-query matches if all queries passed into it are true; if none are
passed in, then it matches, thus making cts:and-query(()) work
the same as cts:true-query.

See Also
• XQuery version: “Find documents that do NOT have an
element”
• Recipe: “Find documents that do NOT have a JSON Property”

6

|

Chapter 1: Document Searches


CHAPTER 2


Scoring Search Results

MarkLogic is a database that contains a powerful search engine.
There are advantages to this, such as the fact that data does not need
to be replicated to a search engine to provide that functionality,
search results are up to date as soon as a transaction completes, and
the search is subject to the same security as the database content.
While running a search, MarkLogic assigns a score that accounts for
the frequency of your target terms within the database, the fre‐
quency of the terms within each document, and the length of the
document. For a detailed explanation of how scores are calculated,
see “Understanding How Scores and Relevance are Calculated” in
the Search Developer’s Guide.
The recipes in this chapter show some tricks to affect the way search
results are scored.

Sort Results to Promote Recent Documents
Problem
Show more recent documents higher in a result set than older docu‐
ments. For instance, when searching blog posts, more recent content
is more likely to be current and relevant than older content.

7


Solution
Applies to MarkLogic versions 8 and higher
With server-side code:
var jsearch = require('/MarkLogic/jsearch.sjs');

jsearch.documents()
.where([
cts.elementRangeQuery(
fn.QName("", "pubdate"), "<=", fn.currentDateTime(),
"score-function=reciprocal")
])
.result()

With the MarkLogic REST API:
{
"search": {
"qtext": "recent LE " + fn.currentDateTime()
"options": {
"constraint": [
{
"name": "recent",
"range": {
"facet": false,
"type": "xs:dateTime",
"element": {
"name": "pubdate"
},
"range-option": [ "score-function=reciprocal" ]
}
}
]
}
}

Required Index

• A dateTime index on the target element or property

Discussion
Part of searching is determining the order in which to present the
results. This ordering is based on the relevancy score—how well
does each document match the query? By default, range constraints
don’t affect the score, but we can override that. This is useful in

8

|

Chapter 2: Scoring Search Results


preferring recent content, or in finding documents with a geospatial
component near a particular point.
In the example above, our content documents have an element
called pubdate. If we set up a dateTime index on this element, then
we can do range queries. We might use those to limit our results to
just content within the last year, but in this case, the goal is just to
affect the scoring. As such, the JSearch example performs a <= com‐
parison with the current date and time—we’d expect this to match
all documents (note that documents without a score will fail to
match and will drop out of the result set). The current date and time
provides an anchor for the comparison; the distance between a doc‐
ument’s pubdate value and the anchor value is fed into the reciprocal
score function. This means that the more recent documents will get
a boost in score. You may want to adjust the weight parameter to the
element range query to tune how much impact recency has.

To use this approach with the REST API, create a range constraint
and specify the score-function=reciprocal range option. You’ll
need to provide an anchor point with the constraint, for instance
recent:2017-05-22T15:36:00. The anchor time will be added by
your middle tier, combining it with the user inputs.
The score-function option can be reversed by specifying the lin
ear function. This rewards values that are further away from the
anchor value. In the case of pubdate, score-function=linear
would favor older documents. This could be useful for a content
manager looking for content that needs to be updated.

See Also
• XQuery version: “Sort results to promote recent documents”
• Documentation: “Range Query Scoring Examples” (Search
Developer’s Guide)

Weigh Matches Based on Document Parts
Problem
When doing a text search, some matches are more valuable than
others. For instance, if you’re searching for a book, a match in an

Weigh Matches Based on Document Parts

|

9


ISBN field is a sure thing, a match on the title or author is very use‐
ful, a match in the abstract is good, and a match in the rest of the

text is a normal hit.

Solution
Applies to MarkLogic versions 7 and higher
Part of the challenge of rewarding matches from different parts of a
document is determining how to weight them. To do that, start with
an easily adjustable query, like this one:
let $text := ("databases")
return
cts:search(
fn:doc(),
cts:or-query((
cts:element-query(xs:QName("isbn"),
cts:word-query($text, (), 64)),
cts:element-query(xs:QName("author"),
cts:word-query($text, (), 16)),
cts:element-query(xs:QName("title"),
cts:word-query($text, (), 16)),
cts:element-query(xs:QName("summary"),
cts:word-query($text, (), 4)),
cts:element-query(xs:QName("content"),
cts:word-query($text, (), 1))
))
)

Once you have settled on the weights, set up a field. Here is the
Management API configuration for a field that matches the query
above; PUT this to /manage/v2/databases/{id|name}/properties:
<database-properties xmlns=" /><fields>
<field>

<field-name>book</field-name>
<field-path>
isbn</path>
<weight>64</weight>
</field-path>
<field-path>
author</path>
<weight>16</weight>
</field-path>
<field-path>
title</path>
<weight>16</weight>
</field-path>

10

| Chapter 2: Scoring Search Results


<field-path>
summary</path>
<weight>4</weight>
</field-path>
<field-path>
content</path>
<weight>1.0</weight>
</field-path>
<word-lexicons/>
<included-elements/>
<excluded-elements/>

<tokenizer-overrides/>
</field>
</fields>
</database-properties>

Here’s an updated query to use the field:
let $text := ("databases")
return
cts:search(
fn:doc(),
cts:field-word-query("book", $text)
)

Required Indexes
• word-positions
• element-word-positions

Discussion
Suppose you’re using keywords to search for a book and you get
three matches. If I tell you one matched in the title, one matched in
the summary, and one matched in the text, which book do you want
to see first? Probably the one with the title match, since the title will
likely have key terms in it. The summary is a bit bigger, but
describes the general purpose of the content. The rest of the content
may have lots of terms that are much more broadly related. This is
the intuition that drives awarding higher scores to matches in differ‐
ent parts of a document.
MarkLogic’s cts: queries take a weight parameter. The default value
is 1.0, but you can set it in a range from 64 to -16. The higher the
value, the more points a match earns. Since we’re using


Weigh Matches Based on Document Parts

|

11


cts:element-query, we need to turn on the word-positions and
element-word-positions indexes.

The biggest challenge with this scoring is figuring out how to weight
the various parts of the document. How much more relevant is a
match in the title than a match in the summary? The answer will be
application-specific and requires experimentation. Setting up an orquery makes it easy to run a set of experiments.
Once you have settled on the weights, you can simplify the query
code by creating a field. The field will include specification of the
paths and their relative weights.

See Also
• Documentation: “Understanding How Scores and Relevance are
Calculated” (Search Developer’s Guide)
• Documentation: “Adding a Weight to Boost or Lower the Rele‐
vance of an Included Element or Property” (Administrator’s
Guide)

12

|


Chapter 2: Scoring Search Results


CHAPTER 3

Understanding Your Data and
How It Gets Used

MarkLogic provides a platform for storing large amounts of hetero‐
geneous data. Understanding what a database holds and how your
users interact with it is key to improving the content over time. The
first recipe in this chapter shows how to log the searches that your
users are running. Based on this information, you can discover gaps
in your content or see what provides the best draw to your applica‐
tion. The second recipe analyzes how content is divided among
directories, which are likely used to contain logical or physical seg‐
ments of your data.

Logging Search Requests
Problem
Record searches run by users, in order to build a recommendation
system, understand user needs, or determine what type of content to
add. The goal is to record more information than the access logs
would provide, and perhaps to associate it with user profiles.

Solution
Applies to MarkLogic versions 7 and higher
There are a variety of ways to implement your search feature. If you
are using XQuery or JavaScript main modules to provide this


13


capability, as opposed to working with the REST API, then you can
combine logging the search parameters with executing the search
request.
import module
namespace search = " />at "/MarkLogic/appservices/search/search.xqy";
let $query := xdmp:get-request-field("query", "")
let $parsed := search:parse($query)
let $user-info :=
<user>
<!-- username, timestamp, -->
</user>
let $results := search:resolve($parsed)
let $log :=
xdmp:invoke(
"/service/log-search.xqy",
map:new((
map:entry("query", $parsed),
map:entry("total", $results/@total),
map:entry("user", $user-info)
)),
<options xmlns="xdmp:eval">
<isolation>different-transaction</isolation>
</options>
)
return $results

Discussion

The code above gets the query string with xdmp:get-requestfield. The Search API parses the query, which will be used both for
logging and for running the actual search. The $query is passed to
search:parse, which interprets the query and generates a serialized
(XML) version of it. The serialized version can then be passed into
the logging process, as well as sent to search:resolve for execution.
The details of what gets logged and where it is stored (the imple‐
mentation of /service/log-search.xqy) are beyond the scope of
this recipe—your requirements will determine what user informa‐
tion you need to capture and how you want to store it. You probably
have user profile documents that you could add to. Alternatively,
you might record the information as managed triples, eliminating
the worry of individual profile documents getting too large.

14

|

Chapter 3: Understanding Your Data and How It Gets Used


The key point to notice is that logging is done in a separate transac‐
tion with xdmp:invoke. This is important to minimize the locking
required. A search can run as a query, while logging the search has
to run as an update. This transaction type has an impact on what
type of locks are required.
MarkLogic handles transactions using Multi-Version Concurrency
Control (MVCC). Each version of a document has a creation time‐
stamp and a deletion timestamp. To change a document, a new ver‐
sion is created; when it’s time to commit, the deletion timestamp of
the earlier version and the creation timestamp of the new version

are both set to the current system timestamp. An update statement is
assigned a timestamp when it commits. Conversely, a query state‐
ment runs at the latest committed timestamp available at the time
the statement begins work.
With that in mind, we can take a closer look at the locks required to
do a search and log it. The search is a query statement, running at a
particular timestamp. Because of MVCC, we know that nothing will
change in the database at that timestamp, therefore no locks are
needed. This is especially useful for a search, which could potentially
touch a large number of documents.
The logging portion needs to make a transactional update to the
database, which requires write locks for any documents that will be
updated, as well as read locks. By running the update in a separate
transaction and passing in the required data, the documents
touched by the search don’t need to be locked, only the documents
that will record the information about the search.
Having the logging done in the same request (though a different
transaction) from the search means that there is more work for the
database to do than if the logging is done separately from the search
request. In practice, this has very little impact (thanks to eliminating
the need for most locks); however, it’s worth considering why this is
better than other strategies.
The simplest approach would be to skip the separate transaction and
do all the work in one. Hopefully the locking discussion above
shows why that’s not ideal.
You might decide to not wait for the logging to finish before moving
on to the next request, spawning the log process instead of invoking
it. This puts the logging request on the Task Server, which seems like

Logging Search Requests


|

15


a win—the logging will be done asynchronously, without making
the search results wait. However, there are a couple of risks. On a
busy system, the Task Server queue could potentially fill up, losing
requests. Also, the queue is not persisted, so if the server goes down,
the logging information will be lost.

See Also
• Documentation: “Search API: Understanding and Using”
(Search Developer’s Guide)
• Documentation: “Understanding Transactions in MarkLogic
Server” (Application Developer’s Guide)

Count Documents in Directories
Problem
Get a count of how many documents are in each directory so that
you or your users can understand your data set better.

Solution
Applies to MarkLogic versions 7 and higher
declare function local:map-uris($uris as xs:string*)
{
let $map := map:map()
let $_ :=
for $uri in $uris

let $toks := fn:tokenize($uri, "/")
for $t at $i in
fn:subsequence($toks, 1, fn:count($toks) - 1)
let $key := fn:string-join($toks[1 to $i], "/") || "/"
let $count := (map:get($map, $key), 0)[1]
return map:put($map, $key, ($count + 1) )
return $map
};
local:map-uris(cts:uris())

Required Index
• URI lexicon

16

|

Chapter 3: Understanding Your Data and How It Gets Used


Discussion
MarkLogic allows you to segment content by collections and by
directories. If you’re using directories, it can be helpful to know how
many documents are in a directory. This information might be used
just by you, as a content manager, or presented to your end users.
The local:map-uris function is given a sequence of URIs and
returns a map. The keys of the map are the directories, starting with
the root (“/”). The results are deep counts, showing the number of
documents somewhere under a directory. A URI of “/a/b/c/1.xml”
will contribute one count to “/”, “/a/”, “/a/b/”, and “a/b/c/”. The count

for “/” will match the total number of URIs passed in, assuming that
all URIs are in the root directory. Thus, for any particular directory,
the count will be the same number as found by cts:directoryquery($dir, "infinity").
Rather than running on all URIs, you can pass a query to cts:uris,
which will run unfiltered. This could be useful in building a multitier facet to give users information about available content.
If you run this on a large database, there’s a good chance that it will
time out. In that case, you might want to make separate calls for
each of your top-level directories using cts:uri-match().

See Also
• Documentation: “Collections Versus Directories” (Search
Developer’s Guide)
• Documentation: “Understanding Unfiltered Searches” (Query
Performance and Tuning Guide)

Count Documents in Directories

|

17


×