Tải bản đầy đủ (.pdf) (54 trang)

Tài liệu Module 6: Adding and Managing External Content doc

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.34 MB, 54 trang )







Contents
Overview 1
Components of a SharePoint Portal Server
Search 2
Adding Content Sources 13
Managing Content Sources 28
Lab A: Adding External Content to a
Workspace 42
Review 48

Module 6: Adding and
Managing External
Content




Information in this document is subject to change without notice. The names of companies,
products, people, characters, and/or data mentioned herein are fictitious and are in no way intended
to represent any real individual, company, product, or event, unless otherwise noted. Complying
with all applicable copyright laws is the responsibility of the user. No part of this document may
be reproduced or transmitted in any form or by any means, electronic or mechanical, for any
purpose, without the express written permission of Microsoft Corporation. If, however, your only
means of access is electronic, permission to print one copy is hereby granted.


Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual
property rights covering subject matter in this document. Except as expressly provided in any
written license agreement from Microsoft, the furnishing of this document does not give you any
license to these patents, trademarks, copyrights, or other intellectual property.

 2001 Microsoft Corporation. All rights reserved.

Microsoft, Active Directory, Active X, FrontPage, JScript, MS-DOS, NetMeeting, Outlook, PowerPoint,
SharePoint, Windows, Windows NT, Visio, Visual Basic, Visual SourceSafe, Visual Studio, and Win32

are either registered trademarks or trademarks of Microsoft Corporation in the U.S.A. and/or other
countries.

Other product and company names mentioned herein may be the trademarks of their respective
owners.


Module 6: Adding and Managing External Content iii


Instructor Notes
This module provides students with the information necessary to add and
manage a Microsoft
®
SharePoint

Portal Server content source.
After completing this module, students will be able to:

Describe the components that are used in the searching and indexing

features of SharePoint Portal Server.

Define content source and describe the types of content that are supported,
how a content source is used, and how to add a content source.

Manage a content source by setting schedules, scope, and rules, and
describe additional functions that apply to content sources.

Materials and Preparation
This section provides the materials and preparation tasks that you need to teach
this module.
Required Materials
To teach this module, you need the Microsoft PowerPoint
®
file 2095a_6.ppt.
Preparation Tasks
To prepare for this module, you should:

Read all of the materials for this module.

Complete the lab.

Instructor Setup for a Lab
This section provides setup instructions that are required to prepare the
instructor computer or classroom configuration for a lab.
Lab A: Adding External Content to a Workspace

To prepare for the lab
• Classroom configured according to the setup guide for course 2059a.


Presentation:
60 Minutes

Lab:
30 Minutes
iv Module 6: Adding and Managing External Content


Module Strategy
Use the following strategy to present this module:

Components of a SharePoint Portal Server Search
Describe the five components of a SharePoint Portal Server search, which
includes the Gatherer, IFilters, word breakers and noise words, plug-ins, and
indexing databases. Describe the function of each of these components and
then briefly explain how each component works.

Adding Content Sources
Explain that SharePoint Portal Server provides access to content that is
stored outside the workspace and that this content is referred to as a content
source. Describe the basic features of content sources and then explain how
to add various content sources to a Content Sources folder.

Managing Content Sources
Explain that once a content source has been added, it must be managed to
ensure that it used effectively during searches. Discuss how to manage a
content source by configuring crawl settings, search scopes, index updates,
rules, gatherer log files and discussion settings as well as other management
functions.


Customization Information
This section identifies the lab setup requirements for a module and the
configuration changes that occur on student computers during the labs. This
information is provided to assist you in replicating or customizing Training and
Certification courseware.

The lab in this module is also dependent on the classroom
configuration that is specified in the Customization Information section in the
Classroom Setup Guide for Course 2095A, Implementing Microsoft
®

SharePoint

Portal Server 2001.

Lab Setup
The following list describes the setup requirements for the lab in this module.
Setup Requirement 1
The lab in this module requires no additional configuration. To prepare student
computers to meet this requirement, perform the following actions:

Configure the instructor computer according to the classroom setup guide
for course 2095a.

Configure the student computers according to the classroom setup guide of
course 2095a.

Lab Results
There are no configuration changes on student computers that affect replication
of customization.


Importan
t
Module 6: Adding and Managing External Content 1


Overview

Components of a SharePoint Portal Server Search

Adding Content Sources

Managing Content Sources

*****************************I
LLEGAL FOR
N
ON
-T
RAINER
U
SE
*****************************
Microsoft
®
SharePoint

Portal Server 2001 stores content that is both internal
and external to the workspace. A content source is used to specify a set of
content that is stored outside the workspace. The Microsoft Search (MSSearch)

service is a full-text indexing and search engine that is used to crawl, retrieve,
create and update indexes for this content. This module discusses this process
and examines the use of content sources for accessing content that is external to
the SharePoint Portal Server computer.
After completing this module, you will be able to:

Describe the components that are used in the searching and indexing
features of SharePoint Portal Server.

Define content source and describe the types of content that are supported,
how a content source is used, and how to add a content source.

Manage a content source by setting schedules, scope, and rules, and
describe additional functions that apply to content sources.

Topic Objective
To provide an overview of
the module topics and
objectives.
Lead-in
In this module, you will learn
about adding and managing
content with SharePoint
Portal Server.
2 Module 6: Adding and Managing External Content







Components of a SharePoint Portal Server Search

The Gatherer

IFilters

Word Breakers and Noise Words

Plug-Ins

Indexing Database

*****************************I
LLEGAL FOR
N
ON
-T
RAINER
U
SE
*****************************
This topic provides an overview of the technology that is used in the searching
and indexing features of SharePoint Portal Server. These components are used
to create and manage content sources.
Topic Objective
To outline this topic.
Lead-in
In this topic, we will examine
the components of

MSSearch.
Module 6: Adding and Managing External Content 3


The Gatherer
Accessing
Accessing
Indexing
Indexing
Filtering
Filtering
Filter
Daemon
Process

Core Component of MSSearch

Manages How Content Is Accessed, Filtered, and Indexed

Includes Native and Registered Protocol Handlers

*****************************I
LLEGAL FOR
N
ON
-T
RAINER
U
SE
*****************************

The Microsoft Gatherer performance object is the core component of
MSSearch. As SharePoint Portal Server processes transactions on your system,
it generates performance data that Windows 2000 can track and log. This data is
described as a performance object and is typically named for the component
generating the data. The Gatherer manages the way that content is accessed,
filtered, and indexed.
How the Gatherer Works
The Gatherer runs inside MSSearch and interacts with a separate filter daemon
process (mssdmn.exe) that performs data access and content filtering. The
following steps describe how the Gatherer works:
1. The filter daemon uses protocol handlers and IFilters to extract data. These
filters are data type–specific components that SharePoint Portal Server uses
to communicate with and filter the documents in the content source.
2. The Gatherer runs the data through a series of plug-ins to process and filter
the data. Plug-ins are used to interpret the data and properties as it is pulled
from the documents in a content source.
3. The data passes through the plug-ins before the index is created and the
document properties are saved to an index database (Microsoft Jet property
store).

A Jet property store is separate from the Microsoft Web Storage
System used by SharePoint Portal Server.


Topic Objective
To explain the function of
the Gatherer.
Lead-in
In this topic we will examine
the Gatherer, a core

component of SharePoint
Portal Server MSSearch.
Note
4 Module 6: Adding and Managing External Content


Using Protocol Handlers to Access Data Store Content
The Gatherer accesses documents in a data store by using the appropriate
protocol by way of a protocol handler interface. The protocol handler, which
has no relation to network protocol, is an interface between the index and
SharePoint Portal Server. When the Gatherer processes a Uniform Resource
Locater (URL) during indexing, the filter daemon determines which protocol
handler to use based on the URL prefix, loads the associated dynamic link
library (DLL), and passes the URL and security credentials to the protocol
handler.
Native Protocol Handlers
SharePoint Portal Server includes native protocol handlers, or handlers that
ship with the product, for Hypertext Transfer Protocol (HTTP), file, Microsoft
Exchange 5.5, Microsoft Exchange 2000 Server, and Lotus Notes.
Exchange 2000 and SharePoint Portal Server share the Web Storage System
technology and the same protocol handler. This protocol handler accesses a
local Web Storage System by using Microsoft OLE DB Provider for
Exchange 2000 Server (EXOLEDB) and uses Web Distributed Authoring and
Versioning (WebDAV) to access the Web Storage System on a remote
Exchange or SharePoint Portal Server computer.
Registered Protocol Handlers
The following table lists the registered protocol handlers that are included with
SharePoint Portal Server.
Prefix DLL ProgID


File Mssph.dll MSSearch.FileHandler.1
HTTP Mssph.dl MSSearch.HttpHandler.1
Exch Mssexph.dll MSSearch.MapiHandler.1
PKM Exstore Pkmexsph.dll PKM.ExstoreHandler.1
Notes Notesph.dll MSSearch.NotesHandler.1

Gatherer Project
A search application can have one or more Microsoft Gatherer Projects
performance object. Gatherer Projects are located inside a search application,
such as SharePoint Portal Server. SharePoint Portal Server has one Gatherer
Project for each internal or external workspace. These workspaces have their
own settings, such as indexing schedules. The Search services uses Gatherer
Projects to keep each workspace separate so it can have its own schedule.
A SharePoint Portal Server workspace is a Gatherer Project with its own index.
Each Gatherer Project contains its own set of build parameters, crawl
restrictions, and plug-ins. Each Gatherer Project contains its own run-time
transaction log containing all URLs to be crawled and maintains its own
statistics.
Module 6: Adding and Managing External Content 5


IFilters
Office (offfilt.dll)
Office (offfilt.dll)
HTML (nlhtml.dll)
HTML (nlhtml.dll)
Text (query.dll)
Text (query.dll)
MIME (mimefilt.dll)
MIME (mimefilt.dll)

TIFF (mspfilt.dll)
TIFF (mspfilt.dll)
Null Filter (tquery.dll)
Null Filter (tquery.dll)

Extract Content and Properties from Documents

Open Data Streams and Expose the Data as Indexable
Chunks

SharePoint Portal Server Provides IFilters for:

*****************************I
LLEGAL FOR
N
ON
-T
RAINER
U
SE
****************************
IFilters are the components of MSSearch that extract a document’s content and
its properties.
How IFilters Work
During the filter daemon process, IFilters open data streams and expose the data
so that it can be indexed. In particular, the Hypertext Markup Language
(HTML) filter strips a document of all HTML tags and emits various HTML
syntactic elements as properties, such as author or title, and also emits the body
text. Each file type, indicated by its file extension, has an IFilter associated with
it.

SharePoint Portal Server provides IFilters for HTML, Microsoft Office, text,
Multipurpose Internet Mail Extensions (MIME) and Tagged Image File Format
(TIFF).

You should convert documents created using Office applications to
Office 95 or later. The office IFilter would not expose document properties of
older Office documents.

Topic Objective
To explain the function of
IFilters.
Lead-in
In this topic we will examine
how filters extract content
and properties from
documents for indexing.
Note
6 Module 6: Adding and Managing External Content


IFilter DLLs
The following table lists the IFilters that are included with SharePoint Portal
Server.
Prefix DLL

Office offfilt.dll
HTML nlhtml.dll
Text query.dll
MIME mimefilt.dll
TIFF mspfilt.dll

Null filter tquery.dll

Module 6: Adding and Managing External Content 7


Word Breakers and Noise Words
Loem Ipsum arnet

Word Breakers

Break words apart

Remove punctuation and symbols

Follow language-specific rules

Follow special case rules

Noise Words

Words that do not add value to a query (“and”, “the”)

MSSearch filters out noise words

*****************************I
LLEGAL FOR
N
ON
-T
RAINER

U
SE
*****************************
Word breakers and noise words are used to facilitate indexing.
Word Breakers
To correctly crawl a document to add it to an index, SharePoint Portal Server
must use word breakers. A word breaker determines where the word boundaries
are in the stream of characters in the query or in a document being crawled. The
word breaker that is used during indexing is determined by the language that is
identified and emitted by the IFilter.
Function of Word Breakers
Common functions of word breakers include:

Breaking words apart at white spaces and at line and paragraph separators.

Removing most punctuation and symbols.

Following language-specific rules to handle such things as URLs, e-mail
addresses, currency, hyphenation, and time/date. For example, the e-mail
address is broken at the @ and the period.

Following special case rules. For example, SharePoint Portal Server word
breakers leave the string C++ intact, because if the ++ were deleted, the
resulting “C” would be discarded as a noise word.

Topic Objective
To explain the function of
word breakers and noise
words.
Lead-in

In this topic, we will examine
how word breakers and
noise words are used to
facilitate indexing.
8 Module 6: Adding and Managing External Content


Using Word Breakers in Indexing
The content index uses the word breaker component in the following two
situations:

When an index is created or updated. The word breaker splits all text that is
referenced by the content index. The index is updated continuously as
documents are modified and closed.

At query time. A word breaker is used to break query strings into words and
phrases.


For more information about word breaking at query time, see Module 7,
“Searching for Content,” in Course 2095A, Implementing Microsoft
®

SharePoint

Portal Server 2001.

Using SharePoint Portal Server and Operating System Word Breakers
The word breakers included in SharePoint Portal Server override existing
operating system word breakers. SharePoint Portal Server calls the operating

system word breaker if a special one for SharePoint Portal Server does not
exist. If Windows 2000 or SharePoint Portal Server does not have a special
language word breaker, the neutral word breaker is used. The neutral word
breaker (query.dll) provided by the operating system breaks at white spaces and
several other breaking characters.
Noise Words
Both noise words and noise word lists are used by MSSearch.
Using Noise Words
Noise words are words that do not add value to a query, such as “and”, “the”,
and single letters. MSSearch filters out noise words to save index space and
increase performance.
Using Noise Word Lists
Noise word lists are customizable language-specific text files that are stored in
the %systemroot%\program files\SharePoint Portal Server\data\ftdata\
SharePoint Portal Server\config folder. There is one noise word list for each
language that is supported. For example, the noise word list for U.S. English is
noiseenu.txt. Each file contains a list of words, with one word per line. If you
change the noise word list, you must perform a full update of the index to
incorporate the changes.
Note
Module 6: Adding and Managing External Content 9


Plug-Ins
Filter Plug-ins

Plug-in Categories

Consumer plug-in


Active plug-in

Default Plug-ins

Auto-Categorization Module
plug-in

PQS plug-in

Indexing plug-in

Gatherer plug-in

*****************************I
LLEGAL FOR
N
ON
-T
RAINER
U
SE
*****************************
A plug-in is a component that resides in the Gatherer data pipeline and
processes the data that is emitted by the content filters. The Gatherer Project
uses plug-ins to process the text and properties of collected content.
Plug-in Categories
The Gatherer includes the following two categories of plug-ins:

Consumer plug-in. This plug-in uses only the text and properties that are
emitted and does not affect the pipeline.


Active plug-in. This plug-in can affect the pipeline by adding, modifying, or
deleting properties.

Default Plug-ins
The Gatherer contains four default plug-ins: the Auto-categorization Module
plug-in, the Persistent Query Server (PQS) plug-in, the Indexing plug-in, and
the Gatherer plug-in.
Auto-Categorization Module Plug-In
The Auto-categorization (AutoCat) Module plug-in is a consumer plug-in that
processes the data being streamed and uses statistical information to
automatically associate certain SharePoint Portal Server categories with
documents.
PQS Plug-In
The PQS plug-in is used for the SharePoint Portal Server Subscriptions feature.
The active PQS plug-in checks the data in the stream against subscription rules
and notifies the subscription engine to generate notifications if needed.
Topic Objective
To explain the function of
plug-ins.
Lead-in
In this topic, we will describe
how the Gatherer uses plug-
ins.
10 Module 6: Adding and Managing External Content


Indexing Plug-In
The Indexing plug-in is essentially a wrapper that interacts with the full-text
engine. The indexing plug-in performs the following tasks:


Checks the schema to determine which properties to include in the index.
For properties that are retrievable in user search queries, it will save the data
in the property store. For properties and text that is marked for indexing, it
will perform additional processing and store the full-text index.

Regulates the amount of data that is being passed to the full-text engine by
blocking the data pipeline when a threshold is reached.

Saves the data to the Jet property store. The data is saved in the property
store first and then the indexing engine saves the data. The property store is
located at %program files%\SharePoint Portal Server\data\ftdata\SharePoint
Portal Server.

Gatherer Plug-In
The Gatherer plug-in can be thought of as the crawl manager. It receives the
call to start a crawl, checks for crawl restrictions, and maintains the crawl queue
and history. It is present in every Gatherer project, regardless of the
configuration.
Module 6: Adding and Managing External Content 11


Indexing Database
Index

The Indexing Database Provides a Consistent
Structure for

Word lists


Shadow indexes

One or more master indexes

*****************************I
LLEGAL FOR
N
ON
-T
RAINER
U
SE
*****************************
The indexing database is a collection of word lists, shadow indexes, and one or
more master indexes. Each data structure contains the same type of information
and is optimized for a different stage in the life cycle of the index.
Word List
Word lists can be quickly created since they are in memory. This also means a
document is accessed quickly. The crawl is not held up for very long as the
word list is being written and the crawl can move quickly from document to
document.
Shadow Index
Because word lists exist only in memory and take up too much space to be used
for long-term storage, the MSSearch service automatically transfers data in
word lists to a shadow index. A shadow index is a disk-based structure that is
created when a specified number of word lists exists. Because data in a shadow
index is compressed, access time is slower than for a word list. Creating a
shadow index is also much slower than creating a word list. After a shadow
index is created, it cannot be modified. Further, if MSSearch determines that
there are too many shadow indexes, they will merge to create new shadow

indexes, building on existing shadow indexes and word lists.
Because shadow indexes cannot be modified, the number of shadow indexes in
the content index will grow over time as new word lists are converted to
shadow indexes.
Topic Objective
To explain the function of an
indexing database and its
collection of four indexes.
Lead-in
In this topic, we will examine
how SharePoint Portal
Server provides a consistent
structure for the
components of the indexing
database.
12 Module 6: Adding and Managing External Content


Master Index
Because the access time for a shadow index is almost constant regardless of
size, content index performance will decrease as more shadow indexes are
created. Therefore, it is advantageous to merge shadow indexes into a master
index. In SharePoint Portal Server, this process is called a master merge and it
happens by default every night at midnight, after a specific number of
documents have been indexed or if disk space gets too low. You cannot
manually initiate the creation of a master index. The master index, which is the
final repository for all indexing information, is by far the largest index. The
optimal content index is a master index, with no word lists or shadow indexes.
The content of the word lists and shadow indexes now exists only in the master
index.

Module 6: Adding and Managing External Content 13






Adding Content Sources

Adding a Content Source

Adding a Web Content Source

Adding an Exchange 5.5 Content Source

Adding an Exchange 2000 Content Source

Adding a Lotus Notes Content Source

*****************************I
LLEGAL FOR
N
ON
-T
RAINER
U
SE
*****************************
In addition to storing content in standard and enhanced folders in the
workspace, SharePoint Portal Server provides access to content that is stored

outside the workspace, by means of content sources. SharePoint Portal Server
provides read access to, and searching within, content sources, but content
sources cannot be edited, checked in, or checked out. This section describes
some of the basic features of content sources and how to add them to your
Content Sources folder.
Topic Objective
To outline this topic.
Lead-in
In this section, you will learn
about the basic procedure
for adding a content source.
14 Module 6: Adding and Managing External Content


Adding a Content Source
Content
Management
Content
Sources
~~~ ~~~ ~~~
Users
Index

*****************************I
LLEGAL FOR
N
ON
-T
RAINER
U

SE
*****************************
A content source represents an external location, indicated by a URL, where the
content is stored and accessed for indexing. You create and store links to this
content in the Content Sources folder that is located in the Management folder.
Content can be located on the same server, a server on your intranet, or a server
on the Internet.
Defining a Content Source
A content source is defined by:

The type of data store that is accessed, such as a network file server, a Web
server, an Exchange server, or a Lotus Notes database.

The address, a URL containing the host name and a path, that is required to
locate the content.

Additional parameters that control how the index of the content is created.

Topic Objective
To describe the function of a
content source as well as
how to add a content
source.
Lead-in
In this topic, you will learn
how to prepare for adding a
content source.
Module 6: Adding and Managing External Content 15



Types of Content Sources
When you add a content source to the Content Sources folder, you must provide
an address or URL for that content. The following table lists the types of
information that you can add to the workspace as a content source.
Source type Sample address

Web site or Web page.
File share. file://server/share/page.htm
-OR- \\server\share\folder
Exchange 5.5 public folder. The
SharePoint Portal Server computer must
be configured to crawl this type of folder.
http://server/Public/Public Folders
-OR- exch://backofficestorage/
Exchange 2000 public folder. http://server/Public/Public Folders/folder
Lotus Notes database. Before you can
create this content source, the Lotus Notes
client must be properly installed on the
SharePoint Portal Server computer, and
the computer must be properly configured
with the NotesSetup utility.
Provide the name of the database and the
address of the database server, such as:
//noteserver
Other SharePoint Portal Server
workspaces.
http://server/workspace/folder/

Creating and Updating an Index of the Content
On a regular basis, SharePoint Portal Server creates and updates an index of the

content that is made available through content sources. After SharePoint Portal
Server includes a content source in the workspace index, users with appropriate
permissions can search for and view its content on the dashboard site. However,
users cannot check out and edit content sources or the documents that are
accessed through the content sources.
SharePoint Portal Server supports indexing of content that is stored on Web
sites, network file shares, Lotus Notes version 4.6a / R5 databases,
Exchange 5.5 servers, Exchange 2000 servers, and other SharePoint Portal
Server workspaces. You can also write custom protocol handlers that gather
content from additional stores.
File Formats
SharePoint Portal Server supports only certain document file formats.
File Formats Supported by SharePoint Portal Server
SharePoint Portal Server supports any of the following document file formats:
Microsoft Office Suite, TIFF, MIME, HTML, and Lotus Notes. Plug-ins are
available from the vendors’ Web sites for Adobe PDF files and Corel
WordPerfect files.
File Formats Not Supported by SharePoint Portal Server
The current version of SharePoint Portal Server does not support some
document file formats. For example, Microsoft Visio
®
and Microsoft Project
are not supported file types. This information is important to remember when
you crawl content or create an index.
16 Module 6: Adding and Managing External Content


Adding a Content Source to the Workspace
To add a content source, you use the Content Source Wizard in the Content
Sources folder under the Management folder. Before you can add a content

source to your workspace, you must have read access to the source, know where
the content source files are stored, and know how the files will be searched.
Before you can add a content source to the workspace, the workspace
administrator must specify a default content access account.
If the administrator has not configured a default account for SharePoint Portal
Server to crawl, the wizard will prompt for one. This account will be used to
connect to the content source. SharePoint Portal Server also will allow you to
create indexes immediately, or you may choose to do so later.
To add a content source to your SharePoint Portal Server workspace:
1. Specify the location of the external content that you want to add to the
workspace.
You can add any one of five types of content sources using the Content Source Wizard.
You must choose content that is external to the current
workspace.

2. Open the Management folder, and then open the Content Sources folder.
3. Double-click Add Content Source.
4. The Add Content Source Wizard opens.
a. Define the content type by selecting the content source type that you
want to incorporate into the index.
b. Provide a path that directs SharePoint Portal Server to the linked content
by providing an address or URL for Web content or by providing the
database address and name for a Lotus Notes database.

The new content source is placed in the Content Sources folder. The
information available from the source is included in the workspace index and is
available for users to search for and view on the dashboard site.

For information about content access accounts, see Module 9, “Managing
SharePoint Portal Server,” in Course 2095A, Implementing Microsoft

®

SharePoint

Portal Server 2001.

Important
Note
Module 6: Adding and Managing External Content 17


Adding a Web Content Source
To Add a Web Content Source:
To Add a Web Content Source:
To Add a Web Content Source:
Run the Add Content Source Wizard
Select Web Site, File Share, or SharePoint Portal
Server as the content type
Enter a valid URL or UNC path to the content, and
specify the desired crawl depth
Assign the content source a unique display name
On the Finish page, you can choose to start the full
build immediately or you can initiate it later

*****************************I
LLEGAL FOR
N
ON
-T
RAINER

U
SE
*****************************
Adding a Web content source for a Web server, network file share, and remote
SharePoint Portal Server workspace requires a simple URL or Uniform Naming
Convention (UNC) file path.
To add a Web content source:
1. Run the Add Content Source Wizard.
2. Select Web Site, File Share, or SharePoint Portal Server as the content type.
3. Enter a valid URL or UNC path to the content, and specify the desired crawl
depth.
4. Assign a unique display name to the content source.
5. On the Finish page, you can choose to start the full build immediately, or
you can initiate it later.

For network file shares, you can specify any standard shared folder on a
Windows file system. MSSearch is also able to crawl mounted network file
shares on other operating systems that support the server message block (SMB)
protocol. For example IBM OS/2, Novell Netware, and UNIX running an SMB
service like Samba.

In Microsoft Site Server 3.0, users can map custom properties stored
in HTML META tags to Office properties using the text files schema.txt and
gathererprm.txt so that the metadata will be indexed. SharePoint Portal Server
version 1 does not support schema mapping using these files. Custom properties
in META tags will not be included in the index if they match properties in the
SharePoint Portal Server schema.

Topic Objective
To describe how to add a

Web content source.
Lead-in
In this topic, we will explore
how to add a Web content
source.
Important
18 Module 6: Adding and Managing External Content


Connecting to a Secure Site
When you are connecting to a secure site, you must specify an account that has
the appropriate type of access and authentication credentials. MSSearch runs as
a local system account and must impersonate an access account by using the
credentials that you provide. You must specify a default content access account
during Setup. You can change the account at any time by using the Accounts
tab on the Properties page of the server in SharePoint Portal Server
Administration. A coordinator can also specify an account other than the default
by creating a site path rule for the URL or UNC path.
Using HTTP Protocol and Authentication Methods
When the Gatherer connects to a SharePoint Portal Server or Web content
source, it uses the HTTP protocol and HTTP authentication methods. To
validate the content access account, it can use the Basic, Anonymous, or
Integrated Windows authentication method. By default, content sources always
use the Integrated Windows authentication method. To configure the content
source to use the Basic authentication method, you must create a site path rule.
Because the Basic authentication method sends credentials over the network
unencrypted, an administrator must ensure this does not pose a security risk. To
secure portal connections, you can enable Secure Socket Layer (SSL) on the
workspace virtual directory in Microsoft Internet Information Server (IIS).
When the Gatherer connects to a file content source, it uses the SMB protocol

and Integrated Windows authentication. When accessing file systems other than
Windows, such as UNIX or Netware, you must use the Basic authentication
method.

When crawling content in a non-trusted domain, you must use the
Basic authentication method, which you can set by using a site path rule. You
also cannot set a default content access account that resides in a non-trusted
domain.


Be careful when you set the crawl settings. If you configure a site to
follow all links, make sure that you are aware of the depth and size of the site.
You might use excessive bandwidth and not have enough disk space to crawl
large sites.

Important
Warnin
g
Module 6: Adding and Managing External Content 19


Adding an Exchange 5.5 Content Source
Required
Required
Required
The Outlook 2000 client must be installed
The Exchange server name
The Outlook Web Access server name
The Exchange site the server belongs to
The Exchange organization the server belongs to

An access account
To Add Exchange 5.5 Content Source:
To Add Exchange 5.5 Content Source:
To Add Exchange 5.5 Content Source:
Provide the path to the public folders

*****************************I
LLEGAL FOR
N
ON
-T
RAINER
U
SE
*****************************
Before you can add an Exchange 5.5 content source, you must enable this
feature in SharePoint Portal Server. Because MSSearch requires Messaging
Application Programming Interface (MAPI) files and Windows 2000 does not
contain these files, you must install Microsoft Outlook
®
2000, including the
Collaboration Data Objects (CDO) component on the SharePoint Portal Server
computer before you can crawl Exchange 5.5 content. You do not need to
configure a MAPI profile. If these conditions are not met, the content source
will not be created. The following information is also required:

The name of the Exchange server.

The name of the Microsoft Outlook Web Access server. If the name of the
server is not specified, it is assumed that it is installed on the Exchange

server that is being indexed. You do not need to use Outlook Web Access,
but if you do not, SharePoint Portal Server requires additional configuration
to crawl the public folders.

The Exchange site that the server belongs to.

The Exchange organization that the server belongs to.

An access account with Administrator privilege on the Organization
container.


Enter the name of the site and the name of the organization exactly,
including the correct capitalization.


To crawl the public folders that reside on a different server, you must
replicate the folders to the crawled server. For information about replicating
public folders, see Module 10, “Examining an Enterprise-Level
Implementation,” in Course 2095A, Implementing Microsoft
®
SharePoint

Portal Server 2001.

Topic Objective
To describe how to add an
Exchange 5.5 content
source.
Lead-in

In this topic, we will explore
how to add an Exchange 5.5
content source.
Important
Tip
20 Module 6: Adding and Managing External Content


Using the Exchange Service Account
Although you can use the Exchange service account to crawl content, any
account that has Administrator rights on the Organization container can be
used. It is not necessary to grant permissions on the Site, Site Configuration, or
Server containers. Exchange Administrator privileges are required because:

Exchange 5.5 does not use Windows access control lists (ACLs) to secure
content, which requires MSSearch to communicate with the Exchange 5.5
directory (dir.edb) at query time to filter out any results for which the user
does not have access.

Crawling Exchange 5.5 uses MAPI calls that require Administrator
privileges.

Providing a Public Folder Path
When you add a content source, you are simply providing the path to the public
folder. The path format reflects the hierarchy of the public folders and starts
with exch://. Each folder name is separated by a slash mark (/).
For example, to crawl a folder called Company News, use the start address
exch://ExchangeServer/Public Folders/All Public Folders/Company News,
where ExchangeServer is the name of the Exchange server that is configured
for Search and the name of the public folder tree is All Public Folders. To crawl

all public folders, the path must end with All Public Folders/ (note trailing slash
mark).
For Your Information
Site Server 3.0 Search
crawling Exchange 5.5
setup was very similar to
SharePoint Portal Server
crawling an Exchange 5.5
content source. However,
Site Server required
MSSearch to run in the
context of the Exchange
Administrator account. With
SharePoint Portal Server,
the service runs as the local
system account and
impersonates the Exchange
account only when crawling
and performing security
validations on search
results.
Module 6: Adding and Managing External Content 21


Adding an Exchange 2000 Content Source
Index
Exchange
Public Folders

SharePoint Portal Server Indexes Any Items That

Can be Read by the Access Account Provided in
Exchange 2000

*****************************I
LLEGAL FOR
N
ON
-T
RAINER
U
SE
*****************************
SharePoint Portal Server can crawl both the content of Exchange 5.5 (no service
pack is required) and Exchange 2000 Server. SharePoint Portal Server crawls
only public folder items for Exchange 5.5 and any items that can be read by the
access account provided in Exchange 2000. This means you can search for
content in private mailboxes in Exchange 2000, such as a shared departmental
mailbox.
Accessing Exchange Content
To access Exchange content that is returned in search results on the dashboard
site, click the Web link, which retrieves and displays the content by using
Outlook Web Access.
Indexing Office Attachments
On crawled messages, Exchange 2000 creates indexes of the following
attachments:

Office attachments. The metadata of an attachment is included in the index.

Custom properties of Office attachments. Unlike Site Server, the custom
properties of an Office attachment are included in the index if they match

SharePoint Portal Server properties, just as with documents inside a
SharePoint Portal Server Web folder.

Attachments that the Gatherer usually filters. For example, an htm file is
included in the index. However, the search results for an attachment display
the subject and author of the message.


For more information about installing and accessing Outlook Web
Access, see the Exchange Server documentation.

Topic Objective
To describe how to add an
Exchange 2000 content
source.
Lead-in
In this topic, we will explore
how to add an
Exchange 2000 content
source.
Note

×