Tải bản đầy đủ (.pdf) (36 trang)

Tài liệu Module 7: Server Cluster Maintenance and Troubleshooting ppt

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (887.24 KB, 36 trang )





Contents
Overview 1
Cluster Maintenance 2
Troubleshooting Cluster Service 11
Lab A: Cluster Maintenance 24
Review 30

Module 7: Server
Cluster Maintenance
and Troubleshooting


Information in this document is subject to change without notice. The names of companies,
products, people, characters, and/or data mentioned herein are fictitious and are in no way intended
to represent any real individual, company, product, or event, unless otherwise noted. Complying
with all applicable copyright laws is the responsibility of the user. No part of this document may
be reproduced or transmitted in any form or by any means, electronic or mechanical, for any
purpose, without the express written permission of Microsoft Corporation. If, however, your only
means of access is electronic, permission to print one copy is hereby granted.

Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual
property rights covering subject matter in this document. Except as expressly provided in any
written license agreement from Microsoft, the furnishing of this document does not give you any
license to these patents, trademarks, copyrights, or other intellectual property.

 2000 Microsoft Corporation. All rights reserved.


Microsoft, Active Directory, BackOffice, Jscript, PowerPoint, Visual Basic, Visual Studio, Win32,
Windows, Windows NT are either registered trademarks or trademarks of Microsoft Corporation
in the U.S.A. and/or other countries.

Other product and company names mentioned herein may be the trademarks of their respective
owners.

Program Manager: Don Thompson
Product Manager: Greg Bulette
Instructional Designers: April Andrien, Priscilla Johnston, Diana Jahrling
Subject Matter Experts: Jack Creasey, Jeff Johnson
Technical Contributor: James Cochran
Classroom Automation: Lorrin Smith-Bates
Graphic Designer: Andrea Heuston (Artitudes Layout & Design)
Editing Manager: Lynette Skinner
Editor: Elizabeth Reese
Copy Editor: Bill Jones (S&T Consulting)
Production Manager: Miracle Davis
Build Manager: Julie Challenger
Print Production: Irene Barnett (S&T Consulting)
CD Production: Eric Wagoner
Test Manager: Eric R. Myers
Test Lead: Robertson Lee (Volt Technical)
Creative Director: David Mahlmann
Media Consultation: Scott Serna
Illustration: Andrea Heuston (Artitudes Layout & Design)
Localization Manager: Rick Terek
Operations Coordinator: John Williams
Manufacturing Support: Laura King; Kathy Hershey
Lead Product Manager, Release Management: Bo Galford

Lead Technology Manager: Sid Benavente
Lead Product Manager, Content Development: Ken Rosen
Group Manager, Courseware Infrastructure: David Bramble
Group Product Manager, Content Development: Julie Truax
Director, Training & Certification Courseware Development: Dean Murray
General Manager: Robert Stewart


Module 7: Server Cluster Maintenance and Troubleshooting iii

Instructor Notes
This module is intended to prepare the students to successfully back up and
restore a server cluster. Students need to know how to use the troubleshooting
tools available for troubleshooting server cluster problems. The module covers
common Cluster service problems and possible resolutions.
After completing this module, you will be able to:
 Perform the steps to successfully back up a server cluster.
 Perform the steps to successfully restore a server cluster.
 Evict a node from a server cluster.
 Identify the tools that are necessary to troubleshoot a cluster failure.
 Interpret the entries on the cluster log.
 Identify and troubleshoot common server cluster failures: network
communications, small computer system interface (SCSI) configuration
problems, group, resource, and quorum failures.

Materials and Preparation
This section provides the materials and preparation tasks that you need to teach
this module.
Required Materials
To teach this module, you need the Microsoft® PowerPoint® file 2087A_02.ppt

Preparation Tasks
To prepare for this module, you should:
 Read the materials for this module and anticipate questions students may
ask.
 Read Q224075, Q257892, Q248998, Q172951, Q266274, Q234767,
Q193890, Q245762 and “Interpreting MSCS Cluster Log, on the Student
compact disk.
 Be familiar with the Resource Kit Utilities.
 Practice the labs.
 Study the review questions and prepare alternative answers for discussion.

Presentation:
45 Minutes

Lab:
15 Minutes
iv Module 7: Server Cluster Maintenance and Troubleshooting

Module Strategy
Use the following strategy to present this module:
Because backing up the cluster is a key maintenance task, the first section
begins with information on how to backup the cluster configuration files. The
following pages cover the complete procedure for restoring an entire cluster in
case of catastrophic failure. You can also use each of the topics as a separate
procedure for performing a specific task.
The troubleshooting section lists the tools that are available for troubleshooting
Cluster service and gives common problems and suggested resolutions.
 Cluster Maintenance
Cluster service is self-tuning and requires no maintenance other than daily
backups.

• Backup: Backing up the system state backs up the cluster configuration
files; however, you also need to back up each node’s data and operating
system and the cluster disks.
• Restoring the First Node: The overall procedure for restoring a cluster is
outlined on this page. The first step, restoring the operating system on
the first node, is also covered. The remaining steps are covered in detail
on the following pages.
• Restoring Cluster Disks: Cluster service uses the disk signature file to
identify the cluster disk. To replace this disk, you must write the disk
signature file of the old disk onto the new disk.
• Restoring the Second Node: Restoring the remaining nodes of the cluster
is similar to restoring the first node, except that after it is restored, you
need to test the failover capabilities of the cluster before putting the
cluster back into the production environment.
• Evicting a Node: Evicting a node is a manual process through Cluster
Administrator. As always, it is important to have a good backup of the
server prior to the eviction process.
Module 7: Server Cluster Maintenance and Troubleshooting v

 Troubleshooting Cluster Service
The key point of this section is to give the students the tools and techniques
that are useful in reducing the time it takes to find a root cause for common
Cluster service problems.
• Troubleshooting Tools: The tools that are used to help troubleshoot a
problem with Cluster service are the same tools that are used to help
troubleshoot a server running Microsoft Windows
® 2000.
• Examining the Cluster Log: Cluster service logs every change
configuration and problem to the cluster log. It is important for the
students to become familiar with the syntax of the log.

• Troubleshooting Network Communications: Students need to know that
there are different troubleshooting paths to follow depending on whether
the network problem is a node-to-node or a client-to-node problem.
• SCSI Configuration Problems: SCSI is less reliable than Fibre. There
can be problems with the SCSI controller, SCSI termination, and SCSI
cabling.
• Group and Resource Failures: Remind students to keep dependency trees
vertical so that if a resource fails, it is easier to find a root cause as to
which resource is causing the failure of the group.
• Quorum Log Corruption: If Cluster service cannot write information to
the quorum log, it will not start. You can attempt to reset the quorum
log, or you can delete the quorum log and let Cluster service create a
new log.

vi Module 7: Server Cluster Maintenance and Troubleshooting

Instructor Setup for a Lab
Lab Strategy
This lab is designed to prepare the students to use Backup and Clusrest.exe to
perform the proper backup and restore procedures. Students will uninstall
Cluster service in preparation for the Network Load Balancing (NLB) portion
of the course. NLB and Cluster service cannot run on the same computer.
Lab A: Cluster Maintenance
To conduct this lab:
 Read though the lab carefully, paying close attention to the instructions and
details.
 Students will need the Clusrest utility from c:\moc\2087\labfiles\mscs
 Students work in teams of two, grouped together by their shared bus.
 Help the students determine whether they are Node A or Node B. In these
exercises each node performs a specific task in the backup and restoration

procedures. Both nodes will uninstall Cluster service.

Module 7: Server Cluster Maintenance and Troubleshooting 1

Overview
 Cluster Maintenance
 Troubleshooting Cluster Service

*****************************
ILLEGAL FOR NON-TRAINER USE******************************
Server cluster maintenance and troubleshooting are considered two separate
disciplines. Maintenance is continuous, whereas troubleshooting has a
beginning when the problem is discovered, and an end when the problem is
resolved. The two disciplines are complimentary, however. When every
troubleshooting procedure that you follow fails, you will need to rebuild the
cluster from a backup tape that was generated during a maintenance procedure.
After completing this module, you will be able to:
 Perform the steps to successfully back up a server cluster.
 Perform the steps to successfully restore a server cluster.
 Evict a node from a server cluster.
 Identify the tools that are necessary to troubleshoot a cluster failure.
 Interpret the entries on the cluster log.
 Identify and troubleshoot common server cluster failures: network
communications, small computer system interface (SCSI) configuration
problems, group, resource, and quorum failures.

Topic Objective
To provide an overview of
the module topics and
objectives.

Lead-in
In this module, we will cover
Cluster maintenance in the
form of backing up and
restoring a cluster, and
troubleshooting Cluster
service.
2 Module 7: Server Cluster Maintenance and Troubleshooting



 Cluster Maintenance
 Backup
 Restoring the First Node
 Restoring Cluster Disks
 Restoring the Second Node
 Evicting a Node

*****************************
ILLEGAL FOR NON-TRAINER USE******************************
Cluster service uses the self-tuning features of Microsoft
® Windows® 2000 and
requires very little maintenance. The only day-to-day maintenance operation
that you need to perform is to back up the cluster.
Under special circumstances, a node in the cluster may need to be replaced, for
example, when your organization decides to perform a hardware upgrade. In
this situation, you need to evict a node from the cluster and add the upgraded
node to the cluster.
Topic Objective
To introduce the

fundamental tasks for
maintaining a server cluster.
Lead-in
The only maintenance
performed on a cluster is
backing up and restoring
Cluster service.
Module 7: Server Cluster Maintenance and Troubleshooting 3

Backup
 Backing Up the System State
 Backing Up the Local Disk
 Backing Up the Cluster Disk

*****************************
ILLEGAL FOR NON-TRAINER USE******************************
Backing up the cluster is no different from backing up Microsoft
Windows 2000 Advanced Server. It is recommended that you perform regular
backups by using the Windows 2000 Backup program (NTBackup), or other
compatible backup programs. Additional backup agents are still necessary to
back up applications running on the cluster, such as Microsoft SQL Server


and Microsoft Exchange.

A cluster-aware backup program will be able to perform the same backup
operations as NTBackup, especially with regard to backing up the System State
and the cluster configuration database.

Backing Up the System State

The configuration information for the cluster is located on the registry on each
node (HKEY_LOCAL_MACHINE\Cluster). The Backup tool that is included
with Windows 2000 backs up the cluster database when you back up each
node’s system state.
NTBackup backs up the system state on each node. The system state includes:
 The quorum log.
 The local registry.
 The Cluster registry hive.

Topic Objective
To describe how to back up
the system state, node, and
cluster disks.
Lead-in
A backup of the cluster
includes the system state,
the node, and the cluster
disk.
Note
4 Module 7: Server Cluster Maintenance and Troubleshooting

Backing Up the Local Disk
Follow standard computer backup procedures to back up the operating system
and the data on the local drives. You must also back up key cluster files on the
local disks.
 On each node, back up the cluster database files:
%systemroot%\cluster\CLUSDB
%systemroot%\cluster\CLSUDB.LOG
 On each node, back up the clustering service:
%systemroot%\cluster\*.*



Backup is essential, but regular testing to make sure that backups and
restores actually work as expected is also necessary. A good practice is to
schedule test backup and restore operations frequently.

Backing Up the Cluster Disks
It is critical to back up cluster files on the quorum disk and data on the cluster
disks, because Cluster service will write information to files in the
\mscsdirectory on the quorum disk and cluster-aware applications will likely be
placing data on the cluster disk. Because either node of the cluster could own
the cluster disk resource at any time, it is possible for each node to back up the
data on the drive. However, having each node back up data would require you
to install backup hardware and software on each cluster node, which is not the
best solution.
One possibility is to identify a nonclustered server running Windows 2000
Server and schedule it to back up data remotely through a network connection
to the Cluster disk’s administrative share or a hidden share that you create. For
example, you might create FBackup$, GBackup$, HBackup$, and WBackup$
file share resources on the virtual server for the root of drives F, G, H, and W.
F, G, and H would be cluster disks with data, and W would be the drive letter
for the quorum disk. Hidden shares would not appear in a browse list and you
could configure them to allow access only to members of the Backup Operators
group.
Note
Module 7: Server Cluster Maintenance and Troubleshooting 5

Restoring the First Node
Steps For Restoring a Server Cluster:
1. Restore the first node

2. Restore the cluster disks
3. Restore the second node
4. Perform node testing

*****************************
ILLEGAL FOR NON-TRAINER USE******************************
The following sections describe the procedure for restoring a server cluster in
the event that both nodes and the cluster disk fail. It is possible that any one of
the components in the cluster could fail independently. In the case of a failed
component, you follow the same procedure for restoring that specific
component.
Performing a complete restore of a server cluster is a straightforward process.
1. Restore a node of the cluster.
2. Restore the cluster disks of the restored first node.
3. Restore the remaining node of the cluster.
4. Perform node testing.

Topic Objective
To list the steps for restoring
a server cluster and
describe how to restore the
first node.
Lead-in
In the event of a complete
cluster failure, you first
restore a node.
Delivery Tip
This page lists the four
steps that are involved in
restoring a complete cluster

and covers the first step,
Restoring a Node. Details
about the other three steps
follow on the next pages.
6 Module 7: Server Cluster Maintenance and Troubleshooting

Restoring a Node of the Cluster
To restore a node in a server cluster, you follow the same procedure that you
would use in restoring a Windows 2000 operating system.
1. Install a fresh copy of Windows 2000 Advanced Server on the node to be
restored.
2. Log on as Administrator and restore the system and boot partition, system
state, and associated volumes from the backup. Make sure that you select
the option to restore the system state to the original location in the backup
program.
3. Restart the node.
4. Perform the steps for restoring the cluster disk. These steps follow in the
next section.


The difference between the time of the backup and the time of the
restoration to the new computer may affect the computer account on the domain
controller. You may have to join a workgroup and then rejoin the domain.

Note
Module 7: Server Cluster Maintenance and Troubleshooting 7

Restoring Cluster Disks
 Restoring Disk Signature Files
 Restoring the Data on the Cluster Disk

 Restoring the Cluster Configuration Files

*****************************
ILLEGAL FOR NON-TRAINER USE******************************
After you have restored a node in the cluster, you must restore the cluster disks.
Restoring the cluster disks involves restoring the disk signature file that the
cluster uses to identify the disk. You may also need to restore a cluster disk if
you are running out of disk space or if there is impending disk failure of a disk.
It can be costly to make mistakes while replacing a cluster disk; the
consequence can be the irrecoverable loss of all of the data on that disk. If the
disk is the quorum disk, the server cluster's configuration data is at risk.
Before restoring the cluster disks, stop Cluster service on all of the nodes of the
cluster. Stopping Cluster service will ensure that it will not attempt to start,
which would place a lock on the disks.
Restoring Disk Signature Files
Because Cluster service relies on disk signatures to identify and mount
volumes, if a disk is replaced, or if the bus is re-enumerated, Cluster service
will not find the disk signatures that it is expecting and will not function.
You can run Dumpcfg.exe to extract the disk signature from the registry and
write it to the new disk. Cluster service will recognize the new disk and
successfully start the resource.

The Dumpcfg.exe is a resource kit utility that restores an old disk
signature file to a new disk.

If the disk that you are replacing is the quorum disk, use Cluster Administrator
to move the quorum to a different disk, and proceed in the replacement of the
disk. After the disk is brought back online, you can move the quorum back to
the new disk.
Topic Objective

To describe how to restore
the cluster disk by restoring
signature files, data and
cluster configuration files.
Lead-in
Restoring a cluster disk
involves restoring the disk
signature file.
For Your Information
Be familiar with Q224075,
“Disk Replacement for
Windows 2000 Server
Cluster,” found on the
Student compact disk.
Note
8 Module 7: Server Cluster Maintenance and Troubleshooting

Restoring the Data on the Cluster Disk
Restoring the data on the cluster disk is the same as a restore of a local disk.
Before restoring the data, make sure that you have associated each cluster disk
to the same drive letter as before the disaster or failure. When restoring, make
sure that you restore the data to the original location and verify the integrity
after you have completed the restore.
Restoring the Cluster Configuration Files
The cluster configuration files include the cluster database and the quorum log.
The cluster database is the database or configuration data (cluster objects and
their settings) that are pertinent to the cluster. This database is the product of
the cluster registry key checkpoint and the changes that are recorded in the
quorum log. All of the nodes of the cluster hive maintain a local copy of this
database in the nodes local registry.

After you have restored the disk signature file and data, you can start the server
cluster. If the cluster files were not restored, or were corrupted, the following
procedure can restore the cluster database from the registry of the restored node.
Identify the node on which you will restore the database (in the case of a
disaster restore, this will be the first node that you have restored). Restore the
cluster database on the selected node by restoring the system state. Restoring
the system state creates a temporary folder under the %Systemroot%\Cluster
folder called Cluster_backup.
You use NTBackup to restore the cluster configuration files, which places them
on the node. You then restore the cluster database to the node’s registry by
using the Clusrest.exe tool. Clusrest.exe restores both the quorum log
(Quorum.log) file and the cluster database (Clusdb).

The Clusrest.exe tool is available in the Windows 2000 Resource Kit.
This tool is a free download from www.microsoft.com

Note
Module 7: Server Cluster Maintenance and Troubleshooting 9

Restoring the Second Node
 Restoring the Remaining Node(s) of a Cluster
 Perform Node Testing

*****************************
ILLEGAL FOR NON-TRAINER USE******************************
After you complete the process of restoring a node of a cluster, and Cluster
service has started successfully on the newly restored node, you can start the
restore process on the other node of the cluster.
Restoring the Remaining Node(s) of the Cluster
The restoration of the second node of a cluster is the same procedure as

restoring the first node of a cluster, except that you will not have to restore the
cluster disks.
Performing Node Testing
Testing the failover and failback policy is recommended before putting the
cluster back into production.
1. Verify that the disk and cluster resources are available on the correct node.
2. Fail over each group and resource to verify that they can successfully start
on the other node of the cluster.
3. Test the failback policy of each resource by allowing the resource to fail
back to a preferred owner after the node has come back online.

Topic Objective
To describe how to restore
the second or remaining
nodes of a cluster and test
the failover and failback
policies.
Lead-in
The last step in restoring the
cluster is to restore the
second node and then test
the components of the
cluster.
10 Module 7: Server Cluster Maintenance and Troubleshooting

Evicting a Node
Steps for Evicting a Node
1. Back up both nodes
2. Verify backup
3. Move all groups to the remaining node

4. Stop Cluster service on the node to be removed
5. Evict the node
6. Unplug the server from the shared bus

*****************************
ILLEGAL FOR NON-TRAINER USE******************************
If you need to change a node of a cluster, for example, to add a more powerful
server, you need to logically remove the node before physically removing the
node from the cluster. When you configure a new server with the shared bus,
and the public and private networks, you can then run the Cluster Installation
Wizard.
To remove a node from a cluster, from Cluster Administrator, right-click on the
node to access the menu with the Stop Cluster option and Evict Node options.
To evict a node:
1. Back up both nodes.
2. Verify backup.
3. Move all of the groups to the remaining node.
4. Stop Cluster service on the node that is to be removed.
5. Evict the node.
6. Unplug the server from the shared bus (if the shared bus is a SCSI bus, be
careful about termination).


If a new server is to join the cluster later, run the Cluster Installation
Wizard and select Join a Cluster.

Topic Objective
To describe how to evict a
node from a cluster.
Lead-in

You must first evict a node
from the cluster to add a
new node to the cluster.
Note
Module 7: Server Cluster Maintenance and Troubleshooting 11



 Troubleshooting Cluster Service
 Troubleshooting Tools
 Examining the Cluster Log
 Troubleshooting Network Communications
 SCSI Configuration Problems
 Group and Resource Failures
 Quorum Log Corruption

*****************************
ILLEGAL FOR NON-TRAINER USE******************************
Troubleshooting a problem with Cluster service can be more complex than
troubleshooting a single server because of the virtual servers and the need for
intracluster communications. Virtual servers change ownership from one node
to another, which may cause network connectivity problems. Applications
running on the cluster are difficult to troubleshoot, because they are running on
a virtual server instead of a physical server. You could also have a node-to-node
communication problem because servers usually work independently of each
other and not together. You might experience hardware problems with the
shared bus and the cluster disk resources.
The most common failures are due to improper configurations within groups
and resources. Cluster service will fail if the quorum log becomes corrupt. It is
important to know how to repair the quorum log to restart the cluster.

You use the same tools to identify problems on the cluster as you would use to
identify problems on a physical server. The best resource for troubleshooting is
the cluster log because Cluster service records the activity of each node in the
cluster log. This log can help you identify problems on the node or in the
cluster.
Topic Objective
To introduce the topic of
troubleshooting as it relates
to server clusters.
Lead-in
This section provides an
overview of the tools that
are available for
troubleshooting problems
with Cluster service.
Delivery Tip
Be familiar with Q266274 –
“How to Troubleshoot
Cluster Service Startup
Issues” on the Student
compact disk.
12 Module 7: Server Cluster Maintenance and Troubleshooting

Troubleshooting Tools
 Disk Manager
 Task Manager
 Performance Monitor
 Network Monitor
 Dr. Watson
 Services Snap-in


*****************************
ILLEGAL FOR NON-TRAINER USE******************************
When troubleshooting Cluster service, you can use the same tools and
methodologies that you would when troubleshooting Windows 2000 Advanced
Server.
Cluster service writes logging information to the system log of every node in
the cluster. Cluster service also writes a more detailed log of cluster activity to
the cluster log on each node. Use these two sources to gather information when
you begin troubleshooting a problem. You will be able to determine whether
the problem is related to the network, to services or applications, or to physical
components in the cluster.

Use Event Viewer to filter the system log on event source: ClusSvc. You
can view general events, such as if Microsoft Cluster service failed to join the
cluster on this node and Microsoft Cluster service successfully created a cluster
on this node.

After you have determined the type of problem, you can use the following tools
to search for the source of the problem. You must check each node individually
when using any of these tools.
 Disk Manager. You check disk manager to find out the health of the cluster
disk. You can check whether the operating system recognizes the disks, and
whether the cluster disks are basic versus dynamic. You also need to verify
that the drive letters of the cluster disks are the same on both nodes.
 Task Manager. You can verify that Cluster service is running in Microsoft
Windows 2000 Task Manager. You can also use Task Manager as a
performance monitor, but you do not obtain the level of detail as you would
with a performance monitor. In Task Manager, you will be able to verify the
CPU utilization percentage and the memory resources on the node.

Topic Objective
To describe the tools that
are used for troubleshooting
Cluster service problems.
Lead-in
The tools that you use for
troubleshooting a cluster are
the same tools that you use
to troubleshoot a server.
Note
Module 7: Server Cluster Maintenance and Troubleshooting 13

 Performance Monitor. Microsoft Windows 2000 Performance Monitor is
the primary tool for finding bottlenecks on servers running Windows 2000.
It is recommended that you create a baseline before and after you add
cluster resources to the cluster. You also need to create a baseline on each
node during failover and failback of resources to check for potential
physical resource deficiencies. It is recommended that you configure a
computer to monitor the Cluster service property on every node of the
cluster, and send an e-mail message to an administrator when a node or the
cluster is offline.
 Network Monitor. You use Microsoft Windows 2000 Network Monitor to
troubleshoot any node-to-node and client-to-node communication. You
must configure Network Monitor to capture data on the private network to
see node-to-node communication.
 Dr. Watson. Dr. Watson is a user-mode debugging tool. If a clustered
application or the Cluster Administrator crashes, the debugging information
is found in the Dr. Watson log file.
 Services Snap-in. Cluster service runs as a service in Windows 2000. If
Cluster service is not running correctly, check the properties of the service

through the services snap-in to ensure that the default properties have not
changed. Verify that Cluster service:
• Is set to start automatically.
• Is set to log on as the designated domain service account.
• Is set to restart after a failure.
 Make sure that the four following services have started:
• Network Connections (Network Connections has a Remote Procedure
Call (RPC) dependency)
• RPC
• Windows Management Instrumentation Driver Extensions
• Windows Time

14 Module 7: Server Cluster Maintenance and Troubleshooting

Examining the Cluster Log
Copy of cluster - Wordpad
Creates a new cluster group
000003b8.000003b4::2000/10/02-19:44:12.946 [CS] Cluster Service started – Cluster Node Vers
000003b8.000003b4::2000/10/02-19:44:12.946 OS Version 5.0.21
000003b8.000002f0::2000/10/02-19:44:12.957 [CS] Service Starting…
000003b8.000002f0::2000/10/02-19:44:13.007 [EP] Initialization…
000003b8.000002f0::2000/10/02-19:44:13.057 [DM]: Initialization
000003b8.000002f0::2000/10/02-19:44:13.097 [DM]: Loading cluster database form D:\WINNT\clu
000003b8.000002f0::2000/10/02-19:44:13.397 [DM] DmpStartFlusher: Entry
000003b8.000002f0::2000/10/02-19:44:13.397 [DM] DmpStartFlusher: thread created
000003b8.000002f0::2000/10/02-19:44:13.427 [NM] Initializing…
000003b8.000002f0::2000/10/02-19:44:13.427 [NM] Local node name = SERVER1.
000003b8.000002f0::2000/10/02-19:44:13.427 [NM] Local node ID = 1.
000003b8.000002f0::2000/10/02-19:44:13.427 [NM] Creating object for node 1 (SERVER1)
000003b8.000002f0::2000/10/02-19:44:13.437 [NM] Initializing networks.

000003b8.000002f0::2000/10/02-19:44:13.447 [NM] Initializing network interfaces.
000003b8.000002f0::2000/10/02-19:44:13.788 [NM] Initializing complete.
000003b8.000002f0::2000/10/02-19:44:13.848 [NM] Starting worker thread…
000003b8.000002f0::2000/10/02-19:44:13.848 [API] Initializing
000003b8.000002f0::2000/10/02-19:44:13.848 [FM] Worker thread running
000003b8.000002f0::2000/10/02-19:44:13.878 [LM] :LMInitialize Entry.
000003b8.000002f0::2000/10/02-19:44:13.878 [LM] :TimerActInitialize Entry.
000003b8.000002f0::2000/10/02-19:44:13.878 [CS] Service Domain Account = clusservice@mocmoc
000003b8.000002f0::2000/10/02-19:44:13.878 [CS] Initializing RPC server.
000003b8.000002f0::2000/10/02-19:44:14.038 [INIT] Attempting to join cluster MYCLUSTER
000003b8.000002f0::2000/10/02-19:44:14.048 [JOIN] Spawning thread to connect to sponsor 10.
000003b8.000002f0::2000/10/02-19:44:14.048 [JOIN] Spawning thread to connect to sponsor 169
F
ile Edit View Insert Format Help
The IDs of the process and
thread issuing the log entry
timestamp event description
event description

*****************************
ILLEGAL FOR NON-TRAINER USE******************************
The cluster log is a diagnostic log that is a more complete record of cluster
activity than the Microsoft Windows 2000 Event Log. The cluster log records
the Cluster service activity (Clussvc.exe and associated processes) that leads up
to the events that are recorded in the event log. Although the event log can point
you to a problem, the cluster log helps you to determine the source of the
problem. So, for diagnosis, check the event log for general information and the
cluster log for specific details about the cluster status. If you see a problem in
the event log, note the timestamp and go to approximately the same timestamp
on the cluster log.

The cluster log is enabled by default when you install Cluster service, but will
not start logging information until after the first restart of the node. Cluster log
output is written to %SystemRoot%\Cluster\Cluster.log, and you can view it
with Microsoft Wordpad.
Setting the Logging Level
You can set four logging levels in the cluster log. Four logging levels are
possible. The default level is two, which logs enough information necessary for
normal troubleshooting. To set a different logging level, click Start, point to
Settings, click Control Panel, and then double-click the System icon. Create a
system environment variable under the Advanced button called
ClusterLogLevel with a value of 0, 1, 2, or 3, where 0=no logging, 1=Errors
only, 2=Errors and Warnings, and 3=Everything that happens.
Setting the Log File Size
The log file defaults to a maximum size of 8 megabytes (MB). When the log
file size reaches 8 MB, the log file will start overwriting the data in the log file.
To specify a larger file size, add the registry entry ClusterLogSize under
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ClusSvc\
Parameters. ClusterLogSize has a type of DWORD and it should specify the
maximum size in MB for the log file. If this value is set to 0, logging is
disabled.
Topic Objective
To learn how to use the
cluster log to troubleshoot
Cluster service problems.
Lead-in
The cluster log is the best
source of information that
you have available to
troubleshoot a cluster.
Delivery Tip

For a detailed explanation of
how to interpret the cluster
log, read “Interpreting
MSCS Cluster Log” found
on the Student compact
disk.
Module 7: Server Cluster Maintenance and Troubleshooting 15

Cluster Log Entries
There are two types of cluster log entries: Component Event Log entries and
Resource dynamic-link library (DLL) log entries. Cluster service is made up of
a number of components, such as the database manager and the global update
manager. The cluster log records the interactions of these components, making
it a powerful diagnostic tool. Because resource groups are the basic unit of
failover, resource DLL entries are essential to understanding cluster activity.
The first line in the body of a typical cluster log is:
378.32c::1999/06/09-18:00:18.874 Cluster service started -
Cluster Node Version 3.2051

The main elements of this line are common to every line of the log:
 The IDs of the process and thread issuing the log entry. These two IDs are
concatenated, separated by a period. In the previous example, the Process
ID is 378, and the Thread ID is 32c.
 Timestamp. The timestamp is recorded in the following format, in
Greenwich Mean Time (GMT):
yyyy/mm/dd-hh:mm:ss.sss
 Event description. One example of an event description would be Cluster
service started.

Component Event Log Entries

In the following example, [NM] indicates the component that wrote the event to
the cluster log; in this case, NM stands for node manager.
378.380::1999/06/09-18:00:50.881 [NM] Forming cluster membership.
Resource DLL Log Entries.
The following example is a cluster log entry for a resource DLL event. This
example is one of the entries from the disk arbitration process.

15c.458::1999/06/09-18:00:47.897 Physical Disk <Disk D:>:
[DISKARB] Arbitration Parameters (1 9999).

Instead of listing an abbreviated component name between the timestamp and
event description as component log entries do, entries describing resource DLL
events list the following information:

 Resource type (Physical Disk)
 Resource name (<Disk I:>)

The event description in this example is [DISKARB] Arbitration Parameters
(1 9999).
16 Module 7: Server Cluster Maintenance and Troubleshooting

Troubleshooting Network Communications
 Troubleshooting Node-to-Node Communication
 Verify RPC Communication’s
 Verify Cluster Heartbeats
 Troubleshooting Client-to-Node Communications
 Check NetBT Cache with Nbtstat
 Ping IP Address
 WINS Static Mappings


*****************************
ILLEGAL FOR NON-TRAINER USE******************************
There are two types of cluster network communications that can fail: the client
may be unable to access the cluster or the nodes may be unable to communicate
with each other. When client communications are interrupted, there is a
problem with the public network. When the nodes are unable to communicate,
there is a problem with either the public or the private network.
Troubleshooting these two types of network-related problems requires different
approaches.
Troubleshooting Node-to-Node Communications
You can use Windows 2000 Network Monitor before installing Cluster service
to capture the trace of the ping between the nodes on the public and private
network. After Cluster service is installed, you use Network Monitor to verify
remote procedure call (RPC) communication and cluster heartbeats.

You can also use RPC Ping, which is an RPC connectivity verification
tool that is a free download from www.microsoft.com. This tool verifies that
Windows 2000 Server services are responding to the call requests of remote
procedures between nodes.

Verifying RPC Communication
To verify that RPC communication is occurring between the nodes of a cluster,
use a network capture utility, such as Microsoft Network Monitor.
Windows 2000 Server includes a simple version of Network Monitor that you
can install by using the Network program in Control Panel.
To verify RPC communication, configure the Capture utility to capture all of
the traffic between the nodes of a cluster. After you have started a capture,
using Cluster Administrator to create a group or resource will result in RPC
traffic between the nodes.
Topic Objective

To describe how to
troubleshoot node-to-node
and client-to-node
communication.
Lead-in
Depending on the symptom,
you may have to
troubleshoot node-to-node
or client-to-node
communications.
Note
Module 7: Server Cluster Maintenance and Troubleshooting 17

Verifying Cluster Heartbeats
As with RPC communication, to verify that cluster heartbeats are occurring
between the nodes of a cluster, you must use a network capture utility.
Cluster service uses User Datagram Protocol (UDP) port 3343 to send
heartbeats on the network. Use Network Monitor to capture port 3343 to verify
both nodes of the cluster are sending and receiving cluster heartbeats.
Troubleshooting Client-to-Node Communications
After a failover occurs, clients must still be able to gain access to a cluster, even
though they will be accessing a different node. The client must be able to
resolve any cluster network names so that they will always connect to the node
on which the resources are online. If clients cannot connect to virtual servers,
verify that:
 The client is accessing the cluster by using the correct network name or IP
address.
 The client has the Transmission Control Protocol/Internet Protocol (TCP/IP)
protocol correctly installed and configured.


Check NetBT Cache with Nbtstat
Depending on the resource that is being accessed, the client can address the
cluster by specifying either the resource network name or the IP address. In the
case of the network name, you can verify proper name resolution by checking
the NetBT cache (using the Nbtstat.exe utility) to determine whether the name
had been previously resolved. Also, confirm proper Windows Internet Name
Service (WINS) configuration, at the client and at the cluster nodes.
Ping IP Address Using Ping Utility
If the client is accessing the resource through a specific IP address, ping the IP
address of the cluster resource and cluster nodes from a command prompt.
WINS Static Mappings
You should not create static network name to IP address mappings for any
cluster names in a WINS database. WINS is the only name resolution method
that will cause problems when using static mappings, because WINS static
mappings use the media access control (MAC) address of the network card as
part of the static mapping.
If clients are having a problem connecting to a virtual server, an administrator
might have created a WINS static mapping for a virtual server. The node for
which the mapping is created will be able to bring the network name resource
online and clients will be able to connect. However, if failover occurs, the
second node in the cluster will be able to bring the IP address online but not the
network name. When the second node attempts to bring the network name
online, WINS will return an error preventing it from registering the network
name. WINS prevents the network name from going online because the second
node does not have the same physical address as the one recorded in the static
mapping for the network name.

For more WINS troubleshooting information, see “Recommended WINS
Configuration for Microsoft Cluster Server,” Q193890, on the Student compact
disk.


Note
18 Module 7: Server Cluster Maintenance and Troubleshooting

SCSI Configuration Problems
 SCSI Controllers
 SCSI Terminiation
 SCSI Cabling

*****************************
ILLEGAL FOR NON-TRAINER USE******************************
If you suffer from hardware failures, you may have to replace hardware
components of the cluster. If you replace components in the SCSI subsystems,
you need to make sure that the new SCSI configurations conform to the
following guidelines.
SCSI Controllers

SCSI IDs Each device on the shared SCSI bus must have a
unique SCSI ID. Most SCSI controllers default to SCSI
ID 7. Therefore, you must change the SCSI ID for one
of the controllers on the shared SCSI bus to something
other than ID 7.
Boot Time SCSI Bus Reset Cluster service uses SCSI bus resets, but in a controlled
way during a membership regroup operation. Some
SCSI controllers reset the SCSI bus when they
initialize at start time, before Windows 2000 is loaded.
If the SCSI controllers reset the SCSI bus, the bus reset
can interrupt any data transfers between the other node
and drives on the shared SCSI bus. Therefore, you
should disable automatic SCSI bus resets, if possible,

by using the adapter configuration program accessible
at computer start time.
Non-Compliant Controllers It is important to verify that the SCSI controllers that
are being used are on the Cluster service Hardware
Compatibility List (HCL). For a SCSI controller to
work with Cluster service, it must support the SCSI
reserve and release commands and bus resets.

Topic Objective
To explain how to
troubleshoot SCSI
configuration problems.
Lead-in
Some of the most common
SCSI problems are listed in
the following table.
Module 7: Server Cluster Maintenance and Troubleshooting 19

SCSI Termination

Active or Forced-Perfect
Termination
There are three types of termination that are used for
terminating the SCSI bus: passive termination, active
termination, and forced perfect termination. Because
both active and forced perfect termination use
electronics to provide termination, these types provide
the best termination. You should not use passive
termination in a cluster, because it can result in
problems, such as unnecessary failover or inability to

access the quorum disk.
On-Card Termination Many SCSI controllers provide on-card termination;
however, the on-card termination does not provide
termination when the computer is not turned on. On-
card termination only becomes an issue when external
terminators are not used. When using external
terminators, the on-card termination should be
disabled.

SCSI Cabling

Tri-Link or Y-cable SCSI
Connectors
Attaching Y-cables or tri-link connectors to the back of
the SCSI controllers at each end of the bus is one
method that you can use to allow the SCSI bus to
remain terminated even when one node is turned off.
These components allow you to use external
terminators that will continue to provide termination if
a node is turned off. You must ensure that the SCSI
cards in the nodes are not providing termination when
using these connectors.
Long Cables It is very common to have multiple external SCSI
drives on the shared SCSI bus. When configuring
multiple external drives, it is very important not to
exceed the maximum combined cable length that the
controller manufacturer recommends. The SCSI
specifications specify the maximum combined cable
length when using different types of cabling. If the
manufacturer of the controller recommends a shorter

distance, be sure to follow the recommendation of the
manufacturer.

×