Tải bản đầy đủ (.pdf) (33 trang)

RASCALLI Platform: A Dynamic Modular Runtime Environment for Agent Modeling pptx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.13 MB, 33 trang )

Proposal for:
an Interactive Grid Analysis Environment
Service Architecture
Julian Bunn (), Rick Cavanaugh (),
Iosif Legrand (), Harvey Newman (),
Suresh Singh (), Conrad Steenberg ()
Michael Thomas (), Frank van Lingen()
1 Introduction
This document is a proposal that describes an Interactive Grid Analysis Service Architecture , with
a focus on High Energy Physics applications, including a set of work packages and milestones to
implement this architecture.
The ideas outlined in this architecture document are motivated by a lack of distributed
environments dedicated to interactive Grid analysis, and the RTAG activity ARDA (Architectural
Road map towards Distributed Analysis) [1] within LHC. The foundation of the architecture
described here will be based on web services and peer to peer systems. Web Services have been
identified as being suitable for Grid applications [2]. Peer to peer technologies are well suited for
environments composed of dynamic resource providers [3].
The goal of this document is to identify services that can be used within the GAE. Besides re-use
of existing services (e.g. components), this document describes characteristics of an interactive
Grid Analysis Environment (GAE) and identifies new”Grid services and functionality that are
needed to enable distributed interactive analysis.
One of the big differences between an GAE and production or batch analysis oriented Grids is that
behavior of the users is much more unpredictable. This unpredictable behavior of a collection of
users is too complex for humans to steer usage of resources “manually”. Instead applications are
needed to act on behalf of humans in an autonomous fashion to steer usage of resources and to
optimize resource usage. Furthermore this documents identifies several components that can be
used for policy management within the Grid. Policy management prevents that certain users
allocate more resources than they are entitled too.
While this architecture is being developed to solve specific problems related to CMS data analysis
[4], many of the services described in this architecture can be used in other Grid systems. In
addition, while some of the services have already been developed, or are being developed, they


have never been assembled into such an interactive system before.
1
A Grid as used within High Energy Physics consists of a heterogeneous collection of resources
and components. A secondary goal within this proposal is to define platform and language neutral
APIs in order to allow smooth operation between the heterogeneous Grid resources and
components and minimize dependencies between components. The secondary goal is closely
related to work done in the PPDG CS 11 work group [5] and is also stated in the HEPCAL II
document [6].
Section 2 discusses several characteristics of interactive analysis. Requirements and use cases
are discussed in Section 3. Section 4 describes GAE and the different components it comprised of.
Based on the description of section 4 several scenarios are discussed in section 5 that show
interaction between different GAE components in typical analysis scenarios. Section 6 and Section
7 map the identified components of GAE to use cases identified for interactive analysis in HEPCAL
[7] and HEPCALII [6] and to existing components and applications that can be used within the
development of GAE. Finally Section 8 maps the components described in the architecture section
to work packages and milestones.
The content of this proposal focuses on interactive analysis. However, components developed for
an interactive analysis environment could also be used in a batch or production environment and
vice versa (see section 2 and section 4).
The research problems, service components, and requirements addressed by this proposal are not
only important for interactive analysis; batch analysis will benefit as well. The work outlined in this
proposal will pave the way for a high energy physics data grid in which production, batch and
interactive behavior coexist.
2 Interactivity in a Grid Environment
Part of the motivation for this proposal is the difference between batch and interactive analysis and
the focus on batch analysis within Grid software development for High Energy Physics. This
section serves to explain the characteristics of interactive analysis and how it differs from batch
analysis. Additional information on interactive and batch analysis can be found in HEPCALII [6].
2.1 Interactive versus Batch Analysis
The structure of a batch analysis environment is well known within high energy physics [8] . A

large number of computing jobs are split up into a number of processing steps arranged in a
directed acyclic graph and are executed in parallel on a computing farm. A batch analysis session
is fairly static in that the Directed Acyclic Graph (DAG) structure of the processing steps being
executed is known in advance. The only interaction between the user and the batch job is the
ability to see the progress of the computing job and the ability to restart processing steps that may
2
have failed due to error.
A typical batch analysis session would involve the following operations: sign on to the Grid,
specification of requested datasets, generation and review of an execution plan (this includes the
resource allocation for executing the plan), plan submission, plan monitoring and steering, and
collection of analysis results.
A typical interactive analysis session is quite similar to a batch analysis session. The main
difference is that the execution plan for interactive analysis is an iterative loop that builds on the
results of previous commands. The structure of the job graph is not known in advance as it is with
batch processing. The user will submit a command to be executed, analyze the results, then
submit new commands that build upon the results of the previous commands.
Both interactive and batch analysis make use of “processing instructions”. For batch analysis,
these processing instructions are comprised of shared libraries, interpreted scripts, and command
parameters that are all known in advance, or can be automatically or programmatically determined
from the results of an earlier processing job. In an interactive analysis session, however, the
processing instructions are not known in advance. The end user will analyze the results of each
processing step and change the parameters, libraries, and scripts for succeeding processing steps.
Human intervention, as opposed to computer automation, determines the details of each
processing step. As humans tend to be more impatient, computers and processing instructions in
an interactive session will have a much shorter execution time than processing instructions in batch
jobs. Low latency will be required between the submission of a processing instruction and the
receipt of its results.
Both interactive analysis and batch analysis can be seen as two extremes of the analysis
continuum: It is possible to have an analysis session where both batch and interactive type
analysis is done. Furthermore it is difficult to assign a “batch” label to any particular analysis. If

there is enough low latency and computing power then the response time of a batch analysis could
be short enough such that is becomes part of a larger interactive analysis.
The next subsections describe several issues, important for interactive analysis some of which
could also be used in a batch environment. However none of the current Grid batch environment
have addressed these issues such that it could be used within an interactive environment.
2.2 Execution state
Execution state is the entire set of information needed to recreate the current analysis session.
This includes logical dataset names, processing scripts, shared libraries, processing steps already
taken, processing parameters, data provenance, and any temporary results (both logical and
physical). It is assumed that the size of the execution state will be small enough to be easily
3
replicated to other locations.

Once an interactive session is finished, all the execution steps and their order are known by the
execution state. As such the “recorded” interactive session can be “replayed” as a batch session
with minimum intervention from a user.
Batch and interactive analysis both involve the allocation of execution nodes on which the actual
processing occurs. In batch analysis, there is little advantage to reusing the same execution node
multiple times as with batch analysis the execution state and order is known in advance.
Interactive analysis, however, benefits from the reuse of an execution node since much of the state
of the interactive session resides on the execution node.
The state for an interactive session can not be completely stored on the execution node. Local
scheduling policies and resource failures can make an execution node unavailable for the rest of
an interactive session. As such: The interactive client must be able to transfer the state of the
interactive session to another execution node with no required involvement from the user.
Transition from one execution node to another (due to node failure) should be transparent to the
user. The only reliable source of the current interactive execution state must be stored in the client
application (or for limited resource clients such as hand held devices, the execution state should be
stored on the grid service entry host). The state stored on an individual execution node must be
treated as a cache of this canonical state that is stored in the client application. This execution

node state cache can be leveraged to provide lower execution latencies by reusing the same
execution node as much as possible during an interactive session. The state for an interactive
session is sent with every interactive command so that execution node failures can be handled
transparently to the client application (more on this below).
Figure 2.1 execution state
Figure 2.1 execution state shows how state is managed during an interactive session as described
4
in Scenario 1.
Scenario 1: State Information:
The Grid scheduler locates a suitable Job Processing and Execution Service (JPES) service
instance to handle the job. The JPES then sends the command to an execution node. The
execution node executes the command and returns the new state back to the client (one possibility
is to store the execution state in a CVS repository which would not only track the state but also the
state history) . The execution node retains a local cache of the application state for future
processing. This is shown as step 1 in Figure 2.1.
The client application then sends the second command to a Grid scheduler. The Grid scheduler
attempts to use the previous execution service to handle the command. When the execution
service receives the command, it attempts to use the previous execution node to handle the
command. If the node is still available, then it is used to handle the command. This is also shown
by step 1 in Figure 2.1.
If the previous execution node is not available (it was rescheduled for a higher priority task, or it
experienced some hardware failure) then the execution service returns a failure code back to the
Scheduler, which then attempts to locate another suitable execution service to handle the
command (execution service 2). A new execution node will be allocated (step 2 in the figure). The
new execution node uses the state that was sent along with the command to recreate the
environment of the previous execution node and begins to process new commands. This is shown
by step 2.1 in Figure 2.1.
If the execution service becomes unavailable, the Scheduler sends the command to the another
available execution service instance (step 3 in the figure). As before, the state is sent along with
the command so that the new execution node can recreate the environment of the previous

execution node.
By sending all interactive commands through a Grid Scheduler, the client application never needs
to know about execution node failures, or about execution service failures (except for monitoring
purposes). State information (job traceability) is also described in HEPCALII [6].
2.3 Steering and Intelligent Agents
In order to satisfy users' lust for responsiveness in an interactive system, constant status and
monitoring feedback must be provided. In any Grid system it must be possible for the user to
redirect slow running jobs to faster execution nodes, and perform other redirections based on
perceived optimizations. By giving users this control certain patterns of behavior will start to
appear and will give valuable information on how to tune the entire system for optimum
5
performance.
As certain optimization strategies are discovered, it will be possible to write intelligent agents that
can perform these optimizations on behalf of the user. These agents will be able to reschedule
jobs and re-prioritize requests based on the currently available resources. Intelligent agents can
also be used to monitor resource usage profiles and match them to the profiles desired by the
greater Grid community.
A steering service would inter operate with grid schedulers. As users decide on constraints where
to run their job or what data to access, schedulers need to translate this to a concrete plan that can
be executed.
2.4 Resource Policies
Resources in a Grid environment are scarce (CPU time, storage space, bandwidth), while others
are in abundance (services). Not all users of the Grid will be given equal priority for all resources.
This is true for the entire spectrum of interactive and batch systems.
In addition to policies on who can use what resource, there also needs to an accounting
mechanism that keeps track of who is using what resources and a history on resource usage. One
way of adjusting user priorities is by setting local policies. Managers of the scarce resources can
set local policies that give preference to certain users and groups. This is analogous to UNIX
process priorities, except that the users are given priorities instead of processes.
A more sophisticated way to manage priorities is through the use of a Grid Economy. Users and

groups in the Grid are given some number of Grid Dollars (Higgies). These dollars can be spent to
procure more resources for more important Grid tasks. Similarly, users and groups would “sell”
their own resources to earn more Higgies to spend on their own Grid tasks. Large groups (such as
a research institutes) could distribute Higgies to individual researchers in the form of a Grid Salary,
giving a higher salary to researchers who they believe will use their Higgies more wisely. Within
this paper we think it is important to recognize the concept of a Grid Economy, however it will not
be addressed in the first phases of the GAE.
3 Requirements
Use cases and requirements for an GAE have been described in detail in HEPCAL [7], HEPCAL II
[6] , and the PPDG CS 11 requirements document [9]. Some of the requirements focus on specific
components that will be used within the GAE. Section 6 associates GAE components with use
cases that are relevant for interactive analysis. While requirements and use cases have been
finalized in HEPCAL and PPDG CS 11, during the development of GAE, requirements and use
cases can be subject to change and improvements as developers of the system gain more
6
understanding of the complexity of interactive analysis within a Grid environment.
One GAE requirement has been described in section 2.2: the interactive client must be able to
transfer the state of the interactive session to another execution environment (node or farm) with
no required involvement from the user.
Requirements in [7], [6], and [9], focused on physics analysis, mainly from a single user point of
view. This section lists several requirements that the overall GAE system needs to satisfy:
As stated in [7] section 2.5: physics analysis is less predictable than other physics processes such
as reconstruction and simulation. Furthermore, physicists from all over the world will access
resources and share resources. These resources change/move/disappear in time. Additional to the
requirements stated in [7], [6], and [9] for interactivity in a Grid are the following requirements:
Requirement 1, Fault tolerance: Data loss and loss of time because of failures of components
should be minimized. A corollary of this is that the web services must operate asynchronously so
that the application can continue to run in the face of a slow or unresponsive service.
Requirement 2, Adaptable: Access to data will be unpredictable, however patterns could be
discovered that can be exploited to optimize the overall performance of the system.

Requirement 3, Scalable: The system is able to support an ever-increasing number of resources
(CPU, network bandwidth, storage space) without any significant loss in responsiveness. The
system is also able to support increasing numbers of users. As more users enter the system, the
system will respond predictably and will not become unusable for everyone.
Requirement 4, Robust: The system should not crash if one site crashes (no chain reaction).
Requirement 5, Transparent: It should be possible for Grid clients to get feedback on the
estimated state of grid resources. This would allow clients to make better decisions on what
resources to use within the Grid.
Requirement 6, Traceability: If the performance of Grid resources drop, there should be diagnostic
tools available to analyze these anomalies.
Requirement 7, Secure: Security within the GAE will be certificate based: Users get a
certificate/key pair from a certificate authority. This certificate allows a user to access service within
the Grid using a proxy that acts on behalf of the user for authentication. Depending on the security
settings and policies (e.g. VO management settings) of different Grid sites and individual services
of that site, a user can access these services offered by a particular site. Clarens [10] is one of the
applications that support certificate base security and VO management.
7
Requirement 8, Independence : The components described in the architecture will be deployed as
“stand alone” components: There will be no specific dependency between any two components.
For example, in order to use replica manager “X” I will need to deploy meta-data catalog “Y” within
the architecture. Stand alone components and the use of web services allow to “plug in” different
components (e.g. different meta-data catalogs) that export the same interface as used within the
GAE.
Requirement 9, Policies: It should not be possible for users or groups to allocate more resources
on the Grid than they are allowed to. There should be a mechanism in the Grid that allows
organizations to specify policies and that monitors the use of resources by users and groups based
on these policies. Policies will describe a measure of “fair use of Grid resources” for all users and
groups involved (provided all uses agree on this policy). It should be noted that deciding on a policy
can not be solved by any piece of software but is subject to discussion between groups an users of
that policy.

4 Architecture
The first sub section describes the high level view of the architecture, while the second sub section
identifies the different services that are needed for interactive analysis within such a high level
architecture.
4.1 Peer Groups
The LHC experiments computing models were initially based on a hierarchical organization of Tier
0, multiple Tier 1's and multiple Tier 2's, etc . Most of the data will “flow” from the Tier 0, to Tier
1's and fromTier 1's to Tier 2's. Furthermore, Tier 0 and Tier 1 are powerful in terms of cpu power,
data storage, and bandwidth. This hierarchical model has been described in “Data Intensive Grids
for high-energy physics” [11] . Data is organized such that the institutes (represented by tiers) will
be (in most cases) physically close to the data they want to analyze.
The hierarchical model described in [11] is a good starting point, but there will be a certain
dynamics within the LHC computing/data model. Although it is possible to make predictions on
how to organize data, the patterns of the end users for analysis will be unpredictable. Depending
on how “hot” or “cold” data is, attention will shift to different data sets. Furthermore multiple
geographically dispersed users might be interested in data that is geographically dispersed on the
different Tier's. This geographically dispersed user group for particular data can lead to data
replication in order to prevent “hot spots” in data access.
8
Figure 4.1. Peers and Super Peers
Figure 4.1 shows a modification of the hierarchical model: hierarchical peer to peer model. In which
the different tier x centers act as peers. The thickness of the arrows represent the data movement
between peers while the size of the peer represents its power in terms of cpu, and storage. The
green part shows the number of resources that is idle, yellow: resources being used, red:
resources that are offline. Associated with every peer are data. Red colored data represents “hot”
data that is accessed often. Blue colored data represents “cold” data that is accessed less
frequent.
The hierarchical model is the basis on which data will be distributed, however the unpredictable
behavior of physics analysis as a whole will lead to data and jobs being moved around between
different peers. These data movements and job movements outside the hierarchical model

although relative small compared to the data movement within the hierarchical model, will be
substantial. When users submit jobs to a peer, middleware will discover what resources are
available on other peers to execute this job request. Although a large number of job requests will
follow the hierarchical model, other job requests will not. The more powerful peers (super peers)
will receive more job requests, and data requests and will host a wider variety of services. Within
the figure 4.1 T0 and T1's and 1 T2 act as super peers.
9
It is not unlikely that certain Tier 2's could be more “powerful” than a Tier 1's in terms of this
measure and over time the relations between “tier power” can change, due to hardware and
software upgrades in the future. As such the hierarchical model will form the basis of the peer to
peer model, but this model is not fixed and can change during time, due to the self organizing
capabilities of the Grid. A peer to peer architecture address requirements 4 an 5 about robustness
and scalability.
The services developed within GAE should be robust and scalable enough to be deployed in the
(hierarchical) peer to peer model.
4.2 Services
Based on discussion on interactive analysis (section 2) , use cases mentioned in HEPCAL and,
requirements it is possible to identify a set of core services for GAE.
Within this section the word service will be used to describe the different components within the
architecture, because these components will be implemented as a web service within GAE. While
they have been selected due to their necessity for an interactive Grid environment, many of these
services could be used in other Grid environments. In order to satisfy fault tolerance requirements
all services must be able to operate asynchronously.
The following services (components) have been identified for GAE:
Sign-on and Virtual Organization: These services provide authentication, authorization, and
access control for data and computing resources. Virtual Organization management services are
also provided to assist in maintaining and modifying the membership of virtual organizations.
Virtual organization service relates to requirement 7 and 9 on security and policy.
Data Catalog / Virtual Data: These services provide data lookup capabilities. They allow
datasets to be looked up based on Physics Expressions or other meta data. These services return

lists of Logical Filenames. The various HEP experiments will have their own implementations of
this service to handle the experiment-specific meta-data and dataset naming.
Lookup: Lookup services are the main entry point to access the Grid. Locations of the various
services are never known in advance; they must be located using a lookup service. The lookup
services allow dynamic lookup of the Grid services, eliminating the need to know the service
locations in advance. Decentralized peer to peer style technologies are key to the operation of the
lookup services to keep their service listings accurate and up to date.
Three possible architectures for lookup service implementations are described below. Each
architecture builds upon the previous. This makes it is possible to implement the simplest solution
10
first, and build upon it for future implementations.
The first architecture a centralized lookup service. All clients would contact the centralized service
to locate existing service implementations. This would not fit with the decentralized approach within
GAE.

A second architecture is similar to the first, except that the lookup service information is replicated
across many lookup service instances. When a new service instance is registered on one lookup
service, the lookup service will propagate the registration to all other known lookup services. New
lookup service hosts are added to the system just as any service instance is added. Lookup
service hosts will have to ping other service hosts periodically to ensure that their list of services is
kept up to date in case of a failure in the propagation of a service registration. An advantage to this
architecture is that service lookups should be very fast, as the lookup happens locally. A drawback
is that as the number of available service instances grows, so does the amount of data replicated
on each lookup service host.

A third architecture involves distributed queries and distributed service location storage. When a
new service instance is registered with a lookup service, the registration stays local; it is not
propagated to other lookup service hosts. When a lookup request is received by a lookup service, it
can no longer respond authoritatively to the request. Instead, it has to propagate the request on to
other lookup services, collect responses, and return them to the requester. This architecture is

more fully distributed than the first two architectures. An advantage is that the local service
information storage remains small. But the disadvantage is that the response time for a lookup
request increases as multiple lookup services are queried.
It has yet to be determined whether the second or third architectures would be more ideal.
Measurements on the responsiveness of the system with respect to number of services and
number of service hosts need to be taken before a decision can be made. Lookup services create
a more transparent Grid environment as users and applications are able to locate and identify
different services within a dynamic distributed environment as mentioned in requirement 5.
Processing Wrapper: The details of using many of the services together should be hidden from
the application. When submitting a processing job request, for example, the application should not
have to contact the Meta data Catalog, Job Scheduler, and the Job/Process Execution services
directly. To simplify these steps, a Processing Wrapper service will be used. This service collects
all of the inputs that would normally be given to the Meta Data Catalog, Job Scheduler, and JPES
services. It then contacts these other services on behalf of the application.
Scheduler: This service resolves an abstract job plan into a concrete job execution plan for Grid
applications. It also submits the concrete plan to the grid for execution. The abstract job plan is a
breakdown of the Individual actions and the order in which they will occur. The process execution
11
plan maps the tasks in the abstract plan to specific execution nodes and contains rules for
replicating data to the execution nodes. When an abstract job plan is received by the scheduler, it
first returns a concrete job plan to the user for review. The concrete job plan is then resubmitted to
the scheduler for execution. At this point the scheduler returns back to the client both the concrete
job plan and a handle to a steering service instance. This steering service instance is then used to
monitor the state of the plan execution and to make changes during execution. Note that the job
plan for an interactive job may be as simple as “allocate n interactive execution nodes”. The
abstract and concrete plans used by the scheduler must be usable by any scheduler on the system
for execution; they are not specific to the scheduler that produced them. This reduces the severity
of a failure of a scheduling server.
Replica Management: These services provide the capability to copy and move data around the
Grid. One particular use would be to replicate remote data to a local staging location for more

efficient processing. The Replica Management service makes use of lower level Replica Location,
Meta data, and Replica Optimization services to control the movement of data around the Grid.
Replica management will prevent bottlenecks in data access on the grid and lowers the chance of
losing valuable data. As such replica management adds to Grid robustness and scalability
(requirement 4 and 5)
Replica Selection: This service is used to locate the optimum replica to use for processing. It is
used primarily by the Scheduling service in conjunction with the Execution Location service to
determine the optimum locations for staging and processing data.
Steering/Optimization: The client application obtains a handle to a steering service once the
concrete job plan is submitted to the scheduler for execution. The steering service provides a set
of APIs for sending commands to interactive execution nodes and tuning concrete job processes.
These APIs will allow jobs to be transferred to different execution nodes and allow data to be
replicated between hosts. The steering service acts as a proxy between the client and the
scheduler. The steering service also provides a message subscription API (such as Java Message
Service [12] ) to allow the client to subscribe to specific monitoring and job update notices. The
steering service is responsible for routing the appropriate messages from the various monitors
back to the client. Intelligent agents can also use the steering service to make decisions on behalf
of the user, or the greater Grid community. If a failure in the Steering service occurs then the client
application can submit the concrete job plan to another Steering service instance. The new
steering service will contact any active job tasks (as indicated by the concrete job plan) and
resubscribe the client to the monitoring information.
12
Figure 4.0 Steering Service.
Steering and optimization allow job requests to adapt to the dynamic environment of the Grid.
Adaptability was discussed in requirement 2.
Monitor/Workflow: These services provide information on the current state of the job plan. A
publish/subscribe messaging API (such as JMS [12] ) is used to register listeners to monitoring
data. The Steering service uses Monitors on the execution nodes to send job status feedback to
the user. A monitoring service should not only keep track of the state of the job, but also about the
state of the resources on which jobs are submitted (Grid weather). This information can then be

used by steering services to take decisions on moving jobs around (self organization). Monitoring
enables diagnosing and analyzing Grid resources and provides transparency (requirement 5 and
requirement 6).
Quota/Accounting:
The system needs to be able to satisfy all jobs by allocating storage, computer and network
resources in a fashion that satisfies both global and local constraints. Global constraints include
community-wide policies governing how resources should be prioritized and allocated. Local
constraints include site-specific control as to when or how much external users are allowed use
local resources. For example, individual resource providers (as well as a Virtual Organization as a
whole) may assign usage quotas to intended users and job monitoring mechanisms may be used
to tally accounting information for the user. As such, the architecture requires services for storing
quota and accounting information so that other services (such as a scheduling service or a steering
service) may mine that information for enforcement strategies. Due to the distributed nature of the
problem, enforcement of global constraints would necessarily be "soft" and on a best-effort basis,
whilst enforcement of local constraints would be enforced by the local site itself and thus be "hard."
Accounting enables a “fair” access to resources as specified by policies (requirement 9)
Estimators: Estimators let the application know, in advance, how much resource (time, space,
etc.) a particular action might take, and provide a “progress meter” to the application. An estimator
service can use past performance as a predictive measure of future performance and can
extrapolate from the current action's execution status. Most services will use estimators to give
13
feedback to users, applications, and agents. Estimators are not necessarily a service unto
themselves; they may be part of the service API of the other services. For example, the Replica
Management Service may contain an estimator as part of its API for moving data around the Grid.
Estimators relate to requirement 5.
Job/Process Execution: Executes a set of jobs as part of a concrete job plan. The service
provides job progress and estimate updates to the steering service. The Steering service collects
the information from all of the JPES involved in the user's session and sends the collection of
updates back to the user.
Data Collection: Provides a way for the application to obtain the final result from the execution of

the job/process execution service, and combines the data that has been generated on multiple
locations. The result will be either a logical dataset name for large result sets or the actual result
data for small result sets (as requested by the client). The result should be a Logical Dataset which
can then be mapped to a physical dataset using the Replica Management service. The data
collection service also collects intermediate results from the individual job/process plan steps.
Figure 4.1. GAE services.
14
Figure 4.1 Shows the set of services discussed in this section embedded with several basic grid
services. Part of these services are designated “high-level” services as they build on existing
middleware services.
4.3 Self organization
Self organization means that components withing GAE make decisions based on monitor
information without intervention of humans. Within several components of GAE autonomous
behavior can be inserted to create a “self organizing” Grid:
• Steering/optimization: Monitor the Grid weather and decide if a job can be moved to other
computing nodes based on policies for this job.
• Scheduling: Keep track of history of scheduling decisions and how “good” these decisions were.
• Replica manager: Keep track of access patterns of data and base replica strategy on it.
5 Scenarios
Based on the use cases in [7], [6], [9] , requirements and components described above it is
possible to formulate several scenarios that show the interaction of the GAE components. It is not
the intention to give an exhaustive list of scenarios, but rather a few, that identify core component
interactions . One scenario related to state information within GAE has been described in section
2.2.
5.1 Scenario 2: A simple interactive session from a single user point of
view
This scenario is based on a description of the analysis activity described in section 3 of the
HEPCALII document [6]:
1) Application contacts Virtual Organization Service to authenticate. An authentication token is
returned and passed to other Grid services to determine authorization of service use at a local

level.
2) Application uses a Lookup Service to locate a suitable Process Wrapper service on the Grid.
3) Application sends an abstract job plan to the Process Wrapper. This abstract plan can be as
abstract or concrete as necessary. If the user knows in advance where data is stored or which
execution nodes must be used, then that information can be included in the plan. Or the plan
can leave these as abstract entries to be resolved by the scheduler. The Process Wrapper acts
as an intermediary between the scheduler, the execution service, and the metadata service,
eliminating the need for the application to have to communicate with these services directly.
4) Process Wrapper uses a lookup service to locate Scheduling and Metadata services on the
15
Grid. A Metadata catalog service is used to map any meta data in the abstract plan into logical
data filenames.
5) A process wrapper sends the abstract plan to the scheduler. The scheduler further resolves the
abstract plan into a concrete plan by locating logical file replica and available execution sites.
6) The scheduler contacts the accounting service to make an estimate of how many resources this
user/group has in use and contacts estimator services to get an estimate on how many
resources and for what period this job will use.
7) The scheduler sends the concrete job plan back to the user for approval. This gives the user a
chance to verify the plan and view the estimate of resource usages.
8) The scheduler sends a copy of the job plan to a steering service, and sends a handle to the
steering service back to the user.
9) The scheduler initiates any data movement on the grid and starts submitting jobs to the local
execution services. Data processing begins.
10)As processing is happening, monitoring services return status information back to the
application (via a steering service) so that the user gets real time feedback on the state of the
processing. Steering services are used to modify the plan, resubmit processing tasks, and fine-
tune the processing. The Job/Process Execution Service returns partial data results to a Data
Collection Service. New interactive tasks are submitted iteratively.
11)When processing is completed, the Data Collection Service returns the final Logical Dataset
Filename to the application including a description of a state description of the job that can be

used for further processing.
12)User reviews the results and submits a new set of processing instructions for further
processing.
Within GAE there will be hundreds of users that would perform the above scenario several times
for an interactive session in a decentralized unpredicted manner.
Figure 5.1 shows the set of interacting web services that comprise GAE as described above,
combined with a typical execution flow within the system. The numbers in figure 5.1 correspond
with the numbers of the scenario 2 of section 5. The colors in figure 5.1 highlight the different
classes as shown in figure 4.1. Within figure 5.1 service can reside on different grid service hosts.
The user is not aware how these service are distributed. Figure 5.1 is a snapshot in time as
services can disappear (failure, obsolete), or being duplicated at other grid service hosts,
dynamically in time. The grid service hosts act as peers and super peers within the grid
environment.
16
Figure 5.1 Interaction of different service in GAE
5.2 Scenario 3: A Complex Interactive Section in a Multi User
environment
This scenario is based on scenario 1 (section 2.2) and scenario 2 (section 5.1). Instead of
describing all the steps in detail it will refer to scenario 1 and scenario 2 to prevent a long and
unclear scenario description:
1) User requests several GB of data on which to perform a coarse filter. This involves steps 1-4 of
scenario 2.
2) Steps 5-11 of scenario 2 are executed. In step 9 and 10 of scenario 2 several batch processes
will be interrupted as these filtering jobs are estimated as relative short.
3) Data processing starts at multiple places.
4) During processing on execution node fails, this is fed back to the JPES (scenario 1 kicks in).
5) Data has been filtered and results are combined by the data collection service. As the result is
still relatively large the data collection sends logical names back to the user.
6) User sends another collection of parallel jobs to the scheduler to do finer granularity filtering on
the data associated with the logical name returned in step 5. Each of these parallel jobs will only

17
access a part of the data associated with the logical name returned in step 5 (these parts are
complementary and non overlapping).
7) Steps 5-11 of scenario 2 are executed. In step 9 and 10 of scenario 2 several batch processes
will be interrupted as these filtering jobs are estimated as relative short.
8) Data processing starts at multiple places.
9) User monitors progress of the jobs, and notices that one jobs slow down considerably. He
decides to reschedule this job by sending a “reschedule and make it go faster” command to the
steering service.
10)During processing one execution node fails, this is fed back to the JPES which reschedules the
job on the same farm (scenario 1 kicks in).
11)Data has been filtered and results are combined by the data collection service. As the result is
still relatively large the data collection sends logical names back to the user.
12)For the final step a neural net algorithm will be used. The “right” parameter settings of a neural
net algorithm are not known in advance and several runs of the neural net algorithm are
necessary.
13)User submits the multiple neural net algorithms (each with different parameters) to the
Scheduler. All these algorithm instances will access the data associated by the second logical
name that was returned in step 11.
14)Steps 5-11 of scenario 2 are executed. In step 9 and 10 of scenario 2 several batch processes
will be interrupted as these neural net jobs are estimated as relative short.
15)Data processing starts at multiple places.
16)Data has been filtered and results are combined by the data collection service. As the result is
relatively small the data collection sends the complete data set to the user
17)The user visualizes the results, but is not content with the results. He decides to apply a
different finer granularity. This process starts at step 6 of this scenario.
6 Associating GAE components to HEPCAL use cases
Below is a list that associates use cases from the HEPCAL documents to GAE components. The
use cases can be found at the end of the HEPCAL documents:
1. “Grid Login” [7]: Sign on/Virtual Organization

2. “DS Meta-Data update” [7]: Catalog /Virtual data
3. “DS Meta-Data access” [7]: Catalog/Virtual data
4. “Dataset registration” [7]: Catalog/Virtual Data
5. “Virtual Dataset Declaration” [7]: Catalog/ Virtual Data
6. “(Virtual) Dataset Access” [7]: Catalog/Virtual Data
7. “Dataset Access Cost Evaluation” [7]: Estimator
8. “Data Set Replication” [7]: Replica management, Replica Selection, Catalog
18
9. “Data Set Deletion, Browsing, Update” [7]: Catalog
10.“Job Submission” [7]: Sign on, Catalog, Steering, Scheduler, Job Execution, Monitoring
11.“Job Output Access or Retrieval” [7]: Data Collection
12.“Job Control” [7]: Steering service
13.“Steer Job Submission” [7]: Steering service
14.“Job Resource Estimation” [7]: Estimators
15.“Analysis 1 “[7] :Sign on, Catalog, Steering, Scheduler, Job Execution, Monitoring, Data
collection
16.“Data Transformation” [7]: Sign on, Catalog, Steering, Scheduler, Job Execution, Monitoring,
Data collection
17.“Job Monitoring” [7]: Monitoring
18.“VO wide Resource Reservation” [7]: Job Scheduler
19.“VO wide Resource allocation to Users” [7]: VO management
20.“End User Analysis” [6]: Sign on, Catalog, Steering, Scheduler, Job Execution, Monitoring, Data
collection;
7 Related Work associated with GAE Components
As mentioned in section 1, the GAE described in this paper is not a “stand alone” effort but relies
on ongoing work within Grid research. It is not the intention to give an exhaustive overview in this
section but to highlight some of the Grid related research and associate it with the different
architecture components the GAE. This related research can serve as a basis for implementation
of the different components within the GAE, but does not imply that it is always possible to take the
existing components and “plug” them into the GAE. It is not unlikely that certain GAE components

need to be build from “scratch” as existing components do not meet requirements related to
interactive analysis. However, it needs to be investigated which components can be used.
Several GAE components can currently not be associated with current Grid research: steering &
optimization, work flow, estimators, data collection, and accounting.
Sign-on/Virtual organization
The Clarens [10], [13] Grid-enabled web services framework is a system that allows new web
services to be registered and deployed within a wide area network. It contains powerful virtual
organization (VO) management combined with architectural simplicity, and is available in Python or
Java implementation. Clarens can act as the “backbone” within GAE and can host the VO services
and lookup services. For other components (monitoring, job planning, meta data catalog, etc ) that
are not developed within the Clarens environment, Clarens interfaces can be developed, that
would allow these components to act as web services within a wide area network, using the VO
management and authentication of Clarens. As such Clarens can not only be used for Virtual
19
Organization management within GAE but can provide wrappers to other components and provide
interoperability between these components.
Data Catalog /Virtual Data
POOL(Pool Of persistent Objects for Lhc) [14], [15] is an persistency framework for physics
applications at LHC, that contains (beside others) a meta data service: It is often useful to provide
an efficient means for sub selection of objects within a given collection based upon a manageable
number of properties or attributes that may be readily queried without navigation into or restoration
of the objects themselves. The POOL project is exploring the treatment of such attribute lists as
object-level meta-data. The purpose of the POOL meta-data component is to provide the
infrastructure needed to support this view in a way that is consistent with meta-data services at
other levels of a persistence architecture.
The Chimera Virtual Data System (VDS) [16] provides a catalog that can be used by application
environments to describe a set of application programs ("transformations"), and then track all the
data files produced by executing those applications ("derivations"). Chimera contains the
mechanism to locate the "recipe" to produce a given logical file, in the form of an abstract program
execution graph. These abstract graphs are then turned into and executable DAG.

Components of both POOL and Chimera could be used as a basis for the meta-data catalog
service.
Lookup
JXTA [17] is a set of protocols for developing peer to peer systems. It provides peer to peer
middleware capabilities such as discovery, lookup, and communication channels. JXTA is a
language-neutral set of APIs. In fact, the JXTA project provides Java, C, Perl, and limited Python
libraries. More importantly, two peers in the JXTA network can be written in different languages
but still communicate through the language-agnostic protocols. While JXTA provides a means for
locating peers and service on the JXTA network, it does not enforce any specific service invocation
standard. SOAP [18], xmlrpc [19], or RMI can be used to invoke JXTA services as long as both
peers can communicate through the same protocol.
Jini [20] is a Java RMI based system. Services are registered with a Jini lookup service using the
Jini API. RMI is used to invoke remote services. This use of RMI enables services to be run
remotely, and it enables services to be downloaded and run locally. However, RMI by nature is a
Java-only transport. Jini does not use SOAP or XMLRPC as a transport protocol. Due to its
language specific nature and lack of support for Web Service standards, Jini is not suitable for the
larger GAE environment. Jini can be used in some areas of the GAE environment, however. For
example, the MonALISA [21] monitoring system is a Jini based network of monitoring agents that
20
also provides a web service interface to allow access to the monitoring information from non java
hosts.
UDDI [22] is a web service discovery specification for SOAP-based web services. The standard is
currently only implemented as part of some commercial packages. It has not seen wide adoption
outside of these products. Apart from this restriction, the standard itself also lacks specific support
for secure registration, including secure third party registration of services.
Replica Management/Selection
Within the Grid community there is ongoing research into replica location and replica management.
Reptor [23] is a replica management package developed within the European DataGrid project.
Giggle [24] is a framework in which a wide range of replica location services can be defined. Within
CMS.

Scheduler/Execution Location
Sphinx The central components of the Sphinx system are the scheduling servers. Within a VO,
resources are assigned to the domain of a Sphinx server, possibly dynamically. While the VO may
allow shared access to these resources, it impacts the optimality of decisions made by the Sphinx
server. Within its domain of the grid, the server performs several functions. First, the server
monitors the status of its resources. Second, it maintains catalogs of data, executables and their
replicas. Third, it tracks the progress of all the jobs that are currently executing and have executed.
Fourth, it provides estimates for the completion time of the requests on these resources. Fifth, it
decides how best to allocate those resources to complete the requests. The scheduling servers
can be replicated to improve robustness and response time, and different VOs can use servers
with customized algorithms or interfaces for VO specific systems. Additionally, the scheduling
servers use a database infrastructure to maintain the state of the current scheduling process. This
not only simplifies development, but also provides a fault tolerant system for inter-process
communication, and a robust recovery system that can detect internal component failure. The
architecture also allows for the addition of new modules without necessarily affecting the logical
structure of already written modules. These could be provenance tracking or other high level
modules, for example. Currently, such planning modules are predetermined at runtime; however
one interesting future direction of research is the possibility of dynamically loading planning
modules.
The Sphinx scheduling client is an agent for processing user scheduling and execution requests
and communicates with the Sphinx scheduling server using the Clarens web-service (based on
XML-RPC). The scheduling client interacts with both a remote scheduling server, which allocate
resources for task execution, and a grid execution system (e.g. DAGMan/Condor-G). Because of
this close connection to external components, it is often important that the client be as lightweight
21
as feasible. In this way, the client can be easily modified if external components change. In
addition, the Clarens web-service allows controlled access to the Sphinx Data Warehouse enabling
customized planning modules to also be located on a Sphinx Client.
Monitoring
The MonALISA [21] framework provides a distributed monitoring service system using JINI/JAVA

and WSDL/SOAP [25] technologies. Each MonALISA server acts as a dynamic service system and
provides the functionality to be discovered and used by any other services or clients that require
such information. The goal of MonaALISA to provide the monitoring information from large and
distributed systems to a set of loosely coupled "higher level services" in a flexible, self describing
way. Within the GAE context these “higher level services” are the steering service, job scheduler,
and the end user application.
Job/Process Execution
CondorG [26] Condor-G is the job management part of Condor, and lets you submit jobs into a
queue, have a log detailing the life cycle of your jobs, manage your input and output files, Condor-
G uses the Globus Toolkit [27] to start the job on the remote machine. Once the process wrapper
and the job scheduler within GAE have determined what resources, applications and data are
needed, it can contact a CondorG instance to execute the job.
Other related work not directly associated to one specific GAE component
VDT
The Virtual Data Toolkit (VDT) [28] represents a packaging of a set of fundamental grid middleware
which is decomposed into three primary packages: a server, a client, and an SDK. The server
contains software which manages the resources at a grid site, while client contains software one
would use to submit jobs for execution at a remote grid site. The SDK contains libraries to develop
new software.
22
+ + + +
|Server |Client | SDK |
+ + + +
|Chimera Virtual Data System |Chimera Virtual Data System | |
|Condor/Condor-G |Condor/Condor-G |ClassAds |
|Fault Tolerant Shell |Fault Tolerant Shell | |
|Globus 2 |Globus 2 |Globus 2 |
|Globus Replica Manager | | |
|CA signing policies |CA signing policies | |
|Glue schema |Glue schema | |

|EDG mkgridmap |EDG mkgridmap | |
|EDG Cert. Revocation List |EDG Cert. Revocation List | |
|GSI OpenSSH |GSI OpenSSH | |
|Java JDK 1.1.4 |Java JDK 1.1.4 | |
|KX509/KCA server |KX509 client | |
|MonALISA server |MonALISA client | |
|MyProxy |MyProxy | |
|PyGlobus |PyGlobus |PyGlobus |
|RLS Server |RLS Client |RLS SDK |
| | |NetLogger|
+ + + +
A short description of most of the VDT components follows: The Chimera Virtual Data System is a
tool for specifying how a data product is/was produced. Chimera consists of a Virtual Data
Language for specifying derived data dependencies in terms of a DAG, a database for persistently
storing the DAG representation, an abstract planner for planning a
workflow which is location and data existence independent, and a concrete planner for mapping an
abstract workflow onto specified grid resources. Condor represents a local scheduler for a
compute cluster at a grid site running the VDT Server, while Condor-G represents a manager for
submitting jobs to a remote grid site from a machine running the VDT Client. The Fault Tolerant
Shell represents a UNIX shell which automatically retries a GridFTP transfer a configurable number
of times before reporting failure. The VDT Server.s Globus installation includes: the gatekeeper,
jobmanager, and GridFTP server, etc. The VDT Client installation of Globus includes the libraries
necessary to submit jobs and the GridFTP client, among other things. The CA signing policies
represent a list of trusted certificate issuing authorities. The EDG mkgridmap is a tool for
generating the Globus gridmap file from a remote VO server containing a list of user subject names
known by the VO. The EDG Certificate Revocation List contains a list of user subject names which
have been revoked. GSI OpenSSH represents a Grid Security Infrastructure modified version of
OpenSSH. MonALISA is a distributed monitoring system for computer, storage, and networking
resources and is described elsewhere in this document. PyGlobus represents a python binding to
Globus. And, finally RLS represents the Replica Location Service which consists of a Local Replica

Catalogue (LRC) located at each grid site (VDT Server) and a Replica Location Index (RLI) located
at specified service sites and machines running a VDT Client. To resolve a logical filename (LFN)
to a physical filename, (PFN) query is made to an RLI which returns a lst of LRCs which likely
contain the requested LFN to PFN mappings.
23
The VDT Server, Client, and SDK are currently installed using the Pacman packaging software
which also automatically configures each middleware component after installation.
Alien[29] is a Grid like system that focuses on batch job execution management within the high
energy physics community. Although Alien deals with physics analysis it lacks several
characteristics that will be addressed within GAE:
• Centralized approach. Several components of Alien are centralized, such as the job scheduler.
This creates a single point of failure within a distributed system.
• Lack of monitoring information. Although Alien has some monitoring features, it does not
actively use this monitoring information to (re) schedule jobs. (e.g. A steering service).
• Alien focuses on batch analysis. However components from Alien might be reused or adapted
within GAE.
Proof [30] is a package based on root for distributed analysis using a master slave approach.
Although Proof offers interactive distributed analysis it differs from the architecture and ideas
presented in this paper. Proof is one of the analysis tools that will be used within a Grid analysis
environment next to other analysis tools such as Root (the single process counterpart of Proof).
Proof does not address the multi users analysis environment that is addressed within this proposal,
but rather manages several processes for one user. A Grid analysis environment need not only
schedule and handle jobs of one user, but also match the scarce resources available within the
Grid against the request for resources for multiple users.
Distributed Interactive Analysis of Large Datasets (DIAL) [31] is a project that focuses on handling
of large data sets within an interactive analysis environment. Although this is not the primary focus
of the GAE (see section 1), it is an important problem that is addressed within DIAL. As such the
work within the DIAL project will be largely complementary to the GAE project and both projects
can benefit from each other complementary results.
Figure 7.1 shows a mapping from some of the existing work to GAE services.

24
Figure 7.1. Mapping existing work to GAE services.
8 Work packages
This section gives a description of the different work packages and milestones of the GAE project.
These work packages and milestones are based on the components described in this architecture
document.
Within the work packages and milestone descriptions the word "client" refers to an entity that
accesses the Grid. A client can be a real person (user), or an application. The milestone is a
description of what (minimal) functionality the system should have at that point. Milestones will be
verified by one or more scenarios. Scalability is an implicit assumption within every milestone
description. Scalability refers to multiple clients accessing the GAE system and using multiple
resources within the GEA system to accomplish their task. At every milestone scalability of the
GAE system will be tested.
Projects such as GAE are executed within a distributed environment of multiple collaborating
25

×