Tải bản đầy đủ (.pdf) (207 trang)

Grids, P2P and Services Computing

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (3.36 MB, 207 trang )


Grids, P2P and Services Computing


Frédéric Desprez • Vladimir Getov
Thierry Priol • Ramin Yahyapour
Editors

Grids, P2P and Services
Computing

1C


Editors
Frédéric Desprez
INRIA Grenoble Rhône-Alpes
LIP ENS Lyon
69364 Lyon Cedex 07
France

Vladimir Getov
University of Westminster
School of Electronics and
Computer Science
HA1 3TP London
United Kingdom


Thierry Priol
INRIA Rennes - Bretagne


Atlantique
Campus universitaire de Beaulieu
35042 Rennes Cedex
France

Ramin Yahyapour
TU Dortmund University
IT & Media Center
44221 Dortmund
Germany


ISBN 978-1-4419-6793-0
e-ISBN 978-1-4419-6794-7
DOI 10.1007/978-1-4419-6794-7
Springer New York Dordrecht Heidelberg London
Library of Congress Control Number: 2010930599
© Springer Science+Business Media, LLC 2010
All rights reserved. This work may not be translated or copied in whole or in part without the written
permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY
10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by
similar or dissimilar methodology now known or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are
not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject
to proprietary rights.
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)


Preface


The symposium was organised by the ERCIM1 CoreGRID Working Group (WG)
funded by ERCIM and INRIA. This Working Group sponsored by ERCIM has been
established with two main objectives: to ensure the sustainability of the CoreGRID
Network of Excellence which is requested by both the European Commission and
the CoreGRID members who want to continue and extend their successful co operation, and to establish a forum to foster collaboration between research communities
that are now involved in the area of Service Computing: namely high performance
computing, distributed systems and software engineering.
CoreGRID2 officially started in September 2004 as an European research Network of Excellence to develop the foundations, software infrastructures and applications for large-scale, distributed Grid and Peer-to-Peer technologies. Since then,
the Network has achieved outstanding results in terms of integration, working as a
team to address research challenges, and producing high quality research results.
Although the main objective was to solve research challenges in the area of Grid
and Peer-to-Peer technologies, the Network has adapted its research roadmap to include also the new challenges related to service-oriented infrastructures, which are
very relevant to the European industry as illustrated by the NESSI initiative3 to develop the European Technology Platform on Software and Services. Currently, the
CoreGRID WG is conducting research in the area of the emerging Internet of Services, with direct relevance to the Future Internet Assembly4 . The Grid research
community has not only embraced but has also contributed to the development of
the service-oriented paradigm to build interoperable Grid middleware and to benefit
from the progress made by the services research community.

1

European Research Consortium for Informatics and Mathematics, /> />3 Networked European Software and Services Initiative, />4 />2

v


vi

Preface


The goal of this one day workshop, organized within the frame of the Euro-Par
2009 conference5 , was to gather together participants of the working group, present
the topics chosen for the first year, and to attract new participants.
The program was built upon several interesting papers presenting innovative results for a wide range of topics going from low level optimizations of grid operating
systems to high level programming approaches.
Grid operating systems have a bright future, simplifying the access to large scale
resources. XtreemOS is one of them and it was presented in an invited paper by
Kielmann, Pierre, and Morin.
The seamless access to data at a large scale is offered by Grid file systems such
as Blobseer, described in a paper from Tran, Antoniu, Nicolae, Boug, and Tatebe.
Failure and faults is one of the main issues of large scale production grids. A
paper from Andrzejak, Zeinalipour-Yazti, and Dikaiakos presents an analysis and
prediction of faults in the EGEE grid.
A paper from Cesario, De Caria, Mastroianni, and Talia presents the architecture
of a decentralized peer-to-peer system applied to data-mining.
Monitoring distributed grid systems allows researchers to understand the internal
behavior of middleware systems and applications. The paper from Funika, Caromel,
Koperek, and Kupisz presents a semantic approach chosen for the ProActive software suite.
The resource discovery in large scale systems deserve a distributed approach.
The paper from Papadakis, Trunfio, Talia, and Fragopoulou presents an approach
mixing dynamic queries on top of a distributed hash table.
A paper from Carlini, Coppola, Laforenza, and Richi aims at proposing scalable
approach for resource discovery allowing range queries and minimizing the network
traffic.
Skeleton programming is one promising approach for high level programming
in distributed environments. The paper from Aldinucci, Danelutto, and Kilpatrick
describes a methodology to allow multiple non-functionnal concerns to be managed
in an autonomic way.
In their paper, Moca and Silaghi describe several decision models for resource
agregation within peer-to-peer architectures allowing different decision aids classes

to be taken into account.
Workflows management and scheduling received a large attention of the grid
community. The paper from Sakellariou, Zhao, and Deelman describes several mapping strategies for a astronomy workflow called Montage.
Access control is an important issue that needs to be efficiently solved to allow
the wide scale adoption of grid technologies. The paper from Colombo, Lazouski,
Martinelli, and Mori presents new flexible policy language called U-XACML that
improves the XACML language in several directions.
The paper from Fragopoulou, Mastroianni, Montero, Andrjezak, and Kondo describes several research areas investigated within the Self-* and adaptive mechanisms topic from the Working group.
5

/>

Preface

vii

Several research issues around network monitoring and in particular network
virtualization and network monitoring are presented in the paper from Ciuffoletti.
Research challenges for large scale desktop computing platforms are described
in the paper from Fedak.
Finally, a paper from Rana and Ziegler presents the research areas addressed
within the Service Level Agreement topic of the Working Group.
The Programme Committee who made the selection of papers included:
Alvaro Arenas, STFC Rutherford Appleton Laboratory, UK
Christophe Crin, Universit de Paris Nord, LIPN, France
Augusto Ciuffoletti, University of Pisa, Italy
Fr´ed´eric Desprez, INRIA, France
Gilles Fedak, INRIA, France
Paraskevi Fragopoulou, FORTH-ICS, Greece
Vladimir Getov, University of Westminster, UK

Radek Januszewski, Poznan Supercomputing and Networking Center, Poland
Pierre Massonet, CETIC, Belgium
Thierry Priol, INRIA, France
Norbert Meyer, Poznan Supercomputing Center, Poland
Omer Rana, Cardiff University, UK
Ramin Yahyapour, University of Dortmund, Germany
Wolfgang Ziegler, Fraunhofer Institute SCAI, Germany
All papers in this volume were additionally reviewed by the following external
reviewers whose help we gratefully acknowledge:
Gabriel Antoniu
Alessandro Basso
Eddy Caron
Haiwu He
Syed Naqvi
Christian Perez
Pierre Riteau
Thomas Rblitz
Bing Tang
Special thanks are due to the authors of all submitted papers, the members of the
Programme Committee and the Organising Committee, and to all reviewers, for
their contribution to the success of this event.
Deflt, the Netherlands,
August 2009

Fr´ed´eric Desprez
Vladimir Getov
Thierry Priol
Ramin Yahyapour



Contents

XtreemOS: a Sound Foundation for Cloud Infrastructure and
Federations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Thilo Kielmann, Guillaume Pierre, Christine Morin
Towards a Grid File System Based on a Large-Scale BLOB Management
Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Viet-Trung Tran, Gabriel Antoniu, Bogdan Nicolae, Luc Boug´e, Osamu
Tatebe

1

7

Improving the Dependability of Grids via Short-Term Failure
Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Artur Andrzejak and Demetrios Zeinalipour-Yazti and Marios D. Dikaiakos
Distributed Data Mining using a Public Resource Computing
Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Eugenio Cesario, Nicola De Caria, Carlo Mastroianni and Domenico Talia
Integration of the ProActive Suite and the semantic-oriented monitoring
tool SemMon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Wlodzimierz Funika, Denis Caromel, Pawel Koperek, and Mateusz Kupisz
An Experimental Evaluation of the DQ-DHT Algorithm in a Grid
Information Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Harris Papadakis, Paolo Trunfio, Domenico Talia and Paraskevi Fragopoulou
Reducing traffic in DHT-based discovery protocols for dynamic
resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Emanuele Carlini, Massimo Coppola, Domenico Laforenza and Laura Ricci
Autonomic management of multiple non-functional concerns in

behavioural skeletons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Marco Aldinucci, Marco Danelutto and Peter Kilpatrick

ix


x

Contents

Decision Models for Resource Aggregation in Peer-to-Peer Architectures . 105
Mircea Moca and Gheorghe Cosmin Silaghi
Mapping Workflows on Grid Resources: Experiments with the Montage
Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Rizos Sakellariou and Henan Zhao and Ewa Deelman
A Proposal on Enhancing XACML with Continuous Usage Control
Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Maurizio Colombo, Aliaksandr Lazouski, Fabio Martinelli, and Paolo Mori
Self-* and Adaptive Mechanisms for Large Scale Distributed Systems . . . 147
P. Fragopoulou, C. Mastroianni, R. Montero, A. Andrjezak, D. Kondo
Network Monitoring in the age of the Cloud . . . . . . . . . . . . . . . . . . . . . . . . . 157
Augusto Ciuffoletti
Recent Advances and Research Challenges in Desktop Grid and
Volunteer Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Gilles Fedak
Research Challenges in Managing and Using Service Level Agreements . . 187
Omer Rana, Wolfgang Ziegler


List of Contributors


Marco Aldinucci
Dept. Computer Science, University of Torino, Italy e-mail:

Artur Andrzejak
Zuse Institute Berlin (ZIB), Takustraße 7, 14195 Berlin, Germany, e-mail:

Gabriel Antoniu
INRIA, Centre Rennes - Bretagne Atlantique, IRISA, Rennes, France e-mail:

Luc Boug´e
ENS Cachan/Brittany, IRISA, France e-mail: luc.bouge@bretagne.
ens-cachan.fr
Emanuele Carlini
Institute of Information Science and Technologies CNR-ISTI “A. Faedo”,
Pisa, Italy, and Institutions Markets Technologies IMT, Lucca, Italy e-mail:

Denis Caromel
INRIA - CNRS - University of Nice Sophia-Antipolis, 2004, Route
des Lucioles - BP93 - 06902 Sophia Antipolis Cedex, France, e-mail:

Eugenio Cesario
ICAR-CNR, Rende, Italy, e-mail:
Augusto Ciuffoletti
Dipartimento di Informatica Universit`a di Pisa e-mail:
Maurizio Colombo
Istituto di Informatica e Telematica, Consiglio Nazionale delle Ricerche, via G.

xi



xii

List of Contributors

Moruzzi 1, Pisa, Italy e-mail:
Massimo Coppola
Institute of Information Science and Technologies CNR-ISTI, Pisa, Italy e-mail:

Nicola De Caria
DEIS - University of Calabria, Rende, Italy e-mail:
unical.it
Marco Danelutto
Dept. Computer Science, University of Pisa, Italy, e-mail:
it
Ewa Deelman
USC Information Sciences Institute, 4676 Admiralty Way, Marina Del Rey,
CA90292, USA
Marios D. Dikaiakos
Department of Computer Science, University of Cyprus, CY-1678, Nicosia, Cyprus
e-mail:
Gilles Fedak
LIP/INRIA Rhˆone-Alpes, e-mail:
Paraskevi Fragopoulou
FORTH-ICS, N. Plastira 100, Vassilika Vouton, GR 71003 Heraklion-Crete,
Greece, e-mail:
Wlodzimierz Funika
Institute of Computer Science AGH-UST, al. Mickiewicza 30, 30-059, Krak´ow,
Poland, e-mail:
Thilo Kielmann

Vrije Universiteit, Amsterdam, The Netherlands, e-mail:
Peter Kilpatrick
Dept. Computer Science, Queen’s University Belfast, UK, e-mail:

Derrick Kondo
Laboratoire LIG, ENSIMAG - antenne de Montbonnot, ZIRST 51, Av. Jean
Kuntzmann, 38330 Monbonnot Saint Martin, France, e-mail:
Pawel Koperek
Institute of Computer Science AGH-UST, al. Mickiewicza 30, 30-059, Krak´ow,
Poland,
e-mail:
Mateusz Kupisz
Institute of Computer Science AGH-UST, al. Mickiewicza 30, 30-059, Krak´ow,
Poland,


List of Contributors

xiii

e-mail:
Domenico Laforenza
Institute of Information Science and Technologies CNR-ISTI and Institute of Informatics and Telematics CNR-IIT, Pisa, Italy e-mail:

Aliaksandr Lazouski
Universita di Pisa, via B. Pontecorvo 3, Pisa, Italy e-mail: lazouski@di.
unipi.it
Fabio Martinelli
Istituto di Informatica e Telematica, Consiglio Nazionale delle Ricerche, via G.
Moruzzi 1, Pisa, Italy e-mail:

Carlo Mastroianni
ICAR-CNR, Via P. Bucci 41C, 87036 Rende (CS), Italy, e-mail:

Mircea Moca
Babes¸-Bolyai University of Cluj-Napoca, Str. Theodor Mihali, nr. 58-60, ClujNapoca, Romania, e-mail:
Ruben Montero
Departamento de Arquitectura de Computadores y Autom´atica, Universidad
Complutense, 28040 Madrid, Spain, e-mail:
Paolo Mori
Istituto di Informatica e Telematica, Consiglio Nazionale delle Ricerche, via G.
Moruzzi 1, Pisa, Italy e-mail:
Christine Morin
INRIA, Centre Rennes - Bretagne Atlantique, Rennes, France, e-mail:

Bogdan Nicolae
University of Rennes 1, IRISA, Rennes, France e-mail: bogdan.nicolae@
irisa.fr
Harris Papadakis
Foundation for Research and Technology-Hellas, Institute of Computer Science
(FORTH-ICS), Heraklion, Greece, e-mail:
Guillaume Pierre
Vrije Universiteit, Amsterdam, The Netherlands,e-mail:
Omer Rana
School of Computer Science/Welsh eScience Centre, Cardiff University, UK,
e-mail:
Laura Ricci
Universit`a di Pisa, Pisa, Italy e-mail:


xiv


List of Contributors

Rizos Sakellariou
School of Computer Science, University of Manchester, Manchester M13 9PL,
United Kingdom, e-mail:
Gheorghe Cosmin Silaghi
Babes¸-Bolyai University of Cluj-Napoca, Str. Theodor Mihali, nr. 58-60, ClujNapoca, Romania, e-mail:
Domenico Talia
Institute of High Performance Computing and Networking, Italian National
Research Council (ICAR-CNR) and Department of Electronics, Computer
Science and Systems (DEIS), University of Calabria, Rende, Italy, e-mail:

Osamu Tatebe
University of Tsukuba, Tsukuba, Japan e-mail:
Viet-Trung Tran
ENS Cachan/Brittany, IRISA, France e-mail:
Paolo Trunfio
Department of Electronics, Computer Science and Systems (DEIS), University of
Calabria, Rende, Italy, e-mail:
Demetrios Zeinalipour-Yazti
Department of Computer Science, University of Cyprus, CY-1678, Nicosia, Cyprus
e-mail:
Henan Zhao
School of Computer Science, University of Manchester, Manchester M13 9PL,
United Kingdom
Wolfgang Ziegler
Fraunhofer Institute SCAI, Germany, e-mail: Wolfgang.Ziegler@scai.
fraunhofer.de



XtreemOS: a Sound Foundation for Cloud
Infrastructure and Federations
Thilo Kielmann, Guillaume Pierre, Christine Morin

Abstract XtreemOS is a Linux-based operating system with native support for virtual organizations (VO’s), for building large-scale resource federations. XtreemOS
has been designed as a grid operating system, supporting the model of resource
sharing among independent administrative domains. We argue, however, that the
VO concept can be used to establish either resource sharing or resource isolation,
or even both at the same time. We outline XtreemOS’ fundamental properties and
how its native VO support can be used to implement cloud infrastructure and cloud
federations.

1 XtreemOS
Developing and deploying applications for traditional (single computer) operating
systems is well understood. Federated resources like in grid environments, however,
are generally perceived as highly complex and difficult to use. The difference lies in
the underlying system achitecture. Operating systems provide a well-integrated set
of services like processes, files, memory, sockets, user accounts and access rights.
Grids, in contrast, add a more or less heterogeneous middleware layer on top of the
operating systems of the federated resources. This lack of integration has lead to a
lot of complexity, for both users and administrators.
To remedy this situation, XtreemOS [7] has been designed as a grid operating
system. While being based on Linux, it provides a comprehensive set of services as
well as a stable interface for wide-area, dynamic, distributed infrastructures comThilo Kielmann and Guillaume Pierre
Vrije Universiteit, Amsterdam, The Netherlands, e-mail: ,gpierre@
cs.vu.nl
Christine Morin
INRIA, Centre Rennes - Bretagne Atlantique, Rennes, France, e-mail: Christine.Morin@
irisa.fr


1


2

Thilo Kielmann, Guillaume Pierre, Christine Morin

posed of heterogeneous resources spanning multiple administrative domains. The
fundamental issues addressed by XtreemOS are scalability and transparency.
Scalability. Wide-area, distributed infrastructures like grids easily consist of
thousands of nodes and users. Along with this scale comes heterogeneity of
(compute and file) resources, networks, administrative policies, as well as churn
of resources and users. XtreemOS addresses these issues by its integrated view
on resources, along with its built-in support for virtual organizations (VO’s) that
provide the scoping for resource provisioning and access. For sustained operation, XtreemOS provides an infrastructure for highly-available services, to support both its own critical services and user-defined application services.
Transparency. Vital for managing the complexity of grid-like infrastructures is
providing transparency for the distributed nature of the environment, by maintaining common look–and–feel for the user and by exposing distribution and
federation only as much as necessary. To the user, XtreemOS provides single
sign-on access, Linux look–and–feel via grid-aware shell tools, and API’s that
are based on both POSIX and the Simple API for Grid Applications (SAGA).
For the administrators of VO’s and site resources, XtreemOS provides easy-touse services for all management tasks.

Security

XtreemOS API (based on SAGA & POSIX)
AEM

VOM


XtreemFS/OSS

Infrastructure for highly available & scalable services
Extensions to Linux for VO support & checkpointing

Stand−alone PC

SSI cluster

mobile device

Fig. 1 The XtreemOS system architecture

Figure 1 summarizes the XtreemOS system architecture. XtreemOS comes in
three flavours; one for stand-alone nodes (PC’s), one for clusters providing a singlesystem image (SSI), and one for mobile devices. Common to all three flavours are
the Linux extensions for VO support, providing VO-based user accounts via kernel
modules [1]. PC and cluster flavour also share support for grid-wide, kernel-level
job checkpointing.
The infrastructure for highly available and scalable services consists of implementations of distributed servers and of virtual nodes [6]. The distributed servers
form a transparent group of machines that provide their services through a shared
(mobile IPv6) address. Within the group, load balancing and fault tolerance are implemented transparent to the clients. The virtual nodes provide fault-tolerant service
replication via a Java container, transparent to the service implementation itself.


XtreemOS: a Sound Foundation for Cloud Infrastructure and Federations

3

Central to VO-wide operation are the services AEM, VOM, the XtreemFS file
system and the OSS mechanism for sharing volatile application objects. The VO

management services (VOM) provide authentication, authorization, and accounting
for VO users and resources. VO’s can be managed dynamically through their whole
life cycle while user access is organized with flexible policies, providing customizable isolation, access control, and auditing. The VO management services, together
with the kernel modules enforcing local accounts and policies provide a security
infrastructure underlying all XtreemOS functionality.
The Application Execution Management (AEM) relies on the Scalaris [4] peer–
to–peer overlay among the compute nodes of a VO that allows to discover, select,
and allocate resources to applications. It provides POSIX-style job control to launch,
monitor, and control applications.
The XtreemFS grid file system [2] provides users with a global, location independent view of their data. XtreemFS provides a standard POSIX interface, accomodating from multiple VO’s, across different administrative domains. It provides
autonomous data management with self-organized replication and distribution. The
Object Sharing Service (OSS) provides access to volatile, shared objects in main
memory segments.
The XtreemOS API’s accomodate existing Linux and grid applications, while
adding support to XtreemOS’ unique features. POSIX interfaces support Linux applications; grid-aware shell tools seemlessly integrate compute nodes within a VO.
Grid applications find their support via the OGF-standardized Simple API for Grid
Applications (SAGA) [5]. API’s for XtreemOS-specific functionality (XtreemOS
credentials, AEM’s resource reservation, XtreemFS URL’s, OSS shared segments,
etc.) are provided as SAGA extension packages, commonly referred to as the
XOSAGA API.

2 Cloud Infrastructure and Federations
Grid infrastructures operate by sharing physical resources among the users of a VO;
sharing and isolation are managed by the site-local operating systems and the VOwide (middleware) services. Although cloud computing as such is still in its infancy,
the Infrastructure as a Service paradigm (IaaS) has gained importance. Here, virtualized resources are rented to cloud users; sharing and isolation are managed by the
Virtual Machine Managers (VMM’s). What makes this model attractive is that users
get full control over the virtual machines, while the underlying IaaS infrastructure
remains in charge of resource sharing and management. An important drawback of
this model is that it provides only isolated machines rather than integrated clusters
with secure and fast local networks, integrated user management and file systems.

This is where XtreemOS provides added value to IaaS clouds [3]. Figure 2 shows
how XtreemOS can integrate resources from one or more IaaS providers to form
a clustered resource collection for a given user. Within a single IaaS platform,
XtreemOS integrates multiple virtual machines similar to its SSI cluster version,


4

Thilo Kielmann, Guillaume Pierre, Christine Morin
Cloud Federation

XtreemOS

XtreemOS
Virtualization

Fig. 2 XtreemOS integrating IaaS resources

to form a cloud cluster with integrated access control based on its VO-management
mechanisms, here applied to a user-defined, dynamic VO. Across multiple IaaS platforms, the same VO management mechanisms allow the federation of multiple cloud
clusters to a user’s VO. In combination with the XtreemFS file system, such IaaS
federations provide flexibly allocated resources that match a user’s requirements,
while giving full control over the virtualized resources.
XtreemOS extends Linux by its integrated support for VO’s. Within grid computing environments, VO’s enable sharing of physical resources. Within IaaS clouds,
VO’s enable proper isolation between clustered resources, thus allowing to form
unified environments tailored to their users.

Acknowledgements
This work has been supported by the EU IST program as part of the XtreemOS
project (contract FP6-033576).


References
1. M. Coppola, Y. J´egou, B. Matthews, Ch. Morin, L.P. Prieto, O.D. S´anchez, E.Y. Yang, H. Yu:
Virtual Organization Support within a Grid-Wide Operating System. IEEE Internet Computing, Vol. 12, No. 2, 2008
2. F. Hupfeld, T. Cortes, B. Kolbeck, J. Stender, E. Focht, M. Hess, J. Malo, J. Marti, E. Cesario:
The XtreemFS Architecture—a Case for Object-based File Systems in Grids. Concurrency
and computation: Practice and experience, Vol. 20, No. 17, 2008.
3. Ch. Morin, Y. J´egou, J. Gallard, P. Riteau: Clouds: a new Playground for the XtreemOS Grid
Operating System. Parallel Processing Letters, Vol. 19, No. 3, 2009.
4. T. Sch¨utt, F. Schintke, A. Reinefeld: Scalaris: Reliable Transactional P2P Key/Value Store
– Web 2.0 Hosting with Erlang and Java. 7th ACM SIGPLAN Erlang Workshop, Victoria,
September 2008.
5. Ch. Smith, T. Kielmann, S. Newhouse, M. Humphrey: The HPC Basic Profile and SAGA:
Standardizing Compute Grid Access in the Open Grid Forum. Concurrency and Computation:
Practice and Experience, Vol. 21, No. 8, 2009.


XtreemOS: a Sound Foundation for Cloud Infrastructure and Federations

5

6. M. Szymaniak, G. Pierre, M. Simons-Nikolova, M. van Steen: Enabling Service Adaptability
with Versatile Anycast. Concurrency and Computation: Practice and Experience, Vol. 19, No.
13, 2007.
7. XtreemOS: www.xtreemos.eu


Towards a Grid File System Based on a
Large-Scale BLOB Management Service
Viet-Trung Tran, Gabriel Antoniu, Bogdan Nicolae, Luc Boug´e, Osamu Tatebe


Abstract This paper addresses the problem of building a grid file system for applications that need to manipulate huge data, distributed and concurrently accessed at
a very large scale. In this paper we explore how this goal could be reached through
a cooperation between the Gfarm grid file system and BlobSeer, a distributed object
management system specifically designed for huge data management under heavy
concurrency. The resulting BLOB-based grid file system exhibits scalable file access performance in scenarios where huge files are subject to massive, concurrent,
fine-grain accesses. This is demonstrated through preliminary experiments of our
prototype, conducted on the Grid’5000 testbed.

1 Introduction
The need for transparent grid data management
As more and more applications in many areas (nuclear physics, health, cosmology,
etc.) generate larger and larger volumes of data that are geographically distributed,
appropriate mechanisms for storing and accessing data at a global scale become increasingly necessary. Grid file systems (such as LegionFS [16], Gfarm [14], etc.)
Viet-Trung Tran and Luc Boug´e
ENS Cachan/Brittany, IRISA, France e-mail: ,luc.bouge@
bretagne.ens-cachan.fr
Gabriel Antoniu
INRIA, Centre Rennes - Bretagne Atlantique, IRISA, Rennes, France e-mail: gabriel.

Bogdan Nicolae
University of Rennes 1, IRISA, Rennes, France e-mail:
Osamu Tatebe
University of Tsukuba, Tsukuba, Japan e-mail:

7


8


Viet-Trung Tran, Gabriel Antoniu, Bogdan Nicolae, Luc Boug´e, Osamu Tatebe

prove their utility in this context, as they provide a means to federate a very large
number of large-scale distributed storage resources and offer a large storage capacity and a good persistence achieved through file-based storage. Beyond these
properties, grid file systems have the important advantage of offering a transparent access to data through the abstraction of a shared file namespace, in contrast
to explicit data transfer schemes (e.g. GridFTP-based [3], IBP [4]) currently used
on some production grids. Transparent access greatly simplifies data management
by applications, which no longer need to explicitly locate and transfer data across
various sites, as data can be accessed the same way from anywhere, based on globally shared identifiers. Implementing transparent access at a global scale naturally
leads however to a number of challenges related to scalability and performance, as
the file system is put under pressure by a very large number of concurrent, largely
distributed accesses.

From block-based to object-based distributed file systems
Recent research [7] emphasizes a clear move currently in progress from a blockbased interface to a object-based interface in storage architectures, with the goal of
enabling scalable, self-managed storage networks by moving low-level functionalities such as space management to storage devices or to storage server, accessed
through a standard object interface. This move has a direct impact on the design of
today’s distributed file systems: object-based file system would then store data rather
as objects than as unstructured data blocks. According to [7], this move may eliminate nearly 90% of management workload which was the major obstacle limiting
file systems’ scalability and performance.
Two approaches exploit this idea. In the first approach, the data objects are stored
and manipulated directly by a new type of storage device called object-based storage device (OSD). This approach requires an evolution of the hardware, in order to
allow high-level object operations to be delegated to the storage device. The standard OSD interface was defined in the Storage Networking Industry Association
(SNIA) OSD working group. The protocol is embodied over SCSI and defines a new
set of SCSI commands. Recently, a second generation of the command set, ObjectBased Storage Devices - 2 (OSD-2) has been defined. The distributed file systems
taking the OSD approach assume the presence of such an OSD in the near future
and currently rely on a software module simulating its behavior. Examples of parallel/distributed file systems following this approach are Lustre [13] and Ceph [15].
Recently, research efforts [6] have explored the feasibility and the possible benefits
of integrating OSDs into parallel file systems, such as PVFS [5].
The second approach does not rely on the presence of OSDs, but still tries to

benefit from an object-based approach to improve performance and scalability: files
are structured as a set of objects that are stored on storage servers. Google File
System [8], and HDFS (Hadoop File System) [9]) illustrate this approach.


Towards a Grid File System Based on a Large-Scale BLOB Management Service

9

Large-scale distributed object storage for massive data
Beyond the above developments in the area of parallel and distributed file systems,
other efforts rely on objects for large-scale data management, without exposing a file
system interface. BlobSeer [11] [10] is such a BLOB (binary large object) management service specifically designed to deal with large-scale distributed applications,
which need to store massive data objects and to efficiently access (read, update)
them at a fine grain. In this context, the system should be able to support a large
number of BLOBs, each of which might reach a size in the order of TB. BlobSeer
employs a powerful concurrency management scheme enabling a large number of
clients to efficiently read and update the same BLOB simultaneously in a lock-free
manner.

A two-layer architecture
Most object-based file systems exhibit a decoupled architecture that generally consists of two layers: a low-level object management service, and a high-level file system metadata management. In this paper we propose to explore how this two-layer
approach could be used in order to build an object-based grid file system for applications that need to manipulate huge data, distributed and concurrently accessed at
a very large scale. We investigate this approach by experimenting how the Gfarm
grid file system could leverage the properties of the BlobSeer distributed object
management service, specifically designed for huge data management under heavy
concurrency. We thus couple Gfarm’s powerful file metadata capabilities and rely
on BlobSeer for efficient and transparent low-level distributed object storage. We
expect the resulting BLOB-based grid file system to exhibit scalable file access performance in scenarios where huge files are subject to massive, concurrent, fine-grain
accesses. We intend to deploy a BlobSeer instance at each Gfarm storage node, to

handle object storage. The benefits are mutual: by delegating object management
to BlobSeer, Gfarm can expose efficient fine-grain access to huge files and benefit
from transparent file striping (TB size). On the other hand, BlobSeer benefits from
the file system interface on top of its current API.
The remaining of this paper is structured as follows. Section 2 introduces the two
components of our object-based file system: BlobSeer and Gfarm, whose coupling
is explained in Section 3. Section 4 presents our preliminary experiments on the
Grid’5000 testbed. Finally, Section 5 summarizes the contribution and discusses
future directions.


10

Viet-Trung Tran, Gabriel Antoniu, Bogdan Nicolae, Luc Boug´e, Osamu Tatebe

2 The building blocks: Gfarm and BlobSeer
Our object-based grid file systems consists of two layers: a high-level file metadata
layer, available with the Gfarm file system; a low-level storage layer based on the
BlobSeer BLOB management service.

2.1 The Gfarm grid file system
The Grid Datafarm (Gfarm) [14] is a distributed file system designed for highperformance data access and reliable file sharing in large scale environments including grids of clusters. To facilitate file sharing, Gfarm manages a global namespace
which allows the applications to access files using the same path regardless of file
location. It federates available storage spaces of Grid nodes to provide a single file
system image. We have used Gfarm v2.1.0 in our experiments.

2.1.1 Overview of Gfarm’s architecture
Gfarm consists of a set of communicating components, each of which fulfills a particular role.
Gfarm’s metadata server: the gfmd daemon. The metadata server stores and manages the namespace hierarchy together with file metadata, user-related metadata,
as well as file location information allowing clients to physically locate the files.

Gfarm file system nodes: the gfsd daemons. They are responsible for physically
storing full Gfarm files on their local storage. Gfarm does not implement file
stripping and here is where BlobSeer can bring its contribution, through transparent file fragmentation and distribution.
Gfarm clients: Gfarm API and FUSE access interface for Gfarm. Gfarm
provides users with a specific API and several command lines to access the Gfarm
file system. To facilitate data access, the Gfarm team developed Gfarm2fs:
a POSIX file system interface based on the FUSE library [17]. Basically,
Gfarm2fs transparently maps all standard file I/Os to the corresponding routines
of the Gfarm API. Thus, existing applications handling files must no longer be
modified in order to work with the Gfarm file system.


Towards a Grid File System Based on a Large-Scale BLOB Management Service

11

2.2 The BlobSeer BLOB management service
2.2.1 BlobSeer at a glance
BlobSeer [11] [10] addresses the problem of storing and efficiently accessing very
large, unstructured data objects, in a distributed environment. It focuses on heavy
access concurrency where data is huge, mutable and potentially accessed by a very
large number of concurrent, distributed processes. To cope with very large data
BLOBs, BlobSeer uses striping: each BLOB is cut into fixed-size pages, which
are distributed among data providers. BLOB Metadata facilitates access to a range
(offset, size) for any existing version of a BLOB snapshot, by associating such a
range with the physical nodes where the corresponding pages are located. Metadata
are organized as a segment-tree like structure (see [11] for details) and are scattered across the system using a Distributed Hash Table (DHT). Distributing data
and metadata is the key choice in our design: it enables high performance through
parallel, direct access I/O paths, as demonstrated in [12]. Further, BlobSeer provides
concurrent clients with efficient fine-grained access to BLOBs, without locking. To

deal with the mutable data, BlobSeer introduces a versioning scheme which allows
clients not only to roll back data changes when desired, but also enables access to
multiple versions of the same BLOB within the same computation.

2.2.2 Overview of BlobSeer’s architecture
The system consists of distributed processes, that communicate through remote procedure calls (RPCs). A physical node can run one or more processes and, at the same
time, may play multiple roles from the ones mentioned below.
Clients. Clients may issue CREAT E, W RIT E, APPEND and READ requests.
There may be multiple concurrent clients. Their number dynamically vary in
time without notifying the system.
Data providers. Data providers physically store and manage the pages generated
by W RIT E and APPEND requests. New data providers are free to join and leave
the system in a dynamic way.
The provider manager. The provider manager keeps information about the available data providers and schedules the placement of newly generated pages according to a load balancing strategy.
Metadata providers. Metadata providers physically store the metadata, allowing
clients to find the pages corresponding to the various BLOB versions. Metadata
providers are distributed, to allow an efficient concurrent access to metadata.
The version manager. The version manager is the key actor of the system. It registers update requests (APPEND and W RIT E), assigning BLOB version numbers
to each of them. The version manager eventually publishes these updates, guaranteeing total ordering and atomicity.


12

Viet-Trung Tran, Gabriel Antoniu, Bogdan Nicolae, Luc Boug´e, Osamu Tatebe

Accessing data in BlobSeer
To READ data, the client contacts the version manager: it needs to provide a BLOB
id, a specific version of that BLOB, and a range, specified by an offset and a size. If
the specified version is available, the client queries the metadata providers to retrieve
the metadata indicating the location of the pages for the requested range. Finally, the

client contacts in parallel the data providers that store the corresponding pages.
For a W RIT E request, the client contacts the provider manager to obtain a list
of providers, one for each page of the BLOB segment that needs to be written.
Then, the client contacts the providers in the list in parallel and requests them to
store the pages. Each provider executes the request and sends an acknowledgment
to the client. When the client has received all the acknowledgments, it contacts
the version manager and requests a new version number. This version number is
then used by the client to generate the corresponding new metadata. Finally, the
client notifies the version manager of success, and returns successfully to the user.
At this point, the version manager is responsible for eventually publishing the new
version of the BLOB. The APPEND operation is a particular case of W RIT E, where
the offset is implicitly the size of the previously published snapshot version. The
detailed algorithms for READ, W RIT E and APPEND are given in [11].

2.3 Why combine Gfarm and BlobSeer?
Gfarm does not rely on autonomous, self-managing object-based storage, like the
file systems mentioned in Section 1. Each Gfarm file is fully stored on a file system node, or totally replicated to multiple file system nodes. If a large number of
clients concurrently access small parts of the same copy of a huge file, this can
lead to a bottleneck both for reading and for writing. Second, Gfarm’s file sizes are
limited by the storage capabilities of the machines used as file system nodes in the
Gfarm deployment. However, some powerful features, including user management,
authentication and single sign-on (based on GSI: Grid Security Infrastructure [1])
are present in Gfarm’s current implementation. Moreover, due to the Gfarm’s FUSE
access interface, data can be accessed in a transparent manner via the POSIX file
system API.
BlobSeer brings different benefits: it handles huge data, which is transparently
fragmented and distributed at a large scale. Thanks to its distributed metadata
scheme, it sustains a high bandwidth is maintained even when the BLOB grows
to large sizes, and when the BLOB faces heavy concurrent access [12]. BlobSeer
is mostly suitable for massive data processing, fine-grained access, and versioning

in a large-scale distributed environment. But BlobSeer lacks a file system interface
that may help existing applications to use it directly. As explained above, such an
interface is provided by Gfarm, together with the associated file system metadata
management. It then clearly appears that making Gfarm cooperate with BlobSeer
would enhance their respective functionalities and would lead to an object-based


Towards a Grid File System Based on a Large-Scale BLOB Management Service

13

file system with better properties: huge file support (TBs), fine-grain access under
heavy concurrency, versioning, user and GSI-compliant security management. In
this paper we focus on providing an enhanced concurrency support. Exposing multiversioning to the file system user is currently under study and will not be addressed
in this paper.

3 Towards an object-based file system based on Gfarm and
BlobSeer
3.1 How to couple Gfarm and BlobSeer?
Since each gfsd daemon running on Gfarm’s file system nodes is responsible for
physically storing Gfarm’s data on its local file system, our first approach aims at
integrating BlobSeer calls at the gfsd daemon. The main idea is to trap all requests
to the local file system, and map them to the corresponding BlobSeer API in order to
leave the job of storing Gfarm’s data to BlobSeer. A Gfarm file is no longer directly
stored as a file on the local system; it is stored as a BLOB in BlobSeer. This way, file
fragmentation and striping is introduced transparently for Gfarm at the gfsd level.
Nevertheless, this way of integrating BlobSeer into gfsd daemon clearly does
not fully exploit BlobSeer’s capability of efficiently handling concurrency, in which
multiple clients simultaneously access the same BLOB. The gfsd daemon always
acts as an intermediary for data transfer between Gfarm clients and BlobSeer data

providers, which may limit the data transfer throughput. For this reason, we propose
a second approach. Currently, Gfarm defines two modes for data access, local access
mode and remote access mode. The local access mode is the mode in which the
client and the gfsd daemon involved in a data transaction are on the same physical
node, allowing the client to directly access its local disk. In contrast, the remote
access mode is the mode in which a client accesses data through a remote gfsd
daemon.
Our second approach consists in introducing into Gfarm a new access mode,
called BlobSeer direct access mode, allowing Gfarm clients to directly access BlobSeer. In this mode, as explained in Section 2.2, clients benefit from a better throughput, as they access the distributed BLOB pages in parallel. During data accesses, the
risk to create a bottleneck at the gfsd level is then reduced, since the gfsd daemon no
longer acts as an intermediary for accessing data; its task now is simply to establish
the mapping between Gfarm logical files and BlobSeer’s corresponding BLOB ids.
Keeping the management of this mapping at the gfsd level is important, as, this way,
no change is required on Gfarm’s metadata server (gfmd), which is not aware of the
use of BlobSeer.


×