Tải bản đầy đủ (.pdf) (87 trang)

Content description model and framework for efficient content distribution

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (245.5 KB, 87 trang )

CONTENT DESCRIPTION MODEL AND FRAMEWORK
FOR EFFICIENT CONTENT DISTRIBUTION

ZHANG SHUTAO
(B. Eng. (Hons.) NUS)
HT00-6864A

A THESIS SUBMITTED
FOR THE DEGREE OF
MASTER OF SCIENCE

DEPARTMENT OF COMPUTER SCIENCE
SCHOOL OF COMPUTING
NATIONAL UNIVERSITY OF SINGAPORE
2005


CONTENT DESCRIPTION MODEL AND FRAMEWORK
FOR EFFICIENT CONTENT DISTRIBUTION

ZHANG SHUTAO

NATIONAL UNIVERSITY OF SINGAPORE
2005


Acknowledgement

I owe my deepest gratitude and appreciation to my thesis supervisor, Dr. Chi Chi-Hung,
for giving me the opportunity to work with him and my lab mates. I thank him for his
continued guidance, insight, patience, encouragement, and above all, his confidence in


me, without which this thesis would not have been possible. I am grateful to him for all
the time and efforts he has spent in helping me improve my research and this document. I
would also like to thank Dr. Chi Chi-Hung, for giving me advices on how to choose my
career path at this important stage of life.

I sincerely thank all my lab mates for offering me much needed assistance and for sharing
their invaluable insights during my research. Special thanks to my dear friend Wang
Hong-Guang for his sincere help and encouragement during the most difficult time of my
research. Also I want to thank Yuan Jun-Li and Li Qi-Ming for sharing their valuable
advice on my research experiment.

Finally, I would like to express my immeasurable appreciation to my wife, my parents
and my parents in law for their love, trust, inspiration and understanding,


Contents
Summary

iii

List of Figures

v

Chapter 1 Introduction

1

Chapter 2 Related Works


6

2.1 Framework for Customized Content Delivery . . . . . . . . . . . . . . . . . . ..7
2.2 Content Description Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9
2.3 Client Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..12
2.4 Server Side Approaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12
2.5 Existing Software Tools . . . . . . . ….. . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6 Summary . . . . . . ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Chapter 3 A General Content Description Model

17

3.1 General Settings . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Proposed Content Description Model.. . . . . . . . . . . . . . . . . . . . . . . . . .20
3.2.1 Web Objects………………….… . . . . . . . . .. . . . . . . . . . . . . . . . . . 20
3.2.2 Object Description Scheme. . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . 21
3.2.3 Discussion…………………………. . . . . .. . . . . . . . . . . . . . . . . . . . 26
Chapter 4 A Framework for Efficient Content Distribution

27

4.1 Design Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . … . . . . . . . . 27
4.2 Overall Architecture…….. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Server Operations…..……………. . . . . . . . . . . . . . . . . . . . . . . . . . . .34
4.4 Proxy Operations… ………… . . . . . . . . . . . . . . …….. . . . . . . . . . . .37


4.4.1 Mapping User Descriptions to Content Descriptions . . . . . . . . 37
4.4.2 Managing Local Content Descriptions…………….. . . . . . . . . 41
4.5 User Operations ……. . . . . . . . . . …………………………………...45

4.6 Summary……... . . . …… . . . . . . …………………………………...45

Chapter 5 A Case Study on the Framework

47

5.1 Simulation Setup….….……...… . . . . . . . . . . . . . . …….. . . . . . . . . .47
5.2 Web Object Size…………….. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.3 Web Object Latency..……….. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.4 XHTML Page Latency……………………… .... . . …….. . .. . . . . . .55
5.5 Summary ……………………………..….... . . . . . . …….. . . . . . . . . 63
Chapter 6 Conclusion

65

Reference

68


Summary
Today, the Web has become a highly heterogeneous environment. Users are
accessing information on the Web pervasively through heterogeneous end points with
different capabilities. To accommodate the needs due to heterogeneous user preferences
and device capabilities, web intermediaries, called proxies, start to perform various
functions including Web content caching and image transcoding on the Web content
before it is distributed to the users. As different functions require different content
semantic information, which we refer to as content descriptions, web servers are hosting
a large amount of content descriptions to help proxies perform various functions.


Under the heterogeneous environment, efficient content distribution has become a
problem due to a few challenging issues. First of all, it is not clear how a proxy should
decide which functions to perform given any user preferences and device capabilities,
because it is not easy, if possible at all, for every proxy to understand every type of
devices and users, and the users may not be able to know all the functions provided by
proxies, either. If this is not properly handled, we may end up delivering non-acceptable
content to users. Secondly, to provide semantic information about different attributes of
Web content, the server may need to store a large amount of content descriptions.
Delivering all the descriptions about a Web page to a proxy when the Web page is
requested may be highly inefficient because the proxy may only need a small fraction of
the content descriptions to perform the desirable functions. Thirdly, repeatedly delivering
the same content descriptions to the same proxy is unnecessary. But insofar, there lacks a


mechanism for a proxy to properly cache and reuse the content descriptions that are
already retrieved.

In this thesis, we propose a content description model and framework for efficient content
distribution. The content description model employs ideas from Resource Description
Framework [3] and External Annotation [2], which allow flexible descriptions for Web
content. The model also allows a server to efficiently select any subset of the descriptions
of any Web page and deliver them to a proxy. The framework consists of several
algorithms for the proxies to map user preferences and device capabilities to a set of
functions to be performed, and for the server to select and deliver necessary content
descriptions to the proxy, and for the proxy to efficiently cache and reuse the content
descriptions.

To evaluate the performance of our framework, we conduct a simulation study
with certain simplifications (the details are given in Chapter 5). We employ real world
Web objects identified from network traces, and study how our content description model

and framework reduce the size of the Web objects, the delay in retrieving Web objects,
the number chunks in HTTP responses, and the delay of entire Web pages. We give some
preliminary results and some discussions.


List of Figures
2.1. ICAP Response Modification………………………………………………...7
2.2. ICAP Request Modification …………………………………………………8
2.3. InfoPyramid Model…………………………………………………………...9
3.1. General Settings …………………………………………………………….18
3.2. Description for a Simple XHTML Page .........................................................24
4.1. The Framework Overview …………………………………………………..33
4.2. Mapping User Descriptions to Functions …………………………………...38
4.3. Mapping Functions to Set of Attribute Descriptions ………………………..40
4.4. Caching and Validation for Content Descriptions ………………………….43
4.5. Managing Local Attribute Descriptions …………………………………….44
5.1. A Sample Content Selection Flow…………………………………………..50
5.2. Web Object Size Reduction ………………………………………………...51
5.3. Web Object Latency Reductions …………………………………………...53
5.4. Chunk Number Distribution for Web Objects………………………………54
5.5. HTML Chunk Number Reduction ………………………………………….55
5.6. XHTML Page Latency ……………………………………………………..57
5.7. Effect of Different Parallel Connections with User Description D1 ……….59
5.8. Effect of Different Parallel Connections with User Description D2 ……….59


5.9. Effect of Different Parallel Connections with User Description D3 ……….61
5.10. Effect of Different Parallel Connections with User Description D4 ……...61
5.11. Effect of Different Parallel Connections with User Description D5 ……...62



Chapter 1
Introduction
The Internet keeps growing rapidly based on latest surveys [7, 8, 9, 18]. The WorldWide-Web (or Web in short), which is based on the Hyper Text Transfer Protocol
(HTTP), has become the main platform for information distribution on the Internet.
Thompson et al. [9] conducted a study on InternetMCI’s backbone and found that Web
traffic occupied more than half of the total Internet traffic.

Today, the Web has become a highly heterogeneous environment. Users are
accessing information on the Web pervasively through heterogeneous end points,
including personal computers and workstations on traditional wired networks, and
devices based on more recent wireless technologies.

Wireless devices such as smart phones, palm-top devices, and laptop computers
are playing a very important role on the Internet. All these Web accessing devices have
various capabilities due to their widely diversified hardware computation power (e.g.,


processor speed, memory size, I/O capability), software configuration (e.g., operating
system, Web browser, audio-visual applications), and network access methods
(communication media and bandwidth).

Besides that devices are heterogeneous in their capabilities, users may as well have
different preferences for Web access, which may vary in several aspects such as privacy,
advertising, latency, and so on. Consequently, different users may require different
treatments on the Web content, based on their own preferences and device capabilities.

To accommodate the needs for heterogeneous users and devices, network nodes between
servers and end users start to perform various functions on the Web content before it is
distributed to the users. These network nodes are often referred to as active web

intermediaries or proxies, in the rest of this thesis, we call them proxies. Below are some
examples of the functions that are widely supported by proxies.

Web content caching [39, 40, 41]
To achieve fast access to Web content for users who are spread out in a large
range of different networks, one can employ a Web caching proxy on a subnet that
temporarily stores copies of selected content provided by some server, so that a local user
can obtain the content quickly from the local proxy instead of the remote server. These
proxies have become very important to speed up Web content distribution.


Content adaptation [2, 3, 4, 6]
Different device capabilities and user preferences pose different constraints on
what kind of content is acceptable to the users. To deliver only acceptable content to
users, proxies can perform content adaptations, which include image transcoding [41],
content transformation [4], content filtering [42], and so on.

To perform these functions properly, a proxy usually requires some semantic
information about the Web content. We will refer this kind of semantic information as
content descriptions in the rest of this thesis. To support various functions, many content
description models and frameworks have been proposed to provide semantic information
about different attributes of Web content. For example, the Edge Side Includes (ESI) [1]
language was proposed to describe attributes such as expected expiry time or Time-ToLive (TTL) for Web content to support dynamic content caching. Extensible Device
Independent Markup Language (XDIME) was proposed by Volantis [43] to describe
content layout, image color and others attributes to support content adaptation for mobile
devices.

Under the heterogeneous environment, efficient content distribution has become a
problem. In the following, we will address the challenging issues related to this problem
one by one.



First of all, it is not clear how a proxy should decide which functions to perform
given any user preferences and device capabilities, because it is not easy, if possible at
all, for every proxy to understand every type of devices and users, and the users may not
be able to know all the functions provided by proxies, either. If this is not properly
handled, we may end up delivering non-acceptable content to users.

Secondly, to provide semantic information about different attributes of Web
content, the server may need to store a large amount of content descriptions. Delivering
all the descriptions about a Web page to a proxy when the Web page is requested may be
highly inefficient because the proxy may only need a small fraction of the content
descriptions to perform the desirable functions.

Thirdly, repeatedly delivering the same content descriptions to the same proxy is
unnecessary. But insofar, there lacks a mechanism for a proxy to properly cache and
reuse the content descriptions that are already retrieved.

In this thesis, we propose a content description model and framework for efficient content
distribution. The content description model employs ideas from Resource Description
Framework [3] and External Annotation [2], which allow flexible descriptions for Web
content. The model also allows a server to efficiently select any subset of the descriptions
of any Web page and deliver them to a proxy. The framework consists of several
algorithms for the proxies to map user preferences and device capabilities to a set of


functions to be performed, and for the server to select and deliver necessary content
descriptions to the proxy, and for the proxy to efficiently cache and reuse the content
descriptions.


To evaluate the performance of our framework, we conduct a simulation study
with certain simplifications (the details are given in Chapter 5). We employ real world
Web objects identified from network traces, and study how our content description model
and framework reduce the size of the Web objects, the delay in retrieving Web objects,
the number chunks in HTTP responses, and the delay of entire Web pages. We give some
preliminary results and some discussions.

This thesis is organized as follows. In chapter 2, we review existing content description
models and frameworks. In chapter 3, we give a general content description model to
support various content descriptions. Subsequently, in chapter 4, we propose a framework
to support efficient content distribution in a heterogeneous environment. After that, in
chapter 5, we conduct a performance study to show the efficiency of the model and
framework by simulations. We conclude this thesis in chapter 6.


Chapter 2
Related Works

Rapid growing of personal devices to access the Web leads to increasing demand for
customized Web content delivery. These devices include smart phones, personal digital
assistants (PDAs), game stations, notebooks and desktop PCs. With a wide variation of
their computation power and display capability, together with different user preferences,
it is very important to adapt the Web content to suit needs of different users. Because of
this, Web servers and proxies have started to perform various functions on the content
before delivering it to the users.

In the following, we will outline approaches from different aspects. There are general
frameworks for customized content distribution, content description models for providing
Web content descriptions, mechanisms to support descriptions for device capabilities and
user preferences, as well as existing software tools to do content adaptation.



2.1 Framework for Customized Content Delivery
There are many frameworks for Web content customization. In the following, we will
introduce two well known frameworks: Internet content adaptation protocol (ICAP) [14]
and Open Pluggable Edge Services (OPES) [36]. In the following, we will introduce
ICAP followed by OPES.

ICAP, the Internet content adaptation protocol, is a protocol designed to provide simple
Web object based content vectoring for HTTP services. It is essentially a lightweight
protocol for executing a “remote procedure call” on HTTP messages. In other words,
ICAP clients can pass HTTP messages to ICAP servers for some kind of content
modification. The ICAP server executes its own processes on messages and sends back
response to the client, usually with modified messages. The modified messages may be
either HTTP requests or responses. The following figure shows the flow of HTTP
messages under the ICAP protocol for request modification and response modification.

Figure 2.1. ICAP Response Modification


Figure 2.2. ICAP Request Modification

From the above diagrams, the ICAP server is a dedicated server to off-load specific
Internet-based content modification from the original server, therefore freeing up
resources in original servers and standardizing the way in which content modification can
be implemented.

Similar to ICAP, OPES working group [36] is chartered to define a framework and
protocols to authorize and invoke services to perform functions on Web objects. It
extends the functionality of a caching proxy to provide additional services that mediate,

modify, and monitor object requests and responses.

In general, both of the frameworks are proposed to provide support for almost any web
services to modify Web content. That means anyone can provide any function via these
frameworks. However, applying functions on Web content help to adapt the content
according to special needs of users. We cannot rely on any “special” functions to handle
issues related to efficiency of content delivery in a heterogeneous environment. In the


next section, we will look at content description models for web content description to
facilitate customized web content delivery.

2.2 Content Description Model
In this section, we review approaches on describing web content to facilitate customized
web content delivery. We will again talk about two well known content description
models here, namely InfoPyramid [31] and Resource Description Framework (RDF) [3].

InfoPyramid is a representation scheme for handling Web content (text, image, audio and
video) hierarchically along the dimension of fidelity/resolution (in different quality but in
the same media type) and modality (in different media type). This representation scheme
is shown in Figure 2.3. The representation scheme includes methods for analyzing,
filtering, translating, and manipulating the Web content.

Figure 2.3. InfoPyramid Model


For the InfoPyramid model, the content is authored in XML [44], allowing the author to
provide more information to the system performing content modification as only limited
information about the content can be deducted from an HTML page directly. The content
will later be converted to HTML prior to delivery. The authored content is analyzed to

extract information that will be useful in adaptation. Two types of content analysis are
performed.

First, each component of the content is analyzed to determine its resource requirements.
These requirements are content size, display size, streaming bit-rate, color requirements,
compression formats, and hardware requirements.

Second, the semantics of the content are determined in the context of the entire document.
After getting all these information, different modules can be chosen to convert the
content into different versions with various resolutions and modalities. This conversion is
done offline, during content creation time. Then multiple versions of the content, along
with any associated meta-data are stored. When a request comes, the web server
determines the user device capabilities, selects the best fidelity and/or modality, and
delivers the object in a suitable delivery format to the user.

Resource Description Framework (RDF) is another general purpose content description
framework. This framework is based on XML and uses a collection of triples to provide
descriptions. A triple consists of a subject, a predicate and an object. The assertion of an
RDF triple says that some relationship, indicated by the predicate, holds between the


things denoted by subject and object of the triple. A set of such triples is called an RDF
graph. This can be illustrated by a node and directed-arc diagram, in which each triple is
represented as a node-arc-node link (hence the term "graph").

The assertion of an RDF graph amounts to asserting all the triples in it, so the meaning of
an RDF graph is the conjunction (logical AND) of the statements corresponding to all the
triples it contains. Note that the subject in the triple can be anything that can be
referenced by a URI. We know that External Annotation [2] proposed by W3C has
suggested a way to reference to any node of an XML document. For a well formed

HTML page, we can parse it into a tree and use External Annotation to create a URI to
any node in the HTML parse tree. That means combining RDF and External Annotation
can create a very flexible approach to provide any descriptions about any node in a well
formed HTML Web page.

For the two content description models, the InfoPyramid approach provides a model to
generate and organize web content with different versions. This is a one-for-all approach,
it tries to handle all types of content in the container object (usually HTML objects),
including text, images, videos, etc. But it relies on content descriptions (embedded in
XML format) to determine their resource requirements of the content. However it may
not work on HTML objects without extra content descriptions embedded. RDF is a
general and flexible framework providing content descriptions. Combining RDF and


External Annotation is a very useful approach to provide arbitrary descriptions about
Web content without changing the content at all. Actually our new content description
model uses this idea to provide flexibility in our content description model.

2.3 User Descriptions
To deliver the best-fit presentation of content to the users, we need descriptions about the
user preference and device capabilities in the first place. W3C has proposed the
Composite Capability and Preference Profile (CC/PP) [10] to achieve this goal. Wireless
Application Protocol (WAP) Forum has proposed a similar approach named User Agent
Profile (UAProf) [37] to handle user descriptions. Both CC/PP and UAProf are based on
Resource Description Framework (RDF) [38] and aim at describing and managing
software and hardware profiles. In our framework for efficient content distribution, we
can use CC/PP or UAProf to provide descriptions about user preferences and device
capabilities.

2.4 Server Side Approaches

Besides descriptions about the clients, there are also approaches on the server side to
address the issue of customized content delivery. Approaches in this category fall into
two main streams: providing web content descriptions or giving instructions on how to
process web content from the web server. We will introduce examples in these two
streams in the following part of this section.


W3C has proposed a working draft on content selection for web contents for device
independence [17]. It specifies a processing model general purpose selection. Selection
involves conditional processing of various parts of an XML information set according to
the results of the evaluation of expressions. These logical expressions are associated with
some parts of the information set and they will be processed at run time. Using this
mechanism some parts of the information set can be selected for further processing and
others can be suppressed. The specification of the parts of the infoset affected and the
expressions that govern processing is by means of XML-friendly syntax. This includes
elements, attributes and XPath [45] expressions. When using this selection mechanism
with HTML objects, these logical expressions are embedded into the HTML objects and
evaluated at run time to determine which part to include.

ESI [1] uses a similar mechanism as W3C’s content selection. Logical expressions are
embedded with ESI markups into HTML object and evaluated at run to determine which
fragment will be selected. But main purpose of the ESI selection is for dynamic content
assembly for different users.

Besides providing content descriptions on the server side, there are also approaches
which suggest web servers giving explicit guidance to allow a proxy to make the best
choice while modifying web contents. An example of this approach is server-directedtranscoding [33] by Mogul et al. He proposed new HTTP header directives, by which a
web server could give hints to a proxy on how to modify a web object. He also proposed



the use of applets (Java, Perl, etc.) to modify the web object according to web server’s
guidance.

From the above approaches, either the web server gives instructions on how to customize
web content delivery, or they provide content descriptions about the content so that other
web intermediaries can perform the task. In the next section, we will introduce several
software tools on providing customized HTML content to clients.

2.5 Existing Software Tools
There are numerous software tools in the market providing customized Web content
according to different clients’ needs. These software tools include WebSphere
Everyplace Mobile Portal [46], Web Logic Portal [47], etc. They can transform HTML
content to different markup languages such as WML, changing the page layout to suit
different screen size, etc. examining the users’ hardware capability and preferences, by
filtering out parts of HTML objects that clients are not interested. In the above mentioned
software, content description is embedded in the content via special mark up. User
preferences are stored locally on the server when user registers himself with the server.
Their device capabilities are retrieved from specialized external repositories such as
Wolantis [48]. As we can see, these commercial software tools have the ability to support
certain content transformation functions but the implementation is proprietary and not
easily extensible to support other functions.


Different software tools may provide a different set of options depending on the software
design. But if new type of content emerges, clients have to wait for an update of the
software to handle the new type of contents. Thus extensibility and flexibility is a
problem for these existing software tools.

2.6 Summary
This chapter lists some of the approaches relevant to customized content delivery to

clients from different aspects. ICAP provides a framework where almost any services for
customized web content delivery can be implemented. The service can be provided by
redirect HTTP request or response to dedicated ICAP servers. OPES provides a similar
system. The InfoPyramid approach provides a model to generate and organize different
versions of content in HTML objects. Different versions can be selected when client
sends request for a particular HTML object. However, generation of different versions of
content relies on content descriptions and specific modules to accomplish.

To support content customization for different clients, we need descriptions about the
clients’ preference and capabilities as well as the contents. From the client side, there are
frameworks such as CC/PP and UAProf to handle description for clients. From server
side, approaches like ESI provide content description through its own mark up languages,
but their focus is on dynamic content assembly and caching. Other approaches like
server-directed-transcoding provide server guidance on how to provide content


customization. However it emphasizes on transformation of embedded objects in HTML
pages.

There are also existing software products like IBM WebSphere Mobile Portal and BEA
WebLogic Portal to provide content adaptation according to device capabilities of users.
But different software provides a different set of options to clients, and there is no
standard way to map all the clients’ preferences to the options provided by the software.
From above, there is no direct solution from the literature that addresses the efficiency
issue in content delivery. We can make use of existing frameworks such as ICAP and
CC/PP to support our model. But we need to add elements in our model to improve
efficiency. In the next chapter, we will explain our own content description model in
detail.



×