QuantNet – A Database-Driven Online Repository of Scientiﬁc Information potx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (8.18 MB, 46 trang )

QuantNet – A Database-Driven Online
Repository of Scientiﬁc Information
A Master’s Thesis Presented
by
Anton Andriyashin
(188779)
to
Prof. Dr. Wolfgang H¨ardle
CASE – Center of Applied Statistics and Economics
Humboldt University, Berlin
in partial fulﬁllment of the requirements
for the degree of
Master of Science
Berlin, June 20, 2007
Declaration of Authorship
I hereby conﬁrm that I have authored this Master’s thesis independently and without use of others
than the indicated resources. All passages, which are literally or in general matter taken out of
publications or other resources, are marked as such.
Anton Andriyashin
Berlin, June 20, 2007
1
Contents
1 Introduction 4
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 QuantNet: A Look Inside . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 An Online Repository of Information . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 What Is Wrong With Regular HTML Publishing? . . . . . . . . . . . . . . . . . . . 11
2 Single Document Setup 13
2.1 Typical Structure of a Submitted ASCII File . . . . . . . . . . . . . . . . . . . . . . 13
2.2 What is XML? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 XML and XSLT – A Single Document in HTML . . . . . . . . . . . . . . . . . . . . 19

2.4 ASCII to XML: Atox and XSLT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3 Multiple Documents Setup 26
3.1 From a Single Document to Multiple Documents . . . . . . . . . . . . . . . . . . . . 26
3.2 mySQL and PHP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Javascript, CSS and PHP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Putting Everything Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4 Possibilities of QuantNet’s Core 38
2
4.1 Scalability – User-deﬁned Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 Ease of Administration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3 Ways to Make QuantNet Even More Powerful . . . . . . . . . . . . . . . . . . . . . . 40
4.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3
1 Introduction
1.1 Motivation
Many sociologists consider the XXI century to be the true beginning of the new information era.
With the amount of information generated every day and the share of that information being
represented in the World Wide Web, it will not be an exaggeration to say that online media
become at least as important as their paper-based counterparts.
Already in the 1980s the OECD realized the importance of information as an asset in the global
economy [10; 11] and has been using the deﬁnition of Porrat for the indication of information
economy [12] as the one, where at least 50% of the GNP is produced in the so called primary or
secondary information sectors, i.e. sectors that employ information goods and services directly in
the production, distribution or information processing, or information services produced for internal
consumption by companies, which do not produce information for sell, and by government [7].
Nowadays information becomes one of the most valuable assets in the world economy.
New information technologies are able to broaden the horizons and tackle the traditional chal-
lenges in unexpected ways – consider, for instance, the Hypertext Markup Language (HTML).
Its ﬁrst published speciﬁcation was drafted by Berners-Lee with Dan Connolly and was published
in 1993 by the IETF [5], and already in 2000 HTML became an international standard (ISO/IEC

15445:2000 ). This language oﬀered the new way of content navigation by having a possibility to
switch quickly author-deﬁned parts of the entity via so called hyperlinks. Having realized many
advantages of modern IT, many ”oﬄine” journals and magazines established online presence with
unique features of delivering the information that paper-based editions lack. High-resolution photo-
graphic materials, video content, audio podcasts, quick search, archives and links to other resources
– these are just a few of comme il faut elements that almost any high-end online edition has to oﬀer
today. In the last 15 years HTML becam e the corner-stone of the whole Internet that changed the
living style of the majority of people on the planet.
4
And HTML is just one example. New markup languages like the Extensible Markup Language
(XML) and its derivatives (e.g. VoiceXML) structure only data and do not contain direct
style instructions (as opposed to, for instance, font color and many other similar properties in
HTML). Along with Extensible Stylesheet Language Transformations (XSLT), XML is able to
be transformed into another XML document, ASCII, HTML or even PDF ﬁle. Therefore, it is
not for nothing that many web sites employ the data-driven approach of construction and are
able to provide automatic up dates based on new incoming information in real time. Just imagine
some online weather forecast service that receives remote data (in XML format) from a research
center and is able to update the site almost with no delay. If HTML were used instead of XML
and XSLT, the up date would take much longer due to the manual corrections inside HTML code
required. And since XML is now supported by many prominent software applications, e.g. Oracle,
Informix, mySQL or Microsoft Oﬃce, not to mention the support of XML by many programming
languages like PHP, Python and Perl, the potential for employment of XML in the Internet is
enormous.
A substantial portion of scientiﬁc knowledge produced is later presented online to s hare new
ideas with colleagues all over the world. Las t decade clearly showed the importance of Internet
presence, resulting in the emergence of diﬀerent citation index systems like ISI, Scopus, CiteSeer,
RePeC, Google Scholar and others. But if the aptitude towards online presence is quite clear, it is
not always clear how to present that information, since substantial technical diﬃculties can arise
while e stablishing one’s own online content sys tem, e.g. in the framework of a research institution.
The aim of this work is to provide a semi-automated core called QuantNet allowing to publish

signiﬁcant amounts of scientiﬁc information online in the situation when regular updates are implied
and, most importantly, when the authors of submitted materials are not assumed to be aware
of any markup language, i.e. the materials can be submitted as ASCII ﬁles with the simplest
structure.
It is not an ultimate goal of this study to provide a ready-to-ship commercial web application.
5
Instead, the implementation of the core, examination of its possibilities and limitations are of
particular interest. At the same time, only minor eﬀorts should later be undertaken to deploy a
full-scale online system, basing on the created core.
1.2 QuantNet: A Look Inside
Consider a virtual exam ple of a project or a pro c edure that is about to be submitted online via
QuantNet. A typical ASCII ﬁle could look as follows:
1 @Area SFM
2 @Name Aut ocorr elat ion Plots
3 @Fu nct ion_ cal l SFMacfa r2 ()
4 @De scr ipt ion Plots the aut oco rrel ati on f unction of AR (2) proces s
5 @R evisi on 1.2
6 @Author C hrist ian Hafner , 2007 -01 -06
7
8 lag = 30; lag value
9 a1= 0.5 ; value of alpha_1
10 a2= 0.4 ; value of alpha_2
11 input = readv alue (" alpha1 "| " alpha2 "|" lag " , 0.5 |0. 4|30)
12
The ﬁst part of the ﬁle contains some general information about the project like its name,
author and so on, while the second part may contain a detailed description and/or computer code.
As it can be seen from the listing, the ASCII ﬁle does not contain any language-speciﬁc markup
tags. The only tags employed are natural ﬁeld descriptors, followed by the @ symbol. The author
of the submitted document does not have to care about auxiliary properties like font size, color,
family and so on. The only thing required is just to follow the sample structure.

But what is the next step? How is this ASCII ﬁle to be transformed into a well-formed HTML
ﬁle that ultimately will be rendered by the client’s browser? There are several steps that should
6
be undertaken, but the crucial one is to transform the data from the ASCII ﬁle into an XML ﬁle
that could look as follows:
1 <? xml version =" 1.0 " enc oding ="ISO -8859 -1 "? > <quantlet >
2 <name > Au toc orre lat ion Plots </ name >
3 <area > SFM </area >
4 < function_call > SFMac far2 () </ function_call >
5 <desc >
6 Plots the a uto corr ela tion functio n of AR (2) process
7 </desc >
8 <rev >1.2 </ rev >
9 <author > Christ ian Hafner , 20070106 </ author >
10 </ quantlet >
At the same time advanced users of QuantNet are supposed to proﬁt from the maximum
amount of possibilities oﬀered by native HTML, XML and XSLT, so inline tags, if present, should
be processed adequately. For instance, if the <bold> tag is an allowed one in the ASCII ﬁle and
stands for the <b> HTML counterpart, then QuantNet should be able to process the following
ASCII ﬁle adequately:
1 @Area SFM
2 @Name Autoco rre l ati on Plots
3 @Fu nct ion_ cal l SFMacfa r2 ()
4 @De scr ipt ion Plots the <bold>autocorrelationfunction</bold>
5 of AR(2) process
6 @R evisi on 1.2
7 @Author C hrist ian Hafner , 2007 -01 -06
8
9 lag = 30; lag value
10 a1= 0.5 ; value of alpha_1

11 a2= 0.4 ; value of alpha_2
12 input = readv alue (" alpha1 "| " alpha2 "|" lag " , 0.5 |0. 4|30)
13
7
The ﬁle is screened for the <bold> tag that is substituted with the valid HTML <b> tag,
which stands for bold text style.
14 <? xml version =" 1.0 " enc oding ="ISO -8859 -1 "? > <quantlet >
15 <name > Au toc orre lat ion Plots </ name >
16 <area > SFM </area >
17 < function_call > SFMac far2 () </ function_call >
18 <desc >
19 Plots the <b>autocorrelation function</b> of AR (2) process
20 </desc >
21 <rev >1.2 </ rev >
22 <author > Christ ian Hafner , 20070106 </ author >
23 </ quantlet >
And, of course, extra tags should by no means be limited only by markup group. In principle
even MathML, when supported by the browser (e.g. Mozilla Firefox), should adequately be dis-
played inside, say, the <math> tag. That can be very handy for the documents containing a lot
of formulas.
QuantNet is supposed to deliver such a degree of scalability that almost any HTML tag or their
combination could later be deﬁned as simpler and more user-friendly tags allowed for input ASCII
ﬁles.
There exist several solutions that may be helpful in this ﬁeld, e.g. a lightweight markup language
Textile, converting simple ASCII ﬁles into well-formed XHTML and allowing some formatting
variations [4], or AsciiDoc, aimed at writing short documents in ASCII to be converted in HTML [1].
These tools can be good at solving some speciﬁc web-oriented tasks but are not suﬃcient to build
a complete and scalable content system like QuantNet.
In this work the representation part of the content is put solely on XSLT while string manipula-
tion accounts only for the preparation of necessary raw data ﬁles in XML. The logic of Textile, for

instance, is employed exactly at this stage – while creating XML ﬁles out of submitted ASCII ﬁles
8
– but with one key diﬀerence: no style options to appear later inside HTML code are considered
at this stage.
Here is a short overview of the study. In the ﬁrst part of the work, content system administration
issues are considered. Section 1.3 analyzes a setup, typical for a research institution with multiple
departments with no common IT system. Section 1.4 considers the challenge from the technical
point of view, namely through the nature of HTML language.
Part 2 focuses of the implementation issues in the single document framework. Section 2.1 in-
troduces the structure of ASCII ﬁles, recommended by QuantNet. XML is presented in Section 2.2
in the context of a dynamic weather forecasting web application. Section 2.3 focuses on the con-
junction of XSLT and XML as well as discusses multiple document web application implementation
issues, if HTML were the only language employed.
Part 3 primarily concerns the multiple document nature of QuantNet. Section 3.1 provides
a short ove rview of implementation tools necessary to deploy the web application of this type.
Section 3.2 introduces mySQL – a popular database management system for online applications –
and PHP – a scripting language that is mostly used at the server side. Later, in Section 3.3, the
motivation for Javascript as a client-sided scripting language in conjunction with PHP and CSS
is provided. A step-by-step overview of the implementation of QuantNet, available in Section 3.4,
concludes Part 3.
Finally, Part 4 focuses on the potential of QuantNet as a scalable web application. Section 4.1
concentrates on the ability of QuantNet to handle potentially unlimited amount of additional tags
in ASCII ﬁles. Several useful applications of this feature as well as the implementation logic are
provided there. In Section 4.2 the process of adding a new project to QuantNet or the change
of the application’s content structure is considered. Validation by means of XML schemas and
analytic grammar, based on Backus-Naur form, as well as scripting are discussed in Section 4.3.
9
1.3 An Online Repository of Information
A typical application of QuantNet could be an online interdisciplinary repository of research ma-
terials submitted by various parties – from professional researchers to university students. These

materials could contain not only results and algorithm descriptions, which is a traditional form of
almost any publication, but also source codes, when available, as well as other supplementary data
upon author’s wish.
If the target institution is a big organization like a leading university or a research center, then
in many cases diﬀerent departments introduce their own web-publishing standards and may have
diﬀerent hosting. That could create additional obstacles for an end-user of the content, who may be
unsure where to ﬁnd this content and, more importantly, have a question if there are some relevant
research results available perhaps at some other department.
Therefore, the aim of QuantNet is to introduce a centralized system that is constituted by
documents from diﬀerent scientiﬁc areas submitted by various departments. Centralized content
management not only eases the navigation for the end-user but also provides signiﬁcant advantages
in terms of administration. All documents are located at the same server, they share the same
predeﬁned structure, they are easy to catalog, and if there is a decision to change the structure
of QuantNet in some way, for instance, introduce a new version of layout, these changes can be
applied automatically, and they will not aﬀect the original submitted ASCII ﬁles.
Every publishing entity like a journal has its own styling instructions. The same applies to
web-publishing. The aim of QuantNet is to avoid any of prerequisites that come from a markup
ﬁeld. Instead, QuantNet imposes only several restrictions on the original ASCII data ﬁles with
submitted projects so that each of them contained the author’s name, the name of the project etc.,
refer to Table 2 for more details. And that is all! A researcher should not worry about what font
size to employ for a certain heading unless he or she is well aware of speciﬁc HTML tags to take
advantage of.
10
In this sense the submitted ASCII ﬁles normally are to contain only data and minimum amount
(or no) markup tags. This is the fundamental feature of QuantNet – a user supplies a structured
data ﬁle, and QuantNet semi-automatically processes this ﬁle and incorporates it in the proper cell
of the system: the plain data ASCII ﬁle becomes a well-formed HTML document with adequate
graphic elements and navigation tools.
Structured ﬁelds like @Author, @Name or @Area inside ASCII ﬁles could potentially lead to
an eﬀective search mechanism inside QuantNet. Although it is not a goal of this s tudy to introduce

elements like that, this possibility seems to be very important to mention.
1.4 What Is Wrong With Regular HTML Publishing?
While it may be clear what advantages provides a submission of a research study as the data ASCII
ﬁle for a person who is not aware of HTML for online publication, several administration aspe cts ,
which may be not so obvious, are worth mentioning here.
Suppose the author of the project to be published online has the ﬁle in HTML format. Does
that automatically mean that this person is aware of HTML? Not necessarily. Even Microsoft Word
– one of the mostly used text processor – can save its output as an HTML ﬁle [8]. LaTeX2HTML
is another solution for those preferring L
A
T
E
X to Word.
So what is wrong with HTML as a submission format or even Microsoft Word that can be later
converted to HTML? If there is a single document to be published, nothing is probably wrong. If
there are style and/or document structure prerequisites, they can be matched. However, in the
multiple documents setup several problems do arise.
Imagine that, for instance, a new version of graphical design of the web application is to b e
introduced. And if, say, there are 500 HTML documents contained in the system, each of them
must be changed one by one! Not to mention the problems of navigation across these individual
11
Figure 1: BBC Weather web page: navigation menu example
ﬁles and diﬃculties to introduce content-driven dynamic functions like automatic ge neration of
links to auxiliary materials given, for instance, the project name.
Would not it be greater if the user intending to give a name of the project had to type something
like:
1 @Name My First P ro je ct in Quant Ne t
instead of caring about all necessary HTML tags that may easily take the following form:
1 <p style =" color : red; margin - left : 20px ; font : normal oblique 16 px" >
2 My First Project in Qua ntNet </p >

And this is only one element – name. An HTML document with rich formatting contains dozens
of such elements. If one of them is to be changed, then all the documents in the system have to be
updated.
At what about the navigation? Assuming that HTML frames are not employed following the
recommendations of leading web-designers, there is no easy solution in a multiple documents setup
for navigation elements.
On Figure 1 the BBC Weather web site is considered as an example of a system that provides
navigation to the elements created out of raw XML data in real time. Important is that the left part
12
of the page contains some ﬁxed links like Weather Home, UK, World, Sports, Cost and Area, Climate
Change and others, connecting diﬀerent documents into a well-formed weather forecasting portal.
Working with a single page does not restrict the end-user in any form: the links to other areas
are always available. At the same time the current document does not contain any explicit links
to other parts of portal because these links are created on the ﬂy. If plain HTML were employed,
the left navigation pane would b e unavailable unless every page is pos t-processed manually to add
these controls – a real nightmare for an administrator.
Fortunately XML, XSLT and PHP could provide a much more eﬃcient solution in these terms.
Before addressing these areas in more detail, let us have a closer look at what QuantNet gets as
an input – an ASCII ﬁle with raw project information.
2 Single Document Setup
2.1 Typical Structure of a Submitted ASCII File
Since every XML ﬁle normally contains only raw structured information like in a database, the
employment of the ASCII ﬁles with no markup elements perfectly ﬁts XML in this sense.
The very basic structure of an ASCII ﬁle describing, say, some statistical procedure could look
as follows. The ﬁrst ﬁle block refers to project cataloging, i.e. it contains relevant information
about authors, software platform, project stage an so on. The aim of this substructure is to prese nt
summarized information about the project in a compact form when it is being viewed by the
end-user.
The second block contains the project itself with supplementary computer code for this par-
ticular example. Most of information is located at @desc ﬁeld while @input and @output refer

solely to the algorithm implementation.
13
1 // head block
2 @pr oje ct_ nam e // name of the pro ject
3 @area // project area
4 @shor t_d esc // short project des cri ption
5 @fu nct ion_ cal l // functi on call for the at tached c om puter proce dure
6 @matlab // Boolean (yes /no ): ind icate s if proc edure is implem ent ed in
MATLAB
7 @R // Boolean (yes /no ): ind icate s if proc edure is implem ent ed in R
8 @author // author name and contact details
9 @r evisi on // project st atus
10
11 // main part
12 @input // input a rgume nts for the attach ed compu ter proc edure
13 @output // vari ables returne d by the proc edure
14 @desc // detail ed project descr ipt ion
Of course, QuantNet should not be limited by exactly that setup, this is just an example of
the possibilities of QuantNet’s core. For instance, social science projects may have no computer
code attached, therefore @input and @output cells are either not ﬁlled or just excluded from the
master design template. In general such setup can adequately represent the help- or description-
based content, consider, for instance, the help system of MATLAB as an example of structurally
rigorous repository of computer algorithm descriptions.
While XSLT is a style template applied to a given XML ﬁle, XML could be generated out of
the submitted ASCII ﬁle. This important aspect will be regarded later in Section 2.4.
2.2 What is XML?
XML – the Extensible Markup Language – is a markup language with user-deﬁned tags used for
information management. While HTML is another markup language and XML ﬁles are used to
14
Figure 2: BBC Weather web page: generation of dynamic HTML out of raw XML ﬁles with weather

forecast data
create HTML output, there are several noticeable diﬀerences between these two languages.
First and foremost, HTML is constituted by a ﬁxed set of allowed tags that are in charge of
proper representation of text and graphics on the web page. XML ﬁle can contain any used-deﬁned
tag, in fact there are no predeﬁned tags at all. And how is that possible for a language? Since
the aim of XML is just to structure the information and not to display it, this approach is quite
natural because it is impos sible to make predeﬁned tem plates for all or at least the vast majority
of diﬀerent information sets.
Let us continue with the example of the BBC Weather web page. On Figure 2 one can see the
forecasts made for ﬁve days of the current week.
A part of the underlying XML ﬁle that is used to generate the forecasts could look as fol-
15
Figure 3: BBC Weather web page: Tuesday forecast
lows. Since the object of main interest is the forecast, the root tag can have the same name –
<forecast>. Following the structure of the page on Figure 2, every forecast gets its own group tag
like <tuesday> or <wednesday>. Inside these tags all numerical information appe aring on the
page can also be easily structured by imposing extra tags for each element, e.g. the forecasted sun
index of the wind speed – this information may then be stored inside <sun index> or <wind sp>
tags.
1 <? xml version =" 1.0 " enc oding ="ISO -8859 -1 "? > < forecas t >
2 < shared web page infor mation , e. g . headers >
3
4 < tuesday >
5 < sunris e_h >5 </ sunri se_h >
6 < sunris e_m > 12 </ sunri se_m >
7 < sun set_h > 20 </ sunset_ h >
8 < sun set_m > 53 </ sunset_ m >
9 < p_weat her > lig ht_ra in </ p_we ather >
10 < max_day > 13 </ max_day >
11 < m in_ni ght >7 </ min_n ight >

12 < wind_sp > 10 </ wind_sp >
13 < win d_dir > NE </ wind_di r >
14 < v isibi lit y > moderate </ vi sibil ity >
15 < pre ssure > 1013 </ pr essure >
16 < r _humi dit y >85 </ r_hum idi ty >
17 < s un_in dex >2 </ sun_i ndex >
18
19 < other eleme nts that may refer to Tuesday forecas t >
20 </ tuesday >
16
21
22 < w ednes day >
23
24 </ wedn esday >
25
26 < sat urday >
27
28 </ saturd ay >
29 < other shar ed web page information , e . g . footers >
30
31 </ forecas t >
If the co de in this segment is compared with the HTML output on Figure 3, most of the tags
employed are self-explanatory, see Table 1 for details.
XML tag name Description
<sunrise h> Hours part of the sunrise time
<sunrise m> Minutes part of the sunrise time
<sunset h> Hours part of the sunset time
<sunset m> Minutes part of the sunset time
<p weather> Predominant weather
<max day> Maximum temperature during the day

<min night> Minimum temperature during the night
<wind sp> Wind speed
<wind dir> Wind direction
<r humidity> Relative humidity
Table 1: Description of some of the employed XML tags
Once again – unlike HTML, XML is not designed to present directly the stored content,
however HTML code processed by the browser of the end-user’s computer is generated on the ﬂy
from XML. Although only part of data from XML comes directly to the web page, e.g. the maximum
day temperature or the wind speed, other elements like predominant weather category trigger quite
certain graphic objects to appear – consider once again Figure 2 and forecasts for diﬀerent days of
the week. There predominant weather condition determines the image type employed to indicate
17
that in a more obvious form – one can clearly distinguish between the weather states of Tuesday
and Thursday, for instance. Analogously, graphic objects for the wind direction are put on the
page according to the values stored at the <wind dir> cell.
However, up to this moment the way how the ﬁnal HTML output becomes as it is on, for
instance, Figure 2, has not been described. For example, why is the sign of the forecasted sun
index, which is equal to two on Tuesday, green while the same sign for Wednesday, where the sun
index is expected to be equal to four, yellow?
There could be diﬀerent mechanisms employed to process the relevant information from XML.
Naturally one could assume a tool that makes a correspondence between the forecasted sun index
and the graphical object to appear on the HTML page. While there exist diﬀerent tools to employ
in this situation, XSLT is the one that is to be examined in more detail next. PHP could also
become such a tool if server-sided interference is allowed – XSLT can produce the HTML output
on a client machine while PHP is mostly a server-scripting language. PHP will be covered in more
detail in Section 3.2 but in perspective that is diﬀerent from the one for XSLT.
Let us switch back to the setup of QuantNet as a system. Assuming that submitted ASCII ﬁles
can eﬀectively be transformed into well-formed XML documents – this challenge is to be described
in Section 2.4 – the analogy w ith the example presented on Figure 2 is straightforward. If XSLT is
the way to get the HTML output from XML, then even ASCII ﬁles with no markup information

can later be turned into rich-formatted HTML documents!
Figure 4 summarizes this process. With the structure rules deﬁned as in Section 2.1, one is
able, ﬁrst, to transform the project or algorithm documentation into an ASCII ﬁle with intuitive
and clear ﬁelds and tags, if any. Later, when the ASCII ﬁle is represented as the well-formed XML
document, HTML output is produced by applying an XSLT template to this XML object – that
aspect will be covered in more detail in the next section.
18
Figure 4: From the project documentation to the ﬁnal HTML output
2.3 XML and XSLT – A Single Document in HTML
As it was mentioned before, XSLT is a powerful means to transform structured information from
XML into a rich-formatted representation like HTML or even PDF. XSLT is a set of stylesheet rules
applied to speciﬁc portions of XML document and resulting in creation of diﬀerent style elements
that are frequently content-driven. For instance, if the processed portion of information from XML
is a heading, then XSLT template may apply the <h3> HTML tag.
Let us consider the following code example to realize the architecture of XSLT.
1 <? xml version =" 1.0 "? >
2 <xsl : st ylesh eet xmlns : xsl = " http :// www .w3 . org /19 99 / XSL / T ransf orm " version =" 1.0 " >
3 <xsl : t em plate match =" / " >
4
5 <h2 class =" padd ed_ tex t ">
6 <xsl :value - of select =" quant le t / pro jec t_n ame "/ >
7 </ h2>
8
9 <! RECU RSIVE TEMPLAT E APPLICATIO N >
10 <xsl :apply - temp la tes / >
11 <! END OF RECURSI VE TEMPL AT E APPLI CAT ION >
19
12
13 </xsl : tem pl ate >
The part of the template presented here is the heading of the actual template employed in

QuantNet. Important is that it has a scalable structure – the third line starts to deﬁne one global
template that matches all possible elements of any XML ﬁle. The ﬁnal HTML content is mostly
generated through a recursive template application <xsl:apply-templates/> [9]. These templates
are deﬁned after the <xsl:template match=”/”> element is closed.
For instance, a summary table element (refer to Figure 9) deﬁned in QuantNet’s XSLT ﬁle
looks as follows:
1 <xsl : temp la te match = " head " >
2 <! SUMMARY TABLE >
3 < ta ble >
4 <div class =" box " id= " boxContainer " >
5 <div class =" box " id= " bo xCo ntent " >
6 <p class =" padded_ta ble " >
7 <xsl :apply - temp la tes / >
8 </p >
9 </ div >
10 </ div >
11 </ table >
12 <! END OF SUMMARY TABLE >
13 </ xsl : template >
14 <xsl : temp la te match = " head /* " >
15 <p class =" pa dde d_t abl e ">
16 <b >
17 <xsl :value - of select =" @print "/ >:
18 </ b >
19 <xsl :apply - temp la tes / >
20 </ p >
21 </ xsl : template >
20
Figure 5: Production of HTML output via XML and XSLT
As one can see, this is a very general table deﬁnition. Table styles are deﬁned with the help

of HTML/CSS, the number of table elements is arbitrary. So how does XSLT know where to stop
building the table? The <xsl:template match=”head/*”> tag tells to take those elements of
XML ﬁle in the table that are nested into the <head> tag. In this way one can maintain a truly
ﬂexible structure of QuantNet, because if later some extra elements are to be added, for instance,
to the summary table environment, it would suﬃce just to ensure the presence of these elements
in the XML ﬁle in the way the they are properly nested.
Important is that a single general XSLT ﬁle can be applied to numerous XML documents
allowing to deﬁne one metastyle and render appropriate HTML content for every project ﬁle inde-
pendently.
It is still an open question how to link all the documents together – that issue is to be addressed
later in Part 3 of the work. Up to the moment, however, little was said about the ASCII-XML
transformation. Having the desired XML and XSLT structure in mind, the proper translation of
ASCII ﬁles ensures the smooth functioning of QuantNet in the single document setup.
21
2.4 ASCII to XML: Atox and XSLT
One of the very ﬁrst processing steps in QuantNet is the transformation of the structured ASCII
ﬁles into well-formed XML ones. But what does it mean – structured? As it was mentioned before,
an ASCII ﬁle is supposed to contain the predeﬁned ﬁelds or cells.
Field name Description
@project name Project name
@area Project area, e.g. Economics, Informatics etc.
@short desc Project short description, usually one-two sentences
@function call Function call of the s ubmitted computer program if any
@matlab Boolean: yes/no value, indicates if the project is implemented in MATLAB
@R Boolean: yes/no value, indicates if the project is implemented in R
@author Project author’s name
@revision Project state, revision number if any
@input Computer procedure (if any) input variable(s) description
@output Computer procedure (if any) output variable(s) description
@desc Full project/computer algorithm description

Table 2: Fields employed by QuantNet in ASCII ﬁles
By design of QuantNet, ASCII ﬁles should maintain as easy and natural structure as possible.
Therefore, the choice of auxiliary elements like the indication of ﬁeld start or end should be trans-
parent as well. In this case even an unexperienced user would not have diﬃculties understanding
how to replicate this structure for his/her own project description for online publishing.
How does any text processing unit work looking for a particular element? First of all, it looks
for a set of symbols deﬁning the beginning of ﬁeld. QuantNet implies a ﬁeld every time the symbol
@ appears. So, for instance, @ABCD would mean that the ﬁeld ABCD is about to begin.
Most certainly, the end of ﬁeld could be deﬁned analogously, i.e. introducing some extra auxil-
iary symbol like #. Instead of this, QuantNet uses the double new line character as a trigger to an
end of ﬁeld – so there is no need for additional visible character to be introduced. As it is in text,
diﬀerent paragraphs are separated by new lines. Since QuantNet is assumed to contain possibly
22
Figure 6: Structured ASCII ﬁle creation
longer text descriptions, a single new line character would not suﬃce. Therefore, the double new
line one seems to be the most appropriate for the case. It is also worth mentioning that if for some
reason these symbol choices are not appropriate for the system administrator of QuantNet, other
arbitrary choices could easily be adopted.
Apart from predeﬁned ﬁelds, ASCII ﬁles for QuantNet could carry allowed tags for advanced
users. These tags are later translated via XSLT while ASCII-XML conversion should ensure their
”survival” in the XML representation. An example of such a construction could be the <b> HTML
tag – the one creating bold text. Should an experienced user want to override the master style
template, such formatting tags become very handy. A more advanced application of this feature
could be the introduction of the <math> tag. With its help QuantNet could gain the support of
MathML – this feature, however, is not implemented at the moment, but the only necessary thing
to do is to change the master XSLT template accordingly.
When the ASCII ﬁle is ﬁlled in with content following the aforementioned rules, how does plain
text become a well-formed XML document? One common way to implement that is to employ
Yacc/Lex software, refer to Section 4.3 for some more details.
23

However, due to its relative complexity, QuantNet does not employ this approach. Instead,
a combination of XSLT and Atox text processing freeware is used. While Atox is capable of
translating ASCII data into a limited-form XML, XSLT adds the missing elements and introduces
project ﬁles as fully-capable XML doc uments that are later transferred into HTML output.
Atox is a freeware Python-based character managem ent tool developed by Magnus Lie Het-
land [2]. It employs an extra XML markup ﬁle with the target patterns to look for in an ASCI I
ﬁle, and these rules are a mix of XML grammar and Python regular expressions.
1 <ax : format xmlns :ax=" http :// hetland . org / atox "
2 xmlns : xsl = " http :// www .w3 . org /19 99 / XSL / T ransf orm " >
3 <quan tlet >
4 < p roj ect _na me ax : minOccur ="1" ax : maxOccu r =" 1 " >
5 <ax : del > @pro ject_na me \s * </ ax : del >(? = \ s *\n \ s *\n \ s *)
6 </ pr oje ct_ nam e >
7
8 < head >
9 < area ax: minOc cur =" 1 " ax: maxO ccur ="1" >
10 <ax : del > @area \s* </ ax : del >(? = \ s *\ n\ s *\ n\s *)
11 </ area >
12
13 </ head >
14
15 </ quantle t >
This listing is a part of the actual Atox markup ﬁle used for ASCII-XML transformation. The
portion containing (?=\s*\n\s*\n\s*) illustrates the use of Python regular expressions in Atox
while the overall structure of the ﬁle c learly resembles the one of a typical XML document. This
particular regular expression controls the end of ﬁeld pattern that is, as it was mentioned before,
the double new line character. The \s* element ﬁlters the input ASCI I ﬁle for undesired white
spaces.
24

QuantNet – A Database-Driven Online Repository of Scientiﬁc Information potx

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về