Tải bản đầy đủ (.pdf) (1,286 trang)

Wrox professional XML 2nd edition may 2001 ISBN 1861005059 pdf

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (9.45 MB, 1,286 trang )


Professional XML 2nd Edition

Mark Birbeck
Jason Diamond
Jon Duckett
Oli Gauti Gudmundsson
Pete Kobak
Evan Lenz
Steven Livingstone
Daniel Marcus

AM
FL
Y

Stephen Mohr
Nikola Ozu

Jon Pinnock
Keith Visco

TE

Andrew Watt

Kevin Williams
Zoran Zaev

Wrox Press Ltd. 


Team-Fly®


Professional XML 2nd Edition
© 2001 Wrox Press

All rights reserved. No part of this book may be reproduced, stored in a retrieval system or
transmitted in any form or by any means, without the prior written permission of the publisher,
except in the case of brief quotations embodied in critical articles or reviews.
The author and publisher have made every effort in the preparation of this book to ensure the
accuracy of the information. However, the information contained in this book is sold without
warranty, either express or implied. Neither the authors, Wrox Press, nor its dealers or
distributors will be held liable for any damages caused or alleged to be caused either directly or
indirectly by this book.

Published by Wrox Press Ltd,
Arden House, 1102 Warwick Road, Acocks Green,
Birmingham, B27 6BH, UK
Printed in the United States
ISBN 1861005059


Trademark Acknowledgements
Wrox has endeavored to provide trademark information about all the companies and products
mentioned in this book by the appropriate use of capitals. However, Wrox cannot guarantee the
accuracy of this information.

Credits
Authors
Mark Birbeck

Jason Diamond
Jon Duckett
Oli Gauti Gudmundsson
Pete Kobak
Evan Lenz
Steven Livingstone
Daniel Marcus
Stephen Mohr
Nikola Ozu
Jon Pinnock
Keith Visco
Andrew Watt
Kevin Williams
Zoran Zaev

Category Managers
Dave Galloway
Sonia Mulineux

Technical Reviewers
Daniel Ayers
Martin Beaulieu
Arnaud Blandin
Maxime Bombadier
Joseph Bustos
David Carlisle
Pierre-Antoine Champin
Robert Chang
Michael Corning
Chris Crane

Steve Danielson
Chris Dix
Sébastien Gignoux
Tony Hong
Paul Houle
Craig McQueen
Thomas B. Passin
Dave Pawson
Gary L Peskin
Phil Powers DeGeorge
Eric Rajkovic
Gareth Reakes
Matthew Reynolds
David Schultz
Marc H. Simkin
Darshan Singh
Paul Warren
Karli Watson

Project Administrator
Beckie Stones

Production Co-ordinator
Pip Wonson

Author Agent
Marsha Collins

Indexers
Andrew Criddle

Bill Johncocks

Technical Architect
Timothy Briggs
Technical Editors
Phil Jackson
Simon Mackie
Chris Mills
Andrew Polshaw

Proof reader
Agnes Wiggers
Production Manager
Simon Hardware

Diagrams
Shabnam Hussain
Cover
Chris Morris


About the Authors
Mark Birbeck
Mark Birbeck is Technical Director of Parliamentary Communications Ltd. where he has been
responsible for the design and build of their political portal, ePolitix.com. He is also managing
director of XML consultancy x-port.net Ltd., responsible for the publishing system behind
spiked-online.com. Although involved in XML for a number of years, his special interests lay in
metadata, and in particular the use of RDF. He particularly welcomes Wrox's initiative in trying
to move these topics from out of the shadows and into the mainstream.


Mark would particularly like to thank his long-suffering partner Jan for putting up with the
constant smell of midnight oil being burned. He offers the consolation that at least he will
already be up when their first child Louis demands attention during the small hours.

Jon Duckett
Jon has been using and writing about XML since 1998, when he co-authored and edited Wrox's
first XML publication. Having spent the past 3 years working for Wrox in the Birmingham UK
offices, Jon is currently working from Sydney, so that he can get a different view out of the
window while he is working and supping on a nice cup of tea...

Oli Gauti Gundmundsson
Oli is working for SALT, acting as one of two Chief System Architects of the SALT Systems, and
Development Director in New York. He is currently working on incorporating XML and XSL into
SALT’s web authoring and content management systems. He has acted as an instructor in the
Computer Science course (Java) at the University of Iceland, and Java is one of his greatest
strengths (and pleasures!). As a hobby he is trying to finish his BS degree in Computer Engineering.
His nationality is Icelandic, but he is currently situated in New York with his girlfriend Edda. He
can be reached at

Pete Kobak
Pete Kobak built and programmed his first computer from a kit in 1978, which featured 256
bytes of RAM and a single LED output. After a fling as an electrical engineer for IBM, Pete
gradually moved into software development to support mainframe manufacturing. He earned
geek programmer status in the late '80s when he helped to improve Burroughs' Fortran compiler
by introducing vectorization of DO loops. Justified by his desire to continue to pay his mortgage,
Pete left Burroughs in 1991 to put lives in jeopardy by developing medical laboratory software in
OS/2. In 1997, Pete somehow convinced The Vanguard Group to hire him to do Solaris web
development, even though he could barely spell “Unix”. He has helped to add new features to
their web site since then, specializing in secure web communication.
Pete's current interest is in web application security, trying to find the right techniques to enforce

the strong security needed by a serious financial institution while meeting their need to rapidly
extend business relationships. Pete is thankful to be able to introduce interesting web
technologies in the service of helping millions of people to reach for their financial dreams. He
can be contacted at


I'd like to dedicate my humble contribution to my wife Geraldine, and to my children Mary,
John, and Patricia. They have sacrificed my time and attention for me to be able to complete
this project. This chapter is a family effort.

Evan Lenz
Evan Lenz currently works as a software engineer for XYZFind Corp. in Seattle, WA. His
primary area of expertise is in XSLT, and he enjoys exploring new ways to utilize this technology
for various projects. His work at XYZFind includes everything from XSLT and Java
development to writing user's manuals, to designing the XML query language used in XYZFind's
XML database software. Wielding a professional music degree and a philosophy major, he hopes
to someday bring his varying interests together into one grand, masterful scheme.
Thanks to my precious wife, Lisa, and my baby son, Samuel, for putting up with Daddy's
long nights. And praise to my Lord and Savior, Jesus Christ, without whom none of this
would be possible or meaningful.

Steven Livingstone
Steven Livingstone is an IT Architect with IBM Global Services in Winnipeg, Canada. He has
contributed to numerous Wrox books and magazine articles, on subjects ranging from XML to ECommerce. Steven’s current interests include E-Commerce, ebXML, .NET, and Enterprise
Application Architectures.
Steven would like to thank everyone at Wrox, especially for the understanding as he emigrated
from Scotland to Canada (and that could be another book itself ;-) Most importantly he wants to
thank Loretito for putting up with him whilst writing – gracias mi tesoro.
Congratulations Celtic on winning the Treble :)


Daniel Marcus
Dr. Marcus has twenty years of experience in software architecture and design. He is co-founder,
President, and Chief Operating Officer at Speechwise Technologies, an applications software
company at the intersection of speech, wireless, and Internet technologies. Prior to starting
Speechwise, he was Director of E-Business Consulting at Xpedior, leading the strategy,
architecture, and deployment of e-business applications for Global 2000 and dot-com clients. Dr.
Marcus has been a Visiting Scholar at Princeton's Institute for Advanced Study, a research
scientist at the Lawrence Livermore National Laboratory, and is the author of over twenty papers
in computational science. He is a Sun-Certified Java Technology Architect and holds a Ph.D. in
Mechanical Engineering from the University of California, Berkeley.

Stephen Mohr
Stephen Mohr is a software systems architect with Omicron Consulting, Philadelphia, USA. He
has more than ten years' experience working with a variety of platforms and component
technologies. His research interests include distributed computing and artificial intelligence.
Stephen holds BS and MS degrees in computer science from Rensselaer Polytechnic Institute.
For my wife, Denise, and my sons James and Matthew.


Nikola Ozu
Nikola Ozu is an independent systems architect who lives in Wyoming at the end of a few miles
of dirt road – out where the virtual community is closer than town, but only flows at 24kb/s, and
still does not deliver pizza.
His current project involves bringing semantic databases, text searching, and multimedia
components together with XML – on the road to Xanadu. Other recent work has included the
usual web design consulting, some XML vocabularies, and an XML-based production and fulltext indexing system for a publisher of medical reference books and databases.
In the early 90s, Nik designed and developed a hypertext database called Health Reference Center;
followed by advanced versions of InfoTrac. Both of these were bibliographic and full-text databases,
delivered as monthly multi-disc CD-ROM subscriptions. Given the large text databases involved,
some involvement with SGML was unavoidable. His previous work has ranged from library

systems on mainframes to embedded micro systems (telecom equipment, industrial robots, toys,
arcade games, and videogame cartridges). In the early 70s, he was thrilled to learn programming
using patch boards, punch cards, paper tape, and printouts (and Teletypes, too).
When not surfing the 'net, he surfs crowds, the Tetons, and the Pacific; climbs wherever there is
rock; and tries to get more than a day's walk from the nearest road now and then. He enjoys
these even more when accompanied by his teenage son, who's old enough now to appreciate the
joy of mosh pits and sk8ing the Mission District after midnight.
To Noah: May we always think of the next (23 - 1) generations instead of just our own 20.
My thanks to the editors and illustrators at Wrox and my friend Deanna Bauder for their
help with this project. Also, thanks and apologies to my family and friends who endured my
disappearances into the WriterZone for days on end.

Jonathan Pinnock
Jonathan Pinnock started programming in Pal III assembler on his school's PDP 8/e, with a
massive 4K of memory, back in the days before Moore's Law reached the statute books. These
days he spends most of his time developing and extending the increasingly successful
PlatformOne product set that his company, JPA, markets to the financial services community. He
seems to spend the rest of his time writing for Wrox, although he occasionally surfaces to say
hello to his long-suffering wife and two children. JPA’s home page is www.jpassoc.co.uk.

Keith Visco
Keith Visco currently works for Intalio, Inc., the leader in Business Process Management, as a
manager and project leader for XML based technologies. Keith is the project leader for the open
source data-binding framework, Castor. He has been actively working on open source projects
since 1998, including the Mozilla project where he is the original author of Mozilla's XSLT
processor (donated by his previous employer, The MITRE Corporation) and is the current XSLT
module owner.
In all aspects of his life, Keith is most inspired after drinking a large Dunkin' Donuts Hazelnut
Coffee. Keith relieves what little stress his life does encounter by playing guitar or keyboards (he
apologizes to his neighbors). He is also a firm believer that life cannot exist without three basic

elements: music, coffee, and Red Sox baseball.


I would like to acknowledge Intalio, Inc. and The Exolob Group for giving me the opportunity
to work on many industry-leading technologies. I would like to thank my team at Intalio,
specifically Arnaud Blandin and Sebastien Gignoux, for their hard work as well as their
invaluable feedback on this chapter. I would also like to thank my family for their
unconditional support and incessant input into all phases of my life. A special thanks to Cindy
Iturbe, whose encouragement means so much to me and for teaching me that with a little
patience and hard work all things are possible, no matter how distant things may seem.

Andrew Watt
Andrew Watt is an independent consultant who enjoys few things more than exploring the
technologies others have yet to sample. Since he wrote his first programs in 6502 Assembler and
BBC Basic in the mid 1980s, he has sampled Pascal, Prolog, and C++, among others. More
recently he has focused on the power of web-relevant technologies, including Lotus Domino,
Java and HTML. His current interest is in the various applications of the Extensible Markup
Meta Language, XMML, sometimes imprecisely and misleadingly called XML. The present
glimpse he has of the future of SVG, XSL-FO, XSLT, CSS, XLink, XPointer, etc when they
actually work properly together is an exciting, if daunting, prospect. He has just begun to dabble
with XQuery. Such serial dabbling, so he is told, is called “life-long learning”.
In his spare time he sometimes ponders the impact of Web technologies on real people. What
will be the impact of a Semantic Web? How will those other than the knowledge-privileged fare?
To the God of Heaven who gives human beings the capacity to see, think and feel. To my
father who taught me much about life.
My heartfelt thanks go to Gail, who first suggested getting into writing, and now suffers the
consequences on a fairly regular basis, and to Mark and Rachel, who just suffer the consequences.

Kevin Williams
Kevin’s first experience with computers was at the age of 10 (in 1980) when he took a BASIC

class at a local community college on their PDP-9, and by the time he was 12, he stayed up for
four days straight hand-assembling 6502 code on his Atari 400. His professional career has been
focused on Windows development – first client-server, then onto Internet work. He’s done a little
bit of everything, from VB to Powerbuilder to Delphi to C/C++ to MASM to ISAPI, CGI, ASP,
HTML, XML, and any other acronym you might care to name; but these days, he’s focusing on
XML work. Kevin is a Senior System Architect for Equient, an information management
company located in Northern Virginia. He may be reached for comment at


Zoran Zaev
Zoran is a Sr. Web Solutions Architect with Hitachi Innovative Solutions, Corp. in the
Washington DC area. He has worked in technology since the time when 1 MHz CPUs and 48Kb
was considered a 'significant power', in the now distant 1980s. In mid 1990s, Zoran became
involved in web applications development. Since then, he has worked helping large and small
clients alike leverage the power of web applications. His more recent emphasis has been web
applications and web services with XML, SOAP, and other related technologies. When he's not
programming, you'll find him traveling, exploring new learning opportunities.


I would like to thank my wife, Angela, for her support and encouragement, as well as
sharing some of her solid writing knowledge. And, you can never go wrong thanking your
parents, so 'fala' to my mom, Jelica and dad, Vanco. On the professional side, I would like
to thank Ellen Manetti for her strong project management example, and Pete Johnson,
founder of Virtualogic, Inc., for his vision inspiring influence. Finally, thanks to Beckie and
Marsha from Wrox for their always-timely assistance and to Jan from "Images by Jan".
Zoran can be reached at





Introduction

AM
FL
Y

eXtensible Markup Language (XML) has emerged as nothing less than a phenomenon in computing. It
is a concept elegant in its simplicity driving dramatic changes in the way Internet applications are
written. This book is a revision to the first edition to keep pace with this fast-changing technology as
many technologies have been superseded, and new ones have emerged.

What Does This Book Cover?

TE

This book explains and demonstrates both the essential techniques for designing and using XML
documents, and many of the related technologies that are important today. Almost everything in this
book will be based around a specification provided by the World Wide Web Consortium (W3C).
These specifications are at various levels of completion and some of the technologies are nascent, but
we expect them to become very popular when their specifications are finalized because they are
useful or essential. The wider XML community is increasingly jumping in and offering new XMLrelated ideas outside the control of the W3C, although the W3C is still central and important to the
development of XML.
The focus of this book is on learning how to use XML as an enabling technology in real-world
applications. It presents good design techniques, and shows how to interface XML-enabled applications
with web applications. Whether your requirements are oriented toward data exchange or presentation,
this book will cover all the relevant techniques in the XML community.
Most chapters contain a practical example (unless the technology is so new that there were no
working implementations at the time of writing). As XML is a platform-neutral technology, the
examples cover a variety of languages, parsers, and servers. All the techniques are relevant across all
the platforms, so you can get valuable insight from the examples even if they are not implemented

using your favorite platform.

Team-Fly®


Introduction

Who Is This Book For?
This book is for the experienced developer, who already has some basic knowledge of XML, to learn
how to build effective applications using this exciting but simple technology. Web site developers can
learn techniques, using XSLT stylesheets and other technologies, to take their sites to the next level of
sophistication. Other developers can learn where and how XML fits into their existing systems and how
they can use it to solve their application integration problems.
XML applications can be distributed and are usually web-oriented. This book focuses on this kind of
application and so we would expect the reader to have some awareness of multi-tier architecture - preferably
from a web perspective. Although we will retread over XML, in case some of the XML fundamentals have
been missed in your experience, we will cover the full specification thoroughly and fairly quickly.
A variety of programming languages will be used in this book, and we do not expect you to be
proficient in them all. The techniques taught in this book can be transferred to other programming
languages. As XML is a cross-platform language, Java will be a language used in this book, especially
because it has a wealth of tools to manipulate XML. Other languages covered include JavaScript,
VBScript, VB, C#, and Perl. We expect the reader to be proficient in a programming language, but it
does not matter which one.

How is this Book Structured?
Although many authors have contributed towards this book, we have tied the chapters together under
unifying themes. As you will read below, the book has effectively been split into six sections. A standard
example using a toy company has been used in chapters where possible, so you can see how different
technologies can explain, describe, or transform the same data in different ways.
A small number of the chapters, e.g. Chapter 23, rely heavily on a previous chapter, but this will be

made clear. Most of the chapters will be relatively self-contained.

Learning Threads
XML is evolving into a large, wide-ranging field of related markup technologies. This growth is
powering XML applications. With growth comes divergence. Different readers will come to this book
with different expectations. XML is different things to different people.

Foundation
Chapter 1 introduces the XML world in general, discussing the technologies that are relevant today and
may be relevant tomorrow, but with very little code. Chapters 2 (Basic XML Syntax) and 3 (Advanced
XML Syntax) cover the fundamentals of XML 1.0. Chapter 2 gives you the basic syntax of an XML
document, while Chapter 3 covers slightly more advanced issues like namespaces. These chapters form
the irreducible minimum you need to understand XML and, depending on your experience, you may
want to skip these introductory chapters. Chapter 4 teaches you about the Infoset, a standard way of
describing XML, which provides an abstract representation for XML data.
In Chapter 5, we cover document validation using DTDs. Although, as you learn in the subsequent two
chapters, other schema-based validation languages exist that supersede DTDs, they are not quite dead as
many more XML parsers validate with DTDs than any other schema language, and DTDs are relatively
simple. Following this, in Chapter 6, we cover XML Schema and show how to validate your XML
documents using this new XML-based validation language specified by the W3C. Chapter 7 covers
other schema-based validation languages, including James Clark's TREX proposal, and the Schematron.

2


Introduction

In Chapter 8, we explain the XPath specification – a method of referring to specific fragments of XML
that is relevant to and used by other XML technologies. These include XSLT, described in Chapter 9.
Here we teach you how to transform your XML documents into anything else, based on certain

stylesheet declarations. In Chapter 10, we show various linking technologies, such as XLink and
XPointer and describe the XML Fragment Interchange specification.
These ten chapters are enough for you to learn about all of the immediately useful XML technologies –
for those who just use XML. You may already have a lot of experience of XML and so some of these
chapters will be re-treading over well-walked ground, but everybody should be able to learn something
new, especially because XML Schema acquired Proposed Recommendation status, the penultimate
stage of the W3C specifications, just two months before this book was printed. Although a wealth of
XML techniques lie ahead, you will have a firm foundation upon which to build.
So the Foundation thread includes:


Chapter 1: Introducing XML



Chapter 2: Basic XML Syntax



Chapter 3: Advanced XML Syntax



Chapter 4: The XML Information Set



Chapter 5: Validating XML: Schemas




Chapter 6: Introducing XML Schema



Chapter 7: XML Schema Alternatives



Chapter 8: Navigating XML – XPath



Chapter 9: Transforming XML



Chapter 10: Fragments, XLink, and XPointer

XML Programming
XML is both machine and human readable and, not surprisingly, some standard APIs have been
created to manipulate XML data. These APIs are implemented in JavaScript, Java, Visual Basic, C++,
Perl, and many other languages. These provide a standard way of manipulating, and developing for,
XML documents.
In Chapter 11, we consider the first API, which emerged from the HTML world, the DOM. This has
been released as a specification from the W3C, and Level 2 of this specification has recently been
released. XML data can be thought of as hierarchical and object-oriented, and the DOM provides
methods and properties for retrieving and manipulating XML nodes. Chapter 12 discusses the SAX, a
lightweight alternative to the DOM. When manipulating the DOM, the entire document has to be read
into memory; with the SAX, however, it only retrieves as much data as is necessary to retrieve or

manipulate a specific node.
Chapter 13 is the last chapter in this section, and it covers Declarative Programming with XML. Most
programmers use procedural languages, but XML and the XML specifications don't care about how a
particular language or application performs a job, just that it does it according the declarations made.
This chapter explains how to use schemas to design your applications.

3


Introduction

The Programming thread therefore includes:


Chapter 11: The Document Object Model



Chapter 12: SAX 2



Chapter 13: Schema Based Programming

XML as Data
There are four chapters in this section, all targeted specifically at the storage, retrieval, and
manipulation of data – as it relates to XML. Chapter 14, Data Modeling, explains how to plan your
project 'properly', and so model your XML on your data and build better applications because of it.
Chapter 15 extends this concept by covering the binding of the data to XML (and vice versa).
Querying XML covers a nascent technology known as XML Query. It aims to provide the power of

SQL in an XML format. This short chapter teaches you how to use the technology as it stands at the
time of writing.
The final chapter covered, is a case study, which describes how to relate your databases to your XML
data and so integrate your XML and RDBMS in the best way possible.
This means that the Data thread contains:


Chapter 14: Data Modeling



Chapter 15: Data Binding



Chapter 16: Querying XML



Chapter 17: Case Study: XML and Databases

Presentation of XML
Chapter 18 covers an XML technology called SVG – Scalable Vector Graphics. This XML technology,
when coupled with an appropriate viewer (for example, Adobe SVG Viewer), allows quite detailed
graphics files to be displayed and manipulated. In Chapter 19, we describe VoiceXML, an XML
technology to allow voice recognition and processing on the Web. XML data can be converted to
VoiceXML and using the appropriate technology, can be spoken and interacted with over a telephone.
Chapter 20 covers the final technology in this section, XSL-FO. This is an emerging technology that
allows the layout of pages to be specified exactly, much in the same way as PDF does now. The main
difference is, this is XML too and so can be manipulated using the same XML tools you may be used to.

Also, XSL-FO can be converted to PDF if necessary for users without XSL-FO viewers.
In the Presentation thread, therefore, we cover:

4



Chapter 18: Presenting XML Graphically



Chapter 19: VoiceXML



Chapter 20: XSL Formatting Objects: XSL-FO


Introduction

XML as Metadata
In this thread, we discuss how XML can be used to represent metadata – that is, the meaning or
semantics of data, rather than the data itself. In Chapter 21, we cover the setting up of an index of XML
data. This chapter uses a Java indexing application, but the techniques are applicable to any indexing
tool. Chapter 22 is where we really get to the meat of the topic, where we talk about RDF – a language
to describe metadata. We cover the elements and syntax of this technology. In Chapter 23, we go over
some practical examples of RDF technology, before describing RDDL – a method of bundling resources
at the URL of a namespace, so that a RDDL-enabled application can learn what the technology of
which the namespace is referring to, actually is and access schema and standard transforms.
In the Metadata thread, we cover:



Chapter 21: Case Study: Generating a Site Index



Chapter 22: RDF



Chapter 23: RDF Code Samples and RDDL

XML used for B2B
The final section of this book describes what is quite possibly the most important use of XML – B2B and
Web Services. In the past, the communication protocols for B2B (e.g. EDI) have been proprietary, and
expensive – both in terms of cost, and processor power. Using XML vocabularies, an open and
programmable model can be used for B2B transactions.
In Chapter 24, we describe Simple Object Access Protocol. SOAP was a mostly Microsoft initiative
(although the W3C are developing the XML Protocol specification, which should be very similar to
SOAP), which allows two applications to specify services using XML. We cover the intricacies of this
protocol, so that you can use it to web-enable any service you would care to mention.
Chapter 25 covers Microsoft's BizTalk Server. This server can control all B2B transactions, using the
open BizTalk framework. BizTalk is just one method of using SOAP to conduct business transactions,
but it is Microsoft's and is very popular. In Chapter 26, we have a case study discussing E-Business
integration using XML. There are a number of business standards for commerce, and this chapter
explains how you can integrate all of the standards, without having to write code for every possible B2B
transaction between competing standards.
We end in Chapter 27, with a discussion of the Web Services Description Language, which allows us to
formalize other XML vocabularies by defining services that a SOAP, or other client, can connect to.
WSDL describes each service and what it does. In addition, in this chapter, we cover UDDI (Universal

Description, Discovery, and Integration), which is a way of automating the discovery and transactions
with various services. In many cases, it should not be necessary for human interaction to find a service,
and using public registration services, UDDI makes this possible. Both of these technologies are nascent
but their importance will grow as more and more businesses make use of them.
In summary, in the B2B thread, we describe in each chapter the following:


Chapter 24: SOAP



Chapter 25: B2B with Microsoft BizTalk Server



Chapter 26: E-Business Integration



Chapter 27: B2B Futures: WSDL and UDDI

5


Introduction

What You Need to Use this Book
The book assumes that you have some knowledge of HTML, some procedural object-oriented
programming languages (e.g. Java, VB, C++), and some minimal XML knowledge. For some of the
examples in this book, a Java Runtime Environment ( will need to be

installed on your system, and some other chapters, require applications such as MS SQL Server, MS
Index Server, and BizTalk.
The complete source for larger portions of code from the book is available for download from:
More details are given in the section of this Introduction called, "Support,
Errata, and P2P".

Conventions
To help you get the most from the text and keep track of what's happening, we've used a number of
conventions throughout the book.
For instance:

These boxes hold important, not-to-be forgotten information, which is directly
relevant to the surrounding text.

While this style is used for asides to the current discussion.
As for styles in the text:
When we introduce them, we highlight important words
We show keyboard strokes like this: Ctrl-A
We show filenames, and code within the text like so: doGet()
Text on user interfaces is shown as: File | Save
URLs are shown in a similar font, as so: />We present code in two different ways. Code that is important, and testable is shown as so:
In our code examples, the code foreground style shows new, important,
pertinent code

Code that is an aside, shows examples of what not to do, or has been seen before is shown as so:
Code background shows code that's less important in the present context,
or has been seen before.

In addition, when something is to be typed at a command line interface (e.g. a DOS prompt), then we
use the following style to show what is typed, and what is output:


6


Introduction

> java com.ibm.wsdl.Main -in Arithmetic.WSDL
>> Transforming WSDL to NASSL ..
>> Generating Schema to Java bindings ..
>> Generating serializers / deserializers ..
Interface 'wsdlns:ArithmeticSoapPort' not found.

Support, Errata, and P2P
The printing and selling of this book was just the start of our contact with you. If there are any
problems, whatsoever with the code or the explanation in this book, we welcome input from you. A
mail to , should elicit a response within two to three days (depending on how busy
the support team are).
In addition to this, we also publish any errata online, so that if you have a problem, you can check on
the Wrox web site first to see if we have updated the text at all. First, pay a visit to www.wrox.com,
then, click on the Books | By Title(Z-A), or Books | By ISBN link on the left hand side of the page.
See below:

Navigate to this book (this ISBN is 1861005059, if you choose to navigate this way) and then click on it.
As well as giving some information about the book, it also provides options to download the code, view
errata, and ask for support. Just click on the relevant link. All errata that we discover will be added to
the site and so information on changes to the code that has to be made for newer versions of software
may also be included here – as well as corrections to any printing or code errors.

7



Introduction

All of the code for this book can be downloaded from our site. It is included in a zip file, and all of the
code samples in this book can be found within, referenced by chapter number.
In addition, at p2p.wrox.com, we have our free "Programmer to Programmer" discussion lists. There
are a few relevant to this book, and any questions you post will be answered by either someone at
Wrox, or someone else in the developer community. Navigate to and
subscribe to a discussion list from there. All lists are moderated and so no fluff or spam should be
received in your Inbox.

Tell Us What You Think
We've worked hard to make this book as useful to you as possible, so we'd like to know what you think.
We're always keen to know what it is you want and need to know.
We appreciate feedback on our efforts and take both criticism and praise on board in our future
editorial efforts. If you've anything to say, let us know on:


Or via the feedback links on:


8


Introduction

9


Introduction


10


1
Introducing XML

AM
FL
Y

In this chapter, we'll look at the origins of XML, the core technologies and specifications that are related
to XML, and an overview of some current, and future applications of XML. The later sections of this
introduction should also serve as something of a road map to the rest of the book.

Origins and Goals of XML

TE

"XML", as we all know, is an acronym for Extensible Markup Language – but what is a markup
language? What is the history of markup languages, what are the goals of XML, and how does it
improve upon earlier markup?

Markup Languages

Ever since the invention of the printing press, writers have made notes on manuscripts to instruct the
printers on matters such as typesetting and other production issues. These notes were called "markup".
A collection of such notes that conform to a defined syntax and grammar can certainly be called a
"language". Proofreaders use a hand-written symbolic markup language to communicate corrections to
editors and printers. Even the modern use of punctuation is actually a form of markup that remains with

the text to advise the reader how to interpret that text.
These early markup languages use a distinct appearance to differentiate markup from the text to which
it refers. For example, proofreaders' marks consist of a combination of cursive handwriting and special
symbols to distinguish markup from the typeset text. Punctuation consists of special symbols that cannot
be confused with the alphabet and numbers that represent the textual content. These symbols are so
necessary to understanding printed English that they were included in the ASCII character set, and so
have become the foundation of modern programming language syntax.

Team-Fly®


Chapter 1

The ASCII character set standard was the early basis for widespread data exchange between various
hardware and software systems. Whatever the internal representation of characters; conversion to
ASCII allowed these disparate systems to communicate with each other. In addition to text, ASCII also
defined a set of symbols, the C0 control characters (using the hexadecimal values 00 to 1F), which were
intended to be used to markup the structure of data transmissions.
Only a few of these symbols found widespread acceptance, and their use was often inconsistent. The
most common example is the character(s) used to delimit the end of a line of text in a document.
Teletype machines used the physical motion-based character pair CR-LF (carriage-return, line-feed).
This was later used by both MS-DOS and MS-Windows; UNX uses a single LF character; and the
MacOS uses a single CR character. Due to conflicting and non-standard uses of C0 control characters,
document interchange between different systems still often requires a translation step, since even a
simple text file cannot be shared without conversion.
Various forms of delimiters have been used to define the boundaries of containers for content, special
symbol glyphs, presentation style of the text, or other special features of a document. For example, the
C and C++ programming languages use the braces {} to delimit units of data or code. A typesetting
language, intended for manual human editing, might use strings that are more readable, like ".begin"
and ".end".


Markup is a method of conveying metadata (information about another dataset).

XML is a relatively new markup language, but it is a subset of, and is based upon a mature markup
language called Standard Generalized Markup Language (SGML). The WWW's Hypertext Markup
Language (HTML) is also based upon SGML; indeed, it is an application of SGML. There is a new
version of HTML 4 that is called Extensible Hypertext Markup Language (XHTML), which is similarly
an application of XML. All of these markup languages are for metadata, but SGML and XML may be
further considered meta-languages, since they can be used to create other metadata languages. Just as
HTML was expressed in SGML, XHTML and others will use XML.

SGML-based markup languages all use literal strings of characters, called tags to
delimit the major components of the metadata, called elements.

Tags represent object delimiters and other such markup, as opposed to its content (no matter whether
it's simple text or text that is program code). Of course, there has often been conflict between different
sets of tags and their interpretation. Without common delimiter vocabularies, or even common internal
data formats, it has been very difficult to convert data from one format to another, or otherwise share
data between applications and organizations.
For example, the following two markup excerpts (Chapter_01_01.html & Chapter_01_01.xml)
shows familiar HTML and similar XML elements with their delimiting tags:
<HTML>
<HEAD>
<TITLE>Product Catalog (Toysco-only)</TITLE>
</HEAD>
<BODY>
<H1>Product Catalog (Internal-use only!)</H1>
<HR>

2



Introducing XML

<H2>Product Descriptions</H2>
<HR WIDTH=33% ALIGN=LEFT>
<H3>Mega Wonder Widget</H3>
<P>The <EM>Mega Wonder Widget</EM> is a popular toy with a 20 oz. capacity. It
costs only $12.95 to make, whilst selling for $33.99 (plus $3.95 S&H).<BR>
<H3>Giga Wonder Widget</H3>
<P>The <EM>Giga Wonder Widget</EM>is even more popular, because of its
larger 55 oz. capacity. It has a similar profit margin (costs $19.95,
sells for $49.99).
...
<HR>
<P><I>Updated:</I> 2001-04-01 <I>by Webmaster Will</I>
</BODY>
</HTML>

This rather simplistic document uses the few structural tags that exist in HTML, such as <TITLE>, <H1>,
<H2>, and <H3> for headers, and <P> for paragraphs. This structure is limited to a very basic presentation
model of a document as a printed page. Other tags, such as <HR> and <EM>, are purely about the appearance
of the data. Indeed, most HTML tags are now used to describe the presentation of data, interactive logic for
user input and control, and external multimedia objects. These tags give us no idea what structured data
(names, prices, etc.) might appear within the text, or where it might be in that text.
On the other hand, XML allows us to create a structural model of the data within the text. Presentation
cues can be embedded as with HTML tags, but the best XML practice is to separate the data structure
from presentation. An external style sheet can be used with XML to describe the data's presentation
model. So, we might convert – and extend – the above HTML example into the following XML data
file (Chapter_01_01.xml):

<?xml version="1.0" ?>
[
<!ELEMENT ProductCatalog (HEAD?, BODY?) >
<!ELEMENT HEAD (TITLE, Updated, Author+, Security*) >
<!ELEMENT BODY (H1, H2, (H3, Products)+ ) >
<!ELEMENT Products (Product+) >
<!ELEMENT Product (#PCDATA|Prodname|Capacity|Cost|Price|Shipfee)* >

H1
H2
H3
TITLE

(#PCDATA)
(#PCDATA)
(#PCDATA)
(#PCDATA)

>
>
>
>

<!ELEMENT Updated (#PCDATA) >

(#PCDATA) >
<!ELEMENT Security (#PCDATA) >

Prodname
Capacity
Cost
Price
Shipfee

(#PCDATA)
(#PCDATA)
(#PCDATA)
(#PCDATA)
(#PCDATA)

>
>
>
>
>

<!ENTITY MWW "Mega Wonder Widget" >
<!ENTITY GWW "Giga Wonder Widget" >

3



Chapter 1

]>
<ProductCatalog>
<HEAD>
<TITLE>Product Catalog</TITLE>
<Updated>2001-04-01</Updated>
<Author>Webmaster Will</Author>
<Security>Toysco-only (TRADE SECRET)</Security>
</HEAD>
<BODY>
<H1>Product Catalog</H1>
<H2>Product Descriptions</H2>
<Products>
<H3>&MWW;</H3>
<Product>
The <Prodname>&MWW;</Prodname> is a popular toy with a
<Capacity unit="oz.">20</Capacity> capacity. It costs only
<Cost currency="USD">12.95</Cost> to make, whilst selling for
<Price currency="USD">33.99</Price> (plus
<Shipfee currency="USD">3.95</Shipfee> S&H).<BR/>
</Product>
<H3>&GWW;</H3>
<Product>
The <Prodname>&GWW;</Prodname> is a popular, because of its
larger <Capacity unit="oz.">55</Capacity> capacity. It has a
similar profit margin (costs <Cost currency="USD">19.95</Cost>,
sells for <Price currency="USD">33.99</Price>).<BR/>

</Product>
...
</Products>
</BODY>
</ProductCatalog>

The XML document looks very similar to the HTML version, with comparable text content, and some
equivalent tags (as XHTML). XML goes far beyond HTML by allowing the use of custom tags (like
<Prodname> or <Weight>) that preserve some structured data that is embedded within the text of the
description. We can't do this in HTML, since its set of tags is more or less fixed, changing slowly as
browser vendors embrace new features and markup. In contrast, anyone can add tags to their own XML
data. The use of tags to describe data structure allows easy conversion of XML to an arbitrary DBMS
format, or alternative presentations of the XML data such as in tabular form or via a voice synthesizer
connected to a telephone.
We have also assumed that we will use a stylesheet to format the XML data for presentation. Therefore,
we are able to omit certain labels from our text (such as the $ sign in prices, and the "oz." after the
capacity value). We will then rely upon the formatting process to insert them in the output, as
appropriate. In a similar fashion, we have put the document update information in the header (where it
can be argued that it logically belongs). When we transform the data for output, this data can be
displayed as a footer with various string literals interspersed. In this way, it can appear to be identical to
the HTML version.
It should be obvious from the examples that HTML and XML are very similar in both overall structure
and syntax. Let's look at their common ancestor, before we move on to the goals of XML.

4


×