Tải bản đầy đủ (.pdf) (769 trang)

Wrox beginning XML 2nd edition dec 2001 ISBN 0764543946 pdf

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (7.73 MB, 769 trang )

This document is created with the unregistered version of CHM2PDF Pilot

< Free Open Study >

Beginning XML, 2nd Edition: XML Schemas, SOAP,
XSLT, DOM, and SAX 2.0
by David Hunter, Kurt
ISBN:0764543946
Cagle, Chris Dix et al.
Wrox Press © 2003 (784 pages)
This book teaches you all you need to know about
XML--what it is, how it works, what technologies
surround it, and how it can best be used in a variety of
situations, from simple data transfer to using XML in
your web pages.

Table of Contents
Beginning XML, 2nd Edition—XML Schemas, SOAP,XSLT,DOM, and
SAX 2.0
Introduction
Ch
apt
- What is XML?
er
1
Ch
apt
- Well-Formed XML
er
2
Ch


apt
- XML Namespaces
er
3
Ch
apt
- XSLT
er
4
Ch
apt
- Document Type Definitions
er
5
Ch
apt
- XML Schemas
er
6
Ch
apt
- Advanced XML Schemas
er
7


This document is created with the unregistered version of CHM2PDF Pilot

Ch
apt

er
8
Ch
apt
er
9
Ch
apt
er
10
Ch
apt
er
11
Ch
apt
er
12
Ch
apt
er
13
Ca
se
Stu
dy
1
Ca
se
Stu

dy
2
Ap
pe
ndi
x
A
Ap
pe
ndi
x
B
Ap
pe
ndi
x
C

- The Document Object Model (DOM)

- The Simple API for XML (SAX)

- SOAP

- Displaying XML

- XML and Databases

- Linking and Querying XML


- Using XSLT to Build Interactive Web Applications

- XML Web Services

- The XML Document Object Model

- XPath Reference

- XSLT Reference


This document is created with the unregistered version of CHM2PDF Pilot

Ap
pe
ndi x
D
Ap
pe
ndi
xE
Ap
pe
ndi
xF
Ap
pe
ndi x
G
Index


Schema Element and Attribute Reference

Schema Datatypes Reference

SAX 2.0: The Simple API for XML

Useful Web Resources

< Free Open Study >


This document is created with the unregistered version of CHM2PDF Pilot

< Free Open Study >

Back Cover
Extensible Markup Language (XML) is a rapidly maturing technology with powerful real-world applications,
particularly for the management, display, and transport of data. Together with its many related technologies, it has
become the standard for data and document delivery on the Web.
This book teaches you all you need to know about XML—what it is, how it works, what technologies surround it,
and how it can best be used in a variety of situations, from simple data transfer to using XML in your web pages. It
builds on the strengths of the first edition, and provides new material to reflect the changes in the XML
landscape—notably SOAP and Web Services, and the publication of the XML Schemas Recommendation by the
W3C.
Who is this book for?
Beginning XML, 2nd Edition is for any developer who is interested in learning to use XML in web, e-commerce,
or data storage applications. Some knowledge of mark up, scripting, and/or object oriented programming languages
is advantageous, but no essential, as the basis of these techniques is explained as required.
What does this book cover?

 XML syntax and writing well-formed XML
 Using XML Namespaces
 Transforming XML into other formats with XSLT
 XPath and XPointer for locating specific XML data
 XML validation using DTDs and XML Schemas
 Manipulating XML documents with the DOM and SAX 2.0
 SOAP and Web Services
 Displaying XML using CSS and XSL
 Incorporating XML into traditional databases and n-tier architectures
 XLink for linking XML and non-XML resources.
< Free Open Study >


This document is created with the unregistered version of CHM2PDF Pilot

< Free Open Study >

Beginning XML, 2nd
Edition-XML Schemas,
SOAP,XSLT,DOM, and SAX 2.0
Kurt Cagle
Chris Dix
David Hunter
Roger Kovack
Jon Pinnock
Jeff Rafter
Published by
Wiley Publishing, Inc.
10475 Crosspoint Boulevard
Indianapolis, IN 46256

www.wiley.com
Copyright © 2003 by Wiley Publishing, Inc., Indianapolis, Indiana
Published simultaneously in Canada
Library of Congress Card Number: 2003107073
ISBN: 0-7645-4394-6
Manufactured in the United States of America
10 9 8 7 6 5 4 3 2 1
1B/RQ/QW/QT/IN
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any
means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections
107 or 108 ofthe 1976 United States Copyright Act, without either the prior written permission of the Publisher, or
authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood
Drive, Danvers, MA01923, (978) 750-8400, fax (978) 646-8700. Requests to the Publisher for permission should
be addressed to the Legal Department, Wiley Publishing, Inc., 10475 Crosspoint Blvd., Indianapolis, IN 46256,
(317) 572-3447, fax (317) 572-4447, E-Mail:
LIMIT OF LIABILITY/DISCLAIMER OF WARRANTY: WHILE THE PUBLISHER AND AUTHOR
HAVE USED THEIR BEST EFFORTS IN PREPARING THIS BOOK, THEYMAKE NO
REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE ACCURACY OR COMPLETENESS
OF THE CONTENTS OF THIS BOOK AND SPECIFICALLY DISCLAIM ANY IMPLIED WARRANTIES
OF MERCHANTABILITY OR FITNESS FOR APARTICULAR PURPOSE. NO WARRANTY MAY BE
CREATED OR EXTENDED BY SALES REPRESENTATIVES OR WRITTEN SALES MATERIALS. THE
ADVICE AND STRATEGIES CONTAINED HEREIN MAY NOT BE SUITABLE FOR YOUR SITUATION.
YOU SHOULD CONSULT WITH A PROFESSIONAL WHERE APPROPRIATE. NEITHER THE


This document is created with the unregistered version of CHM2PDF Pilot

PUBLISHER NOR AUTHOR SHALLBE LIABLE FOR ANYLOSS OF PROFIT OR ANYOTHER
COMMERCIALDAMAGES, INCLUDING BUT NOT LIMITED TO SPECIAL, INCIDENTAL,
CONSEQUENTIAL, OR OTHER DAMAGES.

For general information on our other products and services or to obtain technical support, please contact our
Customer Care Department within the U.S. at (800) 762-2974, outside the U.S. at (317) 572-3993 or fax (317)
572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be
available in electronic books.
Trademarks: Wiley, the Wiley Publishing logo, Wrox, the Wrox logo, the Wrox Programmer to Programmer logo
and related trade dress are trademarks or registered trademarks of Wiley in the United States and other countries,
and may not be used without written permission. All other trademarks are the property of their respective owners.
Wiley Publishing, Inc., is not associated with any product or vendor mentioned in this book.
Trademark Acknowledgements
Wrox has endeavored to provide trademark information about all the companies and products mentioned in this
book by the appropriate use of capitals. However, Wrox cannot guarantee the accuracy of this information.
Credits
Authors
Kurt Cagle
Chris Dix
David Hunter
Roger Kovack
Jonathon Pinnock
Jeff Rafter
Technical Reviewers
Steve Baker
David Beauchemin
Martin Beaulieu
Natalia Bortniker
Oli Gauti Gudmundsson
Paul Houle
Graham Innocent
Sachin Kanna
Sing Li

Steven Livingstone
Nikola Ozu
Jeff Rafter
Gareth Reakes
Eddie Robertsson
David Schultz
Ian Stokes-Rees
Category Managers
Simon Cox
Dave Galloway
Technical Architect
Peter Morgan


This document is created with the unregistered version of CHM2PDF Pilot

Technical Editors
Sarah Larder
Simon Mackie
Indexers'
Michael Brinkman
Fiona Murray
Production Manager
Liz Toy
Production Project Co-ordinator
Mark Burdett
Author Agent
Marsha Collins
Production Assistant
Abbie Forletta

Project Manager
Vicky Idiens
Cover
Dawn Chellingworth
Proof Reader
Keith Westmoreland
About the Authors
Kurt Cagle
Kurt Cagle is a writer and developer specializing in XML and Internet related issues. He has written eight books and
more than one hundred articles on topics ranging from Visual Basic programming to the impact of the Internet on
society, and has consulted for such companies as Microsoft, Nordstrom, AT&T and others. He also helped launch
Fawcette's XML Magazine and has been the DevX DHTML and XML Pro for nearly two years.
Kurt Cagle contributed Chapter 11 to this book.
Chris Dix
Chris Dix has been developing software for fun since he was 10 years old, and for a living for the past 8 years. He is
one of the authors of Professional XML Web Services, and he frequently writes and speaks on the topic of XML and
Web Services. Chris is Lead Developer for NavTraK, Inc., a leader in automatic vehicle location systems located in
Salisbury, Maryland, where he develops Web Services and system architecture. He can be reached at

I would like to thank my wife Jennifer and my wonderful sons Alexander and Calvin for their love and
support. I would also like to thank the people at Wrox for this opportunity, and for their technical expertise in
helping make this possible.
Chris Dix contributed Case Study 2 to this book.


This document is created with the unregistered version of CHM2PDF Pilot

David Hunter
David Hunter is a Senior Architect for MobileQ, a leading mobile software solutions developer, and the first
company to ship an XML-based mobility server. David has extensive experience building scalable applications, and

provides training on XML. He also works closely with the team that develops MobileQ's flagship product,
XMLEdge, which delivers the ideal mobile user experience on a diverse number of mobile devices.
First of all, I would like to thank God for the incredible opportunities he has given me to do something I love,
and even write books about it. I pray that the glory will go to him. I would also like to thank Wrox's editors; if
this book is helpful, easy to read, and easy to understand, it's because the editors made it that way.
And finally, I'd like to thank the person who gave me the most support, but probably doesn't even realize it.
Thank you, Andrea, for helping me through this."
David Hunter contributed Chapters 1,2,3,4,8,10, 12, and 13 to this book.
Roger Kovack
Roger Kovack has more than 25 years of software development experience, started by programming medical
research applications in Fortran on DEC machines at the University of California. More recently he has consulted to
Wells Fargo and Bank of America, developing departmental information systems on desktop and client/server
platforms. Bitten by Java and the web bug in the mid '90s he developed web applications for Commerce One, a
major B2B software vendor; and for LookSmart.com, one of the best known and still operating web portals. He was
instrumental in bringing Java into those organizations to replace ASP and C++. Roger can be contacted on
.
"My deep thanks to my wife, Julie, for the encouragement and support for writing this chapter. I'm also
endlessly grateful for the help and attention the editorial team at Wrox Press provided. Their concern for
quality content can't be overstated.
Words can't express my sorrow and compassion for the innocent victims and their families whose lives were
shattered by the terrorist attacks on New York and Washington DC on September 11, 2001. The personal,
permanent wound that has caused makes me plead for world peace."
Roger Kovack contributed Case Study 1 to this book.
Jon Pinnock
Jonathan Pinnock started programming in Pal III assembler on his school's PDP 8/e, with a massive 4K of memory,
back in the days before Moore's Law reached the statute books. These days he spends most of his time developing
and extending the increasingly successful PlatformOne product set that his company, JPA, markets to the financial
services community. JPA's home page is at: www.jpassoc.co.uk
"My heartfelt thanks go to Gail, who first suggested getting into writing, and now suffers the consequences
on a fairly regular basis, and to Mark and Rachel, who just suffer the consequences."

Jon Pinnock contributed Chapter 9 to this book.
Jeff Rafter
Jeff Rafter currently resides in Iowa City, where he is studying Creative Writing at the University of Iowa. For the
past two years, he has worked with Standfacts Credit Services, a Los Angeles based company, developing XML
interfaces for use in the mortgage industry. He also leads the XML development for Defined Systems, a web hosting
company founded with his long time friend Dan English. In his free time, Jeff composes sonnets, plays chess in parks,
skateboards, and reminisces about the Commodore64 video game industry of the late 1980s.


This document is created with the unregistered version of CHM2PDF Pilot

"I thank God for his love and grace in all things. I would also like to thank my beautiful wife Ali, who is the
embodiment of that love in countless ways. She graciously encouraged me to pursue my dreams at any cost.
Thanks also to Mike McKay who was first a servant and then a friend as I worked through the writing
process.
Finally, I would like to thank Vicky, Peter, Sarah, Simon, Victoria, Marsha and everyone at Wrox for the
opportunity and support. I would also like to express my gratitude to the invaluable reviewers."
Jeff Rafter contributed Chapters 5, 6, and 7 to this book
< Free Open Study >


This document is created with the unregistered version of CHM2PDF Pilot

< Free Open Study >

Introduction
Welcome to Beginning XML, 2nd Edition, the book I wish I'd had when I was first learning the language!
When we wrote the 1st Edition of this book, XML was a relatively new language, but was already gaining ground
fast, and becoming more and more widely used in a vast range of applications. By the time we started the 2nd Edition,
XML had already proven itself to be more than a passing fad, and was in fact being used throughout the industry for

an incredibly wide range of uses. There are also quite a number of specifications surrounding XML, which either use
XML or provide functionality in addition to the XML core specification, which aim to allow developers to do some
pretty powerful things.
So what is XML? It's a markup language, used to describe the structure of data in meaningful ways. Anywhere that
data is input/output, stored, or transmitted from one place to another, is a potential fit for XML's capabilities. Perhaps
the most well known applications are web related (especially with the latest developments in handheld web access?for
which some of the technology is XML-based). But there are many other non-web based applications where XML is
useful?for example as a replacement for (or to complement) traditional databases, or for the transfer of financial
information between businesses.
This book aims to teach you all you need to know about XML?what it is, how it works, what technologies surround
it, and how it can best be used in a variety of situations, from simple data transfer to using XML in your web pages. It
will answer the fundamental questions:

What is XML?

How do I use XML?

How does it work?

What can I use it for, anyway?

Who is this Book For?
This book is for people who know that it would be a pretty good idea to learn the language, but aren't 100% sure
why. You've heard the hype, but haven't seen enough substance to figure out what XML is, and what it can do. You
may already be somehow involved in web development, and probably even know the basics of HTML, although
neither of these qualifications is absolutely necessary for this book.
What you don't need is knowledge of SGML (XML's predecessor), or even markup languages in general. This book
assumes that you're new to the concept of markup languages, and we have tried to structure it in a way that will make
sense to the beginner, and yet quickly bring you to XML expert status.
The word "Beginning" in the title refers to the style of the book, rather than the reader's experience level. There are

two types of beginner for whom this book will be ideal:



This document is created with the unregistered version of CHM2PDF Pilot

Programmers who are already familiar with some web programming or data exchange techniques. You will
already be used to some of the concepts discussed here, but will learn how you can incorporate XML
technologies to enhance those solutions you currently develop.

Those working in a programming environment but with no substantial knowledge or experience of web
development or data exchange applications. As well as learning how XML technologies can be applied to
such applications, you will be introduced to some new concepts to help you understand how such systems
work.
< Free Open Study >


This document is created with the unregistered version of CHM2PDF Pilot

< Free Open Study >

What's Covered in this Book?
I've tried to arrange the subjects covered in this book to take you from no knowledge to expert, in as logical a
manner as I could. We'll be using the following format:

First, we'll be looking at what exactly XML is, and why the industry felt that a language like this was needed.

After covering the why, the next logical step is the how, so we'll be seeing how to create well-formed XML.

Once we understand the whys and hows of XML, we'll unleash the programmer within, and look at an

XML-based programming language that we can use to transform XML documents from one format to
another.

Now that you're comfortable with XML, and have seen it in action, we'll go on to some more advanced
things you can do when creating your XML documents, to make them not only well-formed, but valid. (And
we'll talk about what "valid" really means).

XML wouldn't really be useful unless we could write programs to read the data in XML documents, and
create new XML documents, so we'll get back to programming, and look at a couple of ways that we can do
that. We'll also take a look at a technology that allows us to send messages across the Internet, which uses
XML.

Since we have all of this data in XML format, it would be great if we could easily display it to people, and it
turns out we can. We'll look at a technology you may already have been using in conjunction with HTML
documents that also works great with XML. We'll also look at how this data fits in with traditional databases,
and even how we can link XML documents to one another.

And finally, we'll finish off with some case studies, which should help to give you ideas on how XML can be
used in real life situation, and which could be used in your own applications.
This book builds on the strengths of the first edition, and provides new material to reflect the changes in the XML
landscape-notably SOAP and Web Services, and the publication of the XML Schemas Recommendation by the
W3C-since we first published it in June 2000.
The chapters are broken down as follows:

Chapter 1: What is XML?
Here we'll cover some basic concepts, introducing the fact that XML is a markup language (a bit like HTML) where
you can define your own elements, tags and attributes (known as a vocabulary). We'll see that tags have no
presentation meaning-they're just a way of describing the structure of data.



This document is created with the unregistered version of CHM2PDF Pilot

Chapter 2: Well-Formed XML
As well as explaining what well-formed XML is, we'll take a look at the rules that exist (the XML 1.0
Recommendation) for naming and structuring elements-you need to comply with these rules if your XML is to be
well-formed.

Chapter 3: Namespaces
Because tags can be made up, we need to avoid name conflicts when sharing documents. Namespaces provide a
way to uniquely identify a group of tags, using a URI. This chapter explains how to use namespaces.

Chapter 4: XSLT
XML can be transformed into other XML, HTML, and other formats using XSLT stylesheets, which are introduced
here. The XPath language, used to locate sections in the XML document, is also covered. This chapter has been
completely overhauled from the first edition.

Chapter 5: Document Type Definitions
We can specify how an XML document should be structured, and even give default values, using Document Type
Definitions (DTDs). If XML conforms to the associated DTD it is known as valid XML. This chapter covers the
basics of using DTDs. Though we have shifted our emphasis on validation technologies in this edition (by giving more
coverage to XML Schemas), we still recognise the importance of DTDs in XML programming, and have provided a
completely new and refocused chapter on the subject.

Chapter 6: XML Schemas
XML Schemas recently became a Recommendation by the W3C. They are a more powerful alternative to DTDs
and are explained here. This chapter and the Advanced Schemas chapter that follows are both new to this edition.

Chapter 7: Advanced Schemas
Some more advanced concepts of using XML Schemas are covered in this chapter.


Chapter 8: The Document Object Model (DOM)
Programmers can use a variety of programming languages to manipulate XML, using the Document Object Model's
objects, interfaces, methods, and properties, which are described here.

Chapter 9: The Simple API for XML (SAX)
An alternative to the DOM for programmatically manipulating XML data is to use the Simple API for XML (SAX)
as an interface. This chapter shows how to use SAX, and has been updated from the first edition to focus on SAX
2.0.

Chapter 10: SOAP
The Simple Object Access Protocol (SOAP) is a specification for allowing cross-computer communications, and is
fundamental to XML Web Services. We can package up XML documents, and send them across the Internet to be
processed. This chapter explains SOAP and XML Web Services, and is new to this edition.


This document is created with the unregistered version of CHM2PDF Pilot

Chapter 11: Displaying XML
Web site designers have long been using Cascading Style Sheets (CSS) with their HTML to easily make changes to
a web site's presentation without having to touch the underlying HTML documents. This power is also available for
XML, allowing you to display XML documents right in the browser. Or, if you need a bit more flexibility with your
presentation, you can use XSLT to transform your XML to HTML or XHTML.

Chapter 12: XML and Databases
XML is perfect for structuring data, and some traditional databases are beginning to offer support for XML. These
are discussed, as well as a more general overview of how XML can be used in an n-tier architecture.

Chapter 13: Linking XML
We can locate specific parts of the XML document using XPath and XPointer. We can also link sections of
documents and other resources using XLink. Both XPointer and XLink are described in this chapter.


Case Studies 1 and 2
Throughout the book you'll gain an understanding of how XML is used in web, business to business (B2B), data
storage, and many other applications. These case studies cover some example applications and show how the theory
can be put into practice in real life situations. Both are new to this edition.

Appendices
These provide reference material that you may find useful as you begin to apply the knowledge gained throughout the
book in your own applications.
< Free Open Study >


This document is created with the unregistered version of CHM2PDF Pilot

< Free Open Study >

What You Need to Use this Book
Because XML is a text-based technology, all you really need to create XML documents is Notepad, or your
equivalent text editor. However, to really see some of these samples in action, you might want to have Internet
Explorer 5 or later, since this browser can natively read XML documents, and even provide error messages if
something is wrong. For readers without IE, there will be some screenshots throughout the book, so that you can see
what things would look like.
If you do have IE, you also have an implementation of the DOM, which you may find useful in the chapter on that
subject.
Some of the examples, and the case studies, require access to a web server, such as Microsoft's IIS (or PWS).
Throughout the book, other (freely available) XML tools will be used, and we'll give instructions for obtaining these
at the appropriate place.
< Free Open Study >



This document is created with the unregistered version of CHM2PDF Pilot

< Free Open Study >

Conventions
To help you understand what's going on, and in order to maintain consistency, we've used a number of conventions
throughout the book:
When we introduce new terms, we highlight them.
Important
These boxes hold important information.
Advice, hints, and background information comes in an indented, italicized font like this.

Try It Out
After learning something new, we'll have a Try It Out section, which will demonstrate the concepts learned, and get
you working with the technology.

How It Works
After a Try It Out section, there will sometimes be a further explanation, to help you relate what you've done to what
you've just learned.
Words that appear on the screen in menus like the File or Window menu are in a similar font to what you see on
screen. URLs are also displayed in this font.
On the occasions when we'll be running the examples from the command line, we'll show the command and the
results like this:
>msxsl blah.xml blah.xsl
<root>
<results/>
</root>

Keys that you press on the keyboard, like Ctrl and Enter, are in italics.
We use two font styles for code. If it's a word that we're talking about in the text, for example, when discussing

functionNames(), <Elements>, and Attributes, it will be in a fixed pitch font. If it's a block of code that you can type in
and run, or part of such a block, then it's also in a gray box:
<html>
<head>
<title>Simple Example</title>
</head>
<body>

Very simple HTML.


</body>
</html>

Sometimes you'll see code in a mixture of styles, like this:
<html>
<head>
<title>Simple Example</title>
</head>


This document is created with the unregistered version of CHM2PDF Pilot
<body>

Very simple HTML.


</body>
</html>

In this case, we want you to consider the code with the gray background, for example to modify it. The code with a
white background is code we've already looked at, and that we don't wish to examine further.
< Free Open Study >


This document is created with the unregistered version of CHM2PDF Pilot


< Free Open Study >

Customer Support
We always value hearing from our readers, and we want to know what you think about this book: what you liked,
what you didn't like, and what you think we can do better next time. You can send us your comments, either by
returning the reply card in the back of the book, or by e-mail to Please be sure to mention the
book title in your message.

How to Download the Sample Code for the Book
When you visit the Wrox site, simply locate the title through our Search facility or by using
one of the title lists. Click on Download in the Code column, or on Download Code on the book's detail page.
The files that are available for download from our site have been archived using WinZip. When you have saved the
file to a folder on your hard-drive, you need to extract the files using a de-compression program such as WinZip or
PKUnzip. When you extract the files, the code is usually extracted into chapter folders. When you start the extraction
process, ensure your software is set to use folder names.

Errata
We've made every effort to make sure that there are no errors in the text or in the code in this book. However, no
one is perfect and mistakes do occur. If you find an error in one of our books, like a spelling mistake or a faulty piece
of code, we would be very grateful for feedback. By sending in errata you may save another reader hours of
frustration, and of course, you will be helping us provide even higher quality information. Simply e-mail the information
to , your information will be checked and if correct, posted to the errata page for that title and
used in subsequent editions of the book.
To see if there are any errata for this book on the web site, go to and simply locate the title
through our Search facility or title list. Click on the Book Errata link, which is below the cover graphic on the book's
detail page.

E-mail Support
If you wish to directly query a problem in the book with an expert who knows the book in detail then e mail

with the title of the book and the last four numbers of the ISBN in the subject field of the e-mail.
A typical e-mail should include the following things:

The title of the book, last four digits of the ISBN, and page number of the problem in the Subject field.

Your name, contact information, and the problem in the body of the message.
We won't send you junk mail. We need the details to save your time and ours. When you send an e-mail message, it
will go through the following chain of support:

Customer Support?Your message is delivered to our customer support staff, who are the first people to read
it. They have files on most frequently asked questions and will answer anything general about the book or the
web site immediately.


This document is created with the unregistered version of CHM2PDF Pilot


Editorial?Deeper queries are forwarded to the technical editor responsible for that book. They have
experience with the programming language or particular product, and are able to answer detailed technical
questions on the subject.

The Authors?Finally, in the unlikely event that the editor cannot answer your problem, he or she will forward
the request to the author. We do try to protect the author from any distractions to their writing; however, we
are quite happy to forward specific requests to them. All Wrox authors help with the support on their books.
They will e-mail the customer and the editor with their response, and again all readers should benefit.
The Wrox Support process can only offer support to issues that are directly pertinent to the content of our published
title. Support for questions that fall outside the scope of normal book support is provided via the community lists of
our forum.

p2p.wrox.com

For author and peer discussion join the P2P mailing lists. Our unique system provides programmer to
programmer? contact on mailing lists, forums, and newsgroups, all in addition to our one-to-one e-mail support
system. If you post a query to P2P, you can be confident that it is being examined by the many Wrox authors and
other industry experts who are present on our mailing lists. At p2p.wrox.com you will find a number of different lists
that will help you, not only while you read this book, but also as you develop your own applications. Particularly
appropriate to this book are the XML and XSLT lists.
To subscribe to a mailing list just follow these steps:
1.
Go to />2.
Choose the appropriate category from the left menu bar (in this case, XML).
3.
Click on the mailing list you wish to join.
4.
Follow the instructions to subscribe and fill in your e-mail address and password.
5.
Reply to the confirmation e-mail you receive.
6.
Use the subscription manager to join more lists and set your e-mail preferences.

Why this System Offers the Best Support
You can choose to join the mailing lists or you can receive them as a weekly digest. If you don't have the time, or
facility, to receive the mailing list, then you can search our online archives. Junk and spam mails are deleted, and your
own e-mail address is protected by the unique Lyris system. Queries about joining or leaving lists, and any other
general queries about lists, should be sent to
< Free Open Study >


This document is created with the unregistered version of CHM2PDF Pilot

< Free Open Study >


Chapter 1:

What is XML?

Overview
Extensible Markup Language (XML) is a buzzword you will see everywhere on the Internet, but it's also a
rapidly maturing technology with powerful real-world applications, particularly for the management, display and
organization of data. Together with its many related technologies, which will be covered in later chapters, it is an
essential technology for anyone using markup languages on the Web or internally. This chapter will introduce you to
some of the basics of XML, and begin to show you why it is so important to learn about it.
We will cover:

The two major categories of computer file types?binary files and text files?and the advantages and
disadvantages of each

The history behind XML, including other markup languages?SGML and HTML

How XML documents are structured as hierarchies of information

A brief introduction to some of the other technologies surrounding XML, which you will be working with
throughout the book

A quick look at some areas where XML is proving to be useful
While there are some short examples of XML in this chapter, you aren't expected to understand what's going
on just yet. The idea is simply to introduce the important concepts behind the language, so that throughout
the book you can see not only how to use XML, but also why it works the way that it does.
< Free Open Study >



This document is created with the unregistered version of CHM2PDF Pilot

< Free Open Study >

Of Data, Files, and Text
XML is a technology concerned with the description and structuring of data, so before we can really delve into the
concepts behind XML, we need to understand how data is stored and accessed by computers. For our purposes,
there are two kinds of data files that are understood by computers: text files and binary files.

Binary Files
A binary file, at its simplest, is just a stream of bits (1's and 0's). It's up to the application that created a binary file
to understand what all of the bits mean. That's why binary files can only be read and produced by certain computer
programs, which have been specifically written to understand them.
For example, when a document is created with a word processor, the program creates a binary file in its own
proprietary format. The programmers who wrote the word processor decided to insert certain binary codes into the
document to denote bold text, other codes to denote page breaks, and many other codes for all of the information
that needs to go into these documents. When you open a document in the word processor it interprets those codes,
and displays the properly formatted text on the screen, or prints it to the printer.
The codes inserted into the document are meta data, or information about information. Examples could be "this
word should be in bold", "that sentence should be centered", etc. This meta data is really what differentiates one file
type from another; the different types of files use different kinds of meta data.
For example, a word processing document will have different meta data from a spreadsheet document, since they are
describing different things. Not so obviously, word processing documents from different word processing applications
will also have different metadata because the applications were written differently:

As the above diagram shows, a document created with one word processor cannot be assumed to be readable in or
used by another, because the companies who write word processors all have their own proprietary formats for their
data files. So Word documents open in Microsoft Word, and WordPerfect documents open in WordPerfect.
Luckily for us most word processors come with translators, which can translate documents from other word
processors into formats that can be understood natively. Of course, many of us have seen the garbage that sometimes

occurs as a result of this translation; sometimes applications are not as good as we'd like them to be at converting the
information.
The advantage of binary file formats is that it is easy for computers to understand these binary codes, meaning that
they can be processed much faster, and they are very efficient for storing this meta data. There is also a disadvantage,
as we've seen, in that binary files are "proprietary". You might not be able to open binary files created by one
application in another application, or even in the same application running on another platform.

Text Files


This document is created with the unregistered version of CHM2PDF Pilot

Like binary files, text files are also streams of bits. However, in a text file these bits are grouped together in
standardized ways, so that they always form numbers. These numbers are then further mapped to characters. For
example, a text file might contain the bits:
1100001

This group of bits could be translated as the number "97", which would then be further translated into the letter "a".
This example makes a number of assumptions. A better description of how numbers are represented in text
files is given in the section on "Encoding" in Chapter 2.
Because of these standards, text files can be read by many applications, and can even be read by humans, using a
simple text editor. If I create a text document, anyone in the world can read it (as long as they understand English, of
course), in any text editor they wish. There are still some issues, like the fact that different operating systems treat line
ending characters differently, but it is much easier to share information with others than with binary formats.
The following diagram shows just some of the applications on my machine that are capable of opening text files.
Some of these programs will just allow me to view the text, while others will let me edit it as well.

In its beginning, the Internet was almost completely text-based, which allowed people to communicate with relative
ease. This contributed to the explosive rate at which the Internet was adopted, and to the ubiquity of applications like
e-mail, the World Wide Web, newsgroups, etc.

The disadvantage of text files is that it's more difficult and bulky to add other information?our meta data in other
words. For example, most word processors allow you to save documents in text form, but if you do then you can't
mark a section of text as bold, or insert a binary picture file. You will simply get the words, with none of the
formatting.

A Brief History of Markup
We can see that there are advantages to binary file formats (easy to understand by a computer, compact), as well as
advantages to text files (universally interchangeable). Wouldn't it be ideal if there were a format that combined the
universality of text files with the efficiency and rich information storage capabilities of binary files?
This idea of a universal data format is not new. In fact, for as long as computers have been around, programmers
have been trying to find ways to exchange information between different computer programs. An early attempt to
combine a universally interchangeable data format with rich information storage capabilities was SGML (Standard
Generalized Markup Language). This is a text-based language that can be used to mark up data?that is, add
meta data?in a way which is self-describing. We'll see in a moment what self-describing means.
SGML was designed to be a standard way of marking up data for any purpose, and took off mostly in large
document management systems. It turns out that when it comes to huge amounts of complex data there are a lot of


This document is created with the unregistered version of CHM2PDF Pilot

considerations to take into account and, as a result, SGML is a very complicated language. With that complexity
comes power though.
A very well-known language, based on the SGML work, is the HyperText Markup Language, or HTML.
HTML uses many of SGML's concepts to provide a universal markup language for the display of information, and the
linking of different pieces of information. The idea was that any HTML document (or web page) would be
presentable in any application that was capable of understanding HTML (termed a web browser).

Not only would that browser be able to display the document, but also if the page contained links (termed
hyperlinks) to other documents, the browser would be able to seamlessly retrieve them as well.
Furthermore, because HTML is text-based, anyone can create an HTML page using a simple text editor, or any

number of web page editors, some of which are shown below:

Even many word processors, such as WordPerfect and Word, allow you to save documents as HTML. Think about
the ramifications of these two diagrams: any HTML editor, including a simple text editor, can create an HTML file,
and that HTML file can then be viewed in any web browser on the Internet!
< Free Open Study >


This document is created with the unregistered version of CHM2PDF Pilot

< Free Open Study >

So What is XML?
Unfortunately, SGML is such a complicated language that it's not well suited for data interchange over the web. And,
although HTML has been incredibly successful, it's also limited in its scope: it is only intended for displaying
documents in a browser. The tags it makes available do not provide any information about the content they
encompass, only instructions on how to display that content. This means that I could create an HTML document
which displays information about a person, but that's about all I could do with the document. I couldn't write a
program to figure out from that document which piece of information relates to the person's first name, for example,
because HTML doesn't have any facilities to describe this kind of specialized information. In fact, that program
wouldn't even know that the document was about a person at all.
Extensible Markup Language (XML) was created to address these issues.
Note that it's spelled "Extensible", not "eXtensible". Mixing these up is a common mistake.
XML is a subset of SGML, with the same goals (mark up of any type of data), but with as much of the complexity
eliminated as possible. XML was designed to be fully compatible with SGML, which means that any document which
follows XML's syntax rules is by definition also following SGML's syntax rules, and can therefore be read by existing
SGML tools. It doesn't go both ways though, so an SGML document is not necessarily an XML document.
It is important to realize, however, that XML is not really a "language" at all, but a standard for creating languages
that meet the XML criteria (we'll go into these rules for creating XML documents in Chapter 2). In other words,
XML describes a syntax that you use to create your own languages. For example, suppose I have data about a name,

and I want to be able to share that information with others and I also want to be able to use that information in a
computer program. Instead of just creating a text file like this:
John Doe

or an HTML file like this:
<html>
<head><title>Name</title></head>
<body>

John Doe


</body>
</html>

I might create an XML file like this:
<name>
<first>John</first>
<last>Doe</last>
</name>

Even from this simple example you can see why markup languages like SGML and XML are called "self-describing".
Looking at the data, you can easily tell that this is information about a <name>, and you can see that there is data
called <first> and more data called <last>. I could have given the tags any names I liked, however, if you're going to
use XML, you might as well use it right, and give things meaningful names.
You can also see that the XML version of this information is much larger than the plain-text version. Using XML to
mark up data will add to its size, sometimes enormously, but achieving small file sizes isn't one of the goals of XML;
it's only about making it easier to write software that accesses the information, by giving structure to data. However,


This document is created with the unregistered version of CHM2PDF Pilot

this larger file size should not deter you from using XML. The advantages of easier-to-write code far outweigh the

disadvantages of larger bandwidth issues. Also, if bandwidth is a critical issue for your applications, you can always
compress your XML documents before sending them across the network-compressing text files yields very good
results.

Try It Out-Opening an XML File in Internet Explorer
If you're running IE 5 or later, our XML from above can be viewed in your browser.
1.
Open up Notepad and type in the following XML:
<name>
<first>John</first>
<last>Doe</last>
</name>

2.
Save the document to your hard drive as name.xml.
3.
You can then open it up in IE 5 (for example by double-clicking on the file in Windows Explorer), where it
will look something like this:

How It Works
Although our XML file has no information concerning display, IE 5 formats it nicely for us, with our information in
bold, and our markup displayed in different colors. Also, <name> is collapsible, like your file folders in Windows
Explorer; try clicking on the - sign next to <name> in the browser window. For large XML documents, where you
only need to concentrate on a smaller subset of the data, this can be quite handy.
This is one reason why IE 5 can be so helpful when authoring XML: it has a default stylesheet built in, which applies
this default formatting to any XML document.
XML styling is accomplished through another document dedicated to the task, called a stylesheet. In a
stylesheet the designer specifies rules that determine the presentation of the data. The same stylesheet can
then be used with multiple documents to create a similar appearance among them. There are a variety of
languages that can be used to create stylesheets. In Chapter 4 we'll learn about a transformation stylesheet

language called Extensible Stylesheet Language Transformations (XSLT) and in Chapter 11 we'll be looking at


×