1118162137 xml split 2 9721

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (2.44 MB, 7 trang )

Simpo PDF Merge and Split Unregistered Version -

PART V

Programming
CHAPTER 11: Event-Driven Programming
CHAPTER 12: LINQ to XML

c11.indd 401

05/06/12 5:39 PM

Simpo PDF Merge and Split Unregistered Version -

c11.indd 402

05/06/12 5:39 PM

Simpo PDF Merge and Split Unregistered Version -

11
Event-Driven Programming
WHAT YOU WILL LEARN IN THIS CHAPTER:

➤

Necessity of XML data access methods: SAX and .NET’s
XMLReader

➤

Why SAX and XMLReader are considered event-driven methods

➤

How to use SAX and XMLReader

➤

The right time to choose one of these methods to process your XML

There are many ways to extract information from an XML document. You’ve already seen
how to use the document object model and XPath; both of these methods can be used to ﬁ nd
any relevant item of data. Additionally, in Chapter 12 you’ll meet LINQ to XML, Microsoft’s
latest attempt to incorporate XML data retrieval in its universal data access strategy.
Given the wide variety of methods already available, you may be wondering why you need
more, and why in particular do you need event-driven methods? The main answer is because
of memory limitations. Other XML processing methods require that the whole XML document be loaded into memory (that is, RAM) before any processing can take place. Because
XML documents typically use up to four times more RAM than the size of the ﬁle containing
the document, some documents can take up more RAM than is available on a computer; it is
therefore necessary to ﬁ nd an alternative method to extract data. This is where event-driven
paradigms come into play. Instead of loading the complete ﬁ le into memory, the ﬁle is processed in sequence. There are two ways to do this: SAX and .NET’s XMLReader. Both are covered in this chapter.

c11.indd 403

05/06/12 5:39 PM

Simpo PDF Merge and Split Unregistered Version -

404

❘

CHAPTER 11

EVENT-DRIVEN PROGRAMMING

UNDERSTANDING SEQUENTIAL PROCESSING
There are two main ways of processing a ﬁ le sequentially. The ﬁ rst relies on events being ﬁ red whenever speciﬁc items are found; whether you respond to these events is up to you. For example, say an
event is ﬁ red when the opening tag of the root element is encountered, and the name of this element
is passed to the event handler. Any time any textual content is found after this, another event is
ﬁ red. In this scenario there would also be events that capture the closing of any elements with the
ﬁ nal event being ﬁ red when the closing tag of the root element is encountered.
The second method is slightly different in that you tell the processor what sort of content you are
interested in. For example, you may want to read an attribute on the ﬁrst child under the root element. To do so, you instruct the XML reader to move to the root element and then to its ﬁ rst child.
You would then begin to read the attributes until you get to the one you need. Both of these methods are similar conceptually, and both cope admirably with the problem of larger memory usage
posed by using the DOM that requires the whole XML document to be loaded into memory before
being processed.
Processing ﬁles in a sequential fashion includes one or two downsides, however. The ﬁrst is that you
can’t revisit content. If you read an element and then move on to one of its siblings or children,
you can’t then go back and examine one of its attributes without starting from the beginning
again. You need to plan carefully what information you’ll need. The second problem is validation.
Imagine you receive the document shown here:
<document>
<data>Here is some data.</data>
<data>Here is some more data.</data>
</document>

This document is well-formed, but what if its schema states that after all <data> elements there

should be a <summary> element? The processor will report the elements and text content that it
encounters, but won’t complain that the document is not valid until it reaches the relevant point.
You may not care about the extra element, in which case you can just extract whatever you need,
but if you want to validate before processing begins, this usually involves reading the document
twice. This is the price you pay for not needing to load the full document into memory.
In the following sections you’ll examine the two methods in more detail. The pure event-driven
method is called SAX and is commonly used with Java, although it can be used from any language
that supports events. The second is speciﬁc to .NET and uses the System.Xml.XmlReader class.

USING SAX IN SEQUENTIAL PROCESSING
SAX stands for the Simple API for XML, and arose out of discussions on the XML-DEV list in the
late 1990s.

c11.indd 404

05/06/12 5:39 PM

Simpo PDF Merge and Split Unregistered Version -
Using SAX in Sequential Processing

❘ 405

NOTE The archives for the XML-DEV list are available at
.org/archives/xml-dev/. The list is still very active and any XML-related
problems are usually responded to within hours, if not minutes.

Back then people were having problems because different parsers were incompatible. David
Megginson took on the job of coordinating the process of specifying a new API with the group. On
May 11, 1998, the SAX 1.0 speciﬁcation was completed. A whole series of SAX 1.0–compliant parsers then began to emerge, both from large corporations, such as IBM and Sun, and from enterprising individuals, such as James Clark. All of these parsers were freely available for public download.

Eventually, a number of shortcomings in the speciﬁcation became apparent, and David Megginson
and his colleagues got back to work, ﬁ nally producing the SAX 2.0 speciﬁcation on May 5, 2000.
The improvements centered on added support for namespaces and tighter adherence to the XML
speciﬁcation. Several other enhancements were made to expose additional information in the XML
document, but the core of SAX was very stable. On April 27, 2004, these changes were ﬁ nalized and
released as version 2.0.2.
SAX is speciﬁed as a set of Java interfaces, which initially meant that if you were going to do any
serious work with it, you were looking at doing some Java programming using Java Development
Kit (JDK) 1.1 or later. Now, however, a wide variety of languages have their own version of SAX,
some of which you learn about later in the chapter. In deference to the SAX tradition, however, the
examples in this chapter are written in Java.
All the latest information about SAX is at www.saxproject.org. It remains a public domain, open
source project hosted by SourceForge. To download SAX, go to the homepage and browse
for the latest version, or go directly to the SourceForge project page at http://sourceforge
.net/projects/sax.
This is one of the extraordinary things about SAX — it isn’t owned by anyone. It doesn’t belong to
any consortium, standards body, company, or individual. In other words, it doesn’t survive because
some organization or government says that you must use it to comply with their standards, or
because a speciﬁc company supporting it is dominant in the marketplace. It survives because it’s
simple and it works.

Preparing to Run the Examples
The SAX speciﬁcation does not limit which XML parser you use with your document. It simply sits
on top of it and reports what content it ﬁ nds. A number of different parsers are available out in the
wild, but these examples use the one that comes with the JDK.
If you don’t have the JDK already installed, perform the following steps to do so:

1.

Go to />.html. Download the latest version under the SE section. These examples use 1.6 but 1.7 is

the latest available version and will work just as well.

c11.indd 405

05/06/12 5:39 PM

Simpo PDF Merge and Split Unregistered Version -

Simpo PDF Merge and Split Unregistered Version -

1118162137 xml split 2 9721

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về