Tải bản đầy đủ (.pdf) (14 trang)

Tài liệu Jump Right To It. pptx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (457.92 KB, 14 trang )

Jump Right To It.
Three days of pure PHP
/>php|w rks
Toronto, Sept. 22-24, 2004
Existing
subscribers
can upgrade to
the Print edition
and save!
Login to your account
for more details.
NEW!
NEW!
*By signing this order form, you agree that we will charge your account in Canadian
dollars for the “CAD” amounts indicated above. Because of fluctuations in the
exchange rates, the actual amount charged in your currency on your credit card
statement may vary slightly.
**Offer available only in conjunction with the purchase of a print subscription.
Choose a Subscription type:
CCaannaaddaa//UUSSAA $$ 8833..9999 CCAADD (($$5599..9999 UUSS**))
IInntteerrnnaattiioonnaall SSuurrffaaccee $$111111..9999 CCAADD (($$7799..9999 UUSS**))
IInntteerrnnaattiioonnaall AAiirr $$112255..9999 CCAADD (($$8899..9999 UUSS**))
CCoommbboo eeddiittiioonn aadddd--oonn $$ 1144..0000 CCAADD (($$1100..0000 UUSS))
((pprriinntt ++ PPDDFF eeddiittiioonn))
Your charge will appear under the name "Marco Tabini & Associates, Inc." Please
allow up to 4 to 6 weeks for your subscription to be established and your first issue
to be mailed to you.
*US Pricing is approximate and for illustration purposes only.
php|architect Subscription Dept.
P.O. Box 54526
1771 Avenue Road


Toronto, ON M5M 4N5
Canada
Name: ____________________________________________
Address: _________________________________________
City: _____________________________________________
State/Province: ____________________________________
ZIP/Postal Code: ___________________________________
Country: ___________________________________________
Payment type:
VISA Mastercard American Express
Credit Card Number:________________________________
Expiration Date: _____________________________________
E-mail address: ______________________________________
Phone Number: ____________________________________
Visit: for
more information or to subscribe online.
Signature: Date:
To subscribe via snail mail - please detach/copy this form, fill it
out and mail to the address above or fax to +1-416-630-5057
php|architect
The Magazine For PHP Professionals
YYoouu’’llll nneevveerr kknnooww wwhhaatt wwee’’llll ccoommee uupp wwiitthh nneexxtt
W
elcome to part two of our little trip down PDF
lane. While last month we focused primarily
on understanding what the structure of a PDF
document is, this time over we’ll look at the problem of
altering the contents of a PDF file from a more practical
perspective.
The main thing to understand, before we move on to

anything else, is that parsing a PDF file is a complex—
but by no means complicated—endeavour because the
file is not only not intended for human consumption,
but it also does not follow a top-down logic. In other
words, as we also discovered last month, when parsing
a PDF file one doesn’t start at the beginning and move
down to the end of the file. In fact, the exact opposite
is true.
Since we’ll often find ourselves jumping at various—
and completely arbitrary—positions into the docu-
ment, the first decision that we need to make is how
we’re going to access the data. While it is tempting to
just load the entire file in memory, that’s usually not
such a good idea; if you consider that a PDF can have
pretty much any size, by loading an entire document in
memory we expose ourselves to the potential of clog-
ging up large chunks of RAM, thus limiting our server’s
ability to process a large number of requests.
Yet, seeking to arbitrary locations in a document is
not always easy, or even possible. Imagine, for exam-
ple, if you’re accessing a PDF document via HTTP. In
this case, you’d have to download the entire file before
you could actually find out about any of its characteris-
tics, since the offset of the cross-reference table appears
at the end of the file. Even in this case, I would recom-
mend storing the document in a local file and then
accessing the data through the filesystem.
The one notable exception to this rule is a special
class of PDF documents known as “linearized PDF files”.
A linearized PDF document contains a dictionary at the

beginning of the file that provides the necessary facili-
ties for determining the location of the first page in the
file without having to read through the cross-reference
table first. The structure of linearized PDF files is beyond
the scope of this article, but you can find out more
about it directly from the PDF specification document
published by Adobe.
Getting Started
The first thing we need to do in order to be able to
interpret the contents of a PDF document is to deter-
mine where the cross-reference table and trailer dic-
tionary are. This is quite easy if you consider that the
format of the
ssttaarrttxxrreeff
pointer is fixed. For example, in
my document it looks like the following:
startxref
53593
%%EOF
May 2004

PHP Architect

www.phparch.com
34
FEATURE
In the Belly of the Beast
Interpreting and Manipulating PDF Files
by Marco Tabini
PHP: 4.3.0+

OS: Any
Applications:
A PDF Reader (for testing)
Code Directory: pdf
REQUIREMENTS
In last month's issue, we examined the structure and con-
tents of a PDF document in considerable detail. This
month, we'll actually write a PHP library capable of open-
ing one and modifying its contents.
Thus, all we need to do is move to the end of the file,
back up a few bytes and then find this sequence of
data. As you can see from Listing 1 (
ffiinnddxxrreeff..pphhpp
), this
is readily accomplished by using a simple regular
expression. Note how the regex pattern specification
ends with a dollar sign, indicating that the resulting
match must be anchored to the end of the data stream.
Even though we’re only taking fifty characters from the
end of the file, I have added the anchor to prevent the
regex engine from picking up a previous cross-refer-
ence table pointer by mistake. If you’re wondering why
the cross-reference table pointer is not saved to the
document using a fixed format (say, for example, using
10 digits for the offset like the cross-reference entries
themselves), you’re not alone. This decision is a bit of a
mystery, but it’s something that we have to live with.
By the way—throughout the remainder of the article,
you’ll notice that I have created an individual include
file for each of the functions that we will be writing.

This is clearly not a good design practice, but it fulfills
one important purpose: it keeps the listings in the arti-
cles short and to the point. Thus, in the interest of clar-
ity, I hope that you’ll forgive me and that, if you decide
to use any of the code in your own projects, you will
not follow the same layout.
Reading the Cross-reference Table
Now that we now where to look for it, it’s time to fig-
ure out how to read the cross-reference table itself. If
we move to offset 55,593 of the file, we’ll find the fol-
lowing:
xref
0 22
0000000000 65535 f
0000000017 00000 n
0000005632 00000 n
0000005659 00000 n
0000006483 00000 n
0000053169 00000 n
0000006509 00000 n
0000039936 00000 n
The word
xxrreeff
is followed by the first object represent-
ed in the table (0 in this case) and the number of
entries that follow (twenty-two); we’ll call this the
“header” of the table. Next come the entries them-
selves: for each line, we have the offset at which the
object can be found (10 characters), followed by the
generation number and the letter

nn
for objects that are
in use or
ff
for objects that are free.
There are a few important things to notice here. First
of all, each set of data is conveniently laid out in a line
of text, so that we can use the
ffggeettss(())
function to
retrieve it. However, you should keep in mind that PDF
files always use the Windows convention for identifying
newlines in the cross-reference table (but not necessar-
ily elsewhere) and, therefore, you must instruct the PHP
interpreter to do so as well—regardless of the platform
your script is running on. This can be accomplished by
turning on the
aauuttoo__ddeetteecctt__lliinnee__eennddiinnggss
INI directive
(which became available as of PHP 4.3.0). We can do
this directly from the code by first reading the current
value, turning the directive on for the duration of our
file operations and then restoring it back to its original
value. This sequence of operations is important,
because it is possible that other portions of our script
may depend on the directive being in a different state
than the one we need it in.
Another gotcha when reading the cross-reference
table is that there may be more than one block of
entries—that is, once you’ve read out all the entries,

you could find another header followed by a new set of
entries, or you could find the trailer dictionary. If we
didn’t check for this possibility and simply assume that
the cross-reference table is always followed by a trailer,
our code would be unable to read most documents
that have been modified after their creation, since
that’s the situation in which partial cross-reference
tables are most likely to be found.
As you can see in Listing 2 (
rreeaaddxxrreeff..pphhpp
), the
ppddff__rreeaadd__xxrreeff(())
function is a bit long, but otherwise
quite simple. It is written to take full advantage of the
fact that the cross-reference table is formatted using a
very stylized layout, so that we can take advantage of
the fastest and most convenient string functions pro-
vided by PHP.
The only aspect of this function that we have not
explored is the little segment of code that starts at line
84 and ends at line 100. This is where our code reads
May 2004

PHP Architect

www.phparch.com
35
FFEEAATTUURREE
In the Belly of the Beast
Listing 1

1 <?php
2
3 /*
4 * Returns the offset of the most recent
5 * cross-reference table in the file
6 */
7
8 function pdf_find_xref ($f)
9 {
10 // First, seek to the end of the file,
11 // allowing for 50 bytes just so that
12 // we have enough data to look into.
13
14 fseek ($f, -50, SEEK_END);
15
16 // Next, try to find the proper sequence
17 // of data. Note that the information can be
18 // separated by a Windows-style, Mac-style
19 // or Unix-style newline
20
21 $data = fread ($f, 50);
22
23 if (!preg_match
(‘/startxref(?:\r|\n|\r\n)(\d+)(?:\r|\n|\r\n)%%EOF(?:\r|\n|\r\n)$/’
, $data, $matches)) {
24 die (“Unable to find pointer to xref table”);
25 }
26
27 // If we get here, then we have the offset
28 // where the most recently introduced xref

29 // table is.
30
31 return (int) $matches[1];
32 }
33
34 ?>
the trailer dictionary; as you can see, it makes use of a
few elements that I have not yet introduced (the
ppddff__ccoonntteexxtt
class and the
ppddff__rreeaadd__vvaalluuee(())
function).
However, if you leave the mechanics of how the infor-
mation is retrieved aside for a moment, you’ll notice
that the trailer dictionary ends up in an associative
array. If you remember from last month’s article, files
that have been modified usually contain more than one
cross-reference table; this is indicated by the presence
of a
//PPrreevv
key/value pair in the trailer, with a pointer to
its beginning. If this entry is present, the function sim-
ply recourses onto itself until all the cross-reference
tables present in the file are read. Note that any infor-
mation in the older tables and trailers is not allowed to
overwrite the data contained in the newer ones by the
simple stratagem of checking that an entry is not set in
the first case, and by merging the trailer arrays in a par-
ticular order in the second.
Writing a PDF Lexer

Now that we know where the objects are—the cross
reference table gives us the location of every object in
the file—it’s time to try and read them. We could, in
theory, write a series of ad-hoc functions that try to
read from the file and interpret its contents, but things
are much easier if we, instead, make use of that won-
derful computer science concept known as the lexer
(also known as a tokenizer).
May 2004

PHP Architect

www.phparch.com
36
FFEEAATTUURREE
In the Belly of the Beast
Listing 2
1 <?php
2
3 /*
4 * Reads a cross-reference table
5 *
6 * if $offset is provided and $start and $end are
7 * set to Null, the function will start reading the
8 * xref table from the current position in the file.
9 * If more than one parts of xref table are present,
10 * the function will recurse onto itself as many times
11 * as needed.
12 */
13

14 function pdf_read_xref ($f, &$result, $offset, $start = null,
$end = null)
15 {
16 // If we didn’t get a start and end, we need
17 // to get them from the document itself.
18
19 if (is_null ($start) || is_null ($end)) {
20
21 // Move to the start of the table
22
23 fseek ($f, $offset);
24
25 // Make sure that PHP keeps track of
26 // the line endings properly
27
28 $old_ini = ini_get (‘auto_detect_line_endings’);
29
30 // Get a line of text from the file
31
32 $data = trim (fgets ($f));
33
34 // Make sure the xref marker is where we
35 // expect it.
36
37 if ($data !== ‘xref’) {
38 die (“Unable to find xref table”);
39 }
40
41 // Now get the next line and split
42 // it across a single space character

43
44 $data = explode (‘ ‘, trim (fgets ($f)));
45
46 // Make sure the format is what we expected
47
48 if (count ($data) != 2) {
49 die (“Unexpected header in xref table”);
50 }
51
52 // Calculate the start and end object
53 // in the xref table
54
55 $start = $data[0];
56 $end = $start + $data[1];
57 }
58
59 if (!isset ($result[‘xref_location’])) {
60 $result[‘xref_location’] = $offset;
61 }
62
63 if (!isset ($result[‘max_object’]) || $end >
$result[‘max_object’]) {
64 $result[‘max_object’] = $end;
65 }
66
67 // Now cycle through each object
68 // pointer
69
70 for (; $start < $end; $start++) {
71

72 // Get a line of text from the
73 // file and extract the proper
74 // information out of there
75
76 $data = trim (fgets ($f));
77
78 $offset = substr ($data, 0, 10);
79 $generation = substr ($data, 11, 5);
80
81 if (!isset ($result[‘xref’][$start][(int) $genera-
tion])) {
82 $result[‘xref’][$start][(int) $generation] = (int)
$offset;
83 }
84 }
85
86 // Get the next line, which could either be the beginning
87 // of the trailer dictionary or the header of another
88 // xref section
89
90 $data = trim (fgets ($f));
91
92 if ($data === ‘trailer’) {
93
94 // Read trailer dictionary
95
96 $c = new pdf_context ($f);
97 $trailer = pdf_read_value ($c);
98
99 // Check whether there is a /Prev

100 // entry, which indicates that there
101 // is another xref table from before
102
103 if (isset ($trailer[‘/Prev’])) {
104 pdf_read_xref ($f, $result, $trailer[‘/Prev’]);
105 $result[‘trailer’] = array_merge ($result[‘trail-
er’], $trailer);
106 } else {
107 $result[‘trailer’] = $trailer;
108 }
109
110 } else {
111
112 // We have another xref segment
113 // to read. Extract the start
114 // and length, and recurse into
115 // this function
116
117 $data = explode (‘ ‘, $data);
118 pdf_read_xref ($f, $result, null, $data[0], $data[0] +
$data[1]);
119
120 }
121 }
122
123 ?>

×