Jump Right To It.
Three days of pure PHP
/>php|w rks
Toronto, Sept. 22-24, 2004
Existing
subscribers
can upgrade to
the Print edition
and save!
Login to your account
for more details.
NEW!
NEW!
*By signing this order form, you agree that we will charge your account in Canadian
dollars for the “CAD” amounts indicated above. Because of fluctuations in the
exchange rates, the actual amount charged in your currency on your credit card
statement may vary slightly.
**Offer available only in conjunction with the purchase of a print subscription.
Choose a Subscription type:
CCaannaaddaa//UUSSAA $$ 8833..9999 CCAADD (($$5599..9999 UUSS**))
IInntteerrnnaattiioonnaall SSuurrffaaccee $$111111..9999 CCAADD (($$7799..9999 UUSS**))
IInntteerrnnaattiioonnaall AAiirr $$112255..9999 CCAADD (($$8899..9999 UUSS**))
CCoommbboo eeddiittiioonn aadddd--oonn $$ 1144..0000 CCAADD (($$1100..0000 UUSS))
((pprriinntt ++ PPDDFF eeddiittiioonn))
Your charge will appear under the name "Marco Tabini & Associates, Inc." Please
allow up to 4 to 6 weeks for your subscription to be established and your first issue
to be mailed to you.
*US Pricing is approximate and for illustration purposes only.
php|architect Subscription Dept.
P.O. Box 54526
1771 Avenue Road
Toronto, ON M5M 4N5
Canada
Name: ____________________________________________
Address: _________________________________________
City: _____________________________________________
State/Province: ____________________________________
ZIP/Postal Code: ___________________________________
Country: ___________________________________________
Payment type:
VISA Mastercard American Express
Credit Card Number:________________________________
Expiration Date: _____________________________________
E-mail address: ______________________________________
Phone Number: ____________________________________
Visit: for
more information or to subscribe online.
Signature: Date:
To subscribe via snail mail - please detach/copy this form, fill it
out and mail to the address above or fax to +1-416-630-5057
php|architect
The Magazine For PHP Professionals
YYoouu’’llll nneevveerr kknnooww wwhhaatt wwee’’llll ccoommee uupp wwiitthh nneexxtt
W
elcome to part two of our little trip down PDF
lane. While last month we focused primarily
on understanding what the structure of a PDF
document is, this time over we’ll look at the problem of
altering the contents of a PDF file from a more practical
perspective.
The main thing to understand, before we move on to
anything else, is that parsing a PDF file is a complex—
but by no means complicated—endeavour because the
file is not only not intended for human consumption,
but it also does not follow a top-down logic. In other
words, as we also discovered last month, when parsing
a PDF file one doesn’t start at the beginning and move
down to the end of the file. In fact, the exact opposite
is true.
Since we’ll often find ourselves jumping at various—
and completely arbitrary—positions into the docu-
ment, the first decision that we need to make is how
we’re going to access the data. While it is tempting to
just load the entire file in memory, that’s usually not
such a good idea; if you consider that a PDF can have
pretty much any size, by loading an entire document in
memory we expose ourselves to the potential of clog-
ging up large chunks of RAM, thus limiting our server’s
ability to process a large number of requests.
Yet, seeking to arbitrary locations in a document is
not always easy, or even possible. Imagine, for exam-
ple, if you’re accessing a PDF document via HTTP. In
this case, you’d have to download the entire file before
you could actually find out about any of its characteris-
tics, since the offset of the cross-reference table appears
at the end of the file. Even in this case, I would recom-
mend storing the document in a local file and then
accessing the data through the filesystem.
The one notable exception to this rule is a special
class of PDF documents known as “linearized PDF files”.
A linearized PDF document contains a dictionary at the
beginning of the file that provides the necessary facili-
ties for determining the location of the first page in the
file without having to read through the cross-reference
table first. The structure of linearized PDF files is beyond
the scope of this article, but you can find out more
about it directly from the PDF specification document
published by Adobe.
Getting Started
The first thing we need to do in order to be able to
interpret the contents of a PDF document is to deter-
mine where the cross-reference table and trailer dic-
tionary are. This is quite easy if you consider that the
format of the
ssttaarrttxxrreeff
pointer is fixed. For example, in
my document it looks like the following:
startxref
53593
%%EOF
May 2004
●
PHP Architect
●
www.phparch.com
34
FEATURE
In the Belly of the Beast
Interpreting and Manipulating PDF Files
by Marco Tabini
PHP: 4.3.0+
OS: Any
Applications:
A PDF Reader (for testing)
Code Directory: pdf
REQUIREMENTS
In last month's issue, we examined the structure and con-
tents of a PDF document in considerable detail. This
month, we'll actually write a PHP library capable of open-
ing one and modifying its contents.
Thus, all we need to do is move to the end of the file,
back up a few bytes and then find this sequence of
data. As you can see from Listing 1 (
ffiinnddxxrreeff..pphhpp
), this
is readily accomplished by using a simple regular
expression. Note how the regex pattern specification
ends with a dollar sign, indicating that the resulting
match must be anchored to the end of the data stream.
Even though we’re only taking fifty characters from the
end of the file, I have added the anchor to prevent the
regex engine from picking up a previous cross-refer-
ence table pointer by mistake. If you’re wondering why
the cross-reference table pointer is not saved to the
document using a fixed format (say, for example, using
10 digits for the offset like the cross-reference entries
themselves), you’re not alone. This decision is a bit of a
mystery, but it’s something that we have to live with.
By the way—throughout the remainder of the article,
you’ll notice that I have created an individual include
file for each of the functions that we will be writing.
This is clearly not a good design practice, but it fulfills
one important purpose: it keeps the listings in the arti-
cles short and to the point. Thus, in the interest of clar-
ity, I hope that you’ll forgive me and that, if you decide
to use any of the code in your own projects, you will
not follow the same layout.
Reading the Cross-reference Table
Now that we now where to look for it, it’s time to fig-
ure out how to read the cross-reference table itself. If
we move to offset 55,593 of the file, we’ll find the fol-
lowing:
xref
0 22
0000000000 65535 f
0000000017 00000 n
0000005632 00000 n
0000005659 00000 n
0000006483 00000 n
0000053169 00000 n
0000006509 00000 n
0000039936 00000 n
The word
xxrreeff
is followed by the first object represent-
ed in the table (0 in this case) and the number of
entries that follow (twenty-two); we’ll call this the
“header” of the table. Next come the entries them-
selves: for each line, we have the offset at which the
object can be found (10 characters), followed by the
generation number and the letter
nn
for objects that are
in use or
ff
for objects that are free.
There are a few important things to notice here. First
of all, each set of data is conveniently laid out in a line
of text, so that we can use the
ffggeettss(())
function to
retrieve it. However, you should keep in mind that PDF
files always use the Windows convention for identifying
newlines in the cross-reference table (but not necessar-
ily elsewhere) and, therefore, you must instruct the PHP
interpreter to do so as well—regardless of the platform
your script is running on. This can be accomplished by
turning on the
aauuttoo__ddeetteecctt__lliinnee__eennddiinnggss
INI directive
(which became available as of PHP 4.3.0). We can do
this directly from the code by first reading the current
value, turning the directive on for the duration of our
file operations and then restoring it back to its original
value. This sequence of operations is important,
because it is possible that other portions of our script
may depend on the directive being in a different state
than the one we need it in.
Another gotcha when reading the cross-reference
table is that there may be more than one block of
entries—that is, once you’ve read out all the entries,
you could find another header followed by a new set of
entries, or you could find the trailer dictionary. If we
didn’t check for this possibility and simply assume that
the cross-reference table is always followed by a trailer,
our code would be unable to read most documents
that have been modified after their creation, since
that’s the situation in which partial cross-reference
tables are most likely to be found.
As you can see in Listing 2 (
rreeaaddxxrreeff..pphhpp
), the
ppddff__rreeaadd__xxrreeff(())
function is a bit long, but otherwise
quite simple. It is written to take full advantage of the
fact that the cross-reference table is formatted using a
very stylized layout, so that we can take advantage of
the fastest and most convenient string functions pro-
vided by PHP.
The only aspect of this function that we have not
explored is the little segment of code that starts at line
84 and ends at line 100. This is where our code reads
May 2004
●
PHP Architect
●
www.phparch.com
35
FFEEAATTUURREE
In the Belly of the Beast
Listing 1
1 <?php
2
3 /*
4 * Returns the offset of the most recent
5 * cross-reference table in the file
6 */
7
8 function pdf_find_xref ($f)
9 {
10 // First, seek to the end of the file,
11 // allowing for 50 bytes just so that
12 // we have enough data to look into.
13
14 fseek ($f, -50, SEEK_END);
15
16 // Next, try to find the proper sequence
17 // of data. Note that the information can be
18 // separated by a Windows-style, Mac-style
19 // or Unix-style newline
20
21 $data = fread ($f, 50);
22
23 if (!preg_match
(‘/startxref(?:\r|\n|\r\n)(\d+)(?:\r|\n|\r\n)%%EOF(?:\r|\n|\r\n)$/’
, $data, $matches)) {
24 die (“Unable to find pointer to xref table”);
25 }
26
27 // If we get here, then we have the offset
28 // where the most recently introduced xref
29 // table is.
30
31 return (int) $matches[1];
32 }
33
34 ?>
the trailer dictionary; as you can see, it makes use of a
few elements that I have not yet introduced (the
ppddff__ccoonntteexxtt
class and the
ppddff__rreeaadd__vvaalluuee(())
function).
However, if you leave the mechanics of how the infor-
mation is retrieved aside for a moment, you’ll notice
that the trailer dictionary ends up in an associative
array. If you remember from last month’s article, files
that have been modified usually contain more than one
cross-reference table; this is indicated by the presence
of a
//PPrreevv
key/value pair in the trailer, with a pointer to
its beginning. If this entry is present, the function sim-
ply recourses onto itself until all the cross-reference
tables present in the file are read. Note that any infor-
mation in the older tables and trailers is not allowed to
overwrite the data contained in the newer ones by the
simple stratagem of checking that an entry is not set in
the first case, and by merging the trailer arrays in a par-
ticular order in the second.
Writing a PDF Lexer
Now that we know where the objects are—the cross
reference table gives us the location of every object in
the file—it’s time to try and read them. We could, in
theory, write a series of ad-hoc functions that try to
read from the file and interpret its contents, but things
are much easier if we, instead, make use of that won-
derful computer science concept known as the lexer
(also known as a tokenizer).
May 2004
●
PHP Architect
●
www.phparch.com
36
FFEEAATTUURREE
In the Belly of the Beast
Listing 2
1 <?php
2
3 /*
4 * Reads a cross-reference table
5 *
6 * if $offset is provided and $start and $end are
7 * set to Null, the function will start reading the
8 * xref table from the current position in the file.
9 * If more than one parts of xref table are present,
10 * the function will recurse onto itself as many times
11 * as needed.
12 */
13
14 function pdf_read_xref ($f, &$result, $offset, $start = null,
$end = null)
15 {
16 // If we didn’t get a start and end, we need
17 // to get them from the document itself.
18
19 if (is_null ($start) || is_null ($end)) {
20
21 // Move to the start of the table
22
23 fseek ($f, $offset);
24
25 // Make sure that PHP keeps track of
26 // the line endings properly
27
28 $old_ini = ini_get (‘auto_detect_line_endings’);
29
30 // Get a line of text from the file
31
32 $data = trim (fgets ($f));
33
34 // Make sure the xref marker is where we
35 // expect it.
36
37 if ($data !== ‘xref’) {
38 die (“Unable to find xref table”);
39 }
40
41 // Now get the next line and split
42 // it across a single space character
43
44 $data = explode (‘ ‘, trim (fgets ($f)));
45
46 // Make sure the format is what we expected
47
48 if (count ($data) != 2) {
49 die (“Unexpected header in xref table”);
50 }
51
52 // Calculate the start and end object
53 // in the xref table
54
55 $start = $data[0];
56 $end = $start + $data[1];
57 }
58
59 if (!isset ($result[‘xref_location’])) {
60 $result[‘xref_location’] = $offset;
61 }
62
63 if (!isset ($result[‘max_object’]) || $end >
$result[‘max_object’]) {
64 $result[‘max_object’] = $end;
65 }
66
67 // Now cycle through each object
68 // pointer
69
70 for (; $start < $end; $start++) {
71
72 // Get a line of text from the
73 // file and extract the proper
74 // information out of there
75
76 $data = trim (fgets ($f));
77
78 $offset = substr ($data, 0, 10);
79 $generation = substr ($data, 11, 5);
80
81 if (!isset ($result[‘xref’][$start][(int) $genera-
tion])) {
82 $result[‘xref’][$start][(int) $generation] = (int)
$offset;
83 }
84 }
85
86 // Get the next line, which could either be the beginning
87 // of the trailer dictionary or the header of another
88 // xref section
89
90 $data = trim (fgets ($f));
91
92 if ($data === ‘trailer’) {
93
94 // Read trailer dictionary
95
96 $c = new pdf_context ($f);
97 $trailer = pdf_read_value ($c);
98
99 // Check whether there is a /Prev
100 // entry, which indicates that there
101 // is another xref table from before
102
103 if (isset ($trailer[‘/Prev’])) {
104 pdf_read_xref ($f, $result, $trailer[‘/Prev’]);
105 $result[‘trailer’] = array_merge ($result[‘trail-
er’], $trailer);
106 } else {
107 $result[‘trailer’] = $trailer;
108 }
109
110 } else {
111
112 // We have another xref segment
113 // to read. Extract the start
114 // and length, and recurse into
115 // this function
116
117 $data = explode (‘ ‘, $data);
118 pdf_read_xref ($f, $result, null, $data[0], $data[0] +
$data[1]);
119
120 }
121 }
122
123 ?>
Our lexer will take the input from the PDF file and
split it in individual tokens according to a particular set
of rules. For example, if we were writing a lexer for
reducing the contents of this article in a series of words
(with every grammatical element representing a
token), we would establish that a token is either a set of
characters or a punctuation mark—assuming that
whitespace and paragraph markers are of no impor-
tance to us.
Identifying tokens in a PDF file is quite simple in the-
ory, although in practical terms you have to watch out
for a few potential pitfalls. First the basics: the simplest
form of delimiter is the whitespace, which has no
semantic value (meaning that it is used only for the pur-
pose of delimiting tokens and has no other purpose).
Whitespace is composed of space characters, newlines
and line feeds.
This would be enough to cover most situations, but
in some cases you’ll find that tokens are not always
delimited using whitespaces. When some applications
(including some of Adobe’s own) “optimize” a PDF file
to reduce its size as much as possible, they remove
whitespace characters where the distinction between
two tokens is made obvious in another way. For exam-
ple, consider the following snippet of PDF code that
shows the beginning of a dictionary:
<< /Entry (Value) >>
The whitespace between
<<<<
and
//EEnnttrryy
is made unnec-
essary by the fact that the two tokens are made up of
two completely different classes of characters. Since
<<<<
could only appear outside of a literal string to indicate
the beginning of a dictionary, the lexer should stop at
the second open angular bracket and delimit a token
before the next character—whatever that is. Therefore,
the snippet above could be rewritten as follows:
<</Entry (Value)>>
Clearly, whitespace isn’t enough to delimit a token—we
must also keep in mind all the other possible character
classes that can be used for the same purpose. Listing 3
(
ttookkeenniizzeerr..pphhpp
) shows our lexer, the
ppddff__rreeaadd__ttookkeenn(())
function, which looks a lot more complicated than it
really is.
This file also contains the
ppddff__ccoonntteexxtt
class that we
mentioned earlier, which the tokenizer also makes use
of. The
ppddff__ccoonntteexxtt
class is used to create a wrapper
around a file pointer that makes it possible to:
• Create a memory-based buffer for the file’s
contents.
• Keep track of the current pointer in the file
and of the length of the buffer
• Maintain a stack of tokens that have been
read from the file but not yet used
The necessity of creating a buffer here arises from the
fact that we don’t want our tokenizer to read one sin-
gle character at a time out of the file. By reading a fixed
amount at a time and then accessing the dara directly
in memory, we can save ourselves a few expensive
function calls. The token stack is actually used by the
portion of the system that is responsible for interpret-
ing the meaning of the tokens—more about that later.
Note that there is no compelling reason to store this
information in a class, other than the convenience fac-
tor of having a convenient PHP syntax to work with.
You could just as easily store everything in an array and
avoid OOP altogether, although, in my opinion, that
would significantly complicate your code and make it
easier to introduce bugs that would be tough to find
and fix.
Going back to the
ppddff__rreeaadd__ttookkeenn(())
function for a
moment, you can see that it works in a very simple
way: first, it removes any whitespace that is at the cur-
rent offset in the file buffer. Next, it tries to determine
the type of token that it is dealing with by looking at
the first character. The procedure used to then find the
end of the token varies depending on the character
class it belongs to: for array and literal string delimiters,
a single character is all we need, whereas for hex string
and dictionary delimiters we need to check one more
character, since they both share the same initial open
angular bracket. For all the other types of tokens, we
simply scan the file until we end up in a different char-
acter class.
Parsing the Data
Next in the list, we need to be able to understand the
meaning of each token in the context of the PDF file—
and this is the job of another great computer science
construct: the parser.
Parsers can be very complicated, and are usually not
coded by hand—in most cases, a developer would use
a “parser generator” like YACC or Bison. These reduce
the parser to a relatively complex finite-state machine
that is flexible enough to accommodate certain types of
languages. In our case, however, the parsing of a PDF
file is simple enough that the entire process can be
coded in just about 150 lines’ worth of PHP.
Before introducing another listing, however, let’s con-
sider the types of data that we need to deal with. For
the most part, they are simple to handle: for direct val-
ues, for example, we read as many tokens as we need
from the file and store them in the appropriate data
structures. In two cases, however, we need to make a
distinction: strings and indirect objets.
The problem with strings—and, particularly, with lit-
eral string—is that they change the rules that our lexer
May 2004
●
PHP Architect
●
www.phparch.com
37
FFEEAATTUURREE
In the Belly of the Beast
May 2004
●
PHP Architect
●
www.phparch.com
38
FFEEAATTUURREE
In the Belly of the Beast
Listing 3
1 <?php
2
3 /*
4 * This class is used to
5 * read data from the input
6 * file in a bufferized way
7 * and to store unused tokens
8 */
9
10 class pdf_context
11 {
12 var $file;
13 var $buffer;
14 var $offset;
15 var $length;
16
17 var $stack;
18
19 // Constructor
20
21 function pdf_context ($f)
22 {
23 $this->file = $f;
24 $this->reset();
25 }
26
27 // Optionally move the file
28 // pointer to a new location
29 // and reset the buffered data
30
31 function reset($pos = null)
32 {
33 if (!is_null ($pos)) {
34 fseek ($this->file, $pos);
35 }
36
37 $this->buffer = fread ($this->file, 100);
38 $this->offset = 0;
39 $this->length = strlen ($this->buffer);
40 $this->stack = array();
41 }
42
43 // Make sure that there is at least one
44 // character beyond the current offset in
45 // the buffer to prevent the tokenizer
46 // from attempting to access data that does
47 // not exist
48
49 function ensure_content()
50 {
51 if ($this->offset >= $this->length - 1) {
52 return $this->increase_length();
53 } else {
54 return true;
55 }
56 }
57
58 // Forcefully read more data into the buffer
59
60 function increase_length()
61 {
62 if (feof ($this->file)) {
63 return false;
64 } else {
65 $this->buffer .= fread ($this->file, 100);
66 $this->length = strlen ($this->buffer);
67 return true;
68 }
69 }
70 }
71
72 /*
73 * Reads a token from the file
74 */
75
76 function pdf_read_token (&$c)
77 {
78 // If there is a token available
79 // on the stack, pop it out and
80 // return it.
81
82 if (count ($c->stack)) {
83 return array_pop($c->stack);
84 }
85
86 // Strip away any whitespace
87
88 do {
89 if (!$c->ensure_content()) {
90 return false;
91 }
92 $c->offset += strspn ($c->buffer, “ \n\r”, $c->off-
set);
93 } while ($c->offset >= $c->length - 1);
94
95 // Get the first character in the stream
96
97 $char = $c->buffer[$c->offset++];
98
99 switch ($char) {
100
101 case ‘[‘ :
102 case ‘]’ :
103 case ‘(‘ :
104 case ‘)’ :
105
106 // This is either an array or literal string
107 // delimiter, Return it
108
109 return $char;
110
111 case ‘<’ :
112 case ‘>’ :
113
114 // This could either be a hex string or
115 // dictionary delimiter. Determine the
116 // appropriate case and return the token
117
118 if ($c->buffer[$c->offset] == $char) {
119 if (!$c->ensure_content()) {
120 return false;
121 }
122 $c->offset++;
123 return $char . $char;
124 } else {
125 return $char;
126 }
127
128 default :
129
130 // This is “another” type of token (probably
131 // a dictionary entry or a numeric value)
132 // Find the end and return it.
133
134 if (!$c->ensure_content()) {
135 return false;
136 }
137
138 while(1) {
139
140 // Determine the length of the token
141
142 $pos = strcspn ($c->buffer, “ []<>()\r\n/”,
$c->offset);
143
144 if ($c->offset + $pos < $c->length - 1) {
145 break;
146 } else {
147 // If the script reaches this point,
148 // the token may span beyond the end
149 // of the current buffer. Therefore,
150 // we increase the size of the buffer
151 // and try again—just to be safe.
152
153 $c->increase_length();
154 }
155 }
156
157 $result = substr ($c->buffer, $c->offset - 1, $pos
+ 1);
158
159 $c->offset += $pos;
160 return $result;
161 }
162 }
163
164 ?>
has to follow in order to find the end of the token,
because a closed parenthesis could be escaped by a
backslash and, therefore, its presence alone does not
indicate the end of the string. In a “traditional” lexer,
this problem is taken care of by switching the machine
to a new context in which a different set of rules apply.
We could, in fact, do the very same thing to our lexer
by creating a special case in the
sswwiittcchh
statement that
is part of
ppddff__rreeaadd__ttookkeenn(())
in Listing 2 and writing
some additional code that looks for a parenthesis not
preceded by an even number of backslashes. Why an
even number? Because the backslashes themselves can
be escaped by prefixing them with another backslash.
Therefore, an even number of backslashes means that
they are all escaped and should be interpreted as liter-
al characters, so that the last one does not escape the
parenthesis, which becomes the string delimiter. The
last in am odd number of backslashes right before a
parenthesis becomes an “orphan” and escapes the
parenthesis, thus preventing it from terminating the
string.
Given that we only have a limited amount of space
and I really wanted to keep things as simple as possible,
however, I chose to implement the string parsing func-
tionality inside the parser itself. When an open paren-
thesis token is returned by the tokenizer, the code sim-
ply keeps scanning the input file until it finds an
unescaped closed parenthesis.
The other problematic data elements are, as I men-
tioned above, indirect objects. Both object declarations
and references are made up by three tokens. Therefore,
once our parser encounters a numeric value, it won’t be
able to tell whether it is part of a larger element until it
has read at least one more token—and potentially two.
The problem here is not with reading the tokens—it’s
with what to do with them if, by any chance, the
numeric value turns out to be… just a numeric value.
We could, in theory, put the extra tokens “back in the
buffer” by rolling back the offset pointer in the buffer
to the beginning of the second token, but that would
be difficult to do, since we don’t really know how many
whitespace characters were between the tokens to start
with.
Therefore, we use a completely different approach:
unused tokens are stored in a stack, which is part of the
file context. When a new token is requested,
ppddff__rreeaadd__ttookkeenn(())
checks whether anything is present
in the stack and, if something is in there, it pops it out
and returns it, without even reading one character from
the file buffer.
You can see the end result of all our tribulations in
Listing 4 (
rreeaaddvvaalluuee..pphhpp
), which contains the
ppddff__rreeaadd__vvaalluuee(())
function. You will also notice a num-
ber of constant definitions that look suspiciously like
data types—and they are. Since we’ll be reading and
writing data back and forth, we’ll need to keep track of
the object types as we read them from the stream. To
do so, each object is encapsulated in an array whose
zeroth element indicates the type, while element 1 con-
tains the actual value, which varies depending on the
nature of the data. Thus, for example, the trailer dic-
tionary could look like this:
Array (
PDF_TYPE_DICTIONARY,
Array (
‘/Size’ => array (
PDF_TYPE_NUMERIC,
22),
‘/Root’ => array (
PDF_TYPE_OBJ_REF,
12,
0
),
‘/Prev’ => array (
PDF_TYPE_NUMERIC,
54655
)
);
Not unlike some of its predecessors,
ppddff__rreeaadd__vvaalluuee(())
looks a lot scarier than it actually is—the code is quite
heavily commented, so I will limit myself to noting that
each value is actually stored in an array whose zeroth
element contains its type. This makes identifying the
data type of a type practically immediate, which will
turn out to be very important later on when we’ll need
to write objects back to the file.
Before moving on to the next step, note that we
make no provision in our lexer for reading stream data.
This is because we are not intent on interpreting every-
thing that is stored in a PDF file—but only those ele-
ments that allow us to modify its contents. However,
adding support for streams shouldn’t be too much of a
problem—all you need is the ability to resolve object
references, which we’ll add shortly, since the length of
a stream is often expressed in that way.
Getting to the Root of the Problem
All the pieces are finally in place—we should now be
able to read through the PDF file and interpret its con-
tents, at least to the extent that we need in order to be
able to append data to it. In order to demonstrate how
the PDF functionality that we have built works, our goal
is to open a PDF file and add a textual element to its
first page.
Listing 5 (
iinnddeexx..pphhpp
) is our main script—and, unfor-
tunately, it’s too large to show here; you will, however,
find it in the code associated with this article, so you
will hopefully be able to follow me there.
Once we have declared a few variables that we we’ll
end up using throughout the script, we read the cross-
reference table from the file, then immediately attempt
to retrieve the Root object from it. Because the
//RRoooott
entry inside the file trailer has to be an indirect object
reference, we must find a way to retrieve the actual
May 2004
●
PHP Architect
●
www.phparch.com
39
FFEEAATTUURREE
In the Belly of the Beast
May 2004
●
PHP Architect
●
www.phparch.com
40
FFEEAATTUURREE
In the Belly of the Beast
Continued on page 41...
Listing 4
1 <?php
2
3 // Define various data types
4 // that we use throughout the system
5
6 define (‘PDF_TYPE_NULL’, 0);
7 define (‘PDF_TYPE_NUMERIC’, 1);
8 define (‘PDF_TYPE_TOKEN’, 2);
9 define (‘PDF_TYPE_HEX’, 3);
10 define (‘PDF_TYPE_STRING’, 4);
11 define (‘PDF_TYPE_DICTIONARY’, 5);
12 define (‘PDF_TYPE_ARRAY’, 6);
13 define (‘PDF_TYPE_OBJDEC’, 7);
14 define (‘PDF_TYPE_OBJREF’, 8);
15 define (‘PDF_TYPE_OBJECT’, 9);
16 define (‘PDF_TYPE_STREAM’, 10);
17
18 /*
19 * Reads a value from the current
20 * data stream
21 */
22
23 function pdf_read_value (&$c, $token = null)
24 {
25 // Get a token from the stream.
26
27 if (is_null ($token)) {
28 $token = pdf_read_token ($c);
29 }
30
31 if ($token === false) {
32 return false;
33 }
34
35 switch ($token) {
36
37 case ‘<’ :
38
39 // This is a hex string.
40 // Read the value, then the terminator
41
42 $s = pdf_read_token ($c);
43
44 if ($s === false) {
45 return false;
46 }
47
48 $term = pdf_read_token ($c);
49
50 if ($term !== ‘>’) {
51 die (“Unexpected data after hex string”);
52 }
53
54 return array (PDF_TYPE_HEX, $s);
55
56 break;
57
58 case ‘<<’ :
59
60 // This is a dictionary.
61
62 $result = array();
63
64 // Recurse into this function until we reach
65 // the end of the dictionary.
66
67 while (($key = pdf_read_token ($c)) !== ‘>>’) {
68 if ($key === false) {
69 return false;
70 }
71
72 if (($value = pdf_read_value ($c)) === false)
{
73 return false;
74 }
75
76 $result[$key] = $value;
77 }
78
79 return array (PDF_TYPE_DICTIONARY, $result);
80
81 case ‘[‘ :
82
83 // This is an array.
84
85 $result = array();
86
87 // Recurse into this function until we reach
88 // the end of the array.
89
90 while (($token = pdf_read_token ($c)) !== ‘]’) {
91 if ($token === false) {
92 return false;
93 }
94
95 if (($value = pdf_read_value ($c, $token)) ===
false) {
96 return false;
97 }
98
99 $result[] = $value;
100 }
101
102 return array (PDF_TYPE_ARRAY, $result);
103
104 case ‘(‘ :
105
106 // This is a string
107
108 $pos = $c->offset;
109
110 while(1) {
111
112 // Start by finding the next closed
113 // parenthesis
114
115 $pos = strpos ($c->buffer, ‘)’, $pos);
116
117 // If you can’t find it, try
118 // reading more data from the stream
119
120 if ($pos == -1) {
121 if (!$c->increase_length()) {
122 return false;
123 }
124 }
125
126 // Make sure that there is no backslash before
the parenthesis. If there is,
127 // move on. Otherwise, return the string.
128
129 if ($c->buffer[$pos - 1] !== ‘\\’) {
130 $result = substr ($c->buffer, $c->offset,
$pos - $c->offset + 1);
131 $c->offset = $pos + 1;
132 return array (PDF_TYPE_STRING, $result);
133 } else {
134 $pos++;
135
136 if ($pos > $c->offset + $c->length) {
137 $c->increase_length();
138 }
139 }
140 }
141
142 default :
143
144 if (is_numeric ($token)) {
145
146 // A numeric token. Make sure that it is not
part of something else.
147
148 if (($tok2 = pdf_read_token ($c)) !== false) {
149 if (is_numeric ($tok2)) {
150
151 // Two numeric tokens in a row. In
this case, we’re probably in
152 // front of either an object reference
or an object specification.
153 // Determine the case and return the
data
154
155 if (($tok3 = pdf_read_token ($c)) !==
false) {
156 switch ($tok3) {
157
158 case ‘obj’ :
159
160 return array
(PDF_TYPE_OBJDEC, (int) $token, (int) $tok2);
161
162 case ‘R’ :
163
164 return array
(PDF_TYPE_OBJREF, (int) $token, (int) $tok2);
165 }
object data, as the reference itself won’t help us much.
This is accomplished by the
ppddff__rreessoollvvee__oobbjjeecctt(())
func-
tion, which you can see in Listing 6 as part of the
oobbjjeeccttss..pphhpp
include file. The function can actually be
used to determine whether any object is an indirect ref-
erence and resolve it to the actual object data—some-
thing that will come in handy at pretty much every step
of the way.
As you can see,
ppddff__rreessoollvvee__oobbjjeecctt(())
first checks to
see if the value it has been passed is an indirect object
reference. If it isn’t, the function has really nothing to
do, other than returning right away. If, on the other
hand, it did receive an indirect reference, it uses the
cross-reference table to determine its position and
starts reading it. The
$$eennccaappssuullaattee
parameter deter-
mines how the object is returned to the caller; if it is set
to true,
ppddff__rreessoollvvee__oobbjjeecctt(())
stores the object ID and
generation number in the array, so that effectively the
object’s data is encapsulated inside another object of
type
PPDDFF__TTYYPPEE__OOBBJJEECCTT
. Otherwise, the direct value is
returned, and all information regarding the object’s ID
and generation number is lost. Both types of return val-
ues have their uses—if you want to retrieve an object
with the intention of modifying it, you will probably
want it encapsulated, so that you can later rewrite it
back to the stream. If, on the other hand, you’re just
trying to retrieve a value, as you would, for example, if
you were reading a stream object and you wanted to
determine its length, the non-encapsulated version will
be easier to handle. Speaking of retrieving streams,
even though my code doesn’t perform that function
(since I’m not writing a PDF reader), if you intend to
add it, the
ppddff__rreessoollvvee__oobbjjeecctt(())
function is safe to use
because it saves the file pointer’s current position
before reading the object and restores it afterwards. If
the function didn’t do so and you were reading a
stream, resolving the
//LLeennggtthh
parameter could result in
the file pointer being moved to a different location in
the file—and you would be unable to read the rest of
the stream.
Let’s go back to
index.php
. With the root object firm-
ly in hand, we can now compile a list of all the pages
contained in the document. To do so, we feed the
//PPaaggeess
element of the root dictionary to the
ppddff__rreeaadd__ppaaggeess(())
function, which you can see in Listing
7 (
rreeaaddppaaggeess..pphhpp
).
The reason why we have a separate function just to
read through the
//PPaaggeess
element of the root object is
that, as I mentioned in last month’s article, the pages
could be nested in an arbitrary combination of
//PPaaggee
and
//PPaaggeess
dictionaries, so that we may need to recurse
into the function several times in order to end up with
an array that contains only page elements. It is impor-
tant to understand that the order in which the pages
are resolved by using this method doesn’t necessarily
correspond to the logical order in which they will
appear to the user—that is, the first page in the list is
not necessarily the first page of the document; the PDF
specification provides a different set of facilities for
determining the logical page order, but, technically
speaking, you should only be interested in that if you
want to display the contents of a document. In practi-
cal terms, I have never found an occasion in which the
logical and physical page order didn’t coincide—at
most, there might be a fixed discrepancy because the
document is an excerpt that starts from, say, page 25,
but the order of the pages should usually be the same.
In our sample script, we only take in consideration
page 1 (which is the zeroth element resulting from the
pages array). We then use the
ppddff__ffiinndd__rreessoouurrcceess(())
function, shown in Listing 8 (
ppaaggee..pphhpp
) to retrieve the
resources associated with the page. Here, again, we
need a dedicated function because, as you may
remember, the resource dictionary is an inheritable
May 2004
●
PHP Architect
●
www.phparch.com
41
FFEEAATTUURREE
In the Belly of the Beast
Listing 4: Continued from page 40
164 return array (PDF_TYPE_OBJREF, (int) $token, (int) $tok2);
165 }
166
167 // If we get to this point, that numeric value up
168 // there was just a numeric value. Push the extra
169 // tokens back into the stack and return the value.
170
171 array_push ($c->stack, $tok3);
172 }
173 }
174
175 array_push ($c->stack, $tok2);
176 }
177
178 return array (PDF_TYPE_NUMERIC, $token);
179 } else {
180 // Just a token. Return it.
181
182 return array (PDF_TYPE_TOKEN, $token);
183 }
184
185 }
186 }
187
188 ?>
resource, so that if there isn’t one associated with the
page itself, there may be one associated with its parent,
or with its parent’s parent, and so on. In the case of the
sample file that we were looking at last month, the
same resource object is actually associated explicitly
with every page and with their parent (the
/Pages
dic-
tionary). This is rather redundant, and makes for poor
optimization, but it is perfectly acceptable (for the
record, the PDF was creating on Linux by exporting an
OpenOffice.org 1.1 file).
Next, we need to find the font resources, so that we
can append our own to the existing ones. Finding the
font resource dictionary is actually a lot simpler than
any of its predecessors so far, since it’s either there, in
which case we piggyback on it, or it isn’t, in which case
we create our own and add it to the resources associat-
ed with the page. The only difficulty here is in finding a
name for the font resource that doesn’t conflict with
one that already exists. The approach that I have taken
is to simply run through all the resources available and
look at those called
/F
x
, where
x
is a numerical value.
The font resource we create is the next highest avail-
able—for example, if the highest font resource current-
ly used is
/F10
, ours will be
/F11
. Note that this choice
is entirely arbitrary—you can choose whatever combi-
nation you like, as long as it starts with a letter and not
a digit.
The font resource that we create and add to the font
dictionary is the simplest possible one: it uses the
Helvetica font, which must be supported by every PDF
reader and, therefore, doesn’t need to be embedded in
the document itself.
Graffiti on the Wall
We’ve now come to the part where we actually need to
“write” some text on the document. Unfortunately, this
involves a few steps.
First, the concept of drawing pretty much anything
May 2004
●
PHP Architect
●
www.phparch.com
42
FFEEAATTUURREE
In the Belly of the Beast
Listing 6
1 <?php
2
3 /*
4 * Resolves an object reference,
5 * ensuring that the result value
6 * is always a direct object
7 */
8
9 function pdf_resolve_object (&$c, $obj_spec, $encapsulate =
true)
10 {
11 global $xref_data;
12
13 // Exit if we get invalid data
14
15 if (!is_array ($obj_spec)) {
16 return false;
17 }
18
19 if ($obj_spec[0] == PDF_TYPE_OBJREF) {
20
21 // This is a reference, resolve it
22
23 if (isset
($xref_data[‘xref’][$obj_spec[1]][$obj_spec[2]])) {
24
25 // Save current file position
26 // This is needed if you want to resolve
27 // references while you’re reading another object
28 // (e.g.: if you need to determine the length
29 // of a stream)
30
31 $old_pos = ftell ($c->file);
32
33 // Reposition the file pointer and
34 // load the object header.
35
36 $c->reset
($xref_data[‘xref’][$obj_spec[1]][$obj_spec[2]]);
37 $header = pdf_read_value ($c);
38
39 if ($header[0] != PDF_TYPE_OBJDEC || $header[1] !=
$obj_spec[1] || $header[2] != $obj_spec[2]) {
40 die (“Unable to find object ({$obj_spec[1]},
{$obj_spec[2]}) at expected location”);
41 }
42
43 // If we’re being asked to store all the informa-
tion
44 // about the object, we add the object ID and gen-
eration
45 // number for later use
46
47 if ($encapsulate) {
48 $result = array (
49 PDF_TYPE_OBJECT,
50 ‘obj’ => $obj_spec[1],
51 ‘gen’ => $obj_spec[2]
52 );
53 } else {
54 $result = array();
55 }
56
57 // Now simply read the object data until
58 // we encounter an end-of-object marker
59
60 while(1) {
61 $value = pdf_read_value ($c);
62
63 if ($value === false) {
64 return false;
65 }
66
67 if ($value[0] == PDF_TYPE_TOKEN && $value[1]
=== ‘endobj’) {
68 break;
69 }
70
71 $result[] = $value;
72 }
73
74 $c->reset ($old_pos);
75
76 return $result;
77 }
78 } else {
79 return $obj_spec;
80 }
81 }
82
83 /*
84 * Generates a new object container
85 * with the proper object ID and
86 * a generation number of zero
87 */
88
89 function pdf_new_object()
90 {
91 global $xref_data;
92
93 return array (
94 PDF_TYPE_OBJECT,
95 ‘obj’ => $xref_data[‘max_object’]++,
96 ‘gen’ => 0
97 );
98 }
99
100 ?>
on a page requires a series of commands that PDF bor-
rows from Postscript. In order for the reader to recog-
nize them, we’ll have to encapsulate them in a stream,
and add that stream to the contents of the page.
When drawing text on the screen, a certain number
of transformations can be applied to it: translation (so
that you can move the text to the location of your
choice), rotation and scaling. In our case, we will only
deal with the first two.
The transformations are applied using a simple
matrix; unfortunately, we do not have enough space
here to go at length about how the matrix works, but
the PDF specification document does a pretty good job
of that, so I’ll refer you to it. Instead, let us focus on the
commands used to apply the transformation itself;
here’s an example:
Ma Mb Mc Md x y Tm
Looks cryptic, doesn’t it? The first four elements of the
matrix (
Ma
,
Mb
,
Mc
and
Md
) are used, in our case to
express the rotation that should be applied to the text.
They can also be used to determine the scale, but, as I
mentioned, that is beyond the scope of this article. The
x
and
y
parameters, on the other hand, indicate the
coordinates at which we want the text to apply. Finally,
Tm
is the command itself, which tells the PDF interpreter
to apply these values to the text transformation matrix.
As you have probably noticed, the format of this func-
tion call is the exact opposite of what we are used to in
PHP (where we use
function (param1, param2, …)
. This
format is called “Reverse Polish Notation” and is often
used in machines that use a stack to store their param-
May 2004
●
PHP Architect
●
www.phparch.com
43
FFEEAATTUURREE
In the Belly of the Beast
Listing 7
1 <?php
2
3 // Creates a list of all the pages
4 // that are present in a document
5
6 function pdf_read_pages (&$c, &$pages, &$result)
7 {
8 // Get the kids dictionary
9
10 $kids = pdf_resolve_object ($c, $pages[1][1][‘/Kids’]);
11
12 foreach ($kids[1] as $v) {
13 $pg = pdf_resolve_object ($c, $v);
14
15 if ($v[1][1][‘/Type’] === ‘Pages’) {
16
17 // If one of the kids is an embedded
18 // /Pages array, resolve it as well.
19
20 pdf_read_pages ($c, $v, $result);
21 } else {
22 $result[] = $pg;
23 }
24 }
25 }
26
27 ?>
Listing 8
1 <?php
2
3 /*
4 * Finds the resources associated with a page
5 */
6
7 function pdf_find_resources (&$c, $obj)
8 {
9 $obj = pdf_resolve_object($c, $obj);
10
11 // If the current object has a resources
12 // dictionary associated with it, we use
13 // it. Otherwise, we move back to its
14 // parent object.
15
16 if (isset ($obj[1][1][‘/Resources’])) {
17 return pdf_resolve_object ($c, $obj[1][1][‘/Resources’]);
18 } else {
19 if (!isset ($obj[1][1][‘/Parent’])) {
20 return false;
21 } else {
22 return pdf_find_resources ($obj[1][1][‘/Parent’]);
23 }
24 }
25 }
26
27 ?>
eters, such as the PostScript virtual machine on which
the PDF specification is based.
Next, we’ll select a font that will be used to draw the
text:
/F11 10 Tf
The
Tf
command sets the current font resource to
/F11
,
with a size of 10 points. Note that the font size can be
a floating-point value, so that you could have text in
size 12.5.
Before writing the text itself, we need to set the spac-
ing between one line of text and the next. This is not as
easy to determine as you may think—because it
depends on where the baseline of the font resides and
how the font itself is designed. From a practical per-
spective, I find that half the size of the font is a good
empirical default that works in most occasions. The
TL
command below sets the interline to five points:
5 TL
Finally, we can actually draw the text! This is done by
using a combination of two commands. The text is
actually drawn using the
‘
command (no that’s not a
mistake—the command is a quotation mark). However,
if a newline character is present in the text, it is simply
ignored. So, we simply replace every occurrence of a
newline character with the
T*
command, which causes
the drawing pointer to be reset to the next line.
Finally, all we need to do is update the page’s
/Contents
array with a reference to our stream. Once
again, we need to determine if there already is an array
and what it contains, and act accordingly, so that we
can add our own data to it.
Writing it All Back
The final step before we can call it a day consists of
actually writing our changes back to the file so that
they can be applied to the document. To do so, we first
of all open the output file, which was defined at the
beginning of the main script (Listing 5). Next, we call
the
ppddff__wwrriittee__oobbjjeeccttss(())
function to rewrite the objects
that we modified back to the file. If you take a look at
Listing 9 (
writer.php
), you’ll notice that this function is,
essentially, the reverse of
ppddff__rreeaadd__vvaalluuee(())
, since it first
builds an indirect object “shell” and then fills it with the
appropriate value.
There are two things here that are worth mentioning.
First, the information that we write back to the file is
not a “true” delta—the resources dictionary may not
change at all, but we write it back anyway. This is not
May 2004
●
PHP Architect
●
www.phparch.com
44
FFEEAATTUURREE
In the Belly of the Beast
Figure 1
very optimized, but it will do if you’re only making
small changes to a document—and it beats having to
build a system that “remembers” what was changed.
Also, you have probably noticed that, whenever a new
object is written to stream,
ppddff__wwrriittee__oobbjjeeccttss(())
“makes a note” of the file pointer’s current position.
This comes in handy afterwards, when we rebuild the
cross-reference table by calling
ppddff__wwrriittee__xxrreeff(())
. Here,
we create the proper entries one at a time. This process
could be optimized by grouping those entries belong-
ing to objects with IDs in sequence, but, again, if you’re
dealing only with small changes it’s hardly worth the
trouble.
Once the cross-reference table is in the file,
ppddff__wwrriittee__xxrreeff(())
terminates by writing the trailer dic-
tionary. In our case, it contains a pointer to the root
object, which has not changed but must be there
nonetheless, as well as a numeric value that declares
the number of objects stored in the file and a pointer to
the previous cross-reference table.
Where to Go From here
That’s it! As you can see, once one figures out how
things work it doesn’t take too long to actually open
and modify the contents of a PDF file programmatical-
ly—which, of course, makes the impression, apparently
shared by many people, that PDF is a non-modifiable
format quite strange.
Although the end result of our sample script is rela-
tively simple (if you run it against the sample file that I
included in last month’s code—and which is once again
included in this month’s for your convenience—and
open the resulting PDF file, you should see something
like the output in Figure 1), the foundation on which it
is built is quite solid and can be expanded upon to pro-
vide additional functionality.
Before parting ways, I just want to share one final tid-
bit of information with you. Working with PDF files can
be very frustrating, particularly if you work on
Windows, because the Acrobat PDF viewer is about as
useful for debugging as testing whether your house’s
electrical circuit is working by sticking your fingers in
the power outlet. However, there are ways around that.
First, you can actually get Acrobat to provide you with
more useful error messages by pressing the Control key
while click on the OK button in the error window that
appears when you try to load a corrupted file. Second,
you can use a free PDF decoder (such as the one avail-
able online at
wwwwww..ppllaanneettppddff..ccoomm//mmaaiinnppaaggee..aasspp??wweebbppaaggeeii--
dd==33446633
) to visually inspect the contents of your file and
determine what is wrong with it.
Sometimes, however, it will be hard to figure bugs
out. While I was writing this article, I lost lots of time
debugging a problem that turned out to be just a
spelling mistake but that caused Acrobat to crash with-
out any useful error. This brings us to the last tool you’ll
need plenty of—patience!
May 2004
●
PHP Architect
●
www.phparch.com
45
FFEEAATTUURREE
In the Belly of the Beast
About the Author ?>
To Discuss this article:
/>Marco is the Publisher of (and a frequent contributor to) php|architect.
When not posting bogus bugs to the PHP website just for the heck of it,
he can be found trying to hack his computer into submission. You can
write to him at
mmaarrccoott@@pphhppaarrcchh..ccoomm
.
www.moliere.co.uk
Tel: +44 (0)161 2477771
email:
All you need to know about PHP for the
World Wide Web
This course on the world's most popular Web
development language teaches all you need to
know to begin developing dynamic Web sites
today.
20 -23 - July - 04
MySQL and SQL
Discussing both SQL--the standardized
language used by all databases--and MySQL-
-the world's most popular open source
database, this class teaches how to best store
and retrieve information.
26 -29 - July - 04
PHP & MySQL training by published author Larry Ullman
20th-29th July, Manchester, UK
LTD