Tải bản đầy đủ (.pdf) (68 trang)

Tài liệu Matchmaker Make Me a Match ppt

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (3.44 MB, 68 trang )

MARCH 2004
VOLUME III - ISSUE 3
MARCH 2004
VOLUME III - ISSUE 3
www.phparch.com
The Ma
g
azine For PHP Professional
s
Plus:
Tips & Tricks, Security Corner, Product Reviews and much more
Explore your HTML code with Tidy
Testing Automation With PHP
Using the Amazon.com API
through PHP and XML-RPC
PHP And WAP: Past, Present & Future
Matchmaker Matchmaker
Make Me a MatchMake Me a Match
PHP Ahoy!
A Look at: php
Cruise
Bahamas 2004
|

0
R
Q
H
\
%
D


FN
/HDUQ2EMHFW2ULHQWHG3URJUDPPLQJ
ZLWKRYHU3UDFWLFDO3+36ROXWLRQV
3UHSDUH\RXUVHOIIRU3+3«
*HWWKLVVHWRIWZRQHZERRNV
7KH3+3$QWKRORJ\9ROXPH,,$SSOLFDWLRQV
7KH3+3$QWKRORJ\9ROXPH,)RXQGDWLRQV
*8
$
5$
1
7((
/HDUQWREXLOGIDVWVHFXUHDQGUHOLDEOH
2EMHFW2ULHQWHG3+3DSSOLFDWLRQVXVLQJ
SURIHVVLRQDO:HEGHYHORSPHQWWHFKQLTXHV
3UHYHQW64/LQMHFWLRQDWWDFNV
6HQG3DUVH+70/HPDLO
)LOWHUXVHUVXEPLWWHGFRQWHQW
&DFKHSDJHVIRUIDVWHUDFFHVV
&UHDWH\RXURZQ566IHHGV
3URGXFHFKDUWVJUDSKV
:ULWH3URIHVVLRQDO(UURUKDQGOLQJURXWLQHV
&UHDWHVHDUFKIULHQGO\85/V
$QGRWKHUSUDFWLFDODSSOLFDWLRQV
%X\ERWKERRNVWRJHWKHUIRURQO\6$9(
3/86¶3+3$UFKLWHFW·UHDGHUVJHWDQH[WUDRII
RQO\XQWLO$SULOWK
1
H
Z

5
H
OH
D
V
H

7R2UGHU12:YLVLW«
SKSDUFKLWHFWVLWHSRLQWFRP
1
H
Z
5
H
OH
D
V
H

5 Editorial
6 What’s New!
34 Book Review
Flash MX 2004 for Rich Internet
Applications
42 Product Review
Mambo Open Source: Content Management
System
59 Security Corner
Shared Hosting
by Chris Shiflett

63 Tips & Tricks
By John W. Holmes
66 e x i t ( 0 ) ;
I Am Jack's Total Lack of Linux Support
By Marco Tabini
9
Connecting to Amazon.com Web
Services with NuSOAP
by
Alessandro Sfondrini
16
Matchmaker, Matchmaker Make Me A
Match: An Introduction to Regular
Expressions
by
George Schlossnagle
28
Automated Testing For PHP
Applications
by Dr. James McCaffrey
35
PHP Ahoy! A look at php|cruise
by Marco Tabini
47
WAP: Past, Present and Future
by Andrea Trasatti
53
Tidying up your HTML in PHP5
by John Coggeshall
3

March 2004

PHP Architect

www.phparch.com
TABLE OF CONTENTS
II NN DD EE XX
II NN DD EE XX
php|architect
Features
Departments
Existing
subscribers
can upgrade to
the Print edition
and save!
Login to your account
for more details.
NEW!
NEW!
*By signing this order form, you agree that we will charge your account in Canadian
dollars for the “CAD” amounts indicated above. Because of fluctuations in the
exchange rates, the actual amount charged in your currency on your credit card
statement may vary slightly.
**Offer available only in conjunction with the purchase of a print subscription.
Choose a Subscription type:
CCaannaaddaa//UUSSAA $$ 8833 9999 CCAADD (($$5599 9999 UUSS**))
IInntteerrnnaattiioonnaall SSuurrffaaccee $$111111 9999 CCAADD (($$7799 9999 UUSS**))
IInntteerrnnaattiioonnaall AAiirr $$112255 9999 CCAADD (($$8899 9999 UUSS**))
CCoommbboo eeddiittiioonn aadddd oonn $$ 1144 0000 CCAADD (($$1100 0000 UUSS))

((pprriinntt ++ PPDDFF eeddiittiioonn))
Your charge will appear under the name "Marco Tabini & Associates, Inc." Please
allow up to 4 to 6 weeks for your subscription to be established and your first issue
to be mailed to you.
*US Pricing is approximate and for illustration purposes only.
php|architect Subscription Dept.
P.O. Box 54526
1771 Avenue Road
Toronto, ON M5M 4N5
Canada
Name: ____________________________________________
Address: _________________________________________
City: _____________________________________________
State/Province: ____________________________________
ZIP/Postal Code: ___________________________________
Country: ___________________________________________
Payment type:
VISA Mastercard American Express
Credit Card Number:________________________________
Expiration Date: _____________________________________
E-mail address: ______________________________________
Phone Number: ____________________________________
Visit: for
more information or to subscribe online.
Signature: Date:
To subscribe via snail mail - please detach/copy this form, fill it
out and mail to the address above or fax to +1-416-630-5057
php|architect
The Magazine For PHP Professionals
YYoouu’’llll nneevveerr kknnooww wwhhaatt wwee’’llll ccoommee uupp wwiitthh nneexxtt

March 2004

PHP Architect

www.phparch.com
EE DD II TT OO RR II AA LL RR AA NN TT SS
php|architect
Volume III - Issue 3
March, 2004
Publisher
Marco Tabini
Editorial Team
Arbi Arzoumani
Peter MacIntyre
Eddie Peloke
Graphics & Layout
Arbi Arzoumani
Managing Editor
Emanuela Corso
Director of Marketing
J. Scott Johnson

Account Executive
Shelley Johnston

Authors
John Coggeshall, John Holmes,
Dr. James McCaffrey, George Schlossnagle, Alessandro
Sfondrini, Chris Shiflett, Andrea Trasatti
php|architect (ISSN 1709-7169) is published twelve times a year by Marco Tabini &

Associates, Inc., P.O. Box 54526, 1771 Avenue Road, Toronto, ON M5M 4N5, Canada.
Although all possible care has been placed in assuring the accuracy of the contents of this
magazine, including all associated source code, listings and figures, the publisher assumes
no responsibilities with regards of use of the information contained herein or in all asso-
ciated material.
Contact Information:
General mailbox:
Editorial:
Subscriptions:
Sales & advertising:
Technical support:
Copyright © 2003-2004 Marco Tabini & Associates, Inc.
— All Rights Reserved
I
'm sure you're familiar with the Chinese proverb "may
you live in interesting times." Even though I rarely
think of my professional life as dull and boring, the
last month has been particularly exciting. As promised
in my exit(0) column from last month's issue, if you
look through the middle of the magazine you'll find a
full report (in colour!) on the best conference I have
ever attended—our very own php|cruise (forgive me
for a bit of professional price—eight months of prep
work will do that to you). Things went so well that
we're working on another cruise—this time going to
Alaska in the fall—and plan on making php|c an annu-
al event for many years to come.
All good things come to an end, of course, and, once
back from the cruise, it's back to work. Luckily for us,
work means bringing you yet another great issue of

php|architect—and I personally consider that another
good thing. Like every month, we've got some great
content waiting for you in the following pages.
The one I'm most proud of is George Schlossnagle's
regular expressions article. Regexes are something that
pretty much every programmer has to deal with, but
that very few among us really know how to use. In fact,
I've seen developers write extremely complicated code
with the explicit purpose of getting around having to
use a regular expression—and that is just plain wrong.
After all, using the best solution for each problem is
what being a programmer is all about.
Thus, I approached George about writing an article
on regular expressions—and it became quickly evident
that one article would not even come close to covering
the complexity of regex. Now, everyone knows that I
always try my best to stay away from multi-part articles
for a multitude of reasons, but in this case I felt that the
topic more than deserved our attention over multiple
issues and, therefore, George's article is the first in a
series of three. Over the next three months, he will take
you for a ride from the basics (which are covered in this
issue) to the more complex and exotic aspects of regu-
lar expressions, thus hopefully providing the PHP world
with a definitive guide to this topic.
If regular expressions are not your bag, one of the
other topics covered in this month's issue is certain to
tickle your fancy. For example, you may want to read
Alessandro Sfondrini's excellent article on using the
Amazon.com API directly from your PHP website, or

Andrea Trasatti's look at the world of WAP. As you can
probably imagine, both Andrea and Alessandro hail
from my native Italy—and that alone makes their arti-
cles more than worth reading. There, my monthly her-
itage tax is now paid up!
As I'm sure you've noticed, in the past few months
we've been publishing material about testing practices
quite frequently. As larger and larger projects are devel-
EDITORIAL
Continued on page 8
March 2004

PHP Architect

www.phparch.com
6
NNEEWW SSTTUUFFFF
PHP 5.0 Beta 4
PHP.net has announced the release of PHP 4.3.5 RC1.
This fourth beta of PHP 5 is also scheduled to be the
last one (barring unexpected surprises, that did occur
with beta 3). This beta incorporates dozens of bug fixes
since Beta 3, rewritten exceptions support, improved
interfaces support, new experimental SOAP support, as
well as lots of other improvements, some of which are
documented in the ChangeLog. Some of the key fea-
tures of PHP 5 include:
• PHP 5 features the Zend Engine 2.
• XML support has been completely redone in
PHP 5, all extensions are now focused around

the excellent libxml2 library
(
hhttttpp::////wwwwww xxmmllssoofftt oorrgg//
).
• SQLite has been bundled with PHP. For more
information on SQLite, please visit their web-
site.
• A new SimpleXML extension for easily access-
ing and manipulating XML as PHP objects. It
can also interface with the DOM extension
and vice-versa.
• Streams have been greatly improved, includ-
ing the ability to access low-level socket oper-
ations on streams.
PHP.net also announced the release of PHP 4.3.5 RC
3. This will be the last release candidate prior to the
final release, so please test it as much as possible.
For more information visit
hhttttpp::////wwwwww pphhpp nneett//
.
ZEND Optimizer 2.5.1
Zend has announced the release of Zend Optimizer
2.5.1.
Zend.com describes the Optimizer as: "a free applica-
tion that runs the files encoded by the Zend Encoder
and Zend SafeGuard Suite, while enhancing the run-
ning speed of PHP applications.
Benefits:
• Enables users to run files encoded by the Zend
Encoder

• Increases runtime performance up to 40%."
Get more information from
ZZeenndd ccoomm
.
What’s New!
NN EE WW SS TT UU FF FF
March 2004

PHP Architect

www.phparch.com
7
Zend Launches New PHP5 In-Depth
Articles Section
Zend Technologies have launched a new version of
their Developer's
Corner on the zend.com website. PHP5 In-depth
showcases articles from many well-known PHP authors
on the new features of PHP. For more information,
check out
hhttttpp::////wwwwww zzeenndd ccoomm//pphhpp//iinn ddeepptthh pphhpp
DEV Web Management System
Dev is small, but powerful and very flexible content
management system for web portals. System is licensed
as freeware under the terms of GNU/GPL license. It is
absolutely free for non-commercial and commercial
use. Based on php4 + MySQL technology.
This project allows the user to publish articles, evalu-
ate article by taking the pool, publish short news and
create back-ends in xml format, manage download

lists, Manage advertisement on your site, Be informed
about events on your site, create system reports and
export them into MS Excel or XML format and much
more.
For more information visit:
hhttttpp::////ddeevv wwmmss ssoouurrccee
ffoorrggee nneett//
.
PhpMyAdmin 2.5.6
Phpmyadmin.net has released their latest version of
phpMyAdmin. PHPMyAdmin is a tool written in PHP
intended to handle the administration of MySQL over
the Web.
"Welcome to this new version, aimed at stabilization of
the 2.5 branch. Meanwhile, work is continuing on the new
2.6 branch. PhpMyAdmin is a tool written in PHP intend-
ed to handle the administration of MySQL over the Web.
Currently it can create and drop databases,
create/drop/alter tables, delete/edit/add fields, execute
any SQL statement, manage keys on fields."
For more information visit:
wwwwww pphhppmmyyaaddmmiinn nneett
.
PhpSQLiteAdmin 0.2
PhpSQLiteAdmin is a Web interface for the administra-
tion of SQLite databases.
Version 0.2 comes with some new features and a lot
of internal cleanups and refactoring. PhpSQLiteAdmin
is still in an early stage of development. It comes free of
charge and without warranty.

For more information visit:
wwwwww pphhppssqqlliitteeaaddmmiinn nneett
.
phpMyEdit 5.4
phpMyEdit generates PHP code for displaying/editing
MySQL tables in HTML. All you need to do is to write a
simple calling program (a utility to do this is included).
NNEEWW SSTTUUFFFF
Looking for a new PHP Extension? Check out some of the latest offerings from PECL.
ps 1.1.0
ps is an extension similar to the pdf extension but for creating PostScript files. Its api is mod-
eled after the pdf extension.
Memcache 0.2
Memcached is a caching daemon designed especially for dynamic web applications to decrease
database load by storing objects in memory. This extension allows you to work with mem-
cached through handy OO interface. This extension allows you to call the functions made avail-
able by libstatgrab library.
POP3 1.0
The POP3 extension makes it possible for a PHP script to connect to and interact with a POP3
mail server. It is based on the PHP streams interface and requires no external library.
Fileinfo 0.1
This extension allows retrieval of information regarding vast majority of file. This information
may include dimensions, quality, length etc. Additionally it can also be used to retrieve the
mime type for a particular file and for text files proper language encoding.
It includes a huge set of table manipulation functions
(record adition, change, view, copy, and remove), table
sorting, filtering, table lookups, and more.
Several minor bugs were fixed. A few new options
were added. Major features include tabs support, the
ability to specify SQL expressions for fields when writ-

ing to the database, the ability to define new triggers,
and more. All eval() calls were removed due to security
and performance reasons. Some code was optimized.
Several parts of the documentation were updated. A lot
of new language files were added and updated.
For more information visit:
hhttttpp::////ppllaattoonn sskk//pprroojjeeccttss// pphhppMMyyEEddiitt//
.
ionCube Releases New Encoder
UK-based ionCube has released a new version of their
compiled code PHP encoding tools. New features
include a choice of ASCII or binary encoded file formats
and optional support for OpenSource extensions such
as mmcache.
Prices start at a special price of $159 in their March
20% off sale.
For further information, please visit the homepage of
the Encoder:
hhttttpp::////wwwwww iioonnccuubbee ccoomm//ssaa__eennccooddeerr pphhpp
March 2004

PHP Architect

www.phparch.com
8
NNEEWW SSTTUUFFFF
oped using PHP, serious testing processes are going to
become an integral part of every good developer's
arsenal of programming tools. What we never quite
considered is that PHP is a great testing platform even

for those projects that are not written using it.
Thankfully, James McCaffrey came to the rescue and
provided us with a wonderful article on the subject.
Our final article this month is about the new Tidy
extension, which author John Coggeshall has recently
introduced in PHP. You may have already heard about
the Tidy project, which provides a series of libraries
capable of parsing and automatically required docu-
ments written in markup languages like HTML or XML.
Tidy brings an important set of capabilities to PHP, and
I'm happy to have the author of the extension intro-
duce us to it.
That's it for this month—time for me to go tend to
my sunburn while I start working on the next issue.
Until then, happy readings!
Editorial: Contiuned from page 5
php|a
Check out some of the hottest new releases from PEAR.
Mail_Queue 1.1
Class to handle mail queue managment.Wrapper for PEAR::Mail and PEAR::DB (or
PEAR::MDB).It can load, save and send saved mails in background and also backup some mails.
The Mail_Queue class puts mails in a temporary container waiting to be fed to the MTA (Mail
Transport Agent) and send them later (eg. every few minutes) by crontab or in other way.
XML_Transformer 0.9.1
With the XML/Transformer class one can easily bind PHP functionality to XML tags, thus trans-
forming the input XML tree into an output XML tree without the need for XSLT.
Net_LMTP 0.7.0
Provides an implementation of the RFC2033 LMTP using PEAR's Net_Socket and Auth_SASL
class.
Text_Wiki 0.8.3

Abstracts parsing and rendering rules for Wiki markup in structured plain text.
I
n the article "Exploring the Google API with SOAP,"
which appeared in the January issue of php|a, I
showed you what SOAP is and how it can be used
together with PHP. We used a SOAP-encoded docu-
ment to perform a search using the Google Engine,
then we parsed the response to display the results on
our website. To perform these operations, we wrote an
application from scratch; this approach can be great to
understand how SOAP works, but when a customer
asks you to implement a SOAP-based feature in an
application, you can't waste your time in that way.
In this case, there are some libraries that will make
your coding quicker and easier: one of these is
NuSOAP, which allows you to send Remote Procedure
Calls (RPCs) over HTTP.
This article will show you how we can use the
Amazon.com API with NuSOAP to perform searches
and display product details, without having to sort
through a lot of SOAP syntax: if you have had an
opportunity to read my previous article, you will notice
how much shorter an application written this way is,
and how much time can actually be saved by using this
method.
What are Amazon Web Services?
Amazon.com is one of the most widely known on-line
shops. You can find and buy almost everything, from
books to toys to power tools. Several years ago,
Amazon launched a very successful affiliate program,

which they later expanded in their Web Services pro-
gram.
Why would you want to use Amazon Web Services
(AWS)? For instance, if your website is about Literature,
you may want to allow your users to look for books in
the (huge) Amazon database directly from your pages,
without redirecting them to Amazon.com. You can pro-
vide them with a detailed description of each book and,
when they decide to buy one, you can add it directly to
their Amazon shopping cart. When the time comes to
complete the purchase, you can redirect the user
directly to the Amazon website, where the checkout
process actually takes place and you receive credit for
your affiliate referral.
It is important to understand that AWS are designed
only to retrieve information about products and create,
as well as populate, shopping carts, not to perform pay-
ments: this must be done directly on the Amazon web-
site-the reason being, of course, one of security for the
customer's personal information. In any case, a signifi-
cant portion of the transaction is performed from your
website. This results in a benefit both for you and for
your users, since you can offer your customers a nearly
seamless user experience and collect your referral fees.
Access to AWS, as well as to the affiliate program,
requires you to register with the Amazon Associates
Program and obtain an Associates ID, which will identi-
March 2004

PHP Architect


www.phparch.com
9
FF EE AA TT UU RR EE
Connecting to Amazon.com
Web Services with NuSOAP
by Alessandro Sfondrini
PHP: 4.1 and higher
OS:
Any
Other software:: NuSOAP 0.6.4
Code Directory: webs-nusoap
REQUIREMENTS
Have you ever wanted to add an online shop to your
website but gave up on the idea because you lack the
expertise and resources to run it? Using SOAP, you can
connect to Amazon Web Services and create a PHP appli-
cation to remotely browse and search products, add
them to Amazon shopping carts or wish lists and, yes,
you can even earn money on every purchase performed
from your site.
fy each purchase sent through our website.
Getting started
Before we start coding, I recommend you download
the AWS Software Developer's Kit from
hhttttpp::////wwwwww aammaazzoonn ccoomm//ggpp//bbrroowwssee hhttmmll//??nnooddee==33443344664411
. It contains
the License Agreement, a guide (you should have a
look at it to familiarize yourself with the concepts asso-
ciated with the program) and some code samples-

including a few written in PHP!
As I mentioned earlier, you will also have to apply for
your Developer's token-an alphanumerical string need-
ed for performing searches and purchases: to do so,
you have to visit :
hhttttppss::////aassssoocciiaatteess aammaazzoonn ccoomm//eexxeecc//ppaannaammaa//aassssoocciiaatteess//jj
ooiinn//ddeevveellooppeerr//aapppplliiccaattiioonn hhttmmll
and accept the AWS terms and conditions.
To write our application, we will take advantage of a
PHP library called NuSOAP-which is really just a group
of "userland" classes written in PHP and designed to
allow developers to manage SOAP web services, which
will speed up our coding by allowing us to focus on
functionality rather than on the
communication protocols. NuSOAP is distributed
under the LGPL license, and can be downloaded here:
hhttttpp::////ddiieettrriicchh ggaannxx44 ccoomm//nnuussooaapp//
.
To add NuSOAP support to our project, we simply
have to include
nnuussooaapp pphhpp
to our PHP scripts using
rreeqquuiirree(())
. Performing a Remote Procedure Call (RPC) is
simple—look at this example:
require("nusoap.php");
$params = array('name' => 'value');
$s = new soapclient("http://server/file.wsdl", true);
$result = $s -> call('method', $params);
First of all, we include NuSOAP and we store the

parameters we will use for the RPC in the
$$ppaarraammss
asso-
ciative array. We then create a new
ssooaappcclliieenntt
object,
passing two arguments to the constructor: the SOAP
server address and a boolean value that indicates
whether the server uses a WSDL document. WSDL
(Web Services Description Language) documents con-
tain information about a web service, as well as its
methods and properties. They are often used by web
service providers—including Amazon.
Once we have created the object, all we have to do
is to actually execute the RPC by invoking the
ccaallll(())
method and specifying the remote method name and
the parameters to be passed (contained in
$$ppaarraammss
in
our case). NuSOAP automatically fetches the results of
the call and stores them in the
$$rreessuulltt
array.
Since we are working with a WSDL-based server,
NuSOAP can actually create a "proxy" PHP class capa-
ble of providing a better interface to our scripts. Once
we have instantiated
$$ss
, we can also invoke a remote

mmeetthhoodd
in this way:
$proxy = $s -> getProxy();
$result = $proxy -> method($params);
March 2004

PHP Architect

www.phparch.com
10
FFEEAATTUURREE
Connecting to Amazon.com Web Services with NuSOAP
Figure 1
PPaarraammeetteerr NNaammee TTyyppee DDeessccrriippttiioonn
keyword String
The keyword on which the search
should be performed.
page String
The page number. AWS returns ten
results per page, so page 1 will
contain results 1 through 10, page
2 results 11 through 20, and so on.
mode String
Specifies the ID of the store to
browse. Each Amazon store has its
unique ID, which indicates what
kind of products it sells (e.g.:
bbooookkss
,
mmuussiicc

,
ddvvdd
,
vvhhss
, etc.). You
can find a complete list of all the
IDs available in the AWS documenta-
tion.
tag String
Your Associate ID. If you don't
have one, you can use the generic
ID
wweebbsseerrvviicceess 2200
.
type String
Determines the type of search
results.
LLiittee
indicates a simpler
result set, while
hheeaavvyy
provides a
richer set of information about
each item returned. We'll use
lliittee
for our example.
devtag String
The Developer Token you have
received from Amazon.
Figure 2

RReessuulltt DDaattuumm TTyyppee DDeessccrriippttiioonn
Url String
The URL of the product page for
this item on Amazon
Asin String
The Amazon.com Standard Item Number
for this product
ProductName String
The name of the product (in our
case, the title of the book)
Catalog String
The category of the product (e.g.:
bbooookkss
)
Authors String The name(s) of the author(s)
ReleaseDate String
The release date, in human-readable
format (e.g.: "23 February, 1976").
Manufacturer String
The name of the product's manufac-
turer (the publisher in our case)
ImageUrlSmall String
A pointer to the products "small"
image on the Amazon website
ImageUrlMedium String
Same as above, for a slightly larg-
er image
ImageUrlLarge String
Same as above, but for an even
larger image

ListPrice String
The product's list price, including
the currency symbol (e.g.: "$
20.55")
OurPrice String
The product's selling price on
Amazon, including the currency sym-
bol
UsedPrice String
The product's price for used
copies.
This can be useful to simplify our code: first, we cre-
ate a proxy client,
$$pprrooxxyy
; any subsequent RPCs to
methods specified in the WSDL can be performed using
the proxy, without having to use the NuSOAP
ccaallll(())
method again. In our application, we will use proxies to
work with AWS.
Designing the application
Now that we've laid down some ground rules, it's time
to decide in detail what the goals of our application are
going to be. Since we're all PHP fans, our example web-
site will be about PHP and, therefore, we'll want to
allow our users to buy books on this topic from
Amazon.
The first thing that we need is a search page: users
will be able to search for a particular keyword (or for a
set of keywords) and the page will display some basic

information about each book that matches the criteria,
such as its title, an image, the publishing company,
author or authors and price. We also have to provide a
way to browse the results, since AWS calls only return
ten results per call.
The search page should also contain a link for each
product to another page on our website that will con-
tain a detailed description of the book, including any
user reviews and comments. From here, the users will
be able to continue their purchase on Amazon.com or
add the product to their wish lists.
The search page
If you have had an opportunity to read through the
AWS documentation, you have probably discovered
that searches by keyword can be performed using the
KKeeyywwoorrddSSeeaarrcchhRReeqquueesstt(())
method, which requires the
parameters shown in Figure 1.
Assuming that the call will be successful, the server
will return an array containing several items:
• The
TToottaallRReessuullttss
element, which indicates
the number of total results returned by the
query.
• The
TToottaallPPaaggeess
element, which provides the
number of pages available in the search
result.

• The
DDeettaaiillss
sub-array, which contains a set
of data about each search result matching
our search criteria that is included in the
page we have requested. Given that a search
only returns a maximum of ten items per
page, you can expect that this array will
contain no more than ten elements. The
lliittee
search mode returns the data shown in
Figure 2.
March 2004

PHP Architect

www.phparch.com
11
FFEEAATTUURREE
Connecting to Amazon.com Web Services with NuSOAP
1 <form action=”<?=$PHP_SELF ?>” method=”GET”>
2 <input type=”text” name=”keyword” value=”” />
3 <input type=”hidden” name=”page” value=1 />
4 <input type=”submit” name=”button” value=”Search!” />
5 </form>
6 <?php
7 if (empty($_GET[“keyword”])) // If the form has’n been submitted
8 exit; // Stops the execution
9
10 require(“nusoap.php”);

11
12 $client = new soapclient(“ true);
13 $proxy = $client -> getProxy(); // Creates a WSDL client and a proxy
14
15 $param = array(
16 ‘keyword’ => $_GET[“keyword”],
17 ‘page’ => $_GET[“page”],
18 ‘mode’ => ‘books’,
19 ‘tag’ => ‘webservices-20’,
20 ‘type’ => ‘lite’,
21 ‘devtag’ => ‘YOUR-DEV-TOKEN’
22 );
23
24 $results = $proxy -> KeywordSearchRequest($param); // Calls the method
25
26 if(empty($results[“Details”])) // Checks whether there are results
27 die(“<h3>No results found for &quot;”.$_GET[“keyword”].”&quot;.</h3>”);
28
29 echo “<h3>Searched Amazon.com for &quot;”.$_GET[“keyword”].”&quot; - page “
30 .$_GET[“page”].” of “.$results[“TotalPages”].”</h3>”;
31
32 foreach($results[“Details”] as $res) // Prints each product details
33 echo “<img src=’”.$res[“ImageUrlMedium”].”’ align=’left’ /><br/>\n”
34 .”<a href=’details.php?asin=”.$res[“Asin”].”’><b>”.$res[“ProductName”].”</b></a><br /><br />\n”
35 .”<b>Authors</b>: “.@implode(‘, ‘, $res[“Authors”]).”<br />\n”
36 .”<b>Publishing Company</b>: “.$res[“Manufacturer”].”<br />”
37 .”<b>List Price</b>: “.$res[“ListPrice”].” - <b>Our Price</b>: “
38 .$res[“OurPrice”].” - <b>Used Price</b>: “.$res[“UsedPrice”].”<br /><br /><br />\n\n”;
39
40 if($_GET[“page”] > 1) // Prints a link to prev. page if any

41 echo “<a href=’$PHP_SELF?keyword=”.$_GET[“keyword”].”&page=”.($_GET[“page”]-1).”’>Previous Page</a>&nbsp;\n”;
42 if($_GET[“page”] < $results[“TotalPages”]) // Prints a link to next page if any
43 echo “&nbsp;&nbsp;<a href=’$PHP_SELF?keyword=”.$_GET[“keyword”].”&page=”.($_GET[“page”]+1).”’>Next Page</a>”;
44 ?>
Listing 1
As you can see, the
KKeeyywwoorrddSSeeaarrcchhRReeqquueesstt(())
method
returns quite a few pieces of information for every
result item, although, of course, we don't have to out-
put all of them on our site. If you look at Listing 1—the
source for our search page—you'll see that the very first
part of the file is nothing more than a simple HTML
form, which contains an input text box for the keyword
and a hidden field that forces the page number to 1—
this way, a new search will automatically start from the
first page of results.
The form uses the GET method because we need to
use links for the "Next Page" and "Previous Page" oper-
ations (something like
ppaaggee pphhpp??kkeeyywwoorrdd==bbllaahh&&ppaaggee==22
).
Naturally, you could also use POST, but in that case it
would be much more difficult for someone to create a
direct link to your search results, which could, in theo-
ry, prevent you from completing some sales.
The second part of the script contains the actual PHP
code. First of all, an if-then-else control block stops the
execution of the script if
$$__GGEETT[[""kkeeyywwoorrdd""]]

is empty.
Otherwise, we include NuSOAP and create a SOAP
client by passing the URI of the
** wwssddll
file for Amazon
(which is provided in AWS documentation) and the
boolean
ttrruuee
to indicate to the constructor of the
ssooaapp
cclliieenntt(())
class that the SOAP client features WSDL sup-
port. We also create a proxy to call AWS methods
directly as we have seen in the first part of the article.
The parameters needed to invoke
KKeeyywwoorrddSSeeaarrcchhRReeqquueesstt(())
are stored in the
$$ppaarraamm
array;
the first two (the keyword and the page number) are to
be found in the
$$__GGEETT
superglobal, since they change
each time we perform or browse a search, while the
others are constant and, therefore, we hardcode them
in our script. Remember to insert your developer token
in
$$ppaarraamm[[""ddeevvttaagg""]]
.
Once we have invoked the method and stored the

search results in
$$rreessuullttss
, we have to display the latter
in a format that is comprehensible to the user. First, we
check whether there are any results to begin with. If
the search returned no data, the program displays a
warning and exits. Otherwise, we print a short summa-
ry of the search: the keyword, the current page num-
ber and total page count, followed by details about
each product in the current result page. These are actu-
ally produced by a simple
ffoorreeaacchh
loop, which brows-
es the
$$rreessuullttss[[""DDeettaaiillss""]]
array,
eecchhoo
ing the title of
each book, a medium-size image, its authors, publish-
ing company and prices. We will also provide a link to
another page,
ddeettaaiillss pphhpp
, which contains further
information on each book. The link contains a refer-
ence to the product's ASIN (the Amazon identifier for
each product) in order to make the application able to
retrieve the correct product from Amazon's catalogue
with another RPC.
The last part of this page allows the user to browse
the results: if the current page isn't the first one (Page

March 2004

PHP Architect

www.phparch.com
12
FFEEAATTUURREE
Connecting to Amazon.com Web Services with NuSOAP
EElleemmeenntt TTyyppee DDeessccrriippttiioonn
Rating Integer
The rating of the product in this
review
Summary String A summary of the review
Comment String The full review itself
PPaarraammeetteerr TTyyppee BBaassiicc CChhaarraacctteerr CCllaasssseessDDeessccrriippttiioonn
asin String
The product's ASIN (which, in our
case, can be retrieved from
$$__GGEETT[[''aassiinn'']]
tag String
The Associate ID, or [webservices-
20] if you want to use a generic
one
type String
The type of search. In this case,
we'll choose
hheeaavvyy
, since we want
all the information available on a
particular book

devtag String Your Developer Token
RReessuulltt DDaattuumm TTyyppee DDeessccrriippttiioonn
SalesRank Integer The product's sales ranking
Lists
Array of
Strings
The names of the ListMania lists
that contain the product
BrowseList
Array of
Arrays
Indicates the product categories in
which the product can be found. Its
contents look like this:
BrowseList =>
Array
(
[0] => Array
(
BrowseName => PHP
)
)
Media String
The type of medium on which the
product is distributed (e.g.:
paperback or hardcover for books)
Isbn String
The ISBN code of the product (books
only)
Availability String

Indicates how long the product
takes to be shipped
Reviews Array
This array contains information
about the customer reviews associ-
ated with the product. It includes
three elements:
AAvvggCCuussttoommeerrRRaattiinngg
,
which indicates the average cus-
tomer rating for the product,
TToottaallCCuussttoommeerrRReevviieewwss
, which con-
tains the number of customer
reviews available and
CCuussttoommeerrRReevviieewwss
, which is an array
that contains the three most recent
reviews (you can find the contents
of this array in Figure 6).
SimilarProducts
Array of
Strings
Contains the ASINs of products that
are similar to this one.
1), the script prints a link to the previous one and, if it
isn't the last page (based on the information returned
by our AWS call), it prints a link to the next one.
Figure 3 shows our search page at work.
The Product Detail Page

Now that we are done with the first part of the applica-
tion, it's time to move on to the product detail page,
which will show advanced information about a particu-
lar book. The AWS method we need in this case is
AAssiinnSSeeaarrcchhRReeqquueesstt(())
, which needs the parameters
shown in Figure 4. Just like before, the response that we
get back from Amazon is an array of arrays—except
that, in this case, we will simply concern ourselves with
the first result set, since the ASIN uniquely identifies
one product. Our data, therefore, will be stored in
$$rreessuullttss[[''DDeettaaiillss'']][[00]]
, which, in turn, will contain
the information shown in Figure 5. As you can see,
some of the values returned are the same as the results
of the
KKeeyywwoorrddSSeeaarrcchhRReeqquueesstt(())
call that we used in
Listing 1, while some others, like the customer reviews,
are more appropriate for a detailed product page.
Speaking of the product page, Listing 2 contains the
code for
ddeettaaiillss pphhpp
. First, we check
$$__GGEETT[[""aassiinn""]]
; if
it is empty, the program displays a warning and exits.
In a more complete application, you may want a slight-
ly more verbose explanation of what went wrong, or
perhaps an automatic redirection to the search page.

If we have an ASIN, we include the NuSOAP library,
then create a SOAP client and proxy as we did in the
previous page. Please note that we have to use
sspprriinnttff(())
to transform the ASIN in a ten-character
strings, since AWS requires it to be submitted in that
format (as an alternative, you could use
ssttrr__ppaadd(())
to
ensure that the string is ten character long).
This time, we only need to pass the ASIN and specify
hheeaavvyy
as the search type. Once the RPC has been exe-
cuted, we retrieve the results and print them out, using
a
ffoorreeaacchh
loop to cycle through the user reviews.
The final touch in our application consists of provid-
ing a link back to the Amazon website in order to make
it possible for our users to purchase a product—you
can't do much selling by just showing which products
are available!
The AWS documentation specifies that an HTTP form
must be set up for the purpose of submitting the pur-
chase information over to Amazon.com. This form (you
can look at the one in Listing 2 for an example) uses the
POST method, and its
aaccttiioonn
attribute is really nothing
more than a page on Amazon.com that contains the

March 2004

PHP Architect

www.phparch.com
13
FFEEAATTUURREE
Connecting to Amazon.com Web Services with NuSOAP
1 <?php
2 if(empty($_GET[“asin”]))
3 die(“<h3>No ASIN specified</h3>”);
4
5 require(“nusoap.php”);
6 $_GET[“asin”] = sprintf(“%010d”, $_GET[“asin”]);
7
8 $client = new soapclient(“ true);
9 $proxy = $client -> getProxy(); // Creates a WSDL client and a proxy
10
11 $param = array(
12 ‘asin’ => $_GET[“asin”],
13 ‘tag’ => ‘webservices-20’,
14 ‘type’ => ‘heavy’,
15 ‘devtag’ => ‘YOUR-DEV-TOKEN’
16 );
17
18 $results = $proxy -> AsinSearchRequest($param); // Calls the method
19 ?>
20 <h1><?=$results[“Details”][0][“ProductName”] ?></h1>
21 <img src=”<?=$results[“Details”][0][“ImageUrlLarge”] ?>” align=”left” height=”350” />
22 <b>Authors:</b> <?=@implode(‘, ‘, $results[“Details”][0][“Authors”])?><br /><br />

23 <b>Published by</b> <?=$results[“Details”][0][“Manufacturer”]?>
24 <b> on</b> <?=$results[“Details”][0][“ReleaseDate”]?><br /><br />
25 <b>List Price</b>: <?=$results[“Details”][0][“ListPrice”] ?> -
26 <b>Our Price</b>: <?=$results[“Details”][0][“OurPrice”] ?> -
27 <b>Used Price</b>: <?=$results[“Details”][0][“UsedPrice”] ?><br /><br /><br />
28 <!— Form to purchase on Amazon.com —>
29 <form method=”POST” action=” ?>”>
30 <input type=”hidden” name=”asin.<?=$_GET[“asin”] ?>” value=”1”>
31 <input type=”hidden” name=”tag-value” value=”webservices-20”>
32 <input type=”hidden” name=”tag_value” value=”webservices-20”>
33 <input type=”hidden” name=”dev-tag-value” value=”YOUR-DEV-TOKEN”>
34 <input type=”submit” name=”submit.add-to-cart” value=”Buy From Amazon.com”>&nbsp;&nbsp;
35 <input type=”submit” name=”submit.add-to-registry.wishlist” value=”Add to Wish List”>
36 </form>
37 <!— End Form —>
38 <b>ISBN:</b> <?=$results[“Details”][0][“Isbn”]?><br /><br />
39 <b>Availability:</b> <?=$results[“Details”][0][“Availability”]?><br /><br /><br />
40 <b>Sales Ranking:</b> <?=$results[“Details”][0][“SalesRank”]?><br /><br />
41 <b>Average customer rating:</b> <?=$results[“Details”][0][“Reviews”][“AvgCustomerRating”]?>
42 <br /><br /><h2>Read user reviews:</h2>
43 <?php
44 foreach($results[“Details”][0][“Reviews”][“CustomerReviews”] as $res)
45 echo “<h3>”.$res[“Summary”].”</h3>”
46 .”<b>Rating: </b>”.$res[“Rating”].”<br /><br />”.$res[“Comment”].”<br /><hr />”;
47 ?>
Listing 2
ASIN of product that must be added to the user's shop-
ping basket. A few additional hidden fields provide the
ASIN, the Associates Id and the Developer's token. The
form supports two different buttons: one adds the

product to the user's basket, while the other adds it to
his wishlist.
Further Improvements
As you have probably noticed, writing a SOAP-based
application using a library like NuSOAP is much faster
than developing your own SOAP classes—if you have
read my article about the Google API that appeared on
the January issue of php|a, you probably know what I
am talking about. This means that you can develop
rather complex applications without having to waste
time dealing with the nitty-gritty details of the underly-
ing protocol; in fact, we didn't even write any SOAP
code for our Amazon application—NuSOAP did it all for
us.
Naturally, the code that I have introduced here is very
basic and could stand to gain from some improve-
ments. For instance, Amazon Web Services allow you to
to manage a a remote shopping cart or wish list by
adding and removing items to them. The very last part
of the purchase—the one where money changes
hands—must still take place on Amazon.com, but you
can let the user perform most of the normal operations
associated with an e-commerce website without leav-
ing your website. However, do keep in mind that if you
choose to manage the user's shopping cart remotely,
you can't change it once you've submitted to
Amazon—this is done to protect the end user from
fraudulent transactions. You can check out the AWS
documentation for more details on this topic—you'll
find that it's not complicated at all.

Depending on your needs, you may choose to per-
form a different kind of search operation on your web-
site: by similar products, by author, by ISBN, by manu-
facturer, and so on. You may also want to browse a
"node", or product category (e. g. "programming",
"web", etc.) directly, without performing a search. It
goes without saying that all this depends on what your
goals are.
If your Amazon-based shop becomes very popular,
you may decide to join the Amazon Associates
Program, an affiliate system that pays you commissions
on every sale. Be careful, however, that your application
must not send more than one request per second to
Amazon—even if you provide an error handling system,
you must not immediately retry a request if the previ-
ous one has failed.
You should also provide a caching system, in order to
store the data needed by your site without going back
and forth to AWS for every request—you can check out
Bruno Pedro's excellent article in the February 2004
issue of php|a for more idea on caching data from your
PHP scripts. If you choose to do so, don't forget that
you can't keep your data cached for more than twenty-
four hours.
Finally, please keep in mind that in the examples
shown in this article we always referred to
Amazon.com, the American website. AWS are also
available for Amazon.co.uk, Amazon.de and
Amazon.co.jp, but you have to modify the URIs in the
script, changing the specifications in the WSDL docu-

ment from [soap.amazon.com/] to soap-
eu.amazon.com/, and so on. You will also have to add
the locale parameter to your RPC invocations—its value
can be set to uk, de or jp, depending on which Amazon
March 2004

PHP Architect

www.phparch.com
14
FFEEAATTUURREE
Connecting to Amazon.com Web Services with NuSOAP
Figure 3
website you are referring to.
I'm Outta Here
Amazon.com Web Services is a powerful tool that you
can use to add e-commerce functionality to your site
without going to the expense of developing an online
store of your own and stocking all the merchandise.
Even if you can't create a complete on-line shop using
ASW (because the purchase must be completed on the
Amazon website), you can still give your users a cus-
tomized shopping experience that relies on the practi-
cally limitless resources of one of the world's most pop-
ular e-commerce websites.
The sample application that I showed you in this arti-
cle is quite simple: if you plan to use it in a production
environment—especially if your site has a lot of traffic—
you should probably consider implementing features
like error handling and caching in order to prevent

problems with the Amazon servers. Adding these ele-
ments to your application may require some extra
work, but it could all pay off if you enjoy decent traffic
and join the Amazon Associates Program.
Perhaps most importantly, I hope to have given you
a good idea of how much a SOAP library (in this article
we have chosen NuSOAP, but there are some others
packages, like PEAR::SOAP) can simplify the creation of
a complex application—write in few lines of code to
perform a Remote Procedure Call and you're practical-
ly done.
If you want to extend our sample application and cre-
ate a "complete" on-line shop using AWS, have a look
to the documentation: there you will find a detailed
description of every method that's available for use. If
you want to learn more about SOAP, you can check out
the World Wide Web Consortium's notes about the pro-
tocol at
hhttttpp::////wwwwww ww33 oorrgg//TTRR//SSOOAAPP
or—if you missed it—
read the article "Exploring the Google API with SOAP"
published in the January 2004 issue of php|a.
March 2004

PHP Architect

www.phparch.com
15
FFEEAATTUURREE
Connecting to Amazon.com Web Services with NuSOAP

About the Author ?>
To Discuss this article:
/>Alessandro Sfondrini is a young Italian PHP programmer from Como. He
has already written some on-line PHP tutorials and published scripts on
most important Italian web portals. You can contact him at
ggiiuu__aallee22@@hhoottmmaaiill ccoomm
.
FavorHosting.com offers reliable and cost effective web hosting
SETUP FEES WAIVED AND FIRST 30 DAYS FREE!
So if you're worried about an unreliable hosting provider who won't be
around in another month, or available to answer your PHP specific
support questions. Contact us and we'll switch your information and
servers to one of our reliable hosting facilities and you'll enjoy no
installation fees plus your first month of service is free!*
Please visit />call 1-866-4FAVOR1 now for information.
- Strong support team
- Focused on developer needs
- Full Managed Backup Services Included
Our support team consists of knowledgable and experienced
professionals who understand the requirements of installing and
supporting PHP based applications.
R
egular expressions (commonly known as regexes)
are a powerful tool for pattern matching and text
manipulation. A typical problem that pulls people
into learning regular expressions is text munging: you
have a string of text and you need to replace portions
of it based on certain rules. For instance, you
might want to obfuscate all the email addresses
in a block of text so that email addresses like

ggeeoorrggee@@eexxaammppllee ccoomm
get translated to the form
ggeeoorrggee [[aatt]] eexxaammppllee [[ddoott]] ccoomm
Regular
expressions are the tool for the job, and provide a pow-
erful and deep syntax for handling tasks like these.
Alternatives to the PCRE
Functions
PHP supplies some alternatives to the PCRE functions.
The most direct competitor is the POSIX regular expres-
sion library that consists of
eerreegg
,
eerreegg__rreeppllaaccee
and oth-
ers. We won't be looking at the POSIX regular expres-
sion functions because the PCRE library provides a
broader pattern-matching facility than its POSIX coun-
terpart and the PCRE library is about 30% faster on
average. The other option is to perform string match-
ing with the standard string functions. As noted above,
March 2004

PHP Architect

www.phparch.com
16
FF EE AA TT UU RR EE
Matchmaker, Matchmaker Make Me A Match
An Introduction to Regular Expressions

by George Schlossnagle
PHP: ANY
OS: Any
Applications: N/A
Code Directory: match-regex
REQUIREMENTS
A quick search for the words "hate" and "regular expres-
sions" on your favourite search engine is likely to bring up
thousands upon thousands of hits. While most developers
recognize the usefulness of regular expressions (and many
can't do without them once they have figured out how
regexes work), their use remains something of a black-
magic art—right up there with hypnosis and session man-
agement. Despite looking complicated, however, regular
expressions are much easier to work with than most peo-
ple are willing to admit.
Before we get started, we should dispel a
few popular myths about regexs:
Myth: Regular Expressions are Slow.
Truth: Regular expressions can be slow,
but they don't need to be. The main reg-
ular expression library used by PHP (called
PCRE and consisting of the
pprreegg__
family of
functions) is quite fast and also quite
powerful. This power means that it is
easy to write a short regular expression
that performs a lot of work, and perform-
ing a lot of work with any tool can be

slow.
Myth: You should use basic string func-
tions instead of regular expressions.
Truth: Regular string functions (for
example
ssttrrssttrr
or
ssttrrttookk
) are (marginally)
faster than the regular expression to
accomplish the same task. That having
been noted, this myth often leads to peo-
ple implementing complicated string
parsers using string matching functions
where a single regular expression would
do the trick. The PCRE library will always
match complex patterns faster than
implementing a parser on your own.
A Few Myths about Regexes
the string functions are faster on the tasks they were
designed for (finding specific characters or substrings),
but are not an appropriate fit for anything but the sim-
plest patterns.
Your First Regex
The simplest regex is a match against a static string. To
determine if the string '' is pres-
ent in a piece of text, we can use the following code
fragment:
if(preg_match("/george@example\.com/", $text)) {
print "Matches";

} else {
print "Does not match";
}
Despite its simplicity, this example illustrates the
basic syntax of a regex match. The regex itself is the
first parameter, and is contained within slashes ([/]).
The second parameter is the text you want to test
the pattern against. The
pprreegg__mmaattcchh
function returns
ttrruuee
if the match succeeds, and
ffaallssee
if it fails. Using
slashes to delimit regular expressions is a convention
(taken from the UNIX utility awk), but is not neces-
sary—you can actually use any non-alphanumeric
character. Alternative delimiters are convenient if
your pattern itself contains slashes.
For instance, when dealing with file
paths or URLs (both of which con-
tain numerous slashes), it is common
to use a different delimiter.
We can also perform substitutions
with PCREs. To substitute 'george
aatt
nospam.example.com' for my address
(a common anti-spam technique), you
can use
preg_replace("/george@example\.com/",

"george [at]
nospam.example.com",
$text);
The other PCRE functions are:

ppccrree__ggrreepp((ssttrriinngg ppaatttteerrnn,,
aarrrraayy ssuubbjjeeccttss [[,, iinntt ffllaagg]]))

ppccrree__ggrreepp
applies the specified
ppaatttteerrnn
to every ele-
ment of
ssuubbjjeeccttss
, returning an array consist-
ing of those that matched. If the optional
ffllaagg
is set to
PPRREEGG__GGRREEPP__IINNVVEERR
, only those
elements that did not match will be
returned.

ppccrree__mmaattcchh__aallll((ssttrriinngg ppaatttteerrnn,, ssttrriinngg
ssuubbjjeecctt [[,,aarrrraayy mmaattcchheess,, iinntt ffllaaggss]]]]))

ppccrree__mmaattcchh
returns only the first match
found in its subject text.
ppccrree__mmaattcchh__aallll

matches as many times as possible, return-
ing an array of all the matches. I will discuss
this function in more detail later in the arti-
cle.

pprreegg__rreeppllaaccee__ccaallllbbaacckk
—This function
makes it possible to perform very complex
operations on a per-match basis through
the use of callback functions. We will cover
it in a future article, but some of its func-
tionality overlaps with evaluated replace-
ments, which are discussed in this article.

pprreegg__qquuoottee((ssttrriinngg tteexxtt))
—When using input
text in a pattern, you may want to sanitize it
to ensure it does not contain any regex
metacharacters.
pprreegg__qquuoottee
escapes all regex
metachacters in a string.

pprreegg__sspplliitt((ssttrriinngg ppaatttteerrnn,, ssttrriinngg ssuubbjjeecctt
[[,, iinntt lliimmiitt [[,, iinntt ffllaaggss]]]]))

pprreegg__sspplliitt
performs similarly to
eexxppllooddee
, allowing us to

break up the string
ssuubbjjeecctt
into
lliimmiitt
parts.
Instead of splitting on a specific delimiter,
pprreegg__sspplliitt
allows the string to be broken
based on a regex.
Regex Basics
Of course, we can (and should) per-
form the previous simple match using
ssttrrssttrr(())
, which is faster than any regex
function. What if, however, we want to
match all email addresses in a string,
rather than a specific one? What if you
wanted to change text only if it
appeared in a particular position within
your string?
The power of regular expressions is in
matching complex patterns that can-
not be identified using straightforward
text-search functions like
ssttrrssttrr(())
. The
basic components of a regular expres-
sion pattern are:
• Character Classes—Patterns rarely consist of
specified letters, but classes of letters. For

example 'any number' instead of a particular
number, or 'any letter' instead of a particular
letter.
• Grouping—Grouping allows for changing
the precedence of operations as well as
providing a means to extract the text you
matched with a pattern.
• Enumerations—Enumerators allow you to
specify how many times a character class or
sub-pattern appears. This allows for conven-
March 2004

PHP Architect

www.phparch.com
17
FFEEAATTUURREE
Matchmaker, Matchmaker Make Me A Match
“The power of regu-
lar expressions is
in matching com-
plex patterns that
cannot be identi-
fied using straight-
forward text-
search functions
like
ssttrrssttrr(())
.”
ient expression of fixed length patterns like

'a US zipcode is 5 digits' as well as variable
length patterns such as 'a domain is a num-
ber of alphanumeric characters separated by
dots'.
• Alternations—Alternations allow for multiple
patterns to be combined. Unlike character
classes, which allow for a position to match
multiple characters, alternations allow for
entire patterns to be alternatively matched.
For example, a valid workday can be
Monday, Tuesday, Wednesday, Thursday or
Friday.
• Positional Anchors—Anchors allow you to
require your pattern to start matching at a
specific location in the search text, for exam-
ple at the beginning or end of a line.
• Global Pattern Modifiers—Global pattern
modifiers allow you to change the basic
behavior of a regular expression, for exam-
ple rendering it case-insensitive.
Character Classes
While it's usually easy to find a particular substring
within a larger string—for example, my e-mail address
in a message—it's not always easy to find a particular
type of substring-like any e-mail address. To do this,
you need to be able to match against a more generic
pattern and not just against a static string. PCRE sup-
plies character classes to allow you to do this; a char-
acter class allows a specific character in a search text
to be matched against a range of possible characters.

For example, a US phone number is composed of a
three digit area code, a three digit exchange, and a four
digit line number, commonly delimited by a '-'. To
match this pattern, you could use the following regular
expression:
/\d\d\d-\d\d\d-\d\d\d\d/
The
\\dd
specifier is a built-in PCRE character class
that consists of all the digits. There are a couple
things you should note about the pattern above. The
first is that we have many
\\dd
's. In regular expres-
sions, any character or character class matches only
a single character unless you use an enumerator
(which we'll cover later) to attach a quantity to it.
Second, if you test this pattern you will find the fol-
lowing results.
• 555-123-4567 matches. This is correct.
• 5555-123-45678 matches. This is not cor-
rect.
The second example does not represent a valid
phone number (the area code and line number are too
long), but it matches because the pattern fits as shown
in Figure 1.
There are a couple of ways to combat this problem.
If you know that your search text should be exactly a
phone number (with no leading or trailing text), you
can use positional anchors to force the pattern to start

at the beginning of the text and end at the end, as we'll
see later on.
If the phone number might be contained in text, on
the other hand, you might try and fix the pattern by
having the numbers have at least one character of lead-
ing and trailing whitespace, using a pattern like:
/\s\d\d\d-\d\d\d-\d\d\d\d\s/
The
\\ss
specifier is another character class for all
whitespace (spaces, tabs, newlines, etc.). This pat-
tern does not work in all situations, though, since if
the text begins with the phone number you will be
unable to match the leading
\\ss
. To handle this case,
PCRE supports
\\bb
—a boundary condition that
matches at the border (or boundary) between a
'word' and a 'non-word' (these are words in the C
programming language sense—letters, numbers and
underscores only).
\\bb
is actually not a character class,
but what is known as a 'zero-width assertion'; this
means that the
\\bb
specifier does not actually match
the character on the other side of the boundary, but

only ensures that such a boundary exists. Putting
that into our pattern we can refine it to:
/\b\d\d\d-\d\d\d-\d\d\d\d\b/
Continuing the testing, we find that "077-xxx-yyyy"
matches. US and Canadian area codes and exchanges
cannot begin with 0 or 1 (these are reserved for long
distance and operator-assisted or international servic-
es). To be able to restrict the leading numbers to the
allowed set, we need to be able to create our own
character classes. In PCRE, these are constructed by
filling a set of brackets (
[[ ]]
) with the characters we
want to match. To match 2-9, we can use the charac-
ter class
[[2233445566778899]]
, which is commonly shortened via
a range operator to
[[22 99]]
. To use a custom character
class in a pattern, you use it exactly as you would a
regular character or character class. Here is the phone
number pattern reworked to employ this:
/\b[2-9]\d\d-[2-9]\d\d-\d\d\d\d\b/
March 2004

PHP Architect

www.phparch.com
18

FFEEAATTUURREE
Matchmaker, Matchmaker Make Me A Match
RReeggeexx ddooeessnn''tt aallwwaayyss wwoorrkk tthhee wwaayy yyoouu eexxppeecctt
8 8 7 7 - x x x - y y y y y
\d \d \d - \d \d \d - \d \d \d \d
Figure 1
PCRE provides six commonly used built-in charac-
ter classes, described in Figure 2. Additionally, PCRE
provides POSIX-style character classes for compatibil-
ity with POSIX-style regular expressions. These class-
es are described in Figure 3. POSIX character sets
aren't commonly used much in real-life code, which
is a shame because they are often a perfect fit for
problems that programmers encounter in their day-
to-day work.
You can negate a POSIX character class by adding a
^^
after the first colon. For instance, to match all non-let-
ter characters, you could use the class
::^^aallpphhaa::
.
Negations are also available in custom character
classes—for example, to match anything that is not the
greater-than character (>), you can use the custom
character class
[[^^>>]]
. Negations are very useful when
you are creating regular expressions that extract quot-
ed text or if you want to manually parse XML or HTML.
Since '


', '
^^
' and '
[[ ]]
' have special meanings in cus-
tom character classes, if you want those actual char-
acters to be elements of the class, you should escape
them with a backslash (
\\
). The two exceptions are
the range operator

, which can appear un-escaped
as the last character in a class, since that is unam-
biguous, and the negation character
^^
, which can
appear un-escaped in any position but the first.
Grouping and Sub-Patterns
Usually, you will not only want to match a pattern, but
extract data from it as well. To extract a specific part of
a pattern, you surround it within parentheses. For
example, to capture each part of the phone number
pattern, you would add parentheses as follows:
/\b([2-9]\d\d)-([2-9]\d\d)-(\d\d\d\d)\b/
March 2004

PHP Architect


www.phparch.com
19
FFEEAATTUURREE
Matchmaker, Matchmaker Make Me A Match
BBaassiicc CChhaarraacctteerr CCllaasssseess
. Matches any character
\w
An alphanumeric character or the underscore char-
acter.
\W Anything not a \w.
\d A digit.
\D A non-digit.
\s
Any whitespace. This includes spaces, tabs, newlines,
control characters.
\S A non-whitespace character.
Figure 2
Figure 4
PPOOSSIIXX SSttyyllee CCllaasssseess
:alpha: Any letter
:alnum: Any alphanumeric character
:ascii: Any ASCII character
:cntrl: Any control chatacter.
:digit: Any digit (same as \d)
:graph: Any alphanumeric or punctuation character.
:lower: Any lowercase letter.
:print: Any printable character.
:space: Any whitespace character (same as \s).
:upper: Any upperspace character.
:xdigit:] Any hexadecimal 'digit'.

Figure 3
Pattern fragments grouped in this fashion are called
sub-patterns. To see what they capture, you need to
pass a third argument to {preg_match}. This argu-
ment is set by the function as an array with the cap-
tured sub-pattern results in it. The zeroth element the
array is the text matched by the pattern as a whole,
while the sub-patterns captures are at the offset of
their pattern number. Patterns are numbered left-to-
right and outside-to-inside. So in the pattern above
the entire phone number is offset 0, the area code is
sub-pattern 1, the exchange is sub-pattern 2, and the
line number is sub-pattern 3.
Here you can see a sample phone number being run
through the regular expression.
$text = 'My phone number is 555-321-1212';
preg_match("/\b([2-9]\d\d)-([2-9]\d\d)-(\d\d\d\d)\b/",
$text, $matches);
print_r($matches);
Executing that code yields the following results, just
as we predicted:
Array
(
[0] => 555-321-1212
[1] => 555
[2] => 321
[3] => 1212
)
We can also nest patterns. If we wanted to capture
the entire local part of the phone number, in addition

to its componentized parts, the regex could be modi-
fied to be:
/\b([2-9]\d\d)-(([2-9]\d\d)-(\d\d\d\d))\b/
When we nest patterns, we move left to right and,
when we hit a nested pattern, we take the outermost
part first, then recursively parse its contents following
the same rules. With the above pattern, the patterns are
numbered as shown in Figure 4.
Sub-patterns are also extremely useful in substitu-
tions, since they allow us access to the matched sub-
patterns when performing the replacement. A cap-
tured sub-pattern can be accessed in the
{preg_replace} replacement text by referencing its off-
set as
\\NN
(where
NN
is the sub-pattern number). Here is
an example that sanitizes phone numbers by obscur-
ing their line number:
preg_replace("/\b([2-9]\d\d)-([2-9]\d\d)-
(\d\d\d\d)\b/",
'\1-\2-XXXX', $text);
If we run this on the text 'My phone number is 410-
555-1212.', it returns 'My phone number is 410-552-
XXXX'.
Note that the replacement string in the above exam-
ple is single-quoted. If we were to double quote it, we
would have to double escape our sub-pattern refer-
ences as

""\\\\11 \\\\22 XXXXXXXX""
. This may seem mysterious but
the reasoning is this: the PCRE library needs to be
passed the sub-pattern references as
\\11
, but when we
double-quote a string, PHP attempts to interpret the
escaped characters for us. Single-quoting performs no
such interpretation and leaves your references
untouched. This is the same process by which "\n"
becomes a newline, but '\n' remains literally '\n'.
We can reference sub-patterns in matches as well,
using the same rules. A fun example of this is finding
all 6-letter palindromes. A palindrome is a word that
is spelled the same forward and backward, for exam-
ple 'noon' or 'deed'. To spot a six-letter palindrome,
we match 3 characters and require that we see them
immediately in reverse order. Here is the pattern:
March 2004

PHP Architect

www.phparch.com
20
FFEEAATTUURREE
Matchmaker, Matchmaker Make Me A Match
This isn't the full story on RFC compliant email
addresses. Because the specification allows for
addresses to contain descriptions as well, a com-
pletely accurate email address validator is actu-

ally quite complex. An example can be found at
the end of Mastering Regular Expressions in Perl
- the regex presented there is X characters long!
For most purposes, the regex presented above is
completely sufficient.
Enumeration modifiers can also be used to
compress patterns with long repetitive parts.
For instance, the phone-number pattern can be
compressed to:
/\b[2-9]\d{2}-[2-9]\d{2}-\d{4}\b/
or, by noting that the area code and exchange
match the same pattern, we can compress it
even further, as follows:
/\b([2-9]\d{2}-) {2}\d{4}\b/
Note 2
Figure 5
MMaattcchhiinngg aa ppaalliinnddrroommee
h a l l a h
\w(cap-
tured as
\1)
\w(cap-
tured as
\2)
\w(cap-
tured as
\3)
\3 \2 \1
1 $fp = fopen(“/usr/share/dict/words”, “r”);
2 if(!$fp) {

3 print “dictionary file not found\n”;
4 exit;
5 }
6 while(($line = fgets($fp)) !== false) {
7 if(preg_match(‘/\b(\w)(\w)(\w)\3\2\1\b/’, $line)) {
8 print “palindrome: $line\n”;
9 }
10 }
Listing 1
/\b(\w)(\w)(\w)\3\2\1\b/
When we run this pattern against a palindrome like '
hallah', it matches as shown in Figure 5.
Notice that you need to use
\\bb
to make sure you
don't misidentify words that contain palindrome sub-
strings. If you are running on a UNIX system, Listing 1
is a code block that will find all the six-letter palin-
dromes in the dictionary file
//uussrr//sshhaarree//ddiicctt//wwoorrddss
.
When we use
pprreegg__mmaattcchh__aallll
with sub-patterns, we
have two choices of how we want the data returned to
us. The default behavior is for the match array to con-
tain an array for each sub-pattern, where that array
contains the capture for the nth search match as its nth
element. If that's confusing, here is how it looks when
matching all the phone numbers in a text:

<?php
$text = 'Work: 877-555-1212, Fax: 888-555-1212';
preg_match_all("/\b([2-9]\d\d)-([2-9]\d\d)-
(\d\d\d\d)\b/",
$text, $matches);
print_r($matches);
?>
Executing that script returns the following:
Array
(
[0] => Array
(
[0] => 877-555-1212
[1] => 888-555-1212
)
[1] => Array
(
[0] => 877
[1] => 888
)
[2] => Array
(
[0] => 555
[1] => 555
)
[3] => Array
(
[0] => 1212
[1] => 1212
)

)
The alternative is to pass the optional flag
PPRREEGG__SSEETT__OORRDDEERR
. With this flag set, the ordering of the
match array is reversed: the match array contains one
element for each search text matched, with that array
containing the sub-pattern captures for that search
text. If we are looking to replicate the Perl idiom
while($text =~ /$regex/g) {
# perform work on one set of matches at a time
}
you can accomplish it with this PHP:
preg_match_all($regex, $text, $matches,
PREG_SET_ORDER);
foreach($matches as $match) {
// perform work on one set of matches at a time
}
Enumerations
Another important feature in pattern matching is the
ability to match variable-length patterns. In the phone
number example, even though the digits of the num-
ber were unknown, the length of the pattern was
fixed—it is always a three digit area code, three digit
exchange and four digit line number. On the other
hand, if we are matching email addresses, we don't a
priori know the length of the address.
To handle this, PCRE supplies enumeration modifiers.
The most basic description of an email address is a
number of non-whitespace characters, followed by an
'@', followed by more non-whitespace characters.

\\SS
is
the character class for all non-whitespace characters, so
using that we can write this simplistic email-matching
pattern as:
/\S+@\S+/
++
is a PCRE enumerator that instructs the regex
engine to match one or more instances of the charac-
ter or character class it applies to. PCRE supports a
number of enumeration methods for specifying that a
character or character class should be matched multi-
ple times, as you can see in Figure 6.
The
++
and
**
modifiers are both greedy. This means
they will always match as long a sub-pattern as possi-
ble. This is not always the way you want your patterns
to behave, but I will leave the details of when we might
want a greedy or non-greedy match to a later article.
Enumeration modifiers can be applied not only to
characters and character classes, but to sub-patterns as
well. This allows for some pretty complex pattern gen-
eration, which is, after all, one of the best features of
regular expressions (at least when you can understand
what they do).
For example, we can use enumeration modifiers to
significantly improve our email-address pattern.

March 2004

PHP Architect

www.phparch.com
21
FFEEAATTUURREE
Matchmaker, Matchmaker Make Me A Match
Figure 6
EEnnuummeerraattiioonn MMooddiiffiieerrss
* Match 0 or more times.
+ Match 1 or more times.
? Match 0 or 1 times.
{m} Match exactly m times.
{m,n} Match between m and n times.
{m,} Match at least m times.
{,n} Match between 0 and n times.
According to RFC 2822, which defines the "official"
valid email address syntax, an email message is com-
posed of a localpart, an '@' and a domain. The localpart
is one or more characters from the set
[[\\ww!!##$$%%""**++\\//==??``{{}}||~~^^ ]]
, while a domain is a dot-sepa-
rated list of parts composed of
\\ww
. The pattern for the
local part is almost identical to the definition of
\\SS++
:
/[\w!#$%"*+\/=?`{}|~^-]+/

The pattern for domains is more complex. First, we
need to identify elements in the string. These are given
by
/[\w-]+/
If we only have two such elements, the domain pat-
tern would look like this:
/[\w-]+\.[\w-]+/
Note that since '.' is a special regex character (the
wild-card character class), we must escape it to have it
match just the '.' character. Since we can have an arbi-
trary number of dot-separated segments, we will enca-
puslate the first part of the pattern in a sub-pattern and
use the '+' enumerator to specify that it must occur one
or more times:
/([\w-]+\.)+[\w-]+/
Creating a sub-pattern simply involves placing it
inside parentheses. Combining the local and domain
patterns together, we arrive at a decent regular expres-
sion for matching valid email addresses:
/[\w!#$%"*+\/=?`{}|~^-]+@([\w-]+\.)+[\w-]+/
We can use this regular expression to perform the
anti-spam rewriting we illustrated at the beginning of
the article.
function obscure_emails($text) {
$regex = '/([\w!#$%"*+\/=?`{}|~^-]+)@(([\w-]+\.)+[\w-
]+)/';
preg_replace($regex, '\\1 [at] nospam.\\2', $text);
return $text;
}
Alternation

The last of the basic regular expression syntactical ele-
ments is alternation. Where character classes let us
match a single character against a set of allowed char-
acters, alternations allow for matching a string against
multiple sub-patterns. For example, we might want to
identify all HTTP and FTP addresses in a document for
auto-linking or indexing purposes. We could do this
with two regular expressions:
#https?://\S+#
#ftp://\S+#
but this will require the document to be completely
scanned twice. Note that we are using
##
as a delimiter
and not
//
, since our pattern contains slashes and we
would rather not have to escape them. A more elegant
approach is to combine them using an alternation, as
follows:
#(https?|ftp)://\S+#
The alternation operator
||
means that the sub-pat-
tern
##((hhttttppss??||ffttpp))##
matches either
##hhttttppss??##
('http'
with an optional 's') or

##ffttpp##
. To use this to automati-
cally create anchor tags for all linked content, we can
use a replacement like this:
preg_replace('#((https?|ftp)://\S+)#',
'<a href="\1">\1</a>', $text);
Running this over a sample text, we notice that any
preexisting anchor tags will become munged. For
example:
Come visit us at <a
href="">phpa.com</a>.
Becomes
Come visit us at <a href="<a href=
"">phpa.com</a>
.">">phpa.com</a>.</a>
Solving this in a completely robust manner involves
using look-behind assertions, which will be covered in
a future article, but we can do a decent job by noting
that the
hhrreeff
value must be enclosed in quotes. Thus, if
we require the URL to not be preceded by a quote, we
should catch most cases. The revised regular expression
is:
preg_replace('#([^\'"])((https?|ftp)://\S+)([:punct:])
#',
'\1<a href="\2">\2</a>', $text);
Note here that we need to capture and return in
the substitution the non-quote (
^^\\''""

) character we
match before the URL to avoid losing it, and that we
have to escape the single quote, since it the entire
pattern is part of a single-quoted string.
Positional Anchors
In the example of matching valid US phone numbers,
the regular expression we had was good for spotting
phone numbers in a block of text, but not for validat-
ing that a block of text is a phone number. To do that,
we need to ensure that the phone number is the only
element in the search text, with no leading or trailing
components. Anchors help solve this problem. To man-
date that our phone number match starts at the begin-
ning of the search test and ends at the end of it, we can
modify our regex as follows:
/^([2-9]\d{2})-([2-9]\d{2})-(\d{4})$/
The leading
^^
anchors the match at the beginning
of the text, meaning that the match will only succeed
March 2004

PHP Architect

www.phparch.com
22
FFEEAATTUURREE
Matchmaker, Matchmaker Make Me A Match
if it begins there. The trailing
$$

anchors the match at
the end of the text, meaning that the match will only
succeed if the pattern terminates on the final charac-
ter of the text to be matched against.
Here we use a slightly modified version of the
anchored pattern to make a function useful for validat-
ing user-inputted data. If the phone number is valid, it
will return an array of its components. If not, it will
return
ffaallssee
. The regex has been made a bit more
robust by allowing the delimiter (previously

) to be
replaced by an optional

or whitespace.
function validate_us_phone($phone)
{
$regex =
'/^([2-9]\d{2})[.\s -]?([2-9]\d{2})[.\s -
](\d{4})$/';
if(preg_match($regex, $phone, $matches)) {
return array( 'area_code' => $matches[1],
'exchange' => $matches[2],
'line_number' => $matches[3]);
}
return false;
}
Don't confuse the anchor operator

^^
with the negat-
ed character class operator
[[^^]]
. Because an anchor is
not a character class (in fact it's a special zero-length
look behind assertion, but that's a topic for a later arti-
cle), it has no meaning inside a character class.
Anchors are also useful for extracting information
near the beginning or end of a string. For example, a
line from an Apache Common Log Format logfile looks
like the following:
10.80.117.254 - - [13/Feb/2004:14:53:01 -0500]
"GET /~george/blog/ HTTP/1.1" 200 43489
This says that on February 13, 2004 a request for
"/~george/blog/" was made from the IP address
10.80.117.254. This request was successful (it returned
a 200 Request OK response code), and the amount of
data returned was 43489 bytes. Writing a full parser for
this log line is not too difficult (we will do so in the
cookbook section at the end of the article), but many
queries do not require parsing the entire log. For
instance, if we want to count the number of occur-
rences of each response code, the expression to use is
quite simple. Looking at the log format, we see that the
last two fields are numbers, and we want the next to
last one. Expressed as a regex, that pattern looks like
this:
/(\d+) \d+$/
Working backwards, this says we first match the end

of the line (
$$
), then a number (which we don't bother
to capture), then a number which we do want to cap-
ture (the response
code). We can wrap
this into a quick script
to determine the fre-
quency of various
responses as shown in
Listing 2. When we
don't need to parse an
entire text string, espe-
cially if its format is
complex, anchors can
make our life much
easier.
Global Pattern Modifiers
The final regular expression syntactical elements we are
going to discuss in this article are global pattern modi-
fiers. As their name implies, global pattern modifiers
change the overall behavior of the pattern. By far the
most common of these is the case insensitivity modifi-
er,
ii
. Global modifiers are implemented in the Perl
style, directly following the pattern they apply to. Here
is a function which uses a regex to extract all addresses
under a specified domain from a subject text, regard-
less of the casing of the domain (domains are case

insensitive).
function extract_addresses($domain, $text)
{
$domain = preg_quote($domain);
if(preg_match_all('/([\w!#\$%\"*+\/=?\'{}|~^-
]+)@$domain/i',
$text, $matches, PREG_PATTERN_ORDER)) {
return $matches[1];
}
return false;
}
Notice here that, in addition to using the
ii
modifier,
we also use
pprreegg__qquuoottee
to sanitize
$$ddoommaaiinn
. Data that
can potentially come from an untrusted source (such as
a user) should always be quoted to prevent the acci-
dental or malicious inclusion of regex characters. Also,
we use the
PPRREEGG__PPAATTTTEERRNN__OORRDDEERR
flag so that all the sub-
pattern
\\11
matches are stored in
$$mmaattcchheess[[11]]
.

Otherwise we would need to iterate over
$$mmaattcchheess
and
manually build the result set.
The other possible pattern modifiers are as follows:
March 2004

PHP Architect

www.phparch.com
23
FFEEAATTUURREE
Matchmaker, Matchmaker Make Me A Match
1 <?php
2
3 $logfile = $_SERVER[‘argv’][1];
4 if(!$logfile) {
5 print “Please specify a logfile to parse\n”;
6 }
7 if(($fp = fopen($logfile, “r”)) == false) {
8 print “Error opening $logfile\n”;
9 exit;
10 }
11 while(($line = fgets($fp)) !== false) {
12 $regex = ‘/(\d+) \d+$/’;
13 if(preg_match($regex, $line, $matches)) {
14 $frequency[$matches[1]]++;
15 }
16 }
17 print “Code\tOccurences\n”;

18 foreach ($frequency as $code => $occurences) {
19 print “$code\t$occurences\n”;
20 }
21 ?>
Listing 2
“Anchors are also
useful for extract-
ing information
near the begin-
ning or end of a
string.”

mm
(treat as multiline). By default, PCRE
assumes that we intend our search text to
processed as one big string, and
^^
and
$$
will match only the beginning
and ending of the search text,
respectively. When the
mm
modi-
fier is used,
^^
and
$$
will match
at the beginning and ending of

every line in the pattern (the
search text is considered to be
broken into lines by any new-
line characters).

ss
(treat as single line for wild-
cards) By default the wildcard
character (

) will not match a
newline. If

should match new-
lines as well, add the
ss
modifier to the pat-
tern.

xx
(extended legibility) By default, any white-
space in a pattern is considered part of the
pattern. Allowing whitespace in a pattern
can be helpful for readability and inline
comments. Compare the following two reg-
ular expressions:
/([2-9]\d{2})[.\s-]?([2-9]\d{2})[.\s-]?(\d{4})/
and
/([2-9]\d{2}) # Match the area code (200-999) as
subpattern 1

[.\s-]? # An optional delimiter - dot, dash or
ws
([2-9]\d{2}) # Match the exchange as subpattern 2
[.\s-]? # An optional delimiter - dot, dash or
ws
(\d{4}) # Match the line number as subpattern 3
/x
More information of creating readable pat-
terns will be covered in a future article.

AA
(Start anchored) This modifier is equiva-
lent to putting a
^^
at the start of our pat-
tern—it anchors the pattern at the start of
the search text. Thus the following two
regular expressions are equivalent:
/^Subject: (.*)/
/Subject: (.*)/A
There are no benefits of using this method
over manually anchoring a pattern with
^^
(other than, perhaps, moving the anchor
character from the beginning of your pat-
tern to its end).

DD
(Dollar end-only) If this modifier is set, the
dollar end-anchor

$$
will match only at the
end of the string. By default,
$$
will match
before the final character if that character is
a newline. This is ignored if the
mm
modifier is
also used.

SS
(Study) If we are going to
execute a pattern a number of
times, we can use this flag to
instruct PCRE to take extra time
'studying' the pattern to improve
its efficiency.

UU
(Ungreedy) By default, all
matches in PCRE are greedy—
that is, a pattern will attempt to
match the longest possible piece
of the search text. The
UU
modifier
reverses this behavior, asking PCRE to find
the shortest possible match for the pattern.
More on greedy versus non-greedy match-

ing will be covered in a future article.

uu
(UTF-8) This modifier instructs PCRE to
treat patterns and search texts as UTF-8
characters instead of just single-byte charac-
ters. UTF-8 support is still new and should
be used with some caution as it may be
incomplete.

ee
(Evaluated replacements). This causes the
replacement string in a
pprreegg__rreeppllaaccee
call to
be evaluated as PHP. Back-references are
expanded and the resulting expression is
executed via
eevvaall
. The result of the evalua-
tion is used as the final replacement text.
Let's try an example of how to use this writ-
ing Wiki-style links to documents. In Wikis,
putting so-called CamelCaps text in a docu-
ment will link it to the wiki page of that
name. Doing this blindly with a regex can
be achieved with the following replacement:
$text = preg_replace('/\b(([A-Z]\w+){2,})\b/',
'<a href="/wiki/\1.html">\1</a>', $text);
This might result in a number of non-exis-

tent documents being linked to, though. If
March 2004

PHP Architect

www.phparch.com
24
FFEEAATTUURREE
Matchmaker, Matchmaker Make Me A Match
1 function is_wiki_page($token)
2 {
3 $page = $_SERVER[‘DOCUMENT_ROOT’].”/wiki/$token.php”;
4 if(file_exists($page)) {
5 return true;
6 }
7 return false;
8 }
9 $text = preg_replace(‘/\b(([A-Z]\w+){2,})\b/e’,
10 ‘is_wiki_page(\1)?”<a href=\”/wiki/\1\”>\1</a>”:”\1”’,
11 $text);
Listing 3
“As with most
tools, the way to
really learn
regexes is to use
them in practical
situations.”

×