Tải bản đầy đủ (.pdf) (670 trang)

OReilly mastering regular expressions 2nd edition jul 2002 ISBN 0596002890

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (3.41 MB, 670 trang )

8.3Packages,Packages,Packages
TherearemanyregexpackagesforJava;thelistthatfollows
hasafewwordsaboutthosethatIinvestigatedwhile
researchingthisbook.(Seethisbook'swebpage,regex.info/,
forlinks).Thetablebelowgivesasuperficialoverviewofsome
ofthedifferencesamongtheirflavors.

Sun
java.util.regexSun'sownregexpackage,finally
standardasofJava1.4.It'sasolid,activelymaintained
packagethatprovidesarichPerl-likeflavor.Ithasthebest
Unicodesupportofthesepackages.Itprovidesallthebasic
functionalityyoumightneed,buthasonlyminimal
conveniencefunctions.ItmatchesagainstCharSequence
objects,andsoisextremelyflexibleinthatrespect.Its
documentationisclearandcomplete.Itistheall-around
fastestoftheengineslistedhere.Thispackageisdescribed
indetaillaterinthischapter.VersionTested:1.4.0.
License:comesaspartofSun'sJRE.Sourcecodeis
availableunderSCSL(SunCommunitySourceLicensing)

IBM
com.ibm.regexThisisIBM'scommercialregexpackage
(althoughit'ssaidtobesimilartothe
org.apache.xerces.utils.regexpackage,whichIdidnot
investigate).It'sactivelymaintained,andprovidesarich
Perl-likeflavor,althoughissomewhatbuggyincertain
areas.IthasverygoodUnicodesupport.Itcanmatch
againstchar[],CharacterIterator,andString.Overall,



notquiteasfastasSun'spackage,buttheonlyother
packagethat'sinthesameclass.
VersionTested:1.0.0.
License:commercialproduct
Table1.SuperficialOverviewofSomeJavaPackageFlavorDifferences
Feature
Sun IBM
ORO JRegex Pat GNU Regexp
BasicFunctionality







Enginetype

NFA

NFA

Deeply-nestedparens
various various
dotdoesn'tmatch:
\sincludes[•\t\r\n\f]
\wincludesunderscore
Classsetoperators
POSIX[[:···:]]




Metacharacter


Support
\A,\z,\Z

NFA

NFA

NFA

POSIX
NFA

NFA

\n

\n,\r

\n

\r\n

\n








































\A,\Z \A,\z,\Z \A,\z,\Z \A,\z,\Z \A,\Z \A,\Z



\G
(?#···)

Octalescapes
2-,4-,6-digithex
escapes
Lazyquantifiers
Atomicgrouping
Possessivequantifiers
Wordboundaries
Non-wordboundaries
\Q···\E
(ifthen|else)

conditional
Non-capturingparens
Lookahead
Lookbehind






2,4

2,4,6

2

2,4,6

2



2,4


















\b

\b

\<\b\>

\b

\<\>


\b

















(?mod)
(?-mod:···)



(?mod:···)









































Unicode-Aware
Metacharacters
Unicodeproperties
Unicodeblocks
dot,^,$



\d
\s











\w



partial



























partial partial

Wordboundaries
-supported
-partialsupport
-supported,butbuggy
(VersioninfoseeSection8.3)

ORO
org.apache.oro.text.regexTheApacheJakartaproject
hastwounrelatedregexpackages,oneofwhichis"JakartaORO."Itactuallycontainsmultipleregexengines,each
targetingadifferentapplication.Ilookedatoneengine,the
verypopularPerl5Compilermatcher.It'sactively
maintained,andsolid,althoughitsversionofaPerl-like
flavorismuchlessrichthaneithertheSunortheIBM
packages.IthasminimalUnicodesupport.Overall,the
regexengineisnotablyslowerthanmostotherpackages.
Its \G isbroken.Itcanmatchagainstchar[]andString.
Oneofitsstrongestpointsisthatithasavast,modular
structurethatexposesalmostallofthemechanicsthat
surroundtheengine(thetransmission,searchand-replace
mechanics,etc.)soadvanceduserscantuneittosuittheir

needs,butitalsocomesrepletewithafantasticsetof
conveniencefunctionsthatmakesitoneoftheeasiest
packagestoworkwith,particularlyforthosecomingfroma


Perlbackground(orforthosehavingreadChapter2ofthis
book).Thisisdiscussedinmoredetaillaterinthischapter.
VersionTested:2.0.6.
License:ASL(ApacheSoftwareLicense)

JRegex
jregexHasthesameobjectmodelasSun'spackage,with
afairlyrichPerllikefeatureset.IthasgoodUnicode
support.Itsspeedplacesitisinthemiddleofthepack.
VersionTested:v1.01
License:GNU-like

Pat
com.stevesoft.patIthasafairlyrichPerl-likeflavor,but
noUnicodesupport.Veryhaphazardinterface.Ithas
provisionhonthefly.Itsspeedputsitonthehighendof
themiddleofthepack.
VersionTested:1.5.3
License:GNULGPL(GNULesserGeneralPublicLicense)

GNU
gnu.regexpThemoreadvancedofthetwo"GNUregex
packages"forJava.(Theother,gnu.rex,isaverysmall
packageprovidingonlythemostbarebonesregexflavor
andsupport,andisnotcoveredinthisbook.)Ithassome

Perl-likefeatures,andminimalUnicodesupport.It'svery
slow.It'stheonlypackagewithaPOSIXNFA(althoughits
POSIXnessisabitbuggyattimes).


VersionTested:1.1.4
License:GNULGPL(GNULesserGeneralPublicLicense)

Regexp
org.apache.regexpThisistheotherregexpackageunder
theumbrellaoftheApacheJakartaproject.It'ssomewhat
popular,butquitebuggy.Ithasthefewestfeaturesofthe
packageslistedhere.ItsoverallspeedisonparwithORO.
Notactivelymaintained.MinimalUnicodesupport.
VersionTested:1.2
License:ASL(ApacheSoftwareLicense)

8.3.1WhySoMany"Perl5"Flavors?
Thelistmentions"Perl-like"fairlyoften;thepackages
themselvesadvertise"Perl5support."Whenversion5ofPerl
wasreleasedin1994(seeSection3.1.1.7),itintroducedanew
levelofregular-expressioninnovationthatothers,including
Javaregexdevelopers,couldwellappreciate.Perl'sregexflavor
ispowerful,anditsadoptionbyawidevarietyofpackagesand
languageshasmadeitsomewhatofadefactostandard.
However,ofthemanypackages,programs,andlanguagesthat
claimtobe"Perl5compliant,"nonetrulyare.EvenPerlitself
differsfromversiontoversionasnewfeaturesareaddedand
bugsarefixed.Someoftheinnovationsnewwithearly5.x
versionsofPerlwerenon-capturingparentheses,lazy

quantifiers,lookahead,inlinemodemodifierslike (?i) ,and
the/xfree-spacingmode(alldiscussedinChapter3).Packages
supportingonlythesefeaturesclaima"Perl5"flavor,butmiss
outonlaterinnovations,suchaslookbehind,atomicgrouping,
andconditionals.


Therearealsotimeswhenapackagedoesn'tlimititselftoonly
"Perl5"enhancements.Sun'spackage,forexample,supports
possessivequantifiers,andbothSunandIBMsupportcharacter
classsetoperations.Patoffersaninnovativewaytodo
lookbehind,andawaytoallowmatchingofsimplearbitrarily
nestedconstructs.

8.3.2Lies,DamnLies,andBenchmarks
It'sprobablyacommontwistonSamClemens'famous"lies,
damnlies,andstatistics"quote,butwhenIsawitsusewith
"benchmarks"inapaperfromSunwhiledoingresearchforthis
chapter,Iknewitwasanappropriateintroductionforthis
section.Inresearchingthesesevenpackages,I'verunliterally
thousandsofbenchmarks,buttheonlyfactthat'sclearly
emergedisthattherearenoclearconclusions.
Thereareseveralthingsthatcloudregexbenchmarkingwith
Java.First,therearelanguageissues.Recallthebenchmarking
discussionfromChapter6(seeSection6.3.2),andthespecial
issuesthatmakebenchmarkingJavaaslipperyscienceatbest
(primarily,theeffectsoftheJust-In-TimeorBetter-Late-ThanNevercompiler).Indoingthesebenchmarks,I'vemadesureto
useaserverVMthatwas"warmedup"forthebenchmark(see
"BLTN"Section6.3.2),toshowthetruestresults.
Thenthereareregexissues.Duetothecomplexinteractionsof

themyriadofoptimizationslikethosediscussedinChapter6,a
seeminglyinconsequentialchangewhiletryingtotestone
featuremighttickletheoptimizationofanunrelatedfeature,
anonymouslyskewingtheresultsonewayortheother.Idid
many(many!)veryspecifictests,usuallyapproachinganissue
frommultipledirections,andsoIbelieveI'vebeenabletoget
meaningfulresults...butonenevertrulyknows.


8.3.2.1Warning:Benchmarkresultscancause
drowsiness!
Justtoshowhowslipperythisallcanbe,recallthatIjudged
thetwoJakartapackages(OROandRegexp)toberoughly
comparableinspeed.Indeed,theyfinishedequallyinsomeof
themanybenchmarksIran,butforthemostpart,one
generallyranatleasttwicethespeedoftheother(sometimes
10xor20xthespeed).Butwhichwas"one"andwhich"the
other"changeddependinguponthetest.
Forexample,Itargetedthespeedofgreedyandlazy
quantifiersbyapplying ^.*: and ^.*?: toaverylong
stringlike'···xxx:x'.Iexpectedthegreedyonetobefaster
thanthelazyonewiththistypeofstring,andindeed,it'sthat
wayforeverypackage,program,andlanguageItested...
exceptone.Forwhateverreason,Jakarta'sRegexp's ^.*:
performed70%slowerthanits ^.*?: .Ithenappliedthe
sameexpressionstoasimilarlylongstring,butthistimeone
like'x:xxx···'wherethe':'isnearthebeginning.Thisshould
givethelazyquantifieranedge,andindeed,withRegexp,the
expressionwiththelazyquantifierfinished670xfasterthanthe
greedy.Togainmoreinsight,Iapplied ^[^:]*: toeach

string.Thisshouldbeinthesameballpark,Ithought,asthe
lazyversion,buthighlycontingentuponcertainoptimizations
thatmayormaynotbeincludedintheengine.WithRegexp,it
finishedthetestabitslowerthanthelazyversion,forboth
strings.
Doesthepreviousparagraphmakeyoureyesglazeoverabit?
Well,itdiscussesjustsixtests,andforonlyoneregexpackage
wehaven'tevenstartedtocomparetheseRegexpresults
againstOROoranyoftheotherpackages.Whencompared
againstORO,itturnsoutthatRegexpisabout10xslowerwith
fourofthetests,butabout20xfasterwiththeothertwo!It's
fasterwith ^.*?: and ^[^:]*: appliedtothelongstringwith


':'atthefront,soitseemsthatRegexpdoespoorly(orORO
doeswell)whentheenginemustwalkthroughalotofstring,
andthatthespeedsarereversedwhenthematchisfound
quickly.
Areyoueyescompletelyglazedoveryet?Let'strythesameset
ofsixtests,butthistimeonshortstringsinsteadofverylong
ones.ItturnsoutthatRegexpisfasterthreetotentimesfaster
thanOROforallofthem.Okay,sowhatdoesthistellus?
PerhapsthatOROhasalotofclunkyoverheadthat
overshadowstheactualmatchtimewhenthematchesare
foundquickly.OrperhapsitmeansthatRegexpisgenerally
muchfaster,buthasaninefficientmechanismforaccessingthe
targetstring.Orperhapsit'ssomethingelsealtogether.Idon't
know.
Anothertestinvolvedan"exponentialmatch"(seeSection
6.1.4)onashortstring,whichteststhebasicchurningofan

engineasittracksandbacktracks.Theseteststookalong
time,yetRegexptendedtofinishinhalfthetimeofORO.There
justseemstobenorhymenorreasontotheresults.Suchis
oftenthecasewhenbenchmarkingsomethingascomplexasa
regexengine.

8.3.2.2Andthewinneris...
Themind-numbingstatisticsjustdiscussedtakeintoaccount
onlyasmallfractionofthemany,variedtestsIdid.Inlooking
atthemallforRegexpandORO,onepackagedoesnotstand
outasbeingfasteroverall.Rather,thegoodpointsandbad
pointsseemtobedistributedfairlyevenlybetweenthetwo,so
I(perhapssomewhatarbitrarily)judgethemtobeaboutequal.
Addingthebenchmarksfromthefiveotherpackagesintothe
mixresultsinalotofdrowsinessforyourauthor,andno
obviouslyclearwinner,butoverall,Sun'spackageseemstobe


thefastest,followedcloselybyIBM's.Followinginagroup
somewhatbehindarePat,Jregex,Regexp,andORO.TheGNU
packageisclearlytheslowest.
TheoveralldifferencebetweenSunandIBMisnotsoobviously
clearthatanotherequallycomprehensivebenchmarksuite
wouldn'tshowtheoppositeorderifthesuitehappenedtobe
tweakedslightlydifferentlythanmine.Or,forthatmatter,it's
entirelypossiblethatsomeonelookingatallmybenchmark
datawouldreachadifferentconclusion.And,ofcourse,the
resultscouldchangedrasticallywiththenextreleaseofanyof
thepackagesorvirtualmachines(andmaywellhave,bythe
timeyoureadthis).It'saslipperyscience.

Ingeneral,Sundidmostthingsverywell,butit'smissingafew
keyoptimizations,andsomeconstructs(suchascharacter
classes)aremuchslowerthanonewouldexpect.Overtime,
thesewilllikelybeaddressedbySun(andinfact,theslowness
ofcharacterclassesisslatedtobefixedinJava1.4.2).The
sourcecodeisavailableifyou'dliketohackonitaswell;I'm
sureSunwouldappreciateideasandpatchesthatimproveit.

8.3.3Recommendations
Therearemanyreasonsonemightchooseonepackageover
another,butSun'sjava.util.regexpackagewithitshigh
quality,speed,goodUnicodesupport,advancedfeatures,and
futureubiquityisagoodrecommendation.Itcomesintegrated
aspartofJava1.4:String.matches(),forexample,checksto
seewhetherthestringcanbecompletelymatchedbyagiven
regex.
java.util.regex'sstrengthslieinitscoreengine,butit
doesn'thaveagoodsetof"conveniencefunctions,"alayerthat
hidesmuchofthedrudgeryofbit-shufflingbehindthescenes.
ORO,ontheotherhand,whileitscoreengineisn'tasstrong,


doeshaveastrongsupportlayer.Itprovidesaveryconvenient
setoffunctionsforcasualuse,aswellasthecoreinterfacefor
specializedneeds.OROisdesignedtoallowmultipleregexcore
enginestobepluggedin,sothecombinationof
java.util.regexwithOROsoundsveryappealing.I'vetalked
totheOROdeveloper,anditseemslikelythatthiswillhappen,
sotherestofthischapterlooksatSun'sjava.util.regexand
ORO'sinterface.



Chapter2.ExtendedIntroductory
Examples
Rememberthedoubled-wordproblemfromthefirstchapter?I
saidthatafullsolutioncouldbewritteninjustafewlinesina
languagelikePerl.Suchasolutionmightlooklike:

$/=".\n";
while(<>){
nextif!s/\b([a-z]+)((?:\s|<[^>]+>)+)(\1\b)/\e[7m$1\e[m$2\e[
s/^(?:[^\e]*\n)+//mg;#Removeanyunmarkedlines.
s/^/$ARGV:/mg;#Ensurelinesbeginwithfilena
print;
}
Yup,that'sthewholeprogram.
Evenifyou'refamiliarwithPerl,Idon'texpectyouto
understandit(yet!).Rather,Iwantedtoshowanexample
beyondwhategrepcanallow,andtowhetyourappetiteforthe
realpowerofregularexpressions.
Mostofthisprogram'sworkrevolvesarounditsthreeregular
expressions:
\b([a-z]+)((?:\s|<[^>]+>)+)(\1\b)
^(?:[^\e]*\n)+
^
ThoughthisisaPerlexample,thesethreeregularexpressions
canbeusedverbatim(orwithonlyafewchanges)inmany
otherlanguages,includingPython,Java,VisualBasic.NET,Tcl,



andmore.
Now,lookingatthese,thatlast ^ iscertainlyrecognizable,but
theotherexpressionshaveitemsunfamiliartoouregrep-only
experience.ThisisbecausePerl'sregexflavorisnotthesame
asegrep's.Someofthenotationsaredifferent,andPerl(as
wellasmostmoderntools)tendtoprovideamuchrichersetof
metacharactersthanegrep.We'llseemanyexamples
throughoutthischapter.


3.1ACasualStrollAcrosstheRegexLandscape
I'dliketostartwiththestoryabouttheevolutionofsome
regularexpressionflavorsandtheirassociatedprograms.So,
grabahotcup(orfrostymug)ofyourfavoritebrewed
beverageandrelaxaswelookatthesometimeswackyhistory
behindtheregularexpressionswehavetoday.Theideaisto
addcolortoourregexunderstanding,andtodevelopafeeling
astowhy"thewaythingsare"arethewaythingsare.There
aresomefootnotesforthosethatareinterested,butforthe
mostpart,thisshouldbereadasalightstoryforenjoyment.

3.1.1TheOriginsofRegularExpressions
Theseedsofregularexpressionswereplantedintheearly
1940sbytwoneurophysiologists,WarrenMcCullochandWalter
Pitts,whodevelopedmodelsofhowtheybelievedthenervous
systemworkedattheneuronlevel.[1]Regularexpressions
becamearealityseveralyearslaterwhenmathematician
StephenKleeneformallydescribedthesemodelsinanalgebra
hecalledregularsets.Hedevisedasimplenotationtoexpress
theseregularsets,andcalledthemregularexpressions.

[1]"Alogicalcalculusoftheideasimminentinnervousactivity,"firstpublishedinBulletinof
Math.Biophysics5(1943)andlaterreprintedinEmbodimentsofMind(MITPress,1965).The
articlebeginswithaninterestingsummaryofhowneuronsbehave(didyouknowthatintraneuronimpulsespeedscanrangefrom1allthewayto150meterspersecond?),andthen
descendsintoapitofformulaethatis,literally,allGreektome.

Throughthe1950sand1960s,regularexpressionsenjoyeda
richstudyintheoreticalmathematicscircles.RobertConstable
haswrittenagoodsummary[2]forthemathematicallyinclined.
[2]RobertL.Constable,"TheRoleofFiniteAutomataintheDevelopmentofModernComputing
Theory,"inTheKleeneSymposium,Eds.Barwise,Keisler,andKunen(North-HollandPublishing
Company,1980),61-83.


Althoughthereisevidenceofearlierwork,thefirstpublished
computationaluseofregularexpressionsIhaveactuallybeen
abletofindisKenThompson's1968articleRegularExpression
SearchAlgorithm[3]inwhichhedescribesaregularexpression
compilerthatproducedIBM7094objectcode.Thisledtohis
workonqed,aneditorthatformedthebasisfortheUnixeditor
ed.
[3]CommunicationsoftheACM,Vol.11,No.6,June1968.

ed'sregularexpressionswerenotasadvancedasthoseinqed,
buttheywerethefirsttogainwidespreaduseinnon-technical
fields.edhadacommandtodisplaylinesoftheeditedfilethat
matchedagivenregularexpression.Thecommand,"g/
RegularExpression/p",wasread"GlobalRegularExpression
Print."Thisparticularfunctionwassousefulthatitwasmade
intoitsownutility,grep(afterwhichegrepextendedgrepwas
latermodeled).


3.1.1.1Grep'smetacharacters
Theregularexpressionssupportedbygrepandotherearlytools
werequitelimitedwhencomparedtoegrep's.The
metacharacter*wassupported,but+and?werenot(the
latter'sabsencebeingaparticularlystrongdrawback).grep's
capturingmetacharacterswere\(···\),withunescaped
parenthesesrepresentingliteraltext.[4]grepsupportedline
anchors,butinalimitedway.If^appearedatthebeginningof
theregex,itwasametacharactermatchingthebeginningof
theline.Otherwise,itwasn'tametacharacteratallandjust
matchedaliteralcircumflex(alsocalleda"caret").Similarly,$
wastheend-of-linemetacharacteronlyattheendoftheregex.
Theupshotwasthatyoucouldn'tdosomethinglike
end$|^start .Butthat'sokay,sincealternationwasn't
supportedeither!


[4]Historicaltrivia:ed(andhencegrep)usedescapedparenthesesratherthanunadorned
parenthesesasdelimitersbecauseKenThompsonfeltregularexpressionswouldbeusedtowork
primarilywithCcode,whereneedingtomatchrawparentheseswouldbemorecommonthan
backreferencing.

Thewaymetacharactersinteractisalsoimportant.For
example,perhapsgrep'slargestshortcomingwasthatstar
couldnotbeappliedtoaparenthesizedexpression,butonlyto
aliteralcharacter,acharacterclass,ordot.So,ingrep,
parentheseswereusefulonlyforcapturingmatchedtext,and
notforgeneralgrouping.Infact,someearlyversionsofgrep
didn'tevenallownestedparentheses.


3.1.1.2Grepevolves
Althoughmanysystemshavegreptoday,you'llnotethatI've
beenusingpasttense.Thepasttensereferstotheflavorofthe
oldversions,nowupwardsof30yearsold.Overtime,as
technologyadvances,olderprogramsaresometimesretrofitted
withadditionalfeatures,andgrephasbeennoexception.
Alongtheway,AT&TBellLabsaddedsomenewfeatures,such
asincorporatingthe\{min,max\}notationfromtheprogram
lex.Theyalsofixedthe-yoption,whichinearlyversionswas
supposedtoallowcase-insensitivematchesbutworkedonly
sporadically.Aroundthesametime,peopleatBerkeleyadded
startandend-of-wordmetacharactersandrenamed-yto-i.
Unfortunately,youstillcouldn'tapplystarortheother
quantifierstoaparenthesizedexpression.

3.1.1.3Egrepevolves
Bythistime,AlfredAho(alsoatAT&TBellLabs)hadwritten
egrep,whichprovidedmostoftherichersetofmetacharacters
describedinChapter1.Moreimportantly,heimplementedthem
inacompletelydifferent(andgenerallybetter)way.Notonly


were + and ? added,buttheycouldbeappliedto
parenthesizedexpressions,greatlyincreasingegrepexpressive
power.
Alternationwasaddedaswell,andthelineanchorswere
upgradedto"first-class"statussothatyoucouldusethem
almostanywhereinyourregex.However,egrephadproblems
aswellsometimesitwouldfindamatchbutnotdisplaythe

result,anditdidn'thavesomeusefulfeaturesthatarenow
popular.Nevertheless,itwasavastlymoreusefultool.

3.1.1.4Otherspeciesevolve
Atthesametime,otherprogramssuchasawk,lex,andsed,
weregrowingandchangingattheirownpace.Often,
developerswholikedafeaturefromoneprogramtriedtoaddit
toanother.Sometimes,theresultwasn'tpretty.Forexample,if
supportforpluswasaddedtogrep,+byitselfcouldn'tbeused
becausegrephadalonghistoryofaraw'+'notbeinga
metacharacter,andsuddenlymakingitonewouldhave
surprisedusers.Since'\+'wasprobablynotsomethingagrep
userwouldhaveotherwisenormallytyped,itcouldsafelybe
subsumedasthe"oneormore"metacharacter.
Sometimesnewbugswereintroducedasfeatureswereadded.
Othertimes,addedfeatureswerelaterremoved.Therewas
littletonodocumentationforthemanysubtlepointsthatround
outatool'sflavor,sonewtoolseithermadeuptheirownstyle,
orattemptedtomimic"whatseemedtowork"withothertools.
Multiplythatbythepassageoftimeandnumerous
programmers,andtheresultisgeneralconfusion(particularly
whenyoutrytodealwitheverythingatonce).[5]
[5]Suchaswhenwritingabookaboutregularexpressionsaskme,Iknow!


3.1.1.5POSIXAnattemptatstandardization
POSIX,shortforPortableOperatingSystemInterface,isa
wide-rangingstandardputforthin1986toensureportability
acrossoperatingsystems.Severalpartsofthisstandarddeal
withregularexpressionsandthetraditionaltoolsthatusethem,

soit'sofsomeinteresttous.Noneoftheflavorscoveredinthis
book,however,strictlyadheretoalltherelevantparts.Inan
efforttoreorganizethemessthatregularexpressionshad
become,POSIXdistillsthevariouscommonflavorsintojusttwo
classesofregexflavor,BasicRegularExpressions(BREs),and
ExtendedRegularExpressions(EREs).POSIXprogramsthen
supportoneflavorortheother.Table3-1belowsummarizesthe
metacharactersinthetwoflavors.
OneimportantfeatureofthePOSIXstandardisthenotionofa
locale,acollectionofsettingsthatdescribelanguageand
culturalconventionsforsuchthingsasthedisplayofdates,
times,andmonetaryvalues,theinterpretationofcharactersin
theactiveencoding,andsoon.Localesaimtoallowprograms
tobeinternationalized.Thyarenotregex-specificconcept,
althoughtheycanaffectregular-expressionuse.Forexample,
whenworkingwithalocalethatdescribestheLatin-1(ISO8859-1)encoding,àandÀ(characterswithordinalvalues224
and160,respectively)areconsidered"letters,"andany
applicationofaregexthatignorescapitalizationwouldknowto
treatthemasidentical.
Table1.OverviewofPOSIXRegexFlavors
Regexfeature
BREs
dot,^,$,[···],[^···]
*
"anynumber"quantifier
+and?quantifiers

\{min,max\}
rangequantifier
\(···\)

grouping
canapplyquantifierstoparentheses
\1through\9
backreferences

EREs
*
+?
{min,max}
(···)




alternation



Anotherexampleis \w ,commonlyprovidedasashorthandfor
a"word-constituentcharacter"(ostensibly,thesameas [a-zAZ0-9_] inmanyflavors).ThisfeatureisnotrequiredbyPOSIX,
butitisallowed.Ifsupported, \w wouldknowtoallowall
lettersanddigitsdefinedinthelocale,notjustthoseinASCII.
Note,however,thattheneedforthisaspectoflocalesismostly
alleviatedwhenworkingwithtoolsthatsupportUnicode.
UnicodeisdiscussedfurtherbeginninginSection3.3.2.2.

3.1.1.6HenrySpencer'sregexpackage
Alsofirstappearingin1986,andperhapsofmoreimportance,
wasthereleasebyHenrySpencerofaregexpackage,written
inC,whichcouldbefreelyincorporatebyothersintotheirown

programsafirstatthetime.EveryprogramthatusedHenry's
packageandthereweremanyprovidedthesameconsistent
regexflavorunlesstheprogram'sauthorwenttotheexplicit
troubletochangeit.

3.1.1.7Perlevolves
Ataboutthesametime,LarryWallstarteddevelopingatool
thatwouldlaterbecomethelanguagePerl.Hehadalready
greatlyenhanceddistributedsoftwaredevelopmentwithhis
patchprogram,butPerlwasdestinedtohaveatruly
monumentalimpact.
LarryreleasedPerlVersion1inDecember1987.Perlwasan
immediatehitbecauseitblendedsomanyusefulfeaturesof
otherlanguages,andcombinedthemwiththeexplicitgoalof
being,inaday-to-daypracticalsense,useful.


Oneimmediatelynotablefeaturewasasetofregular
expressionoperatorsinthetraditionofthespecialtytoolssed
andawkafirstforageneralscriptinglanguage.Fortheregular
expressionengine,Larryborrowedcodefromanearlierproject,
hisnewsreaderrn(whichbaseditsregularexpressioncodeon
thatinJamesGosling'sEmacs).[6]Theregexflavorwas
consideredpowerfulbytheday'sstandards,butwasnotnearly
asfull-featuredasitistoday.Itsmajordrawbackswerethatit
supportedatmostninesetsofparentheses,andatmostnine
alternativeswith | ,andworstofall, | wasnotallowedwithin
parentheses.Itdidnotsupportcase-insensitivematching,nor
allow \w withinaclass(itdidn'tsupport \s or \d anywhere).
Itdidn'tsupportthe {min,max} rangequantifier.

[6]JamesGoslingwouldlatergoontodevelophisownlanguage,Java,whichsomewhat
ironicallydoesnotnativelysupportregularexpressions.Java1.4however,doesincludea
wonderfulregularexpressionpackage,coveredindepthinChapter8.

Perl2wasreleasedinJune1988.Larryhadreplacedtheregex
codeentirely,thistimeusingagreatlyenhancedversionofthe
HenrySpencerpackagementionedintheprevioussection.You
couldstillhaveatmostninesetsofparentheses,butnowyou
coulduse | insidethem.Supportfor \d and \s wasadded,
andsupportfor \w waschangedtoincludeanunderscore,
sincethenitwouldmatchwhatcharacterswereallowedina
Perlvariablename.Furthermore,thesemetacharacterswere
nowallowedinsideclasses.(Theiropposites, \D , \W ,and \S
,werealsonewlysupported,butweren'tallowedwithinaclass,
andinanycasesometimesdidn'tworkcorrectly.)Importantly,
the/imodifierwasadded,soyoucouldnowdocaseinsensitivematching.
Perl3cameoutmorethanayearlater,inOctober1989.It
addedthe/emodifier,whichgreatlyincreasedthepowerofthe
replacementoperator,andfixedsomebackreference-related
bugsfromthepreviousversion.Itaddedthe {min,max} range
quantifiers,althoughunfortunately,theydidn'talwayswork


quiteright.Worsestill,withVersion3,theregularexpression
enginecouldn'talwaysworkwith8-bitdata,yielding
unpredictableresultswithnon-ASCIIinput.
Perl4wasreleasedhalfayearlater,inMarch1991,andover
thenexttwoyears,itwasimproveduntilitslastupdatein
February1993.Bythistime,thebugswerefixedand
restrictionsexpanded(youcoulduse \D andsuchwithin

characterclasses,andaregularexpressioncouldhavevirtually
unlimitedsetsofparentheses).Workalsowentintooptimizing
howtheregexenginewentaboutitstask,butthereal
breakthroughwouldn'thappenuntil1994.
Perl5wasofficiallyreleasedinOctober1994.Overall,Perlhad
undergoneamassiveoverhaul,andtheresultwasavastly
superiorlanguageineveryrespect.Ontheregular-expression
side,ithadmoreinternaloptimizations,andafew
metacharacterswereadded(including \G ,whichincreasedthe
powerofiterativematchesseeSection3.4.3.3),non-capturing
parentheses(seeSection2.2.3.1),lazyquantifiers(seeSection
3.4.5.9),lookahead(seeSection2.3.5.1),andthe/xmodifier[7]
(seeSection2.3.6.4).
[7]MyclaimtofameisthatLarryaddedthe
modifierafterseeinganotefrommediscussing
alongandcomplexregex.Inthenote,Ihad"prettyprinted"theregularexpressionforclarity.
Uponseeingit,hethoughtthatitwouldbeconvenienttodosoinPerlcodeaswell,soheadded

/x

/x.
Moreimportantthanjustfortheirrawfunctionality,these
"outsidethebox"modificationsmadeitclearthatregular
expressionscouldreallybeapowerfulprogramminglanguage
untothemselves,andwerestillripeforfurtherdevelopment.
Thenewly-addednon-capturingparenthesesandlookahead
constructsrequiredawaytobeexpressed.Noneofthe
groupingpairs(···),[···],<···>,or{···}wereavailable
tobeusedforthesenewfeatures,soLarrycameupwiththe
various'(?'notationsweusetoday.Hechosethisunsightly



sequencebecauseitpreviouslywouldhavebeenanillegal
combinationinaPerlregex,sohewasfreetogiveitmeaning.
OneimportantconsiderationLarryhadtheforesightto
recognizewasthattherewouldlikelybeadditionalfunctionality
inthefuture,sobyrestrictingwhatwasallowedafterthe'(?'
sequences,hewasabletoreservethemforfuture
enhancements.
SubsequentversionsofPerlgrewmorerobust,withfewerbugs,
moreinternaloptimizations,andnewfeatures.Iliketobelieve
thatthefirsteditionofthisbookplayedsomesmallpartinthis,
forasIresearchedandtestedregex-relatedfeatures,Iwould
sendmyresultstoLarryandthePerlPortersgroup,which
helpedgivesomedirectionastowhereimprovementsmightbe
made.
Newregexfeaturesaddedovertheyearsincludelimited
lookbehind(seeSection2.3.5.1),"atomic"grouping(see
Section3.4.5.4),andUnicodesupport.Regularexpressions
werebroughttothenextlevelbytheadditionofconditional
constructs(seeSection3.4.5.6),allowingyoutomakeif-thenelsedecisionsrightthereaspartoftheregularexpression.And
ifthatwasn'tenough,therearenowconstructsthatallowyou
tointerminglePerlcodewithinaregularexpression,which
takesthingsfullcircle(seeSection7.8).TheversionofPerl
coveredinthisbookis5.8.

3.1.1.8Apartialconsolidationofflavors
TheadvancesseeninPerl5wereperfectlytimedfortheWorld
WideWebrevolution.Perlwasbuiltfortextprocessing,andthe
buildingofwebpagesisjustthat,soPerlquicklybecamethe

languageforwebdevelopment.Perlbecamevastlymore
popular,andwithit,itspowerfulregularexpressionflavordidas
well.


Developersofotherlanguageswerenotblindtothispower,and
eventuallyregularexpressionpackagesthatwere"Perl
compatible"tooneextentoranotherwerecreated.Among
thesewerepackagesforTcl,Python,Microsoft's.NETsuiteof
languages,Ruby,PHP,C/C++,andmanypackagesforJava.

3.1.1.9Versionsasofthisbook
Table3-2showsafewoftheversionnumbersforprogramsand
librariesthatItalkaboutinthebook.Olderversionsmaywell
havefewerfeaturesandmorebugs,whilenewerversionsmay
haveadditionalfeaturesandbugfixes(andnewbugsoftheir
own).
BecauseJavadidnotoriginallycomewithregexsupport,
numerousregexlibrarieshavebeendevelopedovertheyears,
soanyonewishingtouseregularexpressionsinJavaneededto
findthem,evaluatethem,andultimatelyselectonetouse.
Chapter6looksatsevensuchpackages,andwaystoevaluate
them.Forreasonsdiscussedthere,theregexpackagethatSun
eventuallycameupwith(theirjava.util.regex,nowstandard
asofJava1.4)iswhatIuseformostoftheJavaexamplesin
thisbook.
Table2.VersionsofSomeToolsMentionedinThisBook
GNUawk3.1
GNUegrep/grep2.4.2
GNUEmacs21.2.1

flex2.5.4
java.util.regex(Java1.4.0)

MySQL3.23.49
.NETFramework2002(1.0.3705)
PCRE3.8
Perl5.8
PHP(pregroutines)4.0.6

Procmail3.22
Python2.2.1
Ruby1.6.7
GNUsed3.02
Tcl8.4

3.1.2AtaGlance
Achartshowingjustafewaspectsofsomecommontoolsgives
agoodcluetohowdifferentthingsstillare.Table3-3provides


averysuperficiallookatafewaspectsoftheregexflavorsofa
fewtools.
Table3.A(Very)SuperficialLookattheFlavorofaFewCommonTools
Modern Modern GNU
Sun'sJava
Feature
Tcl
Perl .NET
grep
egrep Emacs

package
*,^,$,[···]
\?\+\| ?+|
?+\|
?+|
?+| ?+| ?+|
?+|
\(···\) (···)
\(···\)
(···)
(···) (···) (···)
grouping
(?:···)




\<\>
\<\>\b,\B \m,\M,\y \b,\B \b,\B \b,\B
wordboundary
\w,\W

backreferences









supported

AchartlikeTable3-3isoftenfoundinotherbookstoshowthe
differencesamongtools.But,thischartisonlythetipofthe
icebergforeveryfeatureshown,thereareadozenimportant
issuesthatareoverlooked.
Foremostisthatprogramschangeovertime.Forexample,Tcl
usedtonotsupportbackreferencesandwordboundaries,but
nowdoes.Itfirstsupportedwordboundarieswiththeungainlylooking [:<:] and [:>:] ,andstilldoes,althoughsuchuseis
deprecatedinfavorofitsmore-recentlysupported \m , \M ,
and \y (startofwordboundary,endofwordboundary,or
either).
Alongthesamelines,programssuchasgrepandegrep,which
aren'tfromasingleproviderbutrathercanbeprovidedby
anyonewhowantstocreatethem,canhavewhateverflavor
theindividualauthoroftheprogramwishes.Humannature
beingwhatis,eachtendstohaveitsownfeaturesand
peculiarities.(TheGNUversionsofmanycommontools,for
example,areoftenmorepowerfulandrobustthanother
versions.)
Andperhapsasimportantastheeasilyvisiblefeaturesarethe


manysubtle(andsomenot-so-subtle)differencesamong
flavors.Lookingatthetable,onemightthinkthatregular
expressionsareexactlythesameinPerl,.NET,andJava,which
iscertainlynottrue.Justafewofthequestionsonemightask
whenlookingatsomethinglikeTable3-3are:
Arestarandfriendsallowedtoquantifysomethingwrapped

inparentheses?
Doesdotmatchanewline?Donegatedcharacterclasses
matchit?Doeithermatchthenullcharacter?
Arethelineanchorsreallylineanchors(i.e.,dothey
recognizenewlinesthatmightbeembeddedwithinthe
targetstring)?Aretheyfirst-classmetacharacters,orare
theyvalidonlyincertainpartsoftheregex?
Areescapesrecognizedincharacterclasses?Whatelseisor
isn'tallowedwithincharacterclasses?
Areparenthesesallowedtobenested?Ifso,howdeeply
(andhowmanyparenthesesareevenallowedinthefirst
place)?
Ifbackreferencesareallowed,whenacase-insensitive
matchisrequested,dobackreferencesmatch
appropriately?Dobackreferences"behave"reasonablyin
fringesituations?
Areoctalescapessuchas \123 allowed?Ifso,howdothey
reconcilethesyntacticconflictwithbackreferences?What
abouthexadecimalescapes?Isitreallytheregexengine
thatsupportsoctalandhexadecimalescapes,orisitsome
otherpartoftheutility?


Does \w matchonlyalphanumerics,oradditional
charactersaswell?(Amongtheprogramsshownsupporting
\winTable3-3,thereareseveraldifferentinterpretations).
Does \w agreewiththevariousword-boundary
metacharactersonwhatdoesanddoesn'tconstitutea
"wordcharacter"?Dotheyrespectthelocale,orunderstand
Unicode?

Manyissuesmustbekeptinmind,evenwithatidylittle
summarylikeTable3-3asasuperficialguide.(Asanother
example,peekaheadtoTable8-1foralookatachartshowing
somedifferencesamongJavapackages.)Ifyourealizethat
there'salotofdirtylaundrybehindthatnicefaçade,it'snottoo
difficulttokeepyourwitsaboutyouanddealwithit.
Asmentionedatthestartofthechapter,muchofthisisjust
superficialsyntax,butmanyissuesgodeeper.Forexample,
onceyouunderstandthatsomethingsuchas (Jul|July) in
egrepneedstobewrittenas \(Jul\|July\) forGNUEmacs,
youmightthinkthateverythingisthesamefromthere,but
that'snotalwaysthecase.Thedifferencesinthesemanticsof
howamatchisattempted(or,atleast,howitappearstobe
attempted)isanextremelyimportantissuethatisoften
overlooked,yetitexplainswhythesetwoapparentlyidentical
exampleswouldactuallyendupmatchingdifferently:one
alwaysmatches'Jul',evenwhenappliedto'July'.Thosevery
samesemanticsalsoexplainwhytheopposite, (July|Jul)
and \(July\|Jul\) ,domatchthesametext.Again,the
entirenextchapterisdevotedtounderstandingthis.
Ofcourse,whatatoolcandowitharegularexpressionisoften
moreimportantthantheflavorofitsregularexpressions.For
example,evenifPerl'sexpressionswerelesspowerfulthan
egrep's,Perl'sflexibleuseofregexesprovidesformoreraw
usefulness.We'lllookatalotofindividualfeaturesinthis
chapter,andindepthatafewlanguagesinlaterchapters.


×