8.3Packages,Packages,Packages
TherearemanyregexpackagesforJava;thelistthatfollows
hasafewwordsaboutthosethatIinvestigatedwhile
researchingthisbook.(Seethisbook'swebpage,regex.info/,
forlinks).Thetablebelowgivesasuperficialoverviewofsome
ofthedifferencesamongtheirflavors.
Sun
java.util.regexSun'sownregexpackage,finally
standardasofJava1.4.It'sasolid,activelymaintained
packagethatprovidesarichPerl-likeflavor.Ithasthebest
Unicodesupportofthesepackages.Itprovidesallthebasic
functionalityyoumightneed,buthasonlyminimal
conveniencefunctions.ItmatchesagainstCharSequence
objects,andsoisextremelyflexibleinthatrespect.Its
documentationisclearandcomplete.Itistheall-around
fastestoftheengineslistedhere.Thispackageisdescribed
indetaillaterinthischapter.VersionTested:1.4.0.
License:comesaspartofSun'sJRE.Sourcecodeis
availableunderSCSL(SunCommunitySourceLicensing)
IBM
com.ibm.regexThisisIBM'scommercialregexpackage
(althoughit'ssaidtobesimilartothe
org.apache.xerces.utils.regexpackage,whichIdidnot
investigate).It'sactivelymaintained,andprovidesarich
Perl-likeflavor,althoughissomewhatbuggyincertain
areas.IthasverygoodUnicodesupport.Itcanmatch
againstchar[],CharacterIterator,andString.Overall,
notquiteasfastasSun'spackage,buttheonlyother
packagethat'sinthesameclass.
VersionTested:1.0.0.
License:commercialproduct
Table1.SuperficialOverviewofSomeJavaPackageFlavorDifferences
Feature
Sun IBM
ORO JRegex Pat GNU Regexp
BasicFunctionality
Enginetype
NFA
NFA
Deeply-nestedparens
various various
dotdoesn'tmatch:
\sincludes[•\t\r\n\f]
\wincludesunderscore
Classsetoperators
POSIX[[:···:]]
Metacharacter
Support
\A,\z,\Z
NFA
NFA
NFA
POSIX
NFA
NFA
\n
\n,\r
\n
\r\n
\n
\A,\Z \A,\z,\Z \A,\z,\Z \A,\z,\Z \A,\Z \A,\Z
\G
(?#···)
Octalescapes
2-,4-,6-digithex
escapes
Lazyquantifiers
Atomicgrouping
Possessivequantifiers
Wordboundaries
Non-wordboundaries
\Q···\E
(ifthen|else)
conditional
Non-capturingparens
Lookahead
Lookbehind
2,4
2,4,6
2
2,4,6
2
2,4
\b
\b
\<\b\>
\b
\<\>
\b
(?mod)
(?-mod:···)
(?mod:···)
Unicode-Aware
Metacharacters
Unicodeproperties
Unicodeblocks
dot,^,$
\d
\s
\w
partial
partial partial
Wordboundaries
-supported
-partialsupport
-supported,butbuggy
(VersioninfoseeSection8.3)
ORO
org.apache.oro.text.regexTheApacheJakartaproject
hastwounrelatedregexpackages,oneofwhichis"JakartaORO."Itactuallycontainsmultipleregexengines,each
targetingadifferentapplication.Ilookedatoneengine,the
verypopularPerl5Compilermatcher.It'sactively
maintained,andsolid,althoughitsversionofaPerl-like
flavorismuchlessrichthaneithertheSunortheIBM
packages.IthasminimalUnicodesupport.Overall,the
regexengineisnotablyslowerthanmostotherpackages.
Its \G isbroken.Itcanmatchagainstchar[]andString.
Oneofitsstrongestpointsisthatithasavast,modular
structurethatexposesalmostallofthemechanicsthat
surroundtheengine(thetransmission,searchand-replace
mechanics,etc.)soadvanceduserscantuneittosuittheir
needs,butitalsocomesrepletewithafantasticsetof
conveniencefunctionsthatmakesitoneoftheeasiest
packagestoworkwith,particularlyforthosecomingfroma
Perlbackground(orforthosehavingreadChapter2ofthis
book).Thisisdiscussedinmoredetaillaterinthischapter.
VersionTested:2.0.6.
License:ASL(ApacheSoftwareLicense)
JRegex
jregexHasthesameobjectmodelasSun'spackage,with
afairlyrichPerllikefeatureset.IthasgoodUnicode
support.Itsspeedplacesitisinthemiddleofthepack.
VersionTested:v1.01
License:GNU-like
Pat
com.stevesoft.patIthasafairlyrichPerl-likeflavor,but
noUnicodesupport.Veryhaphazardinterface.Ithas
provisionhonthefly.Itsspeedputsitonthehighendof
themiddleofthepack.
VersionTested:1.5.3
License:GNULGPL(GNULesserGeneralPublicLicense)
GNU
gnu.regexpThemoreadvancedofthetwo"GNUregex
packages"forJava.(Theother,gnu.rex,isaverysmall
packageprovidingonlythemostbarebonesregexflavor
andsupport,andisnotcoveredinthisbook.)Ithassome
Perl-likefeatures,andminimalUnicodesupport.It'svery
slow.It'stheonlypackagewithaPOSIXNFA(althoughits
POSIXnessisabitbuggyattimes).
VersionTested:1.1.4
License:GNULGPL(GNULesserGeneralPublicLicense)
Regexp
org.apache.regexpThisistheotherregexpackageunder
theumbrellaoftheApacheJakartaproject.It'ssomewhat
popular,butquitebuggy.Ithasthefewestfeaturesofthe
packageslistedhere.ItsoverallspeedisonparwithORO.
Notactivelymaintained.MinimalUnicodesupport.
VersionTested:1.2
License:ASL(ApacheSoftwareLicense)
8.3.1WhySoMany"Perl5"Flavors?
Thelistmentions"Perl-like"fairlyoften;thepackages
themselvesadvertise"Perl5support."Whenversion5ofPerl
wasreleasedin1994(seeSection3.1.1.7),itintroducedanew
levelofregular-expressioninnovationthatothers,including
Javaregexdevelopers,couldwellappreciate.Perl'sregexflavor
ispowerful,anditsadoptionbyawidevarietyofpackagesand
languageshasmadeitsomewhatofadefactostandard.
However,ofthemanypackages,programs,andlanguagesthat
claimtobe"Perl5compliant,"nonetrulyare.EvenPerlitself
differsfromversiontoversionasnewfeaturesareaddedand
bugsarefixed.Someoftheinnovationsnewwithearly5.x
versionsofPerlwerenon-capturingparentheses,lazy
quantifiers,lookahead,inlinemodemodifierslike (?i) ,and
the/xfree-spacingmode(alldiscussedinChapter3).Packages
supportingonlythesefeaturesclaima"Perl5"flavor,butmiss
outonlaterinnovations,suchaslookbehind,atomicgrouping,
andconditionals.
Therearealsotimeswhenapackagedoesn'tlimititselftoonly
"Perl5"enhancements.Sun'spackage,forexample,supports
possessivequantifiers,andbothSunandIBMsupportcharacter
classsetoperations.Patoffersaninnovativewaytodo
lookbehind,andawaytoallowmatchingofsimplearbitrarily
nestedconstructs.
8.3.2Lies,DamnLies,andBenchmarks
It'sprobablyacommontwistonSamClemens'famous"lies,
damnlies,andstatistics"quote,butwhenIsawitsusewith
"benchmarks"inapaperfromSunwhiledoingresearchforthis
chapter,Iknewitwasanappropriateintroductionforthis
section.Inresearchingthesesevenpackages,I'verunliterally
thousandsofbenchmarks,buttheonlyfactthat'sclearly
emergedisthattherearenoclearconclusions.
Thereareseveralthingsthatcloudregexbenchmarkingwith
Java.First,therearelanguageissues.Recallthebenchmarking
discussionfromChapter6(seeSection6.3.2),andthespecial
issuesthatmakebenchmarkingJavaaslipperyscienceatbest
(primarily,theeffectsoftheJust-In-TimeorBetter-Late-ThanNevercompiler).Indoingthesebenchmarks,I'vemadesureto
useaserverVMthatwas"warmedup"forthebenchmark(see
"BLTN"Section6.3.2),toshowthetruestresults.
Thenthereareregexissues.Duetothecomplexinteractionsof
themyriadofoptimizationslikethosediscussedinChapter6,a
seeminglyinconsequentialchangewhiletryingtotestone
featuremighttickletheoptimizationofanunrelatedfeature,
anonymouslyskewingtheresultsonewayortheother.Idid
many(many!)veryspecifictests,usuallyapproachinganissue
frommultipledirections,andsoIbelieveI'vebeenabletoget
meaningfulresults...butonenevertrulyknows.
8.3.2.1Warning:Benchmarkresultscancause
drowsiness!
Justtoshowhowslipperythisallcanbe,recallthatIjudged
thetwoJakartapackages(OROandRegexp)toberoughly
comparableinspeed.Indeed,theyfinishedequallyinsomeof
themanybenchmarksIran,butforthemostpart,one
generallyranatleasttwicethespeedoftheother(sometimes
10xor20xthespeed).Butwhichwas"one"andwhich"the
other"changeddependinguponthetest.
Forexample,Itargetedthespeedofgreedyandlazy
quantifiersbyapplying ^.*: and ^.*?: toaverylong
stringlike'···xxx:x'.Iexpectedthegreedyonetobefaster
thanthelazyonewiththistypeofstring,andindeed,it'sthat
wayforeverypackage,program,andlanguageItested...
exceptone.Forwhateverreason,Jakarta'sRegexp's ^.*:
performed70%slowerthanits ^.*?: .Ithenappliedthe
sameexpressionstoasimilarlylongstring,butthistimeone
like'x:xxx···'wherethe':'isnearthebeginning.Thisshould
givethelazyquantifieranedge,andindeed,withRegexp,the
expressionwiththelazyquantifierfinished670xfasterthanthe
greedy.Togainmoreinsight,Iapplied ^[^:]*: toeach
string.Thisshouldbeinthesameballpark,Ithought,asthe
lazyversion,buthighlycontingentuponcertainoptimizations
thatmayormaynotbeincludedintheengine.WithRegexp,it
finishedthetestabitslowerthanthelazyversion,forboth
strings.
Doesthepreviousparagraphmakeyoureyesglazeoverabit?
Well,itdiscussesjustsixtests,andforonlyoneregexpackage
wehaven'tevenstartedtocomparetheseRegexpresults
againstOROoranyoftheotherpackages.Whencompared
againstORO,itturnsoutthatRegexpisabout10xslowerwith
fourofthetests,butabout20xfasterwiththeothertwo!It's
fasterwith ^.*?: and ^[^:]*: appliedtothelongstringwith
':'atthefront,soitseemsthatRegexpdoespoorly(orORO
doeswell)whentheenginemustwalkthroughalotofstring,
andthatthespeedsarereversedwhenthematchisfound
quickly.
Areyoueyescompletelyglazedoveryet?Let'strythesameset
ofsixtests,butthistimeonshortstringsinsteadofverylong
ones.ItturnsoutthatRegexpisfasterthreetotentimesfaster
thanOROforallofthem.Okay,sowhatdoesthistellus?
PerhapsthatOROhasalotofclunkyoverheadthat
overshadowstheactualmatchtimewhenthematchesare
foundquickly.OrperhapsitmeansthatRegexpisgenerally
muchfaster,buthasaninefficientmechanismforaccessingthe
targetstring.Orperhapsit'ssomethingelsealtogether.Idon't
know.
Anothertestinvolvedan"exponentialmatch"(seeSection
6.1.4)onashortstring,whichteststhebasicchurningofan
engineasittracksandbacktracks.Theseteststookalong
time,yetRegexptendedtofinishinhalfthetimeofORO.There
justseemstobenorhymenorreasontotheresults.Suchis
oftenthecasewhenbenchmarkingsomethingascomplexasa
regexengine.
8.3.2.2Andthewinneris...
Themind-numbingstatisticsjustdiscussedtakeintoaccount
onlyasmallfractionofthemany,variedtestsIdid.Inlooking
atthemallforRegexpandORO,onepackagedoesnotstand
outasbeingfasteroverall.Rather,thegoodpointsandbad
pointsseemtobedistributedfairlyevenlybetweenthetwo,so
I(perhapssomewhatarbitrarily)judgethemtobeaboutequal.
Addingthebenchmarksfromthefiveotherpackagesintothe
mixresultsinalotofdrowsinessforyourauthor,andno
obviouslyclearwinner,butoverall,Sun'spackageseemstobe
thefastest,followedcloselybyIBM's.Followinginagroup
somewhatbehindarePat,Jregex,Regexp,andORO.TheGNU
packageisclearlytheslowest.
TheoveralldifferencebetweenSunandIBMisnotsoobviously
clearthatanotherequallycomprehensivebenchmarksuite
wouldn'tshowtheoppositeorderifthesuitehappenedtobe
tweakedslightlydifferentlythanmine.Or,forthatmatter,it's
entirelypossiblethatsomeonelookingatallmybenchmark
datawouldreachadifferentconclusion.And,ofcourse,the
resultscouldchangedrasticallywiththenextreleaseofanyof
thepackagesorvirtualmachines(andmaywellhave,bythe
timeyoureadthis).It'saslipperyscience.
Ingeneral,Sundidmostthingsverywell,butit'smissingafew
keyoptimizations,andsomeconstructs(suchascharacter
classes)aremuchslowerthanonewouldexpect.Overtime,
thesewilllikelybeaddressedbySun(andinfact,theslowness
ofcharacterclassesisslatedtobefixedinJava1.4.2).The
sourcecodeisavailableifyou'dliketohackonitaswell;I'm
sureSunwouldappreciateideasandpatchesthatimproveit.
8.3.3Recommendations
Therearemanyreasonsonemightchooseonepackageover
another,butSun'sjava.util.regexpackagewithitshigh
quality,speed,goodUnicodesupport,advancedfeatures,and
futureubiquityisagoodrecommendation.Itcomesintegrated
aspartofJava1.4:String.matches(),forexample,checksto
seewhetherthestringcanbecompletelymatchedbyagiven
regex.
java.util.regex'sstrengthslieinitscoreengine,butit
doesn'thaveagoodsetof"conveniencefunctions,"alayerthat
hidesmuchofthedrudgeryofbit-shufflingbehindthescenes.
ORO,ontheotherhand,whileitscoreengineisn'tasstrong,
doeshaveastrongsupportlayer.Itprovidesaveryconvenient
setoffunctionsforcasualuse,aswellasthecoreinterfacefor
specializedneeds.OROisdesignedtoallowmultipleregexcore
enginestobepluggedin,sothecombinationof
java.util.regexwithOROsoundsveryappealing.I'vetalked
totheOROdeveloper,anditseemslikelythatthiswillhappen,
sotherestofthischapterlooksatSun'sjava.util.regexand
ORO'sinterface.
Chapter2.ExtendedIntroductory
Examples
Rememberthedoubled-wordproblemfromthefirstchapter?I
saidthatafullsolutioncouldbewritteninjustafewlinesina
languagelikePerl.Suchasolutionmightlooklike:
$/=".\n";
while(<>){
nextif!s/\b([a-z]+)((?:\s|<[^>]+>)+)(\1\b)/\e[7m$1\e[m$2\e[
s/^(?:[^\e]*\n)+//mg;#Removeanyunmarkedlines.
s/^/$ARGV:/mg;#Ensurelinesbeginwithfilena
print;
}
Yup,that'sthewholeprogram.
Evenifyou'refamiliarwithPerl,Idon'texpectyouto
understandit(yet!).Rather,Iwantedtoshowanexample
beyondwhategrepcanallow,andtowhetyourappetiteforthe
realpowerofregularexpressions.
Mostofthisprogram'sworkrevolvesarounditsthreeregular
expressions:
\b([a-z]+)((?:\s|<[^>]+>)+)(\1\b)
^(?:[^\e]*\n)+
^
ThoughthisisaPerlexample,thesethreeregularexpressions
canbeusedverbatim(orwithonlyafewchanges)inmany
otherlanguages,includingPython,Java,VisualBasic.NET,Tcl,
andmore.
Now,lookingatthese,thatlast ^ iscertainlyrecognizable,but
theotherexpressionshaveitemsunfamiliartoouregrep-only
experience.ThisisbecausePerl'sregexflavorisnotthesame
asegrep's.Someofthenotationsaredifferent,andPerl(as
wellasmostmoderntools)tendtoprovideamuchrichersetof
metacharactersthanegrep.We'llseemanyexamples
throughoutthischapter.
3.1ACasualStrollAcrosstheRegexLandscape
I'dliketostartwiththestoryabouttheevolutionofsome
regularexpressionflavorsandtheirassociatedprograms.So,
grabahotcup(orfrostymug)ofyourfavoritebrewed
beverageandrelaxaswelookatthesometimeswackyhistory
behindtheregularexpressionswehavetoday.Theideaisto
addcolortoourregexunderstanding,andtodevelopafeeling
astowhy"thewaythingsare"arethewaythingsare.There
aresomefootnotesforthosethatareinterested,butforthe
mostpart,thisshouldbereadasalightstoryforenjoyment.
3.1.1TheOriginsofRegularExpressions
Theseedsofregularexpressionswereplantedintheearly
1940sbytwoneurophysiologists,WarrenMcCullochandWalter
Pitts,whodevelopedmodelsofhowtheybelievedthenervous
systemworkedattheneuronlevel.[1]Regularexpressions
becamearealityseveralyearslaterwhenmathematician
StephenKleeneformallydescribedthesemodelsinanalgebra
hecalledregularsets.Hedevisedasimplenotationtoexpress
theseregularsets,andcalledthemregularexpressions.
[1]"Alogicalcalculusoftheideasimminentinnervousactivity,"firstpublishedinBulletinof
Math.Biophysics5(1943)andlaterreprintedinEmbodimentsofMind(MITPress,1965).The
articlebeginswithaninterestingsummaryofhowneuronsbehave(didyouknowthatintraneuronimpulsespeedscanrangefrom1allthewayto150meterspersecond?),andthen
descendsintoapitofformulaethatis,literally,allGreektome.
Throughthe1950sand1960s,regularexpressionsenjoyeda
richstudyintheoreticalmathematicscircles.RobertConstable
haswrittenagoodsummary[2]forthemathematicallyinclined.
[2]RobertL.Constable,"TheRoleofFiniteAutomataintheDevelopmentofModernComputing
Theory,"inTheKleeneSymposium,Eds.Barwise,Keisler,andKunen(North-HollandPublishing
Company,1980),61-83.
Althoughthereisevidenceofearlierwork,thefirstpublished
computationaluseofregularexpressionsIhaveactuallybeen
abletofindisKenThompson's1968articleRegularExpression
SearchAlgorithm[3]inwhichhedescribesaregularexpression
compilerthatproducedIBM7094objectcode.Thisledtohis
workonqed,aneditorthatformedthebasisfortheUnixeditor
ed.
[3]CommunicationsoftheACM,Vol.11,No.6,June1968.
ed'sregularexpressionswerenotasadvancedasthoseinqed,
buttheywerethefirsttogainwidespreaduseinnon-technical
fields.edhadacommandtodisplaylinesoftheeditedfilethat
matchedagivenregularexpression.Thecommand,"g/
RegularExpression/p",wasread"GlobalRegularExpression
Print."Thisparticularfunctionwassousefulthatitwasmade
intoitsownutility,grep(afterwhichegrepextendedgrepwas
latermodeled).
3.1.1.1Grep'smetacharacters
Theregularexpressionssupportedbygrepandotherearlytools
werequitelimitedwhencomparedtoegrep's.The
metacharacter*wassupported,but+and?werenot(the
latter'sabsencebeingaparticularlystrongdrawback).grep's
capturingmetacharacterswere\(···\),withunescaped
parenthesesrepresentingliteraltext.[4]grepsupportedline
anchors,butinalimitedway.If^appearedatthebeginningof
theregex,itwasametacharactermatchingthebeginningof
theline.Otherwise,itwasn'tametacharacteratallandjust
matchedaliteralcircumflex(alsocalleda"caret").Similarly,$
wastheend-of-linemetacharacteronlyattheendoftheregex.
Theupshotwasthatyoucouldn'tdosomethinglike
end$|^start .Butthat'sokay,sincealternationwasn't
supportedeither!
[4]Historicaltrivia:ed(andhencegrep)usedescapedparenthesesratherthanunadorned
parenthesesasdelimitersbecauseKenThompsonfeltregularexpressionswouldbeusedtowork
primarilywithCcode,whereneedingtomatchrawparentheseswouldbemorecommonthan
backreferencing.
Thewaymetacharactersinteractisalsoimportant.For
example,perhapsgrep'slargestshortcomingwasthatstar
couldnotbeappliedtoaparenthesizedexpression,butonlyto
aliteralcharacter,acharacterclass,ordot.So,ingrep,
parentheseswereusefulonlyforcapturingmatchedtext,and
notforgeneralgrouping.Infact,someearlyversionsofgrep
didn'tevenallownestedparentheses.
3.1.1.2Grepevolves
Althoughmanysystemshavegreptoday,you'llnotethatI've
beenusingpasttense.Thepasttensereferstotheflavorofthe
oldversions,nowupwardsof30yearsold.Overtime,as
technologyadvances,olderprogramsaresometimesretrofitted
withadditionalfeatures,andgrephasbeennoexception.
Alongtheway,AT&TBellLabsaddedsomenewfeatures,such
asincorporatingthe\{min,max\}notationfromtheprogram
lex.Theyalsofixedthe-yoption,whichinearlyversionswas
supposedtoallowcase-insensitivematchesbutworkedonly
sporadically.Aroundthesametime,peopleatBerkeleyadded
startandend-of-wordmetacharactersandrenamed-yto-i.
Unfortunately,youstillcouldn'tapplystarortheother
quantifierstoaparenthesizedexpression.
3.1.1.3Egrepevolves
Bythistime,AlfredAho(alsoatAT&TBellLabs)hadwritten
egrep,whichprovidedmostoftherichersetofmetacharacters
describedinChapter1.Moreimportantly,heimplementedthem
inacompletelydifferent(andgenerallybetter)way.Notonly
were + and ? added,buttheycouldbeappliedto
parenthesizedexpressions,greatlyincreasingegrepexpressive
power.
Alternationwasaddedaswell,andthelineanchorswere
upgradedto"first-class"statussothatyoucouldusethem
almostanywhereinyourregex.However,egrephadproblems
aswellsometimesitwouldfindamatchbutnotdisplaythe
result,anditdidn'thavesomeusefulfeaturesthatarenow
popular.Nevertheless,itwasavastlymoreusefultool.
3.1.1.4Otherspeciesevolve
Atthesametime,otherprogramssuchasawk,lex,andsed,
weregrowingandchangingattheirownpace.Often,
developerswholikedafeaturefromoneprogramtriedtoaddit
toanother.Sometimes,theresultwasn'tpretty.Forexample,if
supportforpluswasaddedtogrep,+byitselfcouldn'tbeused
becausegrephadalonghistoryofaraw'+'notbeinga
metacharacter,andsuddenlymakingitonewouldhave
surprisedusers.Since'\+'wasprobablynotsomethingagrep
userwouldhaveotherwisenormallytyped,itcouldsafelybe
subsumedasthe"oneormore"metacharacter.
Sometimesnewbugswereintroducedasfeatureswereadded.
Othertimes,addedfeatureswerelaterremoved.Therewas
littletonodocumentationforthemanysubtlepointsthatround
outatool'sflavor,sonewtoolseithermadeuptheirownstyle,
orattemptedtomimic"whatseemedtowork"withothertools.
Multiplythatbythepassageoftimeandnumerous
programmers,andtheresultisgeneralconfusion(particularly
whenyoutrytodealwitheverythingatonce).[5]
[5]Suchaswhenwritingabookaboutregularexpressionsaskme,Iknow!
3.1.1.5POSIXAnattemptatstandardization
POSIX,shortforPortableOperatingSystemInterface,isa
wide-rangingstandardputforthin1986toensureportability
acrossoperatingsystems.Severalpartsofthisstandarddeal
withregularexpressionsandthetraditionaltoolsthatusethem,
soit'sofsomeinteresttous.Noneoftheflavorscoveredinthis
book,however,strictlyadheretoalltherelevantparts.Inan
efforttoreorganizethemessthatregularexpressionshad
become,POSIXdistillsthevariouscommonflavorsintojusttwo
classesofregexflavor,BasicRegularExpressions(BREs),and
ExtendedRegularExpressions(EREs).POSIXprogramsthen
supportoneflavorortheother.Table3-1belowsummarizesthe
metacharactersinthetwoflavors.
OneimportantfeatureofthePOSIXstandardisthenotionofa
locale,acollectionofsettingsthatdescribelanguageand
culturalconventionsforsuchthingsasthedisplayofdates,
times,andmonetaryvalues,theinterpretationofcharactersin
theactiveencoding,andsoon.Localesaimtoallowprograms
tobeinternationalized.Thyarenotregex-specificconcept,
althoughtheycanaffectregular-expressionuse.Forexample,
whenworkingwithalocalethatdescribestheLatin-1(ISO8859-1)encoding,àandÀ(characterswithordinalvalues224
and160,respectively)areconsidered"letters,"andany
applicationofaregexthatignorescapitalizationwouldknowto
treatthemasidentical.
Table1.OverviewofPOSIXRegexFlavors
Regexfeature
BREs
dot,^,$,[···],[^···]
*
"anynumber"quantifier
+and?quantifiers
\{min,max\}
rangequantifier
\(···\)
grouping
canapplyquantifierstoparentheses
\1through\9
backreferences
EREs
*
+?
{min,max}
(···)
alternation
Anotherexampleis \w ,commonlyprovidedasashorthandfor
a"word-constituentcharacter"(ostensibly,thesameas [a-zAZ0-9_] inmanyflavors).ThisfeatureisnotrequiredbyPOSIX,
butitisallowed.Ifsupported, \w wouldknowtoallowall
lettersanddigitsdefinedinthelocale,notjustthoseinASCII.
Note,however,thattheneedforthisaspectoflocalesismostly
alleviatedwhenworkingwithtoolsthatsupportUnicode.
UnicodeisdiscussedfurtherbeginninginSection3.3.2.2.
3.1.1.6HenrySpencer'sregexpackage
Alsofirstappearingin1986,andperhapsofmoreimportance,
wasthereleasebyHenrySpencerofaregexpackage,written
inC,whichcouldbefreelyincorporatebyothersintotheirown
programsafirstatthetime.EveryprogramthatusedHenry's
packageandthereweremanyprovidedthesameconsistent
regexflavorunlesstheprogram'sauthorwenttotheexplicit
troubletochangeit.
3.1.1.7Perlevolves
Ataboutthesametime,LarryWallstarteddevelopingatool
thatwouldlaterbecomethelanguagePerl.Hehadalready
greatlyenhanceddistributedsoftwaredevelopmentwithhis
patchprogram,butPerlwasdestinedtohaveatruly
monumentalimpact.
LarryreleasedPerlVersion1inDecember1987.Perlwasan
immediatehitbecauseitblendedsomanyusefulfeaturesof
otherlanguages,andcombinedthemwiththeexplicitgoalof
being,inaday-to-daypracticalsense,useful.
Oneimmediatelynotablefeaturewasasetofregular
expressionoperatorsinthetraditionofthespecialtytoolssed
andawkafirstforageneralscriptinglanguage.Fortheregular
expressionengine,Larryborrowedcodefromanearlierproject,
hisnewsreaderrn(whichbaseditsregularexpressioncodeon
thatinJamesGosling'sEmacs).[6]Theregexflavorwas
consideredpowerfulbytheday'sstandards,butwasnotnearly
asfull-featuredasitistoday.Itsmajordrawbackswerethatit
supportedatmostninesetsofparentheses,andatmostnine
alternativeswith | ,andworstofall, | wasnotallowedwithin
parentheses.Itdidnotsupportcase-insensitivematching,nor
allow \w withinaclass(itdidn'tsupport \s or \d anywhere).
Itdidn'tsupportthe {min,max} rangequantifier.
[6]JamesGoslingwouldlatergoontodevelophisownlanguage,Java,whichsomewhat
ironicallydoesnotnativelysupportregularexpressions.Java1.4however,doesincludea
wonderfulregularexpressionpackage,coveredindepthinChapter8.
Perl2wasreleasedinJune1988.Larryhadreplacedtheregex
codeentirely,thistimeusingagreatlyenhancedversionofthe
HenrySpencerpackagementionedintheprevioussection.You
couldstillhaveatmostninesetsofparentheses,butnowyou
coulduse | insidethem.Supportfor \d and \s wasadded,
andsupportfor \w waschangedtoincludeanunderscore,
sincethenitwouldmatchwhatcharacterswereallowedina
Perlvariablename.Furthermore,thesemetacharacterswere
nowallowedinsideclasses.(Theiropposites, \D , \W ,and \S
,werealsonewlysupported,butweren'tallowedwithinaclass,
andinanycasesometimesdidn'tworkcorrectly.)Importantly,
the/imodifierwasadded,soyoucouldnowdocaseinsensitivematching.
Perl3cameoutmorethanayearlater,inOctober1989.It
addedthe/emodifier,whichgreatlyincreasedthepowerofthe
replacementoperator,andfixedsomebackreference-related
bugsfromthepreviousversion.Itaddedthe {min,max} range
quantifiers,althoughunfortunately,theydidn'talwayswork
quiteright.Worsestill,withVersion3,theregularexpression
enginecouldn'talwaysworkwith8-bitdata,yielding
unpredictableresultswithnon-ASCIIinput.
Perl4wasreleasedhalfayearlater,inMarch1991,andover
thenexttwoyears,itwasimproveduntilitslastupdatein
February1993.Bythistime,thebugswerefixedand
restrictionsexpanded(youcoulduse \D andsuchwithin
characterclasses,andaregularexpressioncouldhavevirtually
unlimitedsetsofparentheses).Workalsowentintooptimizing
howtheregexenginewentaboutitstask,butthereal
breakthroughwouldn'thappenuntil1994.
Perl5wasofficiallyreleasedinOctober1994.Overall,Perlhad
undergoneamassiveoverhaul,andtheresultwasavastly
superiorlanguageineveryrespect.Ontheregular-expression
side,ithadmoreinternaloptimizations,andafew
metacharacterswereadded(including \G ,whichincreasedthe
powerofiterativematchesseeSection3.4.3.3),non-capturing
parentheses(seeSection2.2.3.1),lazyquantifiers(seeSection
3.4.5.9),lookahead(seeSection2.3.5.1),andthe/xmodifier[7]
(seeSection2.3.6.4).
[7]MyclaimtofameisthatLarryaddedthe
modifierafterseeinganotefrommediscussing
alongandcomplexregex.Inthenote,Ihad"prettyprinted"theregularexpressionforclarity.
Uponseeingit,hethoughtthatitwouldbeconvenienttodosoinPerlcodeaswell,soheadded
/x
/x.
Moreimportantthanjustfortheirrawfunctionality,these
"outsidethebox"modificationsmadeitclearthatregular
expressionscouldreallybeapowerfulprogramminglanguage
untothemselves,andwerestillripeforfurtherdevelopment.
Thenewly-addednon-capturingparenthesesandlookahead
constructsrequiredawaytobeexpressed.Noneofthe
groupingpairs(···),[···],<···>,or{···}wereavailable
tobeusedforthesenewfeatures,soLarrycameupwiththe
various'(?'notationsweusetoday.Hechosethisunsightly
sequencebecauseitpreviouslywouldhavebeenanillegal
combinationinaPerlregex,sohewasfreetogiveitmeaning.
OneimportantconsiderationLarryhadtheforesightto
recognizewasthattherewouldlikelybeadditionalfunctionality
inthefuture,sobyrestrictingwhatwasallowedafterthe'(?'
sequences,hewasabletoreservethemforfuture
enhancements.
SubsequentversionsofPerlgrewmorerobust,withfewerbugs,
moreinternaloptimizations,andnewfeatures.Iliketobelieve
thatthefirsteditionofthisbookplayedsomesmallpartinthis,
forasIresearchedandtestedregex-relatedfeatures,Iwould
sendmyresultstoLarryandthePerlPortersgroup,which
helpedgivesomedirectionastowhereimprovementsmightbe
made.
Newregexfeaturesaddedovertheyearsincludelimited
lookbehind(seeSection2.3.5.1),"atomic"grouping(see
Section3.4.5.4),andUnicodesupport.Regularexpressions
werebroughttothenextlevelbytheadditionofconditional
constructs(seeSection3.4.5.6),allowingyoutomakeif-thenelsedecisionsrightthereaspartoftheregularexpression.And
ifthatwasn'tenough,therearenowconstructsthatallowyou
tointerminglePerlcodewithinaregularexpression,which
takesthingsfullcircle(seeSection7.8).TheversionofPerl
coveredinthisbookis5.8.
3.1.1.8Apartialconsolidationofflavors
TheadvancesseeninPerl5wereperfectlytimedfortheWorld
WideWebrevolution.Perlwasbuiltfortextprocessing,andthe
buildingofwebpagesisjustthat,soPerlquicklybecamethe
languageforwebdevelopment.Perlbecamevastlymore
popular,andwithit,itspowerfulregularexpressionflavordidas
well.
Developersofotherlanguageswerenotblindtothispower,and
eventuallyregularexpressionpackagesthatwere"Perl
compatible"tooneextentoranotherwerecreated.Among
thesewerepackagesforTcl,Python,Microsoft's.NETsuiteof
languages,Ruby,PHP,C/C++,andmanypackagesforJava.
3.1.1.9Versionsasofthisbook
Table3-2showsafewoftheversionnumbersforprogramsand
librariesthatItalkaboutinthebook.Olderversionsmaywell
havefewerfeaturesandmorebugs,whilenewerversionsmay
haveadditionalfeaturesandbugfixes(andnewbugsoftheir
own).
BecauseJavadidnotoriginallycomewithregexsupport,
numerousregexlibrarieshavebeendevelopedovertheyears,
soanyonewishingtouseregularexpressionsinJavaneededto
findthem,evaluatethem,andultimatelyselectonetouse.
Chapter6looksatsevensuchpackages,andwaystoevaluate
them.Forreasonsdiscussedthere,theregexpackagethatSun
eventuallycameupwith(theirjava.util.regex,nowstandard
asofJava1.4)iswhatIuseformostoftheJavaexamplesin
thisbook.
Table2.VersionsofSomeToolsMentionedinThisBook
GNUawk3.1
GNUegrep/grep2.4.2
GNUEmacs21.2.1
flex2.5.4
java.util.regex(Java1.4.0)
MySQL3.23.49
.NETFramework2002(1.0.3705)
PCRE3.8
Perl5.8
PHP(pregroutines)4.0.6
Procmail3.22
Python2.2.1
Ruby1.6.7
GNUsed3.02
Tcl8.4
3.1.2AtaGlance
Achartshowingjustafewaspectsofsomecommontoolsgives
agoodcluetohowdifferentthingsstillare.Table3-3provides
averysuperficiallookatafewaspectsoftheregexflavorsofa
fewtools.
Table3.A(Very)SuperficialLookattheFlavorofaFewCommonTools
Modern Modern GNU
Sun'sJava
Feature
Tcl
Perl .NET
grep
egrep Emacs
package
*,^,$,[···]
\?\+\| ?+|
?+\|
?+|
?+| ?+| ?+|
?+|
\(···\) (···)
\(···\)
(···)
(···) (···) (···)
grouping
(?:···)
\<\>
\<\>\b,\B \m,\M,\y \b,\B \b,\B \b,\B
wordboundary
\w,\W
backreferences
supported
AchartlikeTable3-3isoftenfoundinotherbookstoshowthe
differencesamongtools.But,thischartisonlythetipofthe
icebergforeveryfeatureshown,thereareadozenimportant
issuesthatareoverlooked.
Foremostisthatprogramschangeovertime.Forexample,Tcl
usedtonotsupportbackreferencesandwordboundaries,but
nowdoes.Itfirstsupportedwordboundarieswiththeungainlylooking [:<:] and [:>:] ,andstilldoes,althoughsuchuseis
deprecatedinfavorofitsmore-recentlysupported \m , \M ,
and \y (startofwordboundary,endofwordboundary,or
either).
Alongthesamelines,programssuchasgrepandegrep,which
aren'tfromasingleproviderbutrathercanbeprovidedby
anyonewhowantstocreatethem,canhavewhateverflavor
theindividualauthoroftheprogramwishes.Humannature
beingwhatis,eachtendstohaveitsownfeaturesand
peculiarities.(TheGNUversionsofmanycommontools,for
example,areoftenmorepowerfulandrobustthanother
versions.)
Andperhapsasimportantastheeasilyvisiblefeaturesarethe
manysubtle(andsomenot-so-subtle)differencesamong
flavors.Lookingatthetable,onemightthinkthatregular
expressionsareexactlythesameinPerl,.NET,andJava,which
iscertainlynottrue.Justafewofthequestionsonemightask
whenlookingatsomethinglikeTable3-3are:
Arestarandfriendsallowedtoquantifysomethingwrapped
inparentheses?
Doesdotmatchanewline?Donegatedcharacterclasses
matchit?Doeithermatchthenullcharacter?
Arethelineanchorsreallylineanchors(i.e.,dothey
recognizenewlinesthatmightbeembeddedwithinthe
targetstring)?Aretheyfirst-classmetacharacters,orare
theyvalidonlyincertainpartsoftheregex?
Areescapesrecognizedincharacterclasses?Whatelseisor
isn'tallowedwithincharacterclasses?
Areparenthesesallowedtobenested?Ifso,howdeeply
(andhowmanyparenthesesareevenallowedinthefirst
place)?
Ifbackreferencesareallowed,whenacase-insensitive
matchisrequested,dobackreferencesmatch
appropriately?Dobackreferences"behave"reasonablyin
fringesituations?
Areoctalescapessuchas \123 allowed?Ifso,howdothey
reconcilethesyntacticconflictwithbackreferences?What
abouthexadecimalescapes?Isitreallytheregexengine
thatsupportsoctalandhexadecimalescapes,orisitsome
otherpartoftheutility?
Does \w matchonlyalphanumerics,oradditional
charactersaswell?(Amongtheprogramsshownsupporting
\winTable3-3,thereareseveraldifferentinterpretations).
Does \w agreewiththevariousword-boundary
metacharactersonwhatdoesanddoesn'tconstitutea
"wordcharacter"?Dotheyrespectthelocale,orunderstand
Unicode?
Manyissuesmustbekeptinmind,evenwithatidylittle
summarylikeTable3-3asasuperficialguide.(Asanother
example,peekaheadtoTable8-1foralookatachartshowing
somedifferencesamongJavapackages.)Ifyourealizethat
there'salotofdirtylaundrybehindthatnicefaçade,it'snottoo
difficulttokeepyourwitsaboutyouanddealwithit.
Asmentionedatthestartofthechapter,muchofthisisjust
superficialsyntax,butmanyissuesgodeeper.Forexample,
onceyouunderstandthatsomethingsuchas (Jul|July) in
egrepneedstobewrittenas \(Jul\|July\) forGNUEmacs,
youmightthinkthateverythingisthesamefromthere,but
that'snotalwaysthecase.Thedifferencesinthesemanticsof
howamatchisattempted(or,atleast,howitappearstobe
attempted)isanextremelyimportantissuethatisoften
overlooked,yetitexplainswhythesetwoapparentlyidentical
exampleswouldactuallyendupmatchingdifferently:one
alwaysmatches'Jul',evenwhenappliedto'July'.Thosevery
samesemanticsalsoexplainwhytheopposite, (July|Jul)
and \(July\|Jul\) ,domatchthesametext.Again,the
entirenextchapterisdevotedtounderstandingthis.
Ofcourse,whatatoolcandowitharegularexpressionisoften
moreimportantthantheflavorofitsregularexpressions.For
example,evenifPerl'sexpressionswerelesspowerfulthan
egrep's,Perl'sflexibleuseofregexesprovidesformoreraw
usefulness.We'lllookatalotofindividualfeaturesinthis
chapter,andindepthatafewlanguagesinlaterchapters.