Tải bản đầy đủ (.pdf) (344 trang)

Addison wesley text processing in python jun 2003 ISBN 0321112547

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.17 MB, 344 trang )

A.3Datatypes
Pythonhasarichcollectionofbasicdatatypes.AllofPython's
collectiontypesallowyoutoholdheterogeneouselements
insidethem,includingothercollectiontypes(withminor
limitations).Itisstraightforward,therefore,tobuildcomplex
datastructuresinPython.
Unlikemanylanguages,Pythondatatypescomeintwo
varieties:mutableandimmutable.Alloftheatomicdatatypes
areimmutable,asisthecollectiontypetuple.Thecollections
listanddictaremutable,asareclassinstances.The
mutabilityofadatatypeissimplyaquestionofwhetherobjects
ofthattypecanbechanged"inplace"animmutableobjectcan
onlybecreatedanddestroyed,butneveralteredduringits
existence.Oneupshotofthisdistinctionisthatimmutable
objectsmayactasdictionarykeys,butmutableobjectsmay
not.Anotherupshotisthatwhenyouwantadata
structureespeciallyalargeonethatwillbemodifiedfrequently
duringprogramoperation,youshouldchooseamutable
datatype(usuallyalist).
Mostofthetime,ifyouwanttoconvertvaluesbetween
differentPythondatatypes,anexplicitconversion/encodingcall
isrequired,butnumerictypescontainpromotionrulestoallow
numericexpressionsoveramixtureoftypes.Thebuilt-in
datatypesarelistedbelowwithdiscussionsofeach.Thebuilt-in
functiontype()canbeusedtocheckthedatatypeofanobject.

A.3.1SimpleTypes
bool


Python2.3+supportsaBooleandatatypewiththepossible


valuesTrueandFalse.InearlierversionsofPython,these
valuesaretypicallycalled1and0;eveninPython2.3+,the
Booleanvaluesbehavelikenumbersinnumericcontexts.Some
earliermicro-releasesofPython(e.g.,2.2.1)includethenames
TrueandFalse,butnottheBooleandatatype.

int
Asignedintegerintherangeindicatedbytheregistersizeof
theinterpreter'sCPU/OSplatform.Formostcurrentplatforms,
integersrangefrom(2**31)-1tonegative(2**31)-1.Youcan
findthesizeonyourplatformbyexaminingsys.maxint.
Integersarethebottomnumerictypeintermsofpromotions;
nothinggetspromotedtoaninteger,butintegersare
sometimespromotedtoothernumerictypes.Afloat,long,or
stringmaybeexplicitlyconvertedtoanintusingtheint()
function.
SEEALSO:int18;

long
An(almost)unlimitedsizeintegralnumber.Alongliteralis
indicatedbyanintegerfollowedbyan1orL(e.g.,34L,
98765432101).InPython2.2+,operationsonintsthatoverflow
sys.maxintareautomaticallypromotedtolongs.Anint,float,
orstringmaybeexplicitlyconvertedtoalongusingthelong()
function.

float
AnIEEE754floatingpointnumber.Aliteralfloatingpoint



numberisdistinguishedfromanintorlongbycontaininga
decimalpointand/orexponentnotation(e.g.,1.0,1e3,37.,
.453e-12).Anumericexpressionthatinvolvesbothint/long
typesandfloattypespromotesallcomponenttypestofloats
beforeperformingthecomputation.Anint,long,orstringmay
beexplicitlyconvertedtoafloatusingthefloat()function.
SEEALSO:float19;

complex
Anobjectcontainingtwofloats,representingrealandimaginary
componentsofanumber.Anumericexpressionthatinvolves
bothint/long/floattypesandcomplextypespromotesall
componenttypestocomplexbeforeperformingthe
computation.Thereisnowaytospellaliteralcomplexin
Python,butanadditionsuchas1.1+2jistheusualwayof
computingacomplexvalue.AjorJfollowingafloatorint
literalindicatesanimaginarynumber.Anint,long,orstring
maybeexplicitlyconvertedtoacomplexusingthecomplex()
function.Iftwofloat/intargumentsarepassedtocomplex(),
thesecondistheimaginarycomponentoftheconstructed
number(e.g.,complex(1.1,2)).

string
Animmutablesequenceof8-bitcharactervalues.Unlikein
manyprogramminglanguages,thereisno"character"typein
Python,merelystringsthathappentohavelengthone.String
objectshaveavarietyofmethodstomodifystrings,butsuch
methodsalwaysreturnanewstringobjectratherthanmodify
theinitialobjectitself.Thebuilt-inchr()functionwillreturna
length-onestringwhoseordinalvalueisthepassedinteger.The

str()functionwillreturnastringrepresentationofapassedin


object.Forexample:
>>>ord('a')
97
>>>chr(97)
'a'
>>>str(97)
'97'
SEEALSO:string129;

unicode
AnimmutablesequenceofUnicodecharacters.Thereisno
datatypeforasingleUnicodecharacter,butUnicodestringsof
length-onecontainasinglecharacter.Unicodestringscontaina
similarcollectionofmethodstostringobjects,andlikethe
latter,UnicodemethodsreturnnewUnicodeobjectsratherthan
modifytheinitialobject.SeeChapter2andAppendixCfor
additionaldiscussion,ofUnicode.

A.3.2StringInterpolation
LiteralstringsandUnicodestringsmaycontainembedded
formatcodes.Whenastringcontainsformatcodes,valuesmay
beinterpolatedintothestringusingthe%operatorandatuple
ordictionarygivingthevaluestosubstitutein.
Stringsthatcontainformatcodesmayfolloweitheroftwo
patterns.Thesimplerpatternusesformatcodeswiththesyntax
%[flags][len[.precision]]<type>.Interpolatingastringwith
formatcodesonthispatternrequires%combinationwitha

tupleofmatchinglengthandcontentdatatypes.Ifonlyone
valueisbeinginterpolated,youmaygivethebareitemrather
thanatupleoflengthone.Forexample:


>>>"float%3.1f,int%+d,hex%06x"%(1.234,1234,1234)
'float1.2,int+1234,hex0004d2'
>>>'%e'%1234
'1.234000e+03'
>>>'%e'%(1234,)
'1.234000e+03'
The(slightly)morecomplexpatternforformatcodesembedsa
namewithintheformatcode,whichisthenusedasastringkey
toaninterpolationdictionary.Thesyntaxofthispatternis%
(key)[flags][len[.precision]]<type>.Interpolatingastring
withthisstyleofformatcodesrequires%combinationwitha
dictionarythatcontainsallthenamedkeys,andwhose
correspondingvaluescontainacceptabledatatypes.For
example:

>>>dct={'ratio':1.234,'count':1234,'offset':1234}
>>>"float%(ratio)3.1f,int%(count)+d,hex%(offset)06x"%dc
'float1.2,int+1234,hex0004d2'
Youmaynotmixtupleinterpolationanddictionaryinterpolation
withinthesamestring.
Imentionedthatdatatypesmustmatchformatcodes.Different
formatcodesacceptadifferentrangeofdatatypes,butthe
rulesarealmostalwayswhatyouwouldexpect.Generally,
numericdatawillbepromotedordemotedasnecessary,but
stringsandcomplextypescannotbeusedfornumbers.

Oneusefulstyleofusingdictionaryinterpolationisagainstthe
globaland/orlocalnamespacedictionary.Regularboundnames
definedinscopecanbeinterpolatedintostrings.
>>>s="float%(ratio)3.1f,int%(count)+d,hex%(offset)06x"
>>>ratio=1.234
>>>count=1234
>>>offset=1234
>>>s%globals()


'float1.2,int+1234,hex0004d2'
Ifyouwanttolookfornamesacrossscope,youcancreatean
adhocdictionarywithbothlocalandglobalnames:
>>>vardct={}
>>>vardct.update(globals())
>>>vardct.update(locals())
>>>interpolated=somestring%vardct
Theflagsforformatcodesconsistofthefollowing:
0Padtolengthwithleadingzeros
-Alignthevaluetotheleftwithinitslength
-(space)Padtolengthwithleadingspaces
+Explicitlyindicatethesignofpositivevalues
Whenalengthisincluded,itspecifiestheminimumlengthof
theinterpolatedformatting.Numbersthatwillnotfitwithina
lengthsimplyoccupymorebytesthanspecified.Whena
precisionisincluded,thelengthofthosedigitstotherightof
thedecimalareincludedinthetotallength:
>>>'[%f]'%1.234
'[1.234000]'
>>>'[%5f]'%1.234

'[1.234000]'
>>>'[%.1f]'%1.234
'[1.2]'
>>>'[%5.1f]'%1.234
'[1.2]'
>>>'[%05.1f]'%1.234
'[001.2]'
Theformattingtypesconsistofthefollowing:
dSignedintegerdecimal
iSignedintegerdecimal


oUnsignedoctal
uUnsigneddecimal
xLowercaseunsignedhexadecimal
XUppercaseunsignedhexadecimal
eLowercaseexponentialformatfloatingpoint
EUppercaseexponentialformatfloatingpoint
fFloatingpointdecimalformat
gFloatingpoint:exponentialformatif-4GUppercaseversionof'g'
cSinglecharacter:integerforchr(i)orlength-onestring
rConvertsanyPythonobjectusingrepr()
sConvertsanyPythonobjectusingstr()
%The'%'character,e.g.:'%%%d'%(1)-->'%1'
Onemorespecialformatcodestyleallowstheuseofa*in
placeofalength.Inthiscase,theinterpolatedtuplemust
containanextraelementfortheformattedlengthofeach
formatcode,precedingthevaluetoformat.Forexample:
>>>"%0*d#%0*.2f"%(4,123,4,1.23)

'0123#1.23'
>>>"%0*d#%0*.2f"%(6,123,6,1.23)
'000123#001.23'

A.3.3Printing
Theleast-sophisticatedformoftextualoutputinPythonis
writingtoopenfiles.Inparticular,theSTDOUTandSTDERR
streamscanbeaccessedusingthepseudo-filessys.stdout
andsys.stderr.Writingtotheseisjustlikewritingtoany
otherfile;forexample:
>>>importsys
>>>try:
...#somefragileaction
...sys.stdout.write('resultofaction\n')


...except:
...sys.stderr.write('couldnotcompleteaction\n')
...
resultofaction
YoucannotseekwithinSTDOUTorSTDERRgenerallyyoushould
considertheseaspuresequentialoutputs.
WritingtoSTDOUTandSTDERRisfairlyinflexible,andmostof
thetimetheprintstatementaccomplishesthesamepurpose
moreflexibly.Inparticular,methodslikesys.stdout.write()
onlyacceptasinglestringasanargument,whileprintcan
handleanynumberofargumentsofanytype.Eachargumentis
coercedtoastringusingtheequivalentofrepr(obj).For
example:
>>>print"Pi:%.3f"%3.1415,27+11,{3:4,1:2},(1,2,3)

Pi:3.14238{1:2,3:4}(1,2,3)
Eachargumenttotheprintstatmentisevaluatedbeforeitis
printed,justaswhenanargumentispassedtoafunction.Asa
consequence,thecanonicalrepresentationofanobjectis
printed,ratherthantheexactformpassedasanargument.In
myexample,thedictionaryprintsinadifferentorderthanit
wasdefinedin,andthespacingofthelistanddictionaryis
slightlydifferent.Stringinterpolationisalsopeformedandisa
verycommonmeansofdefininganoutputformatprecisely.
Thereareafewthingstowatchforwiththeprintstatement.A
spaceisprintedbetweeneachargumenttothestatement.If
youwanttoprintseveralobjectswithoutaseparatingspace,
youwillneedtousestringconcatenationorstringinterpolation
togettherightresult.Forexample:
>>>numerator,denominator=3,7
>>>printrepr(numerator)+"/"+repr(denominator)
3/7
>>>print"%d/%d"%(numerator,denominator)


3/7
Bydefault,aprintstatementaddsalinefeedtotheendofits
output.Youmayeliminatethelinefeedbyaddingatrailing
commatothestatement,butyoustillwindupwithaspace
addedtotheend:
>>>letlist=('a','B','Z','r','w')
>>>forcinletlist:printc,#insertsspaces
...
aBZrw
Assumingthesespacesareunwanted,youmusteitheruse

sys.stdout.write()orotherwisecalculatethespace-free
stringyouwant:
>>>forcinletlist+('\n',):#nospaces
...sys.stdout.write(c)
...
aBZrw
>>>print''.join(letlist)
aBZrw
Thereisaspecialformoftheprintstatementthatredirectsits
outputsomewhereotherthanSTDOUT.Theprintstatement
itselfcanbefollowedbytwogreater-thansigns,thenawritable
file-likeobject,thenacomma,thentheremainderofthe
(printed)arguments.Forexample:
>>>print>>open('test','w'),"Pi:%.3f"%3.1415,27+11
>>>open('test').read()
'Pi:3.14238\n'
SomePythonprogrammers(includingyourauthor)consider
thisspecialformoverly"noisy,"butitisoccassionallyusefulfor
quickconfigurationofoutputdestinations.
Ifyouwantafunctionthatwoulddothesamethingasaprint


statement,thefollowingonedoesso,butwithoutanyfacilityto
eliminatethetrailinglinefeedorredirectoutput:
defprint_func(*args):
importsys
sys.stdout.write(''.join(map(repr,args))+'\n')
Readerscouldenhancethistoaddthemissingcapabilities,but
usingprintasastatementistheclearestapproach,generally.
SEEALSO:sys.stderr50;sys.stdout51;


A.3.4ContainerTypes
tuple
Animmutablesequenceof(heterogeneous)objects.Being
immutable,themembershipandlengthofatuplecannotbe
modifiedaftercreation.However,tupleelementsand
subsequencescanbeaccessedbysubscriptingandslicing,and
newtuplescanbeconstructedfromsuchelementsandslices.
Tuplesaresimilarto"records"insomeotherprogramming
languages.
Theconstructorsyntaxforatupleiscommasbetweenlisted
items;inmanycontexts,parenthesesaroundaconstructedlist
arerequiredtodisambiguateatupleforotherconstructssuch
asfunctionarguments,butitisthecommasnotthe
parenthesesthatconstructatuple.Someexamples:
>>>tup='spam','eggs','bacon','sausage'
>>>newtup=tup[1:3]+(1,2,3)+(tup[3],)
>>>newtup
('eggs','bacon',1,2,3,'sausage')
Thefunctiontuple()mayalsobeusedtoconstructatuple


fromanothersequencetype(eitheralistorcustomsequence
type).
SEEALSO:tuple28;

list
Amutablesequenceofobjects.Likeatuple,listelementscan
beaccessedbysubscriptingandslicing;unlikeatuple,list
methodsandindexandsliceassignmentscanmodifythelength

andmembershipofalistobject.
Theconstructorsyntaxforalistissurroundingsquarebraces.
Anemptylistmaybeconstructedwithnoobjectsbetweenthe
braces;alength-onelistcancontainsimplyanobjectname;
longerlistsseparateeachelementobjectwithcommas.
Indexingandslices,ofcourse,alsousesquarebraces,butthe
syntacticcontextsaredifferentinthePythongrammar(and
commonsenseusuallypointsoutthedifference).Some
examples:
>>>lst=['spam',(1,2,3),'eggs',3.1415]
>>>lst[:2]
['spam',(1,2,3)]
Thefunctionlist()mayalsobeusedtoconstructalistfrom
anothersequencetype(eitheratupleorcustomsequence
type).
SEEALSO:list28;

dict
Amutablemappingbetweenimmutablekeysandobjectvalues.
Atmostoneentryinadictexistsforagivenkey;addingthe


samekeytoadictionaryasecondtimeoverridestheprevious
entry(muchaswithbindinganameinanamespace).Dictsare
unordered,andentriesareaccessedeitherbykeyasindex;by
creatinglistsofcontainedobjectsusingthemethods.keys(),
.values(),and.items();orinrecentPythonversionswiththe
.popitem()method.Allthedictmethodsgeneratecontained
objectsinanunspecifiedorder.
Theconstructorsyntaxforadictissurroundingcurlybrackets.

Anemptydictmaybeconstructedwithnoobjectsbetweenthe
brackets.Eachkey/valuepairenteredintoadictisseparatedby
acolon,andsuccessivepairsareseparatedbycommas.For
example:
>>>dct={1:2,3.14:(1+2j),'spam':'eggs'}
>>>dct['spam']
'eggs'
>>>dct['a']='b'#additemtodict
>>>dct.items()
[('a','b'),(1,2),('spam','eggs'),(3.14,(1+2j))]
>>>dct.popitem()
('a','b')
>>>dct
{1:2,'spam':'eggs',3.14:(1+2j)}
InPython2.2+,thefunctiondict()mayalsobeusedto
constructadictfromasequenceofpairsorfromacustom
mappingtype.Forexample:
>>>d1=dict([('a','b'),(1,2),('spam','eggs')])
>>>d1
{'a':'b',1:2,'spam':'eggs'}
>>>d2=dict(zip([1,2,3],['a','b','c']))
>>>d2
{1:'a',2:'b',3:'c'}
SEEALSO:dict24;


sets.Set
Python2.3+includesastandardmodulethatimplementsaset
datatype.ForearlierPythonversions,anumberofdevelopers
havecreatedthird-partyimplementationsofsets.Ifyouhaveat

leastPython2.2,youcandownloadandusethesetsmodule
from<(orbrowsethePythonCVS)you
willneedtoaddthedefinitionTrue,False=1,0toyourlocal
version,though.
Asetisanunorderedcollectionofhashableobjects.Unlikea
list,noobjectcanoccurinasetmorethanonce;aset
resemblesadictthathasonlykeysbutnovalues.Setsutilize
bitwiseandBooleansyntaxtoperformbasicset-theoretic
operations;asubsettestdoesnothaveaspecialsyntactic
form,insteadusingthe.issubset()and.issuperset()
methods.Youmayalsoloopthroughsetmembersinan
unspecifiedorder.Someexamplesillustratethetype:
>>>fromsetsimportSet
>>>x=Set([1,2,3])
>>>y=Set((3,4,4,6,6,2))#initwithanyseq
>>>printx,'//',y#makesuredupsremoved
Set([1,2,3])//Set([2,3,4,6])
>>>printx|y#unionofsets
Set([1,2,3,4,6])
>>>printx&y#intersectionofsets
Set([2,3])
>>>printy-x#differenceofsets
Set([4,6])
>>>printx^y#symmetricdifference
Set([1,4,6])
Youcanalsocheckmembershipanditerateoversetmembers:
>>>4iny#membershipcheck
1



>>>x.issubset(y)#subsetcheck
0
>>>foriiny:
...printi+10,
...
12131416
>>>fromoperatorimportadd
>>>plus_ten=Set(map(add,y,[10]*len(y)))
>>>plus_ten
Set([16,12,13,14])
sets.Setalsosupportsin-placemodificationofsets;
sets.ImmutableSet,naturally,doesnotallowmodification.
>>>x=Set([1,2,3])
>>>x|=Set([4,5,6])
>>>x
Set([1,2,3,4,5,6])
>>>x&=Set([4,5,6])
>>>x
Set([4,5,6])
>>>x^=Set([4,5])
>>>x
Set([6])

A.3.5CompoundTypes
classinstance
Aclassinstancedefinesanamespace,butthisnamespace's
mainpurposeisusuallytoactasadatacontainer(buta
containerthatalsoknowshowtoperformactions;i.e.,has
methods).Aclassinstance(oranynamespace)actsverymuch
likeadictintermsofcreatingamappingbetweennamesand

values.Attributesofaclassinstancemaybesetormodified


usingstandardqualifiednamesandmayalsobesetwithinclass
methodsbyqualifyingwiththenamespaceofthefirst(implicit)
methodargument,conventionallycalledself.Forexample:
>>>classKlass:
...defsetfoo(self,val):
...self.foo=val
...
>>>obj=Klass()
>>>obj.bar='BAR'
>>>obj.setfoo(['this','that','other'])
>>>obj.bar,obj.foo
('BAR',['this','that','other'])
>>>obj.__dict__
{'foo':['this','that','other'],'bar':'BAR'}
Instanceattributesoftendereferencetootherclassinstances,
therebyallowinghierarchicallyorganizednamespace
quantificationtoindicateadatastructure.Moreover,anumber
of"magic"methodsnamedwithleadingandtrailingdoubleunderscoresprovideoptionalsyntacticconveniencesforworking
withinstancedata.Themostcommonofthesemagicmethods
is.__init__(),whichinitializesaninstance(oftenutilizing
arguments).Forexample:
>>>classKlass2:
...def__init__(self,*args,**kw):
...self.listargs=args
...forkey,valinkw.items():
...setattr(self,key,val)
...

>>>obj=Klass2(1,2,3,foo='F00',bar=Klass2(baz='BAZ'))
>>>obj.bar.blam='BLAM'
>>>obj.listargs,obj.foo,obj.bar.baz,obj.bar.blam
((1,2,3),'F00','BAZ','BLAM')
Therearequiteafewadditional"magic"methodsthatPython
classesmaydefine.Manyofthesemethodsletclassinstances


behavemorelikebasicdatatypes(whilestillmaintainingspecial
classbehaviors).Forexample,the.__str__()and
.__repr__()methodscontrolthestringrepresentationofan
instance;the.__getitem__()and.__setitem__()methods
allowindexedaccesstoinstancedata(eitherdict-likenamed
indices,orlist-likenumberedindices);methodslike
.__add__(),.__mul__(),.__pow__(),and.__abs__()allow
instancestobehaveinnumber-likeways.ThePythonReference
Manualdiscussesmagicmethodsindetail.
InPython2.2andabove,youcanalsoletinstancesbehave
morelikebasicdatatypesbyinheritingclassesfromthesebuiltintypes.Forexample,supposeyouneedadatatypewhose
"shape"containsbothamutablesequenceofelementsanda
.fooattribute.Twowaystodefinethisdatatypeare:
>>>classFooList(list):#worksonlyinPython2.2+
...def__init__(self,lst=[],foo=None):
...list.__init__(self,lst)
...self.foo=foo
...
>>>foolist=FooList([1,2,3],'F00')
>>>foolist[1],foolist.foo
(2,'F00')
>>>classoldFooList:#worksinolderPythons

...def__init__(self,lst=[],foo=None):
...self._lst,self.foo=1st,foo
...defappend(self,item):
...self._lst.append(item)
...def__getitem__(self,item):
...returnself._lst[item]
...def__setitem__(self,item,val):
...self._lst[item]=val
...def__delitem__(self,item):
...delself._lst[item]
...
>>>foolst2=oldFooList([1,2,3],'F00')


>>>foolst2[1],foolst2.foo
(2,'F00')
Ifyouneedmorecomplexdatatypesthanthebasictypes,or
eventhananinstancewhoseclasshasmagicmethods,often
thesecanbeconstructedbyusinginstanceswhoseattributes
areboundinlink-likefashiontootherinstances.Suchbindings
canbeconstructedaccordingtovarioustopologies,including
circularones(suchasformodelinggraphs).Asasimple
example,youcanconstructabinarytreeinPythonusingthe
followingnodeclass:

>>>classNode:
...def__init__(self,left=None,value=None,right=None):
...self.left,self.value,self.right=left,value,ri
...def__repr__(self):
...returnself.value

...
>>>tree=Node(Node(value="LeftLeaf"),
..."TreeRoot",
...Node(left=Node(value="RightLeftLeaf"),
...right=Node(value="RightRightLeaf")))
>>>tree,tree.left,tree.left.left,tree.right.left,tree.right.ri
(TreeRoot,LeftLeaf,None,RightLeftLeaf,RightRightLeaf)
Inpractice,youwouldprobablybindintermediatenodesto
names,inordertoalloweasypruningandrearrangement.
SEEALSO:int18;float19;list28;string129;tuple28;
UserDict24;UserList28;UserString33;


Chapter2.BasicStringOperations
Thecheapest,fastestandmostreliablecomponentsofa
computersystemarethosethataren'tthere.
GordonBell,EncoreComputerCorporation
IfyouarewritingprogramsinPythontoaccomplishtext
processingtasks,mostofwhatyouneedtoknowisinthis
chapter.Sure,youwillprobablyneedtoknowhowtodosome
basicthingswithpipes,files,andargumentstogetyourtextto
process(coveredinChapter1);butforactuallyprocessingthe
textyouhavegotten,thestringmoduleandstring
methodsandPython'sbasicdatastructuresdomostallofwhat
youneeddone,almostallthetime.Toalesserextent,the
variouscustommodulestoperformencodings,encryptions,and
compressionsarehandytohavearound(andyoucertainlydo
notwanttheworkofimplementingthemyourself).Butatthe
heartoftextprocessingarebasictransformationsofbitsof
text.That'swhatstringfunctionsandstringmethodsdo.

Therearealotofinterestingtechniqueselsewhereinthisbook.
Iwouldn'thavewrittenaboutthemifIdidnotfindthem
important.Butbecautiousbeforedoinginterestingthings.
Specifically,givenafixedtaskinmind,beforecrackingthis
bookopentoanyoftheotherchapters,considerverycarefully
whetheryourproblemcanbesolvedusingthetechniquesin
thischapter.Ifyoucananswerthisquestionaffirmatively,you
shouldusuallyeschewthecomplicationsofusingthehigherlevelmodulesandtechniquesthatotherchaptersdiscuss.Byall
meansreadallofthisbookfortheinsightandedificationthatI
hopeitprovides;butstillfocusonthe"ZenofPython,"and
prefersimpletocomplexwhensimpleisenough.
Thischapterdoesseveralthings.Section2.1looksatanumber


ofcommonproblemsintextprocessingthatcan(andshould)
besolvedusing(predominantly)thetechniquesdocumentedin
thischapter.Eachofthese"Problems"presentsworking
solutionsthatcanoftenbeadoptedwithlittlechangetoreal-life
jobs.Butalargergoalistoprovidereaderswithastartingpoint
foradaptationoftheexamples.Itisnotmygoaltoprovide
merecollectionsofpackagedutilitiesandmodulesplentyof
thoseexistontheWeb,andresourcesliketheVaultsof
ParnassusCookbook
worthinvestigatingaspartofanyproject/task(andnewand
betterutilitieswillbewrittenbetweenthetimeIwritethisand
whenyoureadit).Itisbetterforreaderstoreceiveasolid
foundationandstartingpointfromwhichtodevelopthe
functionalitytheyneedfortheirownprojectsandtasks.And

evenbetterthanspurringadaptation,theseexamplesaimto
encouragecontemplation.Inpresentingexamples,thisbook
triestoembodyawayofthinkingaboutproblemsandan
attitudetowardssolvingthem.Morethananyindividual
technique,suchideasarewhatIwouldmostliketosharewith
readers.
Section2.2isa"referencewithcommentary"onthePython
standardlibrarymodulesfordoingbasictextmanipulations.
Thediscussionsinterspersedwitheachmoduletrytogivesome
guidanceonwhyyouwouldwanttouseagivenmoduleor
function,andthereferencedocumentationtriestocontainmore
examplesofactualtypicalusagethandoesaplainreference.In
manycases,theexamplesanddiscussionofindividualfunctions
addressescommonandproductivedesignpatternsinPython.
Thecross-referencesareintendedtocontextualizeagiven
function(orotherthing)intermsofrelatedones(andtohelp
youdecidewhichisrightforyou).Theactuallistingof
functions,constants,classes,andthelikeisinalphabetical
orderwithintypeofthing.


Section2.3inmanywayscontinuesSection2.1,butalso
providessomeaidsforusingthisbookinalearningcontext.
TheproblemsandsolutionspresentedinSection2.3are
somewhatmoreopen-endedthanthoseinSection2.1.Aswell,
eachsectionlabeledas"Discussion"isfollowedbyonelabeled
"Questions."Thesequestionsareonesthatcouldbeassigned
byateachertostudents;buttheyarealsointendedtobe
issuesthatgeneralreaderswillenjoyandbenefitfrom
contemplating.Inmanycases,thequestionspointtolimitations

oftheapproachesinitiallypresented,andaskreaderstothink
aboutwaystoaddressormovebeyondtheselimitationsexactly
whatreadersneedtodowhenwritingtheirowncustomcodeto
accomplishoutsidetasks.However,eachDiscussioninSection
2.3shouldstandonitsown,eveniftheQuestionsareskipped
overbythereader.


AppendixC.UnderstandingUnicode
SectionC.1.SomeBackgroundonCharacters
SectionC.2.WhatIsUnicode?
SectionC.3.Encodings
SectionC.4.Declarations
SectionC.5.FindingCodepoints
SectionC.6.Resources


Chapter1.PythonBasics
ThischapterdiscussesPythoncapabilitiesthatarelikelytobe
usedintextprocessingapplications.Foranintroductionto
Pythonsyntaxandsemanticsperse,readersmightwanttoskip
aheadtoAppendixA(ASelectiveandImpressionisticShort
ReviewofPython);GuidovanRossum'sPythonTutorialat
excellent.Thefocushereoccupiesasomewhathigherlevel:not
thePythonlanguagenarrowly,butalsonotyetspecifictotext
processing.
InSection1.1,Ilookatsomeprogrammingtechniquesthat
flowoutofthePythonlanguageitself,butthatareusuallynot
obvioustoPythonbeginnersandaresometimesnotobvious

eventointermediatePythonprogrammers.Theprogramming
techniquesthatarediscussedareonesthattendtobe
applicabletotextprocessingcontextsotherprogrammingtasks
arelikelytohavetheirowntricksandidiomsthatarenot
explicitlydocumentedinthisbook.
InSection1.2,IdocumentmodulesinthePythonstandard
librarythatyouwillprobablyuseinyourtextprocessing
application,orattheveryleastwanttokeepinthebackofyour
mind.AnumberofotherPythonstandardlibrarymodulesare
farenoughafieldoftextprocessingthatyouareunlikelytouse
theminthistypeofapplication.Suchremainingmodulesare
documentedverybrieflywithone-ortwo-linedescriptions.
MoredetailsoneachmodulecanbefoundwithPython's
standarddocumentation.


2.1SomeCommonTasks
2.1.1Problem:Quicklysortinglinesoncustom
criteria
Sortingisoneoftherealmeat-and-potatoesalgorithmsoftext
processingand,infact,ofmostprogramming.Fortunatelyfor
Pythondevelopers,thenative[].sortmethodis
extraordinarilyfast.Moreover,Pythonlistswithalmostany
heterogeneousobjectsaselementscanbesortedPythoncannot
relyontheuniformarraysofalanguagelikeC(anunfortunate
exceptiontothisgeneralpowerwasintroducedinrecentPython
versionswherecomparisonsofcomplexnumbersraisea
TypeError;and[1+1j,2+2j].sort()diesforthesame
reason;Unicodestringsinlistscancausesimilarproblems).
SEEALSO:complex22;

Thelistsortmethodiswonderfulwhenyouwanttosortitems
intheir"natural"orderorintheorderthatPythonconsiders
natural,inthecaseofitemsofvaryingtypes.Unfortunately,a
lotoftimes,youwanttosortthingsin"unnatural"orders.For
linesoftext,inparticular,anyorderthatisnotsimple
alphabetizationofthelinesis"unnatural."Butoftentextlines
containmeaningfulbitsofinformationinpositionsotherthan
thefirstcharacterposition:Alastnamemayoccurasthe
secondwordofalistofpeople(forexample,withfirstnameas
thefirstword);anIPaddressmayoccurseveralfieldsintoa
serverlogfile;amoneytotalmayoccuratposition70ofeach
line;andsoon.Whatifyouwanttosortlinesbasedonthis
styleofmeaningfulorderthatPythondoesn'tquiteunderstand?
Thelistsortmethod[].sort()supportsanoptionalcustom
comparisonfunctionargument.Thejobthisfunctionhasisto


return-1ifthefirstthingshouldcomefirst,return0ifthetwo
thingsareequalorder-wise,andreturn1ifthefirstthing
shouldcomesecond.Thebuilt-infunctioncmp()doesthisina
manneridenticaltothedefault[].sort()(exceptintermsof
speed,1st.sort()ismuchfasterthan1st.sort(cmp)).For
shortlistsandquicksolutions,acustomcomparisonfunctionis
probablythebestthing.Inalotofcases,youcanevengetby
withanin-linelambdafunctionasthecustomcomparison
function,whichisapleasantandhandyidiom.
Whenitcomestospeed,however,useofcustomcomparison
functionsisfairlyawful.PartoftheproblemisPython'sfunction
calloverhead,butalotofotherfactorscontributetothe
slowness.Fortunately,atechniquecalled"Schwartzian

Transforms"canmakeformuchfastercustomsorts.
SchwartzianTransformsarenamedafterRandalSchwartz,who
proposedthetechniqueforworkingwithPerl;butthetechnique
isequallyapplicabletoPython.
ThepatterninvolvedintheSchwartzianTransformtechnique
consistsofthreesteps(thesecanmorepreciselybecalledthe
Guttman-RoslerTransform,whichisbasedontheSchwartzian
Transform):
1. Transformthelistinareversiblewayintoonethat
sorts"naturally."
CallPython'snative[].sort()method.
Reversethetransformationin(1)torestoretheoriginallist
items(innewsortedorder).
Thereasonthistechniqueworksisthat,foralistofsizeN,it
onlyrequiresO(2N)transformationoperations,whichiseasyto
amortizeoverthenecessaryO(NlogN)compare/flipoperations
forlargelists.Thesortdominatescomputationaltime,so
anythingthatmakesthesortmoreefficientisawininthelimit


case(thislimitisreachedquickly).
Belowisanexampleofasimple,butplausible,customsorting
algorithm.Thesortisonthefourthandsubsequentwordsofa
listofinputlines.Linesthatareshorterthanfourwordssortto
thebottom.Runningthetestagainstafilewithabout20,000
linesabout1megabyteperformedtheSchwartzianTransform
sortinlessthan2seconds,whiletakingover12secondsforthe
customcomparisonfunctionsort(outputswereverifiedas
identical).Anynumberoffactorswillchangetheexactrelative
timings,butabetterthansixtimesgaincangenerallybe

expected.

schwartzian_sort.py
#Timingtestfor"sortonfourthword"
#Specifically,twolines>=4wordswillbesorted
#lexographicallyonthe4th,5th,etc..words.
#Anylinewithfewerthanfourwordswillbesortedto
#theend,andwilloccurin"natural"order.
importsys,string,time
wrerr=sys.stderr.write
#naivecustomsort
deffourth_word(ln1,ln2):
lst1=string.split(ln1)
lst2=string.split(ln2)
#--Compare"long"lines
iflen(lst1)>=4andlen(lst2)>=4:
returncmp(lst1[3:],lst2[3:])
#--Longlinesbeforeshortlines
eliflen(lst1)>=4andlen(lst2)<4:
return-1
#--Shortlinesafterlonglines
eliflen(lst1)<4andlen(lst2)>=4:


×