A.3Datatypes
Pythonhasarichcollectionofbasicdatatypes.AllofPython's
collectiontypesallowyoutoholdheterogeneouselements
insidethem,includingothercollectiontypes(withminor
limitations).Itisstraightforward,therefore,tobuildcomplex
datastructuresinPython.
Unlikemanylanguages,Pythondatatypescomeintwo
varieties:mutableandimmutable.Alloftheatomicdatatypes
areimmutable,asisthecollectiontypetuple.Thecollections
listanddictaremutable,asareclassinstances.The
mutabilityofadatatypeissimplyaquestionofwhetherobjects
ofthattypecanbechanged"inplace"animmutableobjectcan
onlybecreatedanddestroyed,butneveralteredduringits
existence.Oneupshotofthisdistinctionisthatimmutable
objectsmayactasdictionarykeys,butmutableobjectsmay
not.Anotherupshotisthatwhenyouwantadata
structureespeciallyalargeonethatwillbemodifiedfrequently
duringprogramoperation,youshouldchooseamutable
datatype(usuallyalist).
Mostofthetime,ifyouwanttoconvertvaluesbetween
differentPythondatatypes,anexplicitconversion/encodingcall
isrequired,butnumerictypescontainpromotionrulestoallow
numericexpressionsoveramixtureoftypes.Thebuilt-in
datatypesarelistedbelowwithdiscussionsofeach.Thebuilt-in
functiontype()canbeusedtocheckthedatatypeofanobject.
A.3.1SimpleTypes
bool
Python2.3+supportsaBooleandatatypewiththepossible
valuesTrueandFalse.InearlierversionsofPython,these
valuesaretypicallycalled1and0;eveninPython2.3+,the
Booleanvaluesbehavelikenumbersinnumericcontexts.Some
earliermicro-releasesofPython(e.g.,2.2.1)includethenames
TrueandFalse,butnottheBooleandatatype.
int
Asignedintegerintherangeindicatedbytheregistersizeof
theinterpreter'sCPU/OSplatform.Formostcurrentplatforms,
integersrangefrom(2**31)-1tonegative(2**31)-1.Youcan
findthesizeonyourplatformbyexaminingsys.maxint.
Integersarethebottomnumerictypeintermsofpromotions;
nothinggetspromotedtoaninteger,butintegersare
sometimespromotedtoothernumerictypes.Afloat,long,or
stringmaybeexplicitlyconvertedtoanintusingtheint()
function.
SEEALSO:int18;
long
An(almost)unlimitedsizeintegralnumber.Alongliteralis
indicatedbyanintegerfollowedbyan1orL(e.g.,34L,
98765432101).InPython2.2+,operationsonintsthatoverflow
sys.maxintareautomaticallypromotedtolongs.Anint,float,
orstringmaybeexplicitlyconvertedtoalongusingthelong()
function.
float
AnIEEE754floatingpointnumber.Aliteralfloatingpoint
numberisdistinguishedfromanintorlongbycontaininga
decimalpointand/orexponentnotation(e.g.,1.0,1e3,37.,
.453e-12).Anumericexpressionthatinvolvesbothint/long
typesandfloattypespromotesallcomponenttypestofloats
beforeperformingthecomputation.Anint,long,orstringmay
beexplicitlyconvertedtoafloatusingthefloat()function.
SEEALSO:float19;
complex
Anobjectcontainingtwofloats,representingrealandimaginary
componentsofanumber.Anumericexpressionthatinvolves
bothint/long/floattypesandcomplextypespromotesall
componenttypestocomplexbeforeperformingthe
computation.Thereisnowaytospellaliteralcomplexin
Python,butanadditionsuchas1.1+2jistheusualwayof
computingacomplexvalue.AjorJfollowingafloatorint
literalindicatesanimaginarynumber.Anint,long,orstring
maybeexplicitlyconvertedtoacomplexusingthecomplex()
function.Iftwofloat/intargumentsarepassedtocomplex(),
thesecondistheimaginarycomponentoftheconstructed
number(e.g.,complex(1.1,2)).
string
Animmutablesequenceof8-bitcharactervalues.Unlikein
manyprogramminglanguages,thereisno"character"typein
Python,merelystringsthathappentohavelengthone.String
objectshaveavarietyofmethodstomodifystrings,butsuch
methodsalwaysreturnanewstringobjectratherthanmodify
theinitialobjectitself.Thebuilt-inchr()functionwillreturna
length-onestringwhoseordinalvalueisthepassedinteger.The
str()functionwillreturnastringrepresentationofapassedin
object.Forexample:
>>>ord('a')
97
>>>chr(97)
'a'
>>>str(97)
'97'
SEEALSO:string129;
unicode
AnimmutablesequenceofUnicodecharacters.Thereisno
datatypeforasingleUnicodecharacter,butUnicodestringsof
length-onecontainasinglecharacter.Unicodestringscontaina
similarcollectionofmethodstostringobjects,andlikethe
latter,UnicodemethodsreturnnewUnicodeobjectsratherthan
modifytheinitialobject.SeeChapter2andAppendixCfor
additionaldiscussion,ofUnicode.
A.3.2StringInterpolation
LiteralstringsandUnicodestringsmaycontainembedded
formatcodes.Whenastringcontainsformatcodes,valuesmay
beinterpolatedintothestringusingthe%operatorandatuple
ordictionarygivingthevaluestosubstitutein.
Stringsthatcontainformatcodesmayfolloweitheroftwo
patterns.Thesimplerpatternusesformatcodeswiththesyntax
%[flags][len[.precision]]<type>.Interpolatingastringwith
formatcodesonthispatternrequires%combinationwitha
tupleofmatchinglengthandcontentdatatypes.Ifonlyone
valueisbeinginterpolated,youmaygivethebareitemrather
thanatupleoflengthone.Forexample:
>>>"float%3.1f,int%+d,hex%06x"%(1.234,1234,1234)
'float1.2,int+1234,hex0004d2'
>>>'%e'%1234
'1.234000e+03'
>>>'%e'%(1234,)
'1.234000e+03'
The(slightly)morecomplexpatternforformatcodesembedsa
namewithintheformatcode,whichisthenusedasastringkey
toaninterpolationdictionary.Thesyntaxofthispatternis%
(key)[flags][len[.precision]]<type>.Interpolatingastring
withthisstyleofformatcodesrequires%combinationwitha
dictionarythatcontainsallthenamedkeys,andwhose
correspondingvaluescontainacceptabledatatypes.For
example:
>>>dct={'ratio':1.234,'count':1234,'offset':1234}
>>>"float%(ratio)3.1f,int%(count)+d,hex%(offset)06x"%dc
'float1.2,int+1234,hex0004d2'
Youmaynotmixtupleinterpolationanddictionaryinterpolation
withinthesamestring.
Imentionedthatdatatypesmustmatchformatcodes.Different
formatcodesacceptadifferentrangeofdatatypes,butthe
rulesarealmostalwayswhatyouwouldexpect.Generally,
numericdatawillbepromotedordemotedasnecessary,but
stringsandcomplextypescannotbeusedfornumbers.
Oneusefulstyleofusingdictionaryinterpolationisagainstthe
globaland/orlocalnamespacedictionary.Regularboundnames
definedinscopecanbeinterpolatedintostrings.
>>>s="float%(ratio)3.1f,int%(count)+d,hex%(offset)06x"
>>>ratio=1.234
>>>count=1234
>>>offset=1234
>>>s%globals()
'float1.2,int+1234,hex0004d2'
Ifyouwanttolookfornamesacrossscope,youcancreatean
adhocdictionarywithbothlocalandglobalnames:
>>>vardct={}
>>>vardct.update(globals())
>>>vardct.update(locals())
>>>interpolated=somestring%vardct
Theflagsforformatcodesconsistofthefollowing:
0Padtolengthwithleadingzeros
-Alignthevaluetotheleftwithinitslength
-(space)Padtolengthwithleadingspaces
+Explicitlyindicatethesignofpositivevalues
Whenalengthisincluded,itspecifiestheminimumlengthof
theinterpolatedformatting.Numbersthatwillnotfitwithina
lengthsimplyoccupymorebytesthanspecified.Whena
precisionisincluded,thelengthofthosedigitstotherightof
thedecimalareincludedinthetotallength:
>>>'[%f]'%1.234
'[1.234000]'
>>>'[%5f]'%1.234
'[1.234000]'
>>>'[%.1f]'%1.234
'[1.2]'
>>>'[%5.1f]'%1.234
'[1.2]'
>>>'[%05.1f]'%1.234
'[001.2]'
Theformattingtypesconsistofthefollowing:
dSignedintegerdecimal
iSignedintegerdecimal
oUnsignedoctal
uUnsigneddecimal
xLowercaseunsignedhexadecimal
XUppercaseunsignedhexadecimal
eLowercaseexponentialformatfloatingpoint
EUppercaseexponentialformatfloatingpoint
fFloatingpointdecimalformat
gFloatingpoint:exponentialformatif-4
GUppercaseversionof'g'
cSinglecharacter:integerforchr(i)orlength-onestring
rConvertsanyPythonobjectusingrepr()
sConvertsanyPythonobjectusingstr()
%The'%'character,e.g.:'%%%d'%(1)-->'%1'
Onemorespecialformatcodestyleallowstheuseofa*in
placeofalength.Inthiscase,theinterpolatedtuplemust
containanextraelementfortheformattedlengthofeach
formatcode,precedingthevaluetoformat.Forexample:
>>>"%0*d#%0*.2f"%(4,123,4,1.23)
'0123#1.23'
>>>"%0*d#%0*.2f"%(6,123,6,1.23)
'000123#001.23'
A.3.3Printing
Theleast-sophisticatedformoftextualoutputinPythonis
writingtoopenfiles.Inparticular,theSTDOUTandSTDERR
streamscanbeaccessedusingthepseudo-filessys.stdout
andsys.stderr.Writingtotheseisjustlikewritingtoany
otherfile;forexample:
>>>importsys
>>>try:
...#somefragileaction
...sys.stdout.write('resultofaction\n')
...except:
...sys.stderr.write('couldnotcompleteaction\n')
...
resultofaction
YoucannotseekwithinSTDOUTorSTDERRgenerallyyoushould
considertheseaspuresequentialoutputs.
WritingtoSTDOUTandSTDERRisfairlyinflexible,andmostof
thetimetheprintstatementaccomplishesthesamepurpose
moreflexibly.Inparticular,methodslikesys.stdout.write()
onlyacceptasinglestringasanargument,whileprintcan
handleanynumberofargumentsofanytype.Eachargumentis
coercedtoastringusingtheequivalentofrepr(obj).For
example:
>>>print"Pi:%.3f"%3.1415,27+11,{3:4,1:2},(1,2,3)
Pi:3.14238{1:2,3:4}(1,2,3)
Eachargumenttotheprintstatmentisevaluatedbeforeitis
printed,justaswhenanargumentispassedtoafunction.Asa
consequence,thecanonicalrepresentationofanobjectis
printed,ratherthantheexactformpassedasanargument.In
myexample,thedictionaryprintsinadifferentorderthanit
wasdefinedin,andthespacingofthelistanddictionaryis
slightlydifferent.Stringinterpolationisalsopeformedandisa
verycommonmeansofdefininganoutputformatprecisely.
Thereareafewthingstowatchforwiththeprintstatement.A
spaceisprintedbetweeneachargumenttothestatement.If
youwanttoprintseveralobjectswithoutaseparatingspace,
youwillneedtousestringconcatenationorstringinterpolation
togettherightresult.Forexample:
>>>numerator,denominator=3,7
>>>printrepr(numerator)+"/"+repr(denominator)
3/7
>>>print"%d/%d"%(numerator,denominator)
3/7
Bydefault,aprintstatementaddsalinefeedtotheendofits
output.Youmayeliminatethelinefeedbyaddingatrailing
commatothestatement,butyoustillwindupwithaspace
addedtotheend:
>>>letlist=('a','B','Z','r','w')
>>>forcinletlist:printc,#insertsspaces
...
aBZrw
Assumingthesespacesareunwanted,youmusteitheruse
sys.stdout.write()orotherwisecalculatethespace-free
stringyouwant:
>>>forcinletlist+('\n',):#nospaces
...sys.stdout.write(c)
...
aBZrw
>>>print''.join(letlist)
aBZrw
Thereisaspecialformoftheprintstatementthatredirectsits
outputsomewhereotherthanSTDOUT.Theprintstatement
itselfcanbefollowedbytwogreater-thansigns,thenawritable
file-likeobject,thenacomma,thentheremainderofthe
(printed)arguments.Forexample:
>>>print>>open('test','w'),"Pi:%.3f"%3.1415,27+11
>>>open('test').read()
'Pi:3.14238\n'
SomePythonprogrammers(includingyourauthor)consider
thisspecialformoverly"noisy,"butitisoccassionallyusefulfor
quickconfigurationofoutputdestinations.
Ifyouwantafunctionthatwoulddothesamethingasaprint
statement,thefollowingonedoesso,butwithoutanyfacilityto
eliminatethetrailinglinefeedorredirectoutput:
defprint_func(*args):
importsys
sys.stdout.write(''.join(map(repr,args))+'\n')
Readerscouldenhancethistoaddthemissingcapabilities,but
usingprintasastatementistheclearestapproach,generally.
SEEALSO:sys.stderr50;sys.stdout51;
A.3.4ContainerTypes
tuple
Animmutablesequenceof(heterogeneous)objects.Being
immutable,themembershipandlengthofatuplecannotbe
modifiedaftercreation.However,tupleelementsand
subsequencescanbeaccessedbysubscriptingandslicing,and
newtuplescanbeconstructedfromsuchelementsandslices.
Tuplesaresimilarto"records"insomeotherprogramming
languages.
Theconstructorsyntaxforatupleiscommasbetweenlisted
items;inmanycontexts,parenthesesaroundaconstructedlist
arerequiredtodisambiguateatupleforotherconstructssuch
asfunctionarguments,butitisthecommasnotthe
parenthesesthatconstructatuple.Someexamples:
>>>tup='spam','eggs','bacon','sausage'
>>>newtup=tup[1:3]+(1,2,3)+(tup[3],)
>>>newtup
('eggs','bacon',1,2,3,'sausage')
Thefunctiontuple()mayalsobeusedtoconstructatuple
fromanothersequencetype(eitheralistorcustomsequence
type).
SEEALSO:tuple28;
list
Amutablesequenceofobjects.Likeatuple,listelementscan
beaccessedbysubscriptingandslicing;unlikeatuple,list
methodsandindexandsliceassignmentscanmodifythelength
andmembershipofalistobject.
Theconstructorsyntaxforalistissurroundingsquarebraces.
Anemptylistmaybeconstructedwithnoobjectsbetweenthe
braces;alength-onelistcancontainsimplyanobjectname;
longerlistsseparateeachelementobjectwithcommas.
Indexingandslices,ofcourse,alsousesquarebraces,butthe
syntacticcontextsaredifferentinthePythongrammar(and
commonsenseusuallypointsoutthedifference).Some
examples:
>>>lst=['spam',(1,2,3),'eggs',3.1415]
>>>lst[:2]
['spam',(1,2,3)]
Thefunctionlist()mayalsobeusedtoconstructalistfrom
anothersequencetype(eitheratupleorcustomsequence
type).
SEEALSO:list28;
dict
Amutablemappingbetweenimmutablekeysandobjectvalues.
Atmostoneentryinadictexistsforagivenkey;addingthe
samekeytoadictionaryasecondtimeoverridestheprevious
entry(muchaswithbindinganameinanamespace).Dictsare
unordered,andentriesareaccessedeitherbykeyasindex;by
creatinglistsofcontainedobjectsusingthemethods.keys(),
.values(),and.items();orinrecentPythonversionswiththe
.popitem()method.Allthedictmethodsgeneratecontained
objectsinanunspecifiedorder.
Theconstructorsyntaxforadictissurroundingcurlybrackets.
Anemptydictmaybeconstructedwithnoobjectsbetweenthe
brackets.Eachkey/valuepairenteredintoadictisseparatedby
acolon,andsuccessivepairsareseparatedbycommas.For
example:
>>>dct={1:2,3.14:(1+2j),'spam':'eggs'}
>>>dct['spam']
'eggs'
>>>dct['a']='b'#additemtodict
>>>dct.items()
[('a','b'),(1,2),('spam','eggs'),(3.14,(1+2j))]
>>>dct.popitem()
('a','b')
>>>dct
{1:2,'spam':'eggs',3.14:(1+2j)}
InPython2.2+,thefunctiondict()mayalsobeusedto
constructadictfromasequenceofpairsorfromacustom
mappingtype.Forexample:
>>>d1=dict([('a','b'),(1,2),('spam','eggs')])
>>>d1
{'a':'b',1:2,'spam':'eggs'}
>>>d2=dict(zip([1,2,3],['a','b','c']))
>>>d2
{1:'a',2:'b',3:'c'}
SEEALSO:dict24;
sets.Set
Python2.3+includesastandardmodulethatimplementsaset
datatype.ForearlierPythonversions,anumberofdevelopers
havecreatedthird-partyimplementationsofsets.Ifyouhaveat
leastPython2.2,youcandownloadandusethesetsmodule
from<(orbrowsethePythonCVS)you
willneedtoaddthedefinitionTrue,False=1,0toyourlocal
version,though.
Asetisanunorderedcollectionofhashableobjects.Unlikea
list,noobjectcanoccurinasetmorethanonce;aset
resemblesadictthathasonlykeysbutnovalues.Setsutilize
bitwiseandBooleansyntaxtoperformbasicset-theoretic
operations;asubsettestdoesnothaveaspecialsyntactic
form,insteadusingthe.issubset()and.issuperset()
methods.Youmayalsoloopthroughsetmembersinan
unspecifiedorder.Someexamplesillustratethetype:
>>>fromsetsimportSet
>>>x=Set([1,2,3])
>>>y=Set((3,4,4,6,6,2))#initwithanyseq
>>>printx,'//',y#makesuredupsremoved
Set([1,2,3])//Set([2,3,4,6])
>>>printx|y#unionofsets
Set([1,2,3,4,6])
>>>printx&y#intersectionofsets
Set([2,3])
>>>printy-x#differenceofsets
Set([4,6])
>>>printx^y#symmetricdifference
Set([1,4,6])
Youcanalsocheckmembershipanditerateoversetmembers:
>>>4iny#membershipcheck
1
>>>x.issubset(y)#subsetcheck
0
>>>foriiny:
...printi+10,
...
12131416
>>>fromoperatorimportadd
>>>plus_ten=Set(map(add,y,[10]*len(y)))
>>>plus_ten
Set([16,12,13,14])
sets.Setalsosupportsin-placemodificationofsets;
sets.ImmutableSet,naturally,doesnotallowmodification.
>>>x=Set([1,2,3])
>>>x|=Set([4,5,6])
>>>x
Set([1,2,3,4,5,6])
>>>x&=Set([4,5,6])
>>>x
Set([4,5,6])
>>>x^=Set([4,5])
>>>x
Set([6])
A.3.5CompoundTypes
classinstance
Aclassinstancedefinesanamespace,butthisnamespace's
mainpurposeisusuallytoactasadatacontainer(buta
containerthatalsoknowshowtoperformactions;i.e.,has
methods).Aclassinstance(oranynamespace)actsverymuch
likeadictintermsofcreatingamappingbetweennamesand
values.Attributesofaclassinstancemaybesetormodified
usingstandardqualifiednamesandmayalsobesetwithinclass
methodsbyqualifyingwiththenamespaceofthefirst(implicit)
methodargument,conventionallycalledself.Forexample:
>>>classKlass:
...defsetfoo(self,val):
...self.foo=val
...
>>>obj=Klass()
>>>obj.bar='BAR'
>>>obj.setfoo(['this','that','other'])
>>>obj.bar,obj.foo
('BAR',['this','that','other'])
>>>obj.__dict__
{'foo':['this','that','other'],'bar':'BAR'}
Instanceattributesoftendereferencetootherclassinstances,
therebyallowinghierarchicallyorganizednamespace
quantificationtoindicateadatastructure.Moreover,anumber
of"magic"methodsnamedwithleadingandtrailingdoubleunderscoresprovideoptionalsyntacticconveniencesforworking
withinstancedata.Themostcommonofthesemagicmethods
is.__init__(),whichinitializesaninstance(oftenutilizing
arguments).Forexample:
>>>classKlass2:
...def__init__(self,*args,**kw):
...self.listargs=args
...forkey,valinkw.items():
...setattr(self,key,val)
...
>>>obj=Klass2(1,2,3,foo='F00',bar=Klass2(baz='BAZ'))
>>>obj.bar.blam='BLAM'
>>>obj.listargs,obj.foo,obj.bar.baz,obj.bar.blam
((1,2,3),'F00','BAZ','BLAM')
Therearequiteafewadditional"magic"methodsthatPython
classesmaydefine.Manyofthesemethodsletclassinstances
behavemorelikebasicdatatypes(whilestillmaintainingspecial
classbehaviors).Forexample,the.__str__()and
.__repr__()methodscontrolthestringrepresentationofan
instance;the.__getitem__()and.__setitem__()methods
allowindexedaccesstoinstancedata(eitherdict-likenamed
indices,orlist-likenumberedindices);methodslike
.__add__(),.__mul__(),.__pow__(),and.__abs__()allow
instancestobehaveinnumber-likeways.ThePythonReference
Manualdiscussesmagicmethodsindetail.
InPython2.2andabove,youcanalsoletinstancesbehave
morelikebasicdatatypesbyinheritingclassesfromthesebuiltintypes.Forexample,supposeyouneedadatatypewhose
"shape"containsbothamutablesequenceofelementsanda
.fooattribute.Twowaystodefinethisdatatypeare:
>>>classFooList(list):#worksonlyinPython2.2+
...def__init__(self,lst=[],foo=None):
...list.__init__(self,lst)
...self.foo=foo
...
>>>foolist=FooList([1,2,3],'F00')
>>>foolist[1],foolist.foo
(2,'F00')
>>>classoldFooList:#worksinolderPythons
...def__init__(self,lst=[],foo=None):
...self._lst,self.foo=1st,foo
...defappend(self,item):
...self._lst.append(item)
...def__getitem__(self,item):
...returnself._lst[item]
...def__setitem__(self,item,val):
...self._lst[item]=val
...def__delitem__(self,item):
...delself._lst[item]
...
>>>foolst2=oldFooList([1,2,3],'F00')
>>>foolst2[1],foolst2.foo
(2,'F00')
Ifyouneedmorecomplexdatatypesthanthebasictypes,or
eventhananinstancewhoseclasshasmagicmethods,often
thesecanbeconstructedbyusinginstanceswhoseattributes
areboundinlink-likefashiontootherinstances.Suchbindings
canbeconstructedaccordingtovarioustopologies,including
circularones(suchasformodelinggraphs).Asasimple
example,youcanconstructabinarytreeinPythonusingthe
followingnodeclass:
>>>classNode:
...def__init__(self,left=None,value=None,right=None):
...self.left,self.value,self.right=left,value,ri
...def__repr__(self):
...returnself.value
...
>>>tree=Node(Node(value="LeftLeaf"),
..."TreeRoot",
...Node(left=Node(value="RightLeftLeaf"),
...right=Node(value="RightRightLeaf")))
>>>tree,tree.left,tree.left.left,tree.right.left,tree.right.ri
(TreeRoot,LeftLeaf,None,RightLeftLeaf,RightRightLeaf)
Inpractice,youwouldprobablybindintermediatenodesto
names,inordertoalloweasypruningandrearrangement.
SEEALSO:int18;float19;list28;string129;tuple28;
UserDict24;UserList28;UserString33;
Chapter2.BasicStringOperations
Thecheapest,fastestandmostreliablecomponentsofa
computersystemarethosethataren'tthere.
GordonBell,EncoreComputerCorporation
IfyouarewritingprogramsinPythontoaccomplishtext
processingtasks,mostofwhatyouneedtoknowisinthis
chapter.Sure,youwillprobablyneedtoknowhowtodosome
basicthingswithpipes,files,andargumentstogetyourtextto
process(coveredinChapter1);butforactuallyprocessingthe
textyouhavegotten,thestringmoduleandstring
methodsandPython'sbasicdatastructuresdomostallofwhat
youneeddone,almostallthetime.Toalesserextent,the
variouscustommodulestoperformencodings,encryptions,and
compressionsarehandytohavearound(andyoucertainlydo
notwanttheworkofimplementingthemyourself).Butatthe
heartoftextprocessingarebasictransformationsofbitsof
text.That'swhatstringfunctionsandstringmethodsdo.
Therearealotofinterestingtechniqueselsewhereinthisbook.
Iwouldn'thavewrittenaboutthemifIdidnotfindthem
important.Butbecautiousbeforedoinginterestingthings.
Specifically,givenafixedtaskinmind,beforecrackingthis
bookopentoanyoftheotherchapters,considerverycarefully
whetheryourproblemcanbesolvedusingthetechniquesin
thischapter.Ifyoucananswerthisquestionaffirmatively,you
shouldusuallyeschewthecomplicationsofusingthehigherlevelmodulesandtechniquesthatotherchaptersdiscuss.Byall
meansreadallofthisbookfortheinsightandedificationthatI
hopeitprovides;butstillfocusonthe"ZenofPython,"and
prefersimpletocomplexwhensimpleisenough.
Thischapterdoesseveralthings.Section2.1looksatanumber
ofcommonproblemsintextprocessingthatcan(andshould)
besolvedusing(predominantly)thetechniquesdocumentedin
thischapter.Eachofthese"Problems"presentsworking
solutionsthatcanoftenbeadoptedwithlittlechangetoreal-life
jobs.Butalargergoalistoprovidereaderswithastartingpoint
foradaptationoftheexamples.Itisnotmygoaltoprovide
merecollectionsofpackagedutilitiesandmodulesplentyof
thoseexistontheWeb,andresourcesliketheVaultsof
Parnassus
Cookbook
worthinvestigatingaspartofanyproject/task(andnewand
betterutilitieswillbewrittenbetweenthetimeIwritethisand
whenyoureadit).Itisbetterforreaderstoreceiveasolid
foundationandstartingpointfromwhichtodevelopthe
functionalitytheyneedfortheirownprojectsandtasks.And
evenbetterthanspurringadaptation,theseexamplesaimto
encouragecontemplation.Inpresentingexamples,thisbook
triestoembodyawayofthinkingaboutproblemsandan
attitudetowardssolvingthem.Morethananyindividual
technique,suchideasarewhatIwouldmostliketosharewith
readers.
Section2.2isa"referencewithcommentary"onthePython
standardlibrarymodulesfordoingbasictextmanipulations.
Thediscussionsinterspersedwitheachmoduletrytogivesome
guidanceonwhyyouwouldwanttouseagivenmoduleor
function,andthereferencedocumentationtriestocontainmore
examplesofactualtypicalusagethandoesaplainreference.In
manycases,theexamplesanddiscussionofindividualfunctions
addressescommonandproductivedesignpatternsinPython.
Thecross-referencesareintendedtocontextualizeagiven
function(orotherthing)intermsofrelatedones(andtohelp
youdecidewhichisrightforyou).Theactuallistingof
functions,constants,classes,andthelikeisinalphabetical
orderwithintypeofthing.
Section2.3inmanywayscontinuesSection2.1,butalso
providessomeaidsforusingthisbookinalearningcontext.
TheproblemsandsolutionspresentedinSection2.3are
somewhatmoreopen-endedthanthoseinSection2.1.Aswell,
eachsectionlabeledas"Discussion"isfollowedbyonelabeled
"Questions."Thesequestionsareonesthatcouldbeassigned
byateachertostudents;buttheyarealsointendedtobe
issuesthatgeneralreaderswillenjoyandbenefitfrom
contemplating.Inmanycases,thequestionspointtolimitations
oftheapproachesinitiallypresented,andaskreaderstothink
aboutwaystoaddressormovebeyondtheselimitationsexactly
whatreadersneedtodowhenwritingtheirowncustomcodeto
accomplishoutsidetasks.However,eachDiscussioninSection
2.3shouldstandonitsown,eveniftheQuestionsareskipped
overbythereader.
AppendixC.UnderstandingUnicode
SectionC.1.SomeBackgroundonCharacters
SectionC.2.WhatIsUnicode?
SectionC.3.Encodings
SectionC.4.Declarations
SectionC.5.FindingCodepoints
SectionC.6.Resources
Chapter1.PythonBasics
ThischapterdiscussesPythoncapabilitiesthatarelikelytobe
usedintextprocessingapplications.Foranintroductionto
Pythonsyntaxandsemanticsperse,readersmightwanttoskip
aheadtoAppendixA(ASelectiveandImpressionisticShort
ReviewofPython);GuidovanRossum'sPythonTutorialat
excellent.Thefocushereoccupiesasomewhathigherlevel:not
thePythonlanguagenarrowly,butalsonotyetspecifictotext
processing.
InSection1.1,Ilookatsomeprogrammingtechniquesthat
flowoutofthePythonlanguageitself,butthatareusuallynot
obvioustoPythonbeginnersandaresometimesnotobvious
eventointermediatePythonprogrammers.Theprogramming
techniquesthatarediscussedareonesthattendtobe
applicabletotextprocessingcontextsotherprogrammingtasks
arelikelytohavetheirowntricksandidiomsthatarenot
explicitlydocumentedinthisbook.
InSection1.2,IdocumentmodulesinthePythonstandard
librarythatyouwillprobablyuseinyourtextprocessing
application,orattheveryleastwanttokeepinthebackofyour
mind.AnumberofotherPythonstandardlibrarymodulesare
farenoughafieldoftextprocessingthatyouareunlikelytouse
theminthistypeofapplication.Suchremainingmodulesare
documentedverybrieflywithone-ortwo-linedescriptions.
MoredetailsoneachmodulecanbefoundwithPython's
standarddocumentation.
2.1SomeCommonTasks
2.1.1Problem:Quicklysortinglinesoncustom
criteria
Sortingisoneoftherealmeat-and-potatoesalgorithmsoftext
processingand,infact,ofmostprogramming.Fortunatelyfor
Pythondevelopers,thenative[].sortmethodis
extraordinarilyfast.Moreover,Pythonlistswithalmostany
heterogeneousobjectsaselementscanbesortedPythoncannot
relyontheuniformarraysofalanguagelikeC(anunfortunate
exceptiontothisgeneralpowerwasintroducedinrecentPython
versionswherecomparisonsofcomplexnumbersraisea
TypeError;and[1+1j,2+2j].sort()diesforthesame
reason;Unicodestringsinlistscancausesimilarproblems).
SEEALSO:complex22;
Thelistsortmethodiswonderfulwhenyouwanttosortitems
intheir"natural"orderorintheorderthatPythonconsiders
natural,inthecaseofitemsofvaryingtypes.Unfortunately,a
lotoftimes,youwanttosortthingsin"unnatural"orders.For
linesoftext,inparticular,anyorderthatisnotsimple
alphabetizationofthelinesis"unnatural."Butoftentextlines
containmeaningfulbitsofinformationinpositionsotherthan
thefirstcharacterposition:Alastnamemayoccurasthe
secondwordofalistofpeople(forexample,withfirstnameas
thefirstword);anIPaddressmayoccurseveralfieldsintoa
serverlogfile;amoneytotalmayoccuratposition70ofeach
line;andsoon.Whatifyouwanttosortlinesbasedonthis
styleofmeaningfulorderthatPythondoesn'tquiteunderstand?
Thelistsortmethod[].sort()supportsanoptionalcustom
comparisonfunctionargument.Thejobthisfunctionhasisto
return-1ifthefirstthingshouldcomefirst,return0ifthetwo
thingsareequalorder-wise,andreturn1ifthefirstthing
shouldcomesecond.Thebuilt-infunctioncmp()doesthisina
manneridenticaltothedefault[].sort()(exceptintermsof
speed,1st.sort()ismuchfasterthan1st.sort(cmp)).For
shortlistsandquicksolutions,acustomcomparisonfunctionis
probablythebestthing.Inalotofcases,youcanevengetby
withanin-linelambdafunctionasthecustomcomparison
function,whichisapleasantandhandyidiom.
Whenitcomestospeed,however,useofcustomcomparison
functionsisfairlyawful.PartoftheproblemisPython'sfunction
calloverhead,butalotofotherfactorscontributetothe
slowness.Fortunately,atechniquecalled"Schwartzian
Transforms"canmakeformuchfastercustomsorts.
SchwartzianTransformsarenamedafterRandalSchwartz,who
proposedthetechniqueforworkingwithPerl;butthetechnique
isequallyapplicabletoPython.
ThepatterninvolvedintheSchwartzianTransformtechnique
consistsofthreesteps(thesecanmorepreciselybecalledthe
Guttman-RoslerTransform,whichisbasedontheSchwartzian
Transform):
1. Transformthelistinareversiblewayintoonethat
sorts"naturally."
CallPython'snative[].sort()method.
Reversethetransformationin(1)torestoretheoriginallist
items(innewsortedorder).
Thereasonthistechniqueworksisthat,foralistofsizeN,it
onlyrequiresO(2N)transformationoperations,whichiseasyto
amortizeoverthenecessaryO(NlogN)compare/flipoperations
forlargelists.Thesortdominatescomputationaltime,so
anythingthatmakesthesortmoreefficientisawininthelimit
case(thislimitisreachedquickly).
Belowisanexampleofasimple,butplausible,customsorting
algorithm.Thesortisonthefourthandsubsequentwordsofa
listofinputlines.Linesthatareshorterthanfourwordssortto
thebottom.Runningthetestagainstafilewithabout20,000
linesabout1megabyteperformedtheSchwartzianTransform
sortinlessthan2seconds,whiletakingover12secondsforthe
customcomparisonfunctionsort(outputswereverifiedas
identical).Anynumberoffactorswillchangetheexactrelative
timings,butabetterthansixtimesgaincangenerallybe
expected.
schwartzian_sort.py
#Timingtestfor"sortonfourthword"
#Specifically,twolines>=4wordswillbesorted
#lexographicallyonthe4th,5th,etc..words.
#Anylinewithfewerthanfourwordswillbesortedto
#theend,andwilloccurin"natural"order.
importsys,string,time
wrerr=sys.stderr.write
#naivecustomsort
deffourth_word(ln1,ln2):
lst1=string.split(ln1)
lst2=string.split(ln2)
#--Compare"long"lines
iflen(lst1)>=4andlen(lst2)>=4:
returncmp(lst1[3:],lst2[3:])
#--Longlinesbeforeshortlines
eliflen(lst1)>=4andlen(lst2)<4:
return-1
#--Shortlinesafterlonglines
eliflen(lst1)<4andlen(lst2)>=4: