Tải bản đầy đủ (.pdf) (11 trang)

Addison Wesley Unicode Demystified A Practical Programmers Guide To The Encoding Standard Sep 2002 ISBN 0201700522

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (292.34 KB, 11 trang )

GeneralCategory
Afterthecodepointvalueandthename,thenextmost
importantpropertythataUnicodecharacterhasisitsgeneral
category.Sevenprimarycategoriesexist:letter,number,
punctuation,symbol,mark,separator,andmiscellaneous.Each
issubdividedintoadditionalcategories.

Letters
TheUnicodestandardusestheterm"letter"ratherlooselyin
assigningthingstothisgeneralcategory.Whatevercountsas
thebasicunitofmeaninginaparticularwritingsystem,
whetheritrepresentsaphoneme,asyllable,orawholewordor
idea,isassignedtothe"letter"category.Themajorexception
tothisrulecomprisesmarksthatcombinetypographicallywith
othercharacters,whicharecategorizedas"marks"insteadof
"letters."Theyincludenotonlydiacriticalmarksandtone
marks,butalsovowelsignsinthoseconsonantalwriting
systemswherethevowelsarewrittenasmarksappliedtothe
consonants.
Somewritingsystems,suchastheLatin,Greek,andCyrillic
alphabets,alsohavetheconceptof"case."Thatis,twoseries
ofletterformsareusedtogether,withoneseries,the
"uppercase,"usedforthefirstletterofasentenceoraproper
name,orforemphasis,andtheotherseries,the"lowercase,"
usedformostotherletters.

UppercaseLetter(Lu)
Incasedwritingsystems,theuppercaselettersareplacedin
thiscategory.



LowercaseLetter(Ll)
Incasedwritingsystems,thelowercaselettersareplacedin
thiscategory.

TitlecaseLetter(Lt)
TitlecaseisreservedforafewspecialcharactersinUnicode.
Thesecharactersarebasicallyexamplesofcompatibility
characterscharactersthatwereincludedforround-trip
compatibilitywithsomeotherstandard.Everytitlecaseletteris
actuallyaglyphrepresentingtwoletters,thefirstofwhichis
uppercaseandthesecondofwhichislowercase.Forexample,
theSerbianletternje( )canbethoughtofasaligatureofthe
Cyrilliclettern( )andtheCyrillicsoftsign( ).WhenSerbian
iswrittenusingtheLatinalphabet(asisdoneinCroatian,
whichisalmostthesamelanguage),thisletteriswrittenusing
thelettersnj.ExistingSerbianandCroatianstandardswere
designedtoprovideaone-to-onemappingbetweenevery
CyrilliccharacterusedinSerbianandthecorrespondingLatin
characterusedinCroatian.Thisapproachrequiredusinga
singlecharactercodetorepresentthenjdigraphinCroatian,
andUnicodecarriesthatcharacterforward.CapitalNjein
Cyrillic( )thuscanconverttoeitherNJorNjinLatin
dependingonthecontext.Thefullyuppercaseform,NJ,is
U+01CALATINCAPITALLETTERNJ,andthecombinedupperlowerform,U+01CBLATINCAPITALLETTERNWITHSMALL
LETTERJ,isconsidereda"titlecase"letter.ThreeSerbian
charactershaveatitlecaseLatinform: (lje,whichconvertsto
lj), (nje,whichconvertstonj),and (dzhe,whichconverts
tod ).Thesecharactersweretheonlythreetitlecaselettersin
Unicode2.x.
Unicode3.0addedseveralGreekletterstothiscategory.Some

earlyGreektextsrepresentedcertaindiphthongsbywritinga


smallletteriotaunderneaththeothervowelratherthanafterit.
Forexample,you'dsee"ai"writtenas .Ifyoujustcapitalized
thealpha("Ai"),you'dgetthetitlecaseversion: .Inthefully
uppercaseversion("AI"),thesmalliotabecomesaregulariota
again:AI.ThesecharactersareallintheExtendedGreek
sectionofthestandardandareusedonlyinwritingancient
Greektexts.InmodernGreek,thesediphthongsarewritten
usingaregulariota;forexample,"ai"iswrittenas .

ModifierLetter(Lm)
Justassomethingsyoumightconceptuallythinkofas"letters"
(vowelsignsinvariouslanguages)areclassifiedas"marks"in
Unicode,theoppositealsooccurs.Themodifierlettersare
independentformsthatdon'tcombinetypographicallywiththe
charactersaroundthem,whichiswhyUnicodedoesn'tclassify
themas"marks"(Unicodemarks,bydefinition,combine
typographicallywiththeirneighbors).Insteadofcarryingtheir
ownsounds,themodifierlettersgenerallymodifythesoundsof
theirneighbors.Inotherwords,conceptuallythey'rediacritical
marks.Becausetheyoccurinthemiddleofwords,mosttextanalysisprocessestreatthemasletters,sothey'reclassifiedas
letters.
TheUnicodemodifierlettersaregenerallyeitherInternational
PhoneticAlphabetcharactersorcharactersthatareusedto
transliteratecertain"real"lettersinnon-Latinwritingsystems
thatdon'tseemtocorrespondtoaregularLatinletter.For
example,U+02BCMODIFIERLETTERAPOSTROPHEistypically
usedtorepresenttheglottalstop,thesoundmadeby(or,more

accurately,theabsenceofsoundrepresentedby)theArabic
letteralef,sotheArabicletterisoftentransliteratedasthis
character.Likewise,U+02B2MODIFIERLETTERSMALLJisused
torepresentpalatalization,andthusissometimesusedin
transliterationasthecounterpartoftheCyrillicsoftsign.


OtherLetter(Lo)
Thiscatch-allcategoryincludeseverythingthat'sconceptuallya
"letter,"butthatdoesn'tfitintooneoftheother"letter"
categories.LettersfromuncasedalphabetssuchasArabicand
Hebrewfallintothiscategory,asdosyllablesfromsyllabic
writingsystemslikeKanaandHangulandtheHanideographs.

Marks
Likeletters,marksarepartofwordsandcarrylinguistic
information.Unlikeletters,markscombinetypographicallywith
othercharacters.Forexample,U+0308COMBININGDIAERESIS
maylooklike¨whenshownalone,butisusuallydrawnontop
oftheletterthatprecedesit.Thatis,U+0061LATINSMALL
LETTERAfollowedbyU+0308COMBININGDIAERESISisn't
drawnas"a¨",butratheras"ä".AlloftheUnicodecombining
marksdothiskindofthing.

Non-spacingMark(Mn)
MostoftheUnicodecombiningmarksfallintothiscategory.
Non-spacingmarksdon'ttakeupanyhorizontalspacealonga
lineoftexttheycombinecompletelywiththecharacterthat
precedesthemandfitentirelyintothatcharacter'sspace.The
variousdiacriticalmarksusedinEuropeanlanguages,suchas

theacuteandgraveaccents,thecircumflex,thediaeresis,and
thecedilla,fallintothiscategory.

CombiningSpacingMark(Mc)
Spacingcombiningmarksinteracttypographicallywiththeir
neighbors,butstilltakeuphorizontalspacealongalineoftext.


Allofthesecharactersarevowelsignsorotherdiacriticalmarks
inthevariousIndianandSoutheastAsianwritingsystems.For
example,U+093FDEVANAGARIVOWELSIGNI( )isaspacing
combiningmark.ThusU+0915DEVANAGARILETTERKA
followedbyU+093FDEVANAGARIVOWELSIGNIisdrawnas
thevowelsignattachestotheleft-handsideofthe
consonant.
Notallspacingcombiningmarksreorder,however:U+0940
DEVANAGARIVOWELSIGNII( )isalsoacombiningspacing
mark.WhenitfollowsU+0915DEVANAGARILETTERKA,you
get thevowelattachestotheright-handsideofthe
consonant,butthetwocombinetypographically.

EnclosingMark(Me)
Enclosingmarkscompletelysurroundthecharactersthey
modify.Forexample,U+20DDCOMBININGENCLOSINGCIRCLE
isdrawnasaringaroundthecharacterthatprecedesit.These
tencharactersaregenerallyusedtocreatesymbols.

Numbers
TheUnicodecharactersthatrepresentnumericquantitiesare
giventhe"number"property(technically,itshouldbecalledthe

"numeral"property,butthat'slife).Thecharactersinthese
categorieshaveadditionalpropertiesthatgoverntheir
interpretationasnumerals.Thiscategoryissubdividedas
follows.

Decimal-DigitNumber(Nd)
Thecharactersinthiscategorycanbeusedasdecimaldigits.


Thiscategoryincludesnotonlythedigitswithwhichwe'reall
familiar("0123456789"),butsimilarsetsofdigitsusedwith
otherwritingsystems,suchastheThaidigits("
").

LetterNumber(Nl)
Thecharactersinthiscategorycanbeeitherlettersor
numerals.Manyarecompatibilitycompositeswhose
decompositionsconsistofletters.TheRomannumeralsandthe
Hangzhounumeralsaretheonlycharactersinthiscategory.

OtherNumber(No)
Allofthecharactersthatbelonginthe"number"category,but
notinoneoftheothersubcategories,fallintothisone.This
categoryincludesvariousnumericpresentationforms,suchas
superscripts,subscripts,andcirclednumbers;variousfractions;
andnumeralsusedinvariousnumerationsystemsotherthan
theArabicpositionalnotationusedintheWest.

Punctuation
Thiscategoryattemptstomakesenseofthevarious

punctuationcharactersinUnicode.Itbreaksdownasfollows.

OpeningPunctuation(Ps)
Forpunctuationmarks,suchasparenthesesandbrackets,that
occurinopening-closingpairs,the"opening"charactersin
thesepairsareassignedtothiscategory.


ClosingPunctuation(Pe)
Forpunctuationmarks,suchasparenthesesandbrackets,that
occurinopening-closingpairs,the"closing"charactersinthese
pairsareassignedtothiscategory.

Initial-QuotePunctuation(Pi)
Quotationmarksoccurinopening-closingpairs,justlike
parenthesesdo.Theproblemisthatwhichiswhichdependson
thelanguage.Forexample,bothFrenchandRussianuse
quotationmarksthatlooklikethis:«»,buttheyusethem
differently.
«InFrench,aquotationissetofflikethis.»
»ButinRussian,aquotationissetofflikethis.«
ThiscategoryisequivalenttoeitherPsorPe,dependingonthe
language.

Final-QuotePunctuation(Pf)
ThecounterparttothePicategory,Pfisalsousedwith
quotationmarkswhoseusagevariesdependingonlanguage.
It'sequivalenttoeitherPsorPedependingonlanguage.It's
alwaystheoppositeofPi.


DashPunctuation(Pd)
Thiscategoryisself-explanatory.Itincludesallhyphensand
dashes.


ConnectorPunctuation(Pc)
Charactersinthiscategory,suchasthemiddledotandthe
underscore,gettreatedaspartofthewordinwhichthey
appear.Thatis,they"connect"seriesofletterstogetherinto
singlewords:This_is_all_one_word.Animportantexampleis
U+30FBKATAKANAMIDDLEDOT,whichisusedlikeahyphenin
Japanese.

OtherPunctuation(Po)
Punctuationmarksthatdon'tfitintoanyoftheother
subcategories,includingobviousthingsliketheperiod,comma,
andquestionmark,fallintothiscategory.

Symbols
Thisgroupofcategoriescontainsvarioussymbols.

CurrencySymbol(Sc)
Self-explanatory.

MathematicalSymbol(Sm)
Mathematicaloperators.

ModifierSymbol(Sk)
Thiscategorycontainstwomaintypesofcharacters:the
"spacing"versionsofthecombiningmarksandafewother



symbolswhosepurposeistomodifythemeaningofthe
precedingcharacterinsomeway.Unlikemodifierletters,
modifiersymbolsdon'tnecessarilymodifythemeaningsof
letters,andtheydon'tnecessarilygetcountedaspartsof
words.

OtherSymbol(So)
Thiscategorycontainsallsymbolsthatdidn'tfitintooneofthe
othercategories.

Separators
Thesecharactersmarktheboundariesbetweenunitsoftext.

SpaceSeparator(Zs)
Thiscategoryincludesallofthespacecharacters(yes,there's
morethanonespacecharacter).

ParagraphSeparator(Zp)
Thereisexactlyonecharacterinthiscategory:theUnicode
paragraphseparator(U+2029).Asitsnamesuggests,itmarks
theboundarybetweenparagraphs.

LineSeparator(Zl)
There'salsoonlyonecharacterinthiscategory:theUnicode
lineseparator(U+2028).Asitsnamesuggests,itforcesaline
breakwithoutendingaparagraph.



EventhoughtheASCIIcarriage-returnandline-feedcharacters
areoftenusedaslineandparagraphseparators,they'renot
placedineitherofthesecategories.Likewise,theASCIItab
characterisn'tconsideredaUnicodespacecharacter,even
thoughitprobablyshouldbe.They'reallputinthe"Cc"
category.

Miscellaneous
Anumberofspecialcharactercategoriesdon'treallyfitinwith
theothers.

ControlCharacters(Cc)
ThecodescorrespondingtotheC0andC1controlcharacters
fromtheISO2022standardappearinthiscategory.The
Unicodestandarddoesn'tofficiallyassignanysemanticsto
thesecharacters(whichincludetheASCIIcontrolcharacters),
butmostsystemsthatuseUnicodetexttreatthesecharacters
thesamewayastheytreattheircounterpartsinthesource
standards.Forexample,mostprocessestreattheASCIIlinefeedcharacterasalineorparagraphseparator.
Theoriginalideawastoleavethedefinitionsofthesecode
pointsopen,asISO2022does.Overtime,however,various
Unicodeprocessesandalgorithmshaveattachedsemanticsto
thesecodepoints,effectivelynailingtheISO6429semanticsto
manyofthem.

FormattingCharacters(Cf)
Unicodeincludessome"control"charactersofitsown:
characterswithnovisualrepresentationoftheirownthatare
usedtocontrolhowthecharactersaroundthemaredrawnor



handledbyvariousprocesses.Thesecharactersareassignedto
thiscategory.

Surrogates(Cs)
ThecodepointsintheUTF-16surrogaterangebelongtothis
category.Technically,thecodepointsinthesurrogaterangeare
treatedasunassignedandreserved,butUnicode
implementationsbasedonUTF-16oftentreatthemas
characters,handlingsurrogatepairsthesamewayas
combiningcharactersequencesarehandled.

Private-UseCharacters(Co)
Thecodepointsintheprivate-userangesareassignedtothis
category.

UnassignedCodePoints(Cn)
Allunassignedandnoncharactercodepoints,otherthanthose
inthesurrogaterange,aregiventhiscategory.Thesecode
pointsaren'tlistedintheUnicodeCharacterDatabasetheir
omissiongivesthemthiscategorybutarelistedexplicitlyin
DerivedGeneralCategory.txt.



×