GeneralCategory
Afterthecodepointvalueandthename,thenextmost
importantpropertythataUnicodecharacterhasisitsgeneral
category.Sevenprimarycategoriesexist:letter,number,
punctuation,symbol,mark,separator,andmiscellaneous.Each
issubdividedintoadditionalcategories.
Letters
TheUnicodestandardusestheterm"letter"ratherlooselyin
assigningthingstothisgeneralcategory.Whatevercountsas
thebasicunitofmeaninginaparticularwritingsystem,
whetheritrepresentsaphoneme,asyllable,orawholewordor
idea,isassignedtothe"letter"category.Themajorexception
tothisrulecomprisesmarksthatcombinetypographicallywith
othercharacters,whicharecategorizedas"marks"insteadof
"letters."Theyincludenotonlydiacriticalmarksandtone
marks,butalsovowelsignsinthoseconsonantalwriting
systemswherethevowelsarewrittenasmarksappliedtothe
consonants.
Somewritingsystems,suchastheLatin,Greek,andCyrillic
alphabets,alsohavetheconceptof"case."Thatis,twoseries
ofletterformsareusedtogether,withoneseries,the
"uppercase,"usedforthefirstletterofasentenceoraproper
name,orforemphasis,andtheotherseries,the"lowercase,"
usedformostotherletters.
UppercaseLetter(Lu)
Incasedwritingsystems,theuppercaselettersareplacedin
thiscategory.
LowercaseLetter(Ll)
Incasedwritingsystems,thelowercaselettersareplacedin
thiscategory.
TitlecaseLetter(Lt)
TitlecaseisreservedforafewspecialcharactersinUnicode.
Thesecharactersarebasicallyexamplesofcompatibility
characterscharactersthatwereincludedforround-trip
compatibilitywithsomeotherstandard.Everytitlecaseletteris
actuallyaglyphrepresentingtwoletters,thefirstofwhichis
uppercaseandthesecondofwhichislowercase.Forexample,
theSerbianletternje( )canbethoughtofasaligatureofthe
Cyrilliclettern( )andtheCyrillicsoftsign( ).WhenSerbian
iswrittenusingtheLatinalphabet(asisdoneinCroatian,
whichisalmostthesamelanguage),thisletteriswrittenusing
thelettersnj.ExistingSerbianandCroatianstandardswere
designedtoprovideaone-to-onemappingbetweenevery
CyrilliccharacterusedinSerbianandthecorrespondingLatin
characterusedinCroatian.Thisapproachrequiredusinga
singlecharactercodetorepresentthenjdigraphinCroatian,
andUnicodecarriesthatcharacterforward.CapitalNjein
Cyrillic( )thuscanconverttoeitherNJorNjinLatin
dependingonthecontext.Thefullyuppercaseform,NJ,is
U+01CALATINCAPITALLETTERNJ,andthecombinedupperlowerform,U+01CBLATINCAPITALLETTERNWITHSMALL
LETTERJ,isconsidereda"titlecase"letter.ThreeSerbian
charactershaveatitlecaseLatinform: (lje,whichconvertsto
lj), (nje,whichconvertstonj),and (dzhe,whichconverts
tod ).Thesecharactersweretheonlythreetitlecaselettersin
Unicode2.x.
Unicode3.0addedseveralGreekletterstothiscategory.Some
earlyGreektextsrepresentedcertaindiphthongsbywritinga
smallletteriotaunderneaththeothervowelratherthanafterit.
Forexample,you'dsee"ai"writtenas .Ifyoujustcapitalized
thealpha("Ai"),you'dgetthetitlecaseversion: .Inthefully
uppercaseversion("AI"),thesmalliotabecomesaregulariota
again:AI.ThesecharactersareallintheExtendedGreek
sectionofthestandardandareusedonlyinwritingancient
Greektexts.InmodernGreek,thesediphthongsarewritten
usingaregulariota;forexample,"ai"iswrittenas .
ModifierLetter(Lm)
Justassomethingsyoumightconceptuallythinkofas"letters"
(vowelsignsinvariouslanguages)areclassifiedas"marks"in
Unicode,theoppositealsooccurs.Themodifierlettersare
independentformsthatdon'tcombinetypographicallywiththe
charactersaroundthem,whichiswhyUnicodedoesn'tclassify
themas"marks"(Unicodemarks,bydefinition,combine
typographicallywiththeirneighbors).Insteadofcarryingtheir
ownsounds,themodifierlettersgenerallymodifythesoundsof
theirneighbors.Inotherwords,conceptuallythey'rediacritical
marks.Becausetheyoccurinthemiddleofwords,mosttextanalysisprocessestreatthemasletters,sothey'reclassifiedas
letters.
TheUnicodemodifierlettersaregenerallyeitherInternational
PhoneticAlphabetcharactersorcharactersthatareusedto
transliteratecertain"real"lettersinnon-Latinwritingsystems
thatdon'tseemtocorrespondtoaregularLatinletter.For
example,U+02BCMODIFIERLETTERAPOSTROPHEistypically
usedtorepresenttheglottalstop,thesoundmadeby(or,more
accurately,theabsenceofsoundrepresentedby)theArabic
letteralef,sotheArabicletterisoftentransliteratedasthis
character.Likewise,U+02B2MODIFIERLETTERSMALLJisused
torepresentpalatalization,andthusissometimesusedin
transliterationasthecounterpartoftheCyrillicsoftsign.
OtherLetter(Lo)
Thiscatch-allcategoryincludeseverythingthat'sconceptuallya
"letter,"butthatdoesn'tfitintooneoftheother"letter"
categories.LettersfromuncasedalphabetssuchasArabicand
Hebrewfallintothiscategory,asdosyllablesfromsyllabic
writingsystemslikeKanaandHangulandtheHanideographs.
Marks
Likeletters,marksarepartofwordsandcarrylinguistic
information.Unlikeletters,markscombinetypographicallywith
othercharacters.Forexample,U+0308COMBININGDIAERESIS
maylooklike¨whenshownalone,butisusuallydrawnontop
oftheletterthatprecedesit.Thatis,U+0061LATINSMALL
LETTERAfollowedbyU+0308COMBININGDIAERESISisn't
drawnas"a¨",butratheras"ä".AlloftheUnicodecombining
marksdothiskindofthing.
Non-spacingMark(Mn)
MostoftheUnicodecombiningmarksfallintothiscategory.
Non-spacingmarksdon'ttakeupanyhorizontalspacealonga
lineoftexttheycombinecompletelywiththecharacterthat
precedesthemandfitentirelyintothatcharacter'sspace.The
variousdiacriticalmarksusedinEuropeanlanguages,suchas
theacuteandgraveaccents,thecircumflex,thediaeresis,and
thecedilla,fallintothiscategory.
CombiningSpacingMark(Mc)
Spacingcombiningmarksinteracttypographicallywiththeir
neighbors,butstilltakeuphorizontalspacealongalineoftext.
Allofthesecharactersarevowelsignsorotherdiacriticalmarks
inthevariousIndianandSoutheastAsianwritingsystems.For
example,U+093FDEVANAGARIVOWELSIGNI( )isaspacing
combiningmark.ThusU+0915DEVANAGARILETTERKA
followedbyU+093FDEVANAGARIVOWELSIGNIisdrawnas
thevowelsignattachestotheleft-handsideofthe
consonant.
Notallspacingcombiningmarksreorder,however:U+0940
DEVANAGARIVOWELSIGNII( )isalsoacombiningspacing
mark.WhenitfollowsU+0915DEVANAGARILETTERKA,you
get thevowelattachestotheright-handsideofthe
consonant,butthetwocombinetypographically.
EnclosingMark(Me)
Enclosingmarkscompletelysurroundthecharactersthey
modify.Forexample,U+20DDCOMBININGENCLOSINGCIRCLE
isdrawnasaringaroundthecharacterthatprecedesit.These
tencharactersaregenerallyusedtocreatesymbols.
Numbers
TheUnicodecharactersthatrepresentnumericquantitiesare
giventhe"number"property(technically,itshouldbecalledthe
"numeral"property,butthat'slife).Thecharactersinthese
categorieshaveadditionalpropertiesthatgoverntheir
interpretationasnumerals.Thiscategoryissubdividedas
follows.
Decimal-DigitNumber(Nd)
Thecharactersinthiscategorycanbeusedasdecimaldigits.
Thiscategoryincludesnotonlythedigitswithwhichwe'reall
familiar("0123456789"),butsimilarsetsofdigitsusedwith
otherwritingsystems,suchastheThaidigits("
").
LetterNumber(Nl)
Thecharactersinthiscategorycanbeeitherlettersor
numerals.Manyarecompatibilitycompositeswhose
decompositionsconsistofletters.TheRomannumeralsandthe
Hangzhounumeralsaretheonlycharactersinthiscategory.
OtherNumber(No)
Allofthecharactersthatbelonginthe"number"category,but
notinoneoftheothersubcategories,fallintothisone.This
categoryincludesvariousnumericpresentationforms,suchas
superscripts,subscripts,andcirclednumbers;variousfractions;
andnumeralsusedinvariousnumerationsystemsotherthan
theArabicpositionalnotationusedintheWest.
Punctuation
Thiscategoryattemptstomakesenseofthevarious
punctuationcharactersinUnicode.Itbreaksdownasfollows.
OpeningPunctuation(Ps)
Forpunctuationmarks,suchasparenthesesandbrackets,that
occurinopening-closingpairs,the"opening"charactersin
thesepairsareassignedtothiscategory.
ClosingPunctuation(Pe)
Forpunctuationmarks,suchasparenthesesandbrackets,that
occurinopening-closingpairs,the"closing"charactersinthese
pairsareassignedtothiscategory.
Initial-QuotePunctuation(Pi)
Quotationmarksoccurinopening-closingpairs,justlike
parenthesesdo.Theproblemisthatwhichiswhichdependson
thelanguage.Forexample,bothFrenchandRussianuse
quotationmarksthatlooklikethis:«»,buttheyusethem
differently.
«InFrench,aquotationissetofflikethis.»
»ButinRussian,aquotationissetofflikethis.«
ThiscategoryisequivalenttoeitherPsorPe,dependingonthe
language.
Final-QuotePunctuation(Pf)
ThecounterparttothePicategory,Pfisalsousedwith
quotationmarkswhoseusagevariesdependingonlanguage.
It'sequivalenttoeitherPsorPedependingonlanguage.It's
alwaystheoppositeofPi.
DashPunctuation(Pd)
Thiscategoryisself-explanatory.Itincludesallhyphensand
dashes.
ConnectorPunctuation(Pc)
Charactersinthiscategory,suchasthemiddledotandthe
underscore,gettreatedaspartofthewordinwhichthey
appear.Thatis,they"connect"seriesofletterstogetherinto
singlewords:This_is_all_one_word.Animportantexampleis
U+30FBKATAKANAMIDDLEDOT,whichisusedlikeahyphenin
Japanese.
OtherPunctuation(Po)
Punctuationmarksthatdon'tfitintoanyoftheother
subcategories,includingobviousthingsliketheperiod,comma,
andquestionmark,fallintothiscategory.
Symbols
Thisgroupofcategoriescontainsvarioussymbols.
CurrencySymbol(Sc)
Self-explanatory.
MathematicalSymbol(Sm)
Mathematicaloperators.
ModifierSymbol(Sk)
Thiscategorycontainstwomaintypesofcharacters:the
"spacing"versionsofthecombiningmarksandafewother
symbolswhosepurposeistomodifythemeaningofthe
precedingcharacterinsomeway.Unlikemodifierletters,
modifiersymbolsdon'tnecessarilymodifythemeaningsof
letters,andtheydon'tnecessarilygetcountedaspartsof
words.
OtherSymbol(So)
Thiscategorycontainsallsymbolsthatdidn'tfitintooneofthe
othercategories.
Separators
Thesecharactersmarktheboundariesbetweenunitsoftext.
SpaceSeparator(Zs)
Thiscategoryincludesallofthespacecharacters(yes,there's
morethanonespacecharacter).
ParagraphSeparator(Zp)
Thereisexactlyonecharacterinthiscategory:theUnicode
paragraphseparator(U+2029).Asitsnamesuggests,itmarks
theboundarybetweenparagraphs.
LineSeparator(Zl)
There'salsoonlyonecharacterinthiscategory:theUnicode
lineseparator(U+2028).Asitsnamesuggests,itforcesaline
breakwithoutendingaparagraph.
EventhoughtheASCIIcarriage-returnandline-feedcharacters
areoftenusedaslineandparagraphseparators,they'renot
placedineitherofthesecategories.Likewise,theASCIItab
characterisn'tconsideredaUnicodespacecharacter,even
thoughitprobablyshouldbe.They'reallputinthe"Cc"
category.
Miscellaneous
Anumberofspecialcharactercategoriesdon'treallyfitinwith
theothers.
ControlCharacters(Cc)
ThecodescorrespondingtotheC0andC1controlcharacters
fromtheISO2022standardappearinthiscategory.The
Unicodestandarddoesn'tofficiallyassignanysemanticsto
thesecharacters(whichincludetheASCIIcontrolcharacters),
butmostsystemsthatuseUnicodetexttreatthesecharacters
thesamewayastheytreattheircounterpartsinthesource
standards.Forexample,mostprocessestreattheASCIIlinefeedcharacterasalineorparagraphseparator.
Theoriginalideawastoleavethedefinitionsofthesecode
pointsopen,asISO2022does.Overtime,however,various
Unicodeprocessesandalgorithmshaveattachedsemanticsto
thesecodepoints,effectivelynailingtheISO6429semanticsto
manyofthem.
FormattingCharacters(Cf)
Unicodeincludessome"control"charactersofitsown:
characterswithnovisualrepresentationoftheirownthatare
usedtocontrolhowthecharactersaroundthemaredrawnor
handledbyvariousprocesses.Thesecharactersareassignedto
thiscategory.
Surrogates(Cs)
ThecodepointsintheUTF-16surrogaterangebelongtothis
category.Technically,thecodepointsinthesurrogaterangeare
treatedasunassignedandreserved,butUnicode
implementationsbasedonUTF-16oftentreatthemas
characters,handlingsurrogatepairsthesamewayas
combiningcharactersequencesarehandled.
Private-UseCharacters(Co)
Thecodepointsintheprivate-userangesareassignedtothis
category.
UnassignedCodePoints(Cn)
Allunassignedandnoncharactercodepoints,otherthanthose
inthesurrogaterange,aregiventhiscategory.Thesecode
pointsaren'tlistedintheUnicodeCharacterDatabasetheir
omissiongivesthemthiscategorybutarelistedexplicitlyin
DerivedGeneralCategory.txt.