ISO/IEC 8859

From Knowino

Jump to: navigation, search

ISO/IEC 8859 is a collection of fifteen different 8-bit character encodings. By definition, an 8-bit character encoding assigns a unique number between 0 and 255 to a character. The first ISO/IEC 8895 encodings were designed in the mid-1980s by the European Computer Manufacturer's Association (ECMA)^[1] and endorsed by the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC).

The ISO/IEC 8895 collection consists of numbered parts: ISO/IEC 8859-1 through ISO/IEC 8859-16. They are to be used by languages that use different letters, for example, part 6 covers most of the Arabic language characters; see Table 1 for an overview. The part ISO/IEC 8859-12 was destined for Latin/Devanagari but was prematurely abandoned.

Often the ASCII codes (codes 0 through 127) are seen as part of ISO/IEC 8859. The first 32 ASCII codes are control characters, these form control character set 0, referred to as C0. The characters from 128 through 159 (hexadecimal: 0x80 – 0x9F) constitute control set C1 of ISO 8859. The Windows Latin character set (Windows code page 1252) uses many of the positions in control set C1 for printable characters. Thus, the Windows encoding from 128 through 159 is completely different from the Latin-1 (ISO/IEC 8859-1) encoding. However, the Windows code page 1252 is identical to Latin-1 from character 160 (non-breaking space) through 255 (ÿ).^[2] The extended ASCII set used by DOS, on the other hand, is completely different between 128 and 255, but coincides again with ASCII for the characters below 128.

The ISO and IEC are also responsible for ISO 10646 (UCS, Universal Character Set), a much more ambitious and elaborate character encoding than ISO/IEC 8859. UCS is kept synchronized with Unicode of the Unicode Consortium. Latin-1 (ISO/IEC 8859-1) has been adopted as the first code pages of ISO 10646 and Unicode.^[3]

On the World-Wide-Web, a near-exponential increase in usage of Unicode UTF-8 is observed.^[4] ISO/IEC 8859-1 is in 2011 still important, but is on the decline on the Web.^[5]

Table 1. Parts of ISO/IEC 8859
Part 1	Latin-1 Western European	Covers most Western European languages. Further: Albanian, Indonesian, Afrikaans, and Swahili. The missing euro sign and capital Ÿ are in the revised version ISO/IEC 8859-15 (position 164 and 190). The IANA character set ISO-8859-1 is the default encoding for documents received via HTTP when the document's media type is "text" (as in "text/html").
Part 2	Latin-2 Central European	Supports those Central and Eastern European languages that use the Latin alphabet.
Part 3	Latin-3 South European	Turkish, Maltese, and Esperanto. Largely superseded by ISO/IEC 8859-9 for Turkish and Unicode for Esperanto.
Part 4	Latin-4 North European	Estonian, Latvian, Lithuanian, Greenlandic, and Sami.
Part 5	Latin/Cyrillic	Covers mostly Slavic languages that use a Cyrillic alphabet.
Part 6	Latin/Arabic	Covers the most common Arabic language characters. Does not support other languages using the Arabic script.
Part 7	Latin/Greek	Covers the modern Greek language. Can also be used for Ancient Greek written without accents.
Part 8	Latin/Hebrew	Covers the modern Hebrew alphabet as used in Israel.
Part 9	Latin-5 Turkish	Largely the same as ISO/IEC 8859-1, replacing the rarely used Icelandic letters with Turkish ones.
Part 10	Latin-6 Nordic	a rearrangement of Latin-4. Considered more useful for Nordic languages. Baltic languages prefer Latin-4.
Part 11	Latin/Thai	Contains characters needed for the Thai language. Virtually identical to TIS 620.
Part 13	Latin-7 Baltic Rim	Added some characters for Baltic languages which were missing from Latin-4 and Latin-6.
Part 14	Latin-8 Celtic	Covers Celtic languages such as Gaelic and the Breton language.
Part 15	Latin-9	A revision of 8859-1 that removes some little-used symbols, replacing them with the euro sign € and the letters Š, š, Ž, ž, Œ, œ, and Ÿ. .
Part 16	Latin-10 South-Eastern European	Intended for Albanian, Croatian, Hungarian, Italian, Polish, Romanian and Slovene.

Table 2 lists all the characters in the different parts. The columns are organized such that it is relatively easy to switch between character sets. For example, the German umlauts ë, ä, ö, and ü and scharfes S ß are found at exactly the same positions in Latin-1, Latin-2, Latin-3, Latin-4, Latin-5 (column 9), and Latin-6 (column 10). Thus one can write German/Polish with Latin-2 or German/Turkish with Latin-5.

The HTML version of table 2 is prepared in Unicode UTF-8. Two examples: the Latin-3 character H with stroke (column 3, row 161, U+0126) is given by Ħ → Ħ.^[6] The Thai digit 8 (column 11, row 248, U+0E58) is given by ๘ → ๘. ^[7]

Table 2. Comparison of the various parts of ISO/IEC 8859
Dec	Hex	Binary	1	2	3	4	5	6	7	8	9	10	11	13	14	15	16
160	A0	1010 0000	Non-breaking space (NBSP)
161	A1	1010 0001	¡	Ą	Ħ	Ą	Ё		‘		¡	Ą	ก	”	Ḃ	¡	Ą
162	A2	1010 0010	¢	˘		ĸ	Ђ		’	¢	¢	Ē	ข	¢	ḃ	¢	ą
163	A3	1010 0011	£	Ł	£	Ŗ	Ѓ		£			Ģ	ฃ	£			Ł
164	A4	1010 0100	¤				Є	¤	€	¤		Ī	ค	¤	Ċ	€
165	A5	1010 0101	¥	Ľ		Ĩ	Ѕ		₯	¥		Ĩ	ฅ	„	ċ	¥	„
166	A6	1010 0110	¦	Ś	Ĥ	Ļ	І		¦			Ķ	ฆ	¦	Ḋ	Š
167	A7	1010 0111	§				Ї		§				ง	§
168	A8	1010 1000	¨				Ј		¨			Ļ	จ	Ø	Ẁ	š
169	A9	1010 1001	©	Š	İ	Š	Љ		©			Đ	ฉ	©
170	AA	1010 1010	ª	Ş		Ē	Њ		ͺ	×	ª	Š	ช	Ŗ	Ẃ	ª	Ș
171	AB	1010 1011	«	Ť	Ğ	Ģ	Ћ		«			Ŧ	ซ	«	ḋ	«
172	AC	1010 1100	¬	Ź	Ĵ	Ŧ	Ќ	،	¬			Ž	ฌ	¬	Ỳ	¬	Ź
173	AD	1010 1101	soft hyphen (SHY)										ญ	SHY
174	AE	1010 1110	®	Ž		Ž	Ў			®		Ū	ฎ	®			ź
175	AF	1010 1111	¯	Ż		¯	Џ		―	¯		Ŋ	ฏ	Æ	Ÿ	¯	Ż
176	B0	1011 0000	°				А		°				ฐ	°	Ḟ	°
177	B1	1011 0001	±	ą	ħ	ą	Б		±			ą	ฑ	±	ḟ	±
178	B2	1011 0010	²	˛	²	˛	В		²			ē	ฒ	²	Ġ	²	Č
179	B3	1011 0011	³	ł	³	ŗ	Г		³			ģ	ณ	³	ġ	³	ł
180	B4	1011 0100	´				Д		΄	´		ī	ด	“	Ṁ	Ž
181	B5	1011 0101	µ	ľ	µ	ĩ	Е		΅	µ		ĩ	ต	µ	ṁ	µ	”
182	B6	1011 0110	¶	ś	ĥ	ļ	Ж		Ά	¶		ķ	ถ	¶
183	B7	1011 0111	·	ˇ	·	ˇ	З		·				ท	·	Ṗ	·
184	B8	1011 1000	¸				И		Έ	¸		ļ	ธ	ø	ẁ	ž
185	B9	1011 1001	¹	š	ı	š	Й		Ή	¹		đ	น	¹	ṗ	¹	č
186	BA	1011 1010	º	ş		ē	К		Ί	÷	º	š	บ	ŗ	ẃ	º	ș
187	BB	1011 1011	»	ť	ğ	ģ	Л	؛	»			ŧ	ป	»	Ṡ	»
188	BC	1011 1100	¼	ź	ĵ	ŧ	М		Ό	¼		ž	ผ	¼	ỳ	Œ
189	BD	1011 1101	½	˝	½	Ŋ	Н		½			―	ฝ	½	Ẅ	œ
190	BE	1011 1110	¾	ž		ž	О		Ύ	¾		ū	พ	¾	ẅ	Ÿ
191	BF	1011 1111	¿	ż		ŋ	П	؟	Ώ		¿	ŋ	ฟ	æ	ṡ	¿	ż
192	C0	1100 0000	À	Ŕ	À	Ā	Р		ΐ		À	Ā	ภ	Ą	À
193	C1	1100 0001	Á				С	ء	Α		Á		ม	Į	Á
194	C2	1100 0010	Â				Т	آ	Β		Â		ย	Ā	Â
195	C3	1100 0011	Ã	Ă		Ã	У	أ	Γ		Ã		ร	Ć	Ã		Ă
196	C4	1100 0100	Ä				Ф	ؤ	Δ		Ä		ฤ	Ä
197	C5	1100 0101	Å	Ĺ	Ċ	Å	Х	إ	Ε		Å		ล	Å			Ć
198	C6	1100 0110	Æ	Ć	Ĉ	Æ	Ц	ئ	Ζ		Æ		ฦ	Ę	Æ
199	C7	1100 0111	Ç			Į	Ч	ا	Η		Ç	Į	ว	Ē	Ç
200	C8	1100 1000	È	Č	È	Č	Ш	ب	Θ		È	Č	ศ	Č	È
201	C9	1100 1001	É				Щ	ة	Ι		É		ษ	É
202	CA	1100 1010	Ê	Ę	Ê	Ę	Ъ	ت	Κ		Ê	Ę	ส	Ź	Ê
203	CB	1100 1011	Ë				Ы	ث	Λ		Ë		ห	Ė	Ë
204	CC	1100 1100	Ì	Ě	Ì	Ė	Ь	ج	Μ		Ì	Ė	ฬ	Ģ	Ì
205	CD	1100 1101	Í				Э	ح	Ν		Í		อ	Ķ	Í
206	CE	1100 1110	Î				Ю	خ	Ξ		Î		ฮ	Ī	Î
207	CF	1100 1111	Ï	Ď	Ï	Ī	Я	د	Ο		Ï		ฯ	Ļ	Ï
208	D0	1101 0000	Ð	Đ		Đ	а	ذ	Π		Ğ	Ð	ะ	Š	Ŵ	Ð
209	D1	1101 0001	Ñ	Ń	Ñ	Ņ	б	ر	Ρ		Ñ	Ņ	ั	Ń	Ñ		Ń
210	D2	1101 0010	Ò	Ň	Ò	Ō	в	ز			Ò	Ō	า	Ņ	Ò
211	D3	1101 0011	Ó			Ķ	г	س	Σ		Ó		ำ	Ó
212	D4	1101 0100	Ô				д	ش	Τ		Ô		ิ	Ō	Ô
213	D5	1101 0101	Õ	Ő	Ġ	Õ	е	ص	Υ		Õ		ี	Ő
214	D6	1101 0110	Ö				ж	ض	Φ		Ö		ึ	Ö
215	D7	1101 0111	×				з	ط	Χ		×	Ũ	ื	×	Ṫ	×	Ś
216	D8	1101 1000	Ø	Ř	Ĝ	Ø	и	ظ	Ψ		Ø		ุ	Ų	Ø		Ű
217	D9	1101 1001	Ù	Ů	Ù	Ų	й	ع	Ω		Ù	Ų	ู	Ł	Ù
218	DA	1101 1010	Ú				к	غ	Ϊ		Ú		ฺ	Ś	Ú
219	DB	1101 1011	Û	Ű	Û		л		Ϋ		Û			Ū	Û
220	DC	1101 1100	Ü				м		ά		Ü			Ü
221	DD	1101 1101	Ý		Ŭ	Ũ	н		έ		İ	Ý		Ż	Ý		Ę
222	DE	1101 1110	Þ	Ţ	Ŝ	Ū	о		ή		Ş	Þ		Ž	Ŷ	Þ	Ț
223	DF	1101 1111	ß				п		ί	‗	ß		฿	ß
224	E0	1110 0000	à	ŕ	à	ā	р	ـ	ΰ	א	à	ā	เ	ą	à
225	E1	1110 0001	á				с	ف	α	ב	á		แ	į	á
226	E2	1110 0010	â				т	ق	β	ג	â		โ	ā	â
227	E3	1110 0011	ã	ă		ã	у	ك	γ	ד	ã		ใ	ć	ã		ă
228	E4	1110 0100	ä				ф	ل	δ	ה	ä		ไ	ä
229	E5	1110 0101	å	ĺ	ċ	å	х	م	ε	ו	å		ๅ	å			ć
230	E6	1110 0110	æ	ć	ĉ	æ	ц	ن	ζ	ז	æ		ๆ	ę	æ
231	E7	1110 0111	ç			į	ч	ه	η	ח	ç	į	็	ē	ç
232	E8	1110 1000	è	č	è	č	ш	و	θ	ט	è	č	่	č	è
233	E9	1110 1001	é				щ	ى	ι	י	é		้	é
234	EA	1110 1010	ê	ę	ê	ę	ъ	ي	κ	ך	ê	ę	๊	ź	ê
235	EB	1110 1011	ë				ы	ً	λ	כ	ë		๋	ė	ë
236	EC	1110 1100	ì	ě	ì	ė	ь	ٌ	μ	ל	ì	ė	์	ģ	ì
237	ED	1110 1101	í				э	ٍ	ν	ם	í		ํ	ķ	í
238	EE	1110 1110	î				ю	َ	ξ	מ	î		๎	ī	î
239	EF	1110 1111	ï	ď	ï	ī	я	ُ	ο	ן	ï		๏	ļ	ï
240	F0	1111 0000	ð	đ		đ	№	ِ	π	נ	ğ	ð	๐	š	ŵ	ð	đ
241	F1	1111 0001	ñ	ń	ñ	ņ	ё	ّ	ρ	ס	ñ	ņ	๑	ń	ñ		ń
242	F2	1111 0010	ò	ň	ò	ō	ђ	ْ	ς	ע	ò	ō	๒	ņ	ò
243	F3	1111 0011	ó			ķ	ѓ		σ	ף	ó		๓	ó
244	F4	1111 0100	ô				є		τ	פ	ô		๔	ō	ô
245	F5	1111 0101	õ	ő	ġ	õ	ѕ		υ	ץ	õ		๕	ő
246	F6	1111 0110	ö				і		φ	צ	ö		๖	ö
247	F7	1111 0111	÷				ї		χ	ק	÷	ũ	๗	÷	ṫ	÷	ś
248	F8	1111 1000	ø	ř	ĝ	ø	ј		ψ	ר	ø		๘	ų	ø		ű
249	F9	1111 1001	ù	ů	ù	ų	љ		ω	ש	ù	ų	๙	ł	ù
250	FA	1111 1010	ú				њ		ϊ	ת	ú		๚	ś	ú
251	FB	1111 1011	û	ű	û		ћ		ϋ		û		๛	ū	û
252	FC	1111 1100	ü				ќ		ό		ü			ü
253	FD	1111 1101	ý		ŭ	ũ	§		ύ	LRM	ı	ý		ż	ý		ę
254	FE	1111 1110	þ	ţ	ŝ	ū	ў		ώ	RLM	ş	þ		ž	ŷ	þ	ț
255	FF	1111 1111	ÿ	˙			џ				ÿ	ĸ		’	ÿ

Row 160 gives the non-breaking space (HTML:  ) and row 173 gives, except for column 11 (Thai), the soft hyphen (HTML: ) that only shows at line breaks. Other empty fields are unassigned.

LRM stands for left-to-right mark (U+200E) and RLM stands for right-to-left mark (U+200F).

[edit] References

↑ March 1985
↑ Although the Windows Western character set is often called "ANSI character set" (code page 1252), it has not been approved by the American National Standards Institute.
↑ Code chart U0000.pdf Latin (ASCII) and Code chart U0080.pdf Latin-1 Supplement
↑ Google blog 1/28/2010
↑ Trends July 2011
↑ Code chart U0100.pdf Latin Extended-A
↑ Code chart U0E00.pdf Thai

Some content on this page previously appeared on Wikipedia.

[0] March 1985

[1] Although the Windows Western character set is often called "ANSI character set" (code page 1252), it has not been approved by the American National Standards Institute.

[2] Code chart U0000.pdf Latin (ASCII) and Code chart U0080.pdf Latin-1 Supplement

[3] Google blog 1/28/2010

[4] Trends July 2011

[5] Code chart U0100.pdf Latin Extended-A

[6] Code chart U0E00.pdf Thai

[1]

[2]

[3]

[4]

[5]

[6]

[7]

ISO/IEC 8859

[edit] References

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Community

Toolbox