ISO/IEC 8859-1

From Wikipedia, the free encyclopedia

  (Redirected from ISO 8859-1)
Jump to: navigation, search

ISO/IEC 8859-1:1998, Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987. It is informally referred to as Latin-1. It is generally intended for “Western European” languages (see below for a list).

ISO-8859-1 is the IANA preferred charset name for this standard when supplemented with the control codes from ISO/IEC 6429 for the C0 (0x00-0x1F) and C1 (0x80-0x9F) parts. Escape sequences (from ISO/IEC 6429 or ISO/IEC 2022) are not to be interpreted.

The Windows-1252 codepage coincides with ISO-8859-1 in the code ranges 0x00 to 0x7F and 0xA0 to 0xFF, but not for the range 0x80 to 0x9F.

Contents

[edit] Coverage

ISO 8859-1 encodes what it refers to as "Latin alphabet no. 1," consisting of 191 characters from the Latin script. This character-encoding scheme is used throughout The Americas, Western Europe, Oceania, and much of Africa. It is also commonly used in most standard romanizations of East-Asian languages.

Each character is encoded as a single eight-bit code value. These code values can be used in almost any data interchange system to communicate in the following European languages (with a few exceptions due to missing characters, as noted):

Modern languages with complete coverage of their alphabet
Languages commonly supported with nearly complete coverage of their alphabet
  • Dutch (missing IJ, ij but these should always be represented as IJ or ij in electronic form)
  • Estonian (missing Š, š, Ž, ž for loan words)
    • Note that Windows-1252 and ISO-8859-15 do contain these
  • Old English and French (missing Œ, œ and the very rare Ÿ; they are generally replaced by 'OE' and 'oe' without the normally required ligature, and 'Y' without the diaeresis)
    • Note that Windows-1252 and ISO-8859-15 do contain these
  • Finnish (missing Š, š, Ž, ž for loan words)
    • Note that Windows-1252 and ISO-8859-15 do contain these
  • Hungarian (missing Ő, ő, Ű, ű)
  • Welsh (missing Ŵ, ŵ, Ŷ, ŷ)
Coverage of punctuation signs and apostrophes

For some languages listed above the correct typographical quotation marks are missing, for only « », " ", and ' ' are included.

Also, this encoding scheme does not provide the correct character for the apostrophe and oriented single high quotation marks, although some texts use the spacing grave accent and spacing acute accent that are both part of ISO 8859-1, instead of the 6-shaped/9-shaped quotations marks or apostrophes (and this works reliably with some font styles where all these characters are displayed as slanted wedge glyphs).

See also: Alphabets derived from the Latin

[edit] History

ISO 8859-1 was based on the Multinational Character Set used by Digital Equipment Corporation in the popular VT220 terminal. It was developed within ECMA, the European Computer Manufacturers Association, and published in March 1985 as ECMA-94, by which name it is still sometimes known. The second edition of ECMA-94 (June 1986) also included ISO 8859-2, ISO 8859-3, and ISO 8859-4 as part of the specification.

In 1985 Commodore adopted officially for its new AmigaOS operating system ANSI/ISO8859-1 layout for its codepage and all internal operations in order to refer to international approved standards rather than proprietary standards, as it happened in those times with MS-DOS, and Mac OS and thus this standard was also used for manufacturing the keyboard layout of Amiga 1000 computer that was launched in July 1985. All versions of Amiga OS up to 3.1 used ISO8859-1. Since the demise of Commodore International in 1994 all further versions of AmigaOS (3.5, 3.9) continued to have ISO8859-1 codepage set enhanced with Euro Currency character, but without a leading firm capable to impose official standards both Amiga and its clone variants (MorphOS, AROS) did not update officially to ISO 8859-15 neither follow a common approach in the introduction of Euro character in 2001. MorphOS 2.0 and further versions are UNICODE UTF-8 compliant.

[edit] Relationship to ISO/IEC 8859-15

Although ISO/IEC 8859-1 has enough characters for most French text, it is missing a few letters that are less common. It is also missing a single-glyph representation for the letter IJ, two Finnish letters used for transcription of some foreign names and in a few loanwords (Š and Ž), typographic quotation marks and dashes, and common symbols such as the euro sign (€) and dagger (†).

In order to provide some of these characters, ISO/IEC 8859-15 was developed as an update of ISO/IEC 8859-1. This required, however, the removal of some infrequently-used characters from ISO/IEC 8859-1, including fraction symbols and letter-free diacritics: ¤, ¦, ¨, ´, ¸, ¼, ½, and ¾.

[edit] Codepage layout

Since all 191 characters encoded by ISO/IEC 8859-1 are 'graphic' (ISO's term for characters that are not control codes) and are compatible with most web browsers, they can be shown as glyphs in the following table. Since the space, no-break space, and soft hyphen characters would not normally be visible, they are represented by abbreviations for their names. All other characters are represented literally. Row and column headings indicate the hexadecimal digit combinations to produce the eight-bit code value; e.g., the letter L is at code value 4C.

Under each glyph, the numeric value of its codepoint is given, first in hexadecimal, then in decimal, and finally in octal.

ISO/IEC 8859-1 (Latin-1)
—0 —1 —2 —3 —4 —5 —6 —7 —8 —9 —A —B —C —D —E —F
 
0−
 
                               
 
1−
 
                               
 
2−
 
SP
0020
32
040
!
0021
33
041
"
0022
34
042
#
0023
35
043
$
0024
36
044
%
0025
37
045
&
0026
38
046
'
0027
39
047
(
0028
40
050
)
0029
41
051
*
002A
42
052
+
002B
43
053
,
002C
44
054
-
002D
45
055
.
002E
46
056
/
002F
47
057
 
3−
 
0
0030
48
060
1
0031
49
061
2
0032
50
062
3
0033
51
063
4
0034
52
064
5
0035
53
065
6
0036
54
066
7
0037
55
067
8
0038
56
070
9
0039
57
071
:
003A
58
072
;
003B
59
073
<
003C
60
074
=
003D
61
075
>
003E
62
076
?
003F
63
077
 
4−
 
@
0040
64
100
A
0041
65
101
B
0042
66
102
C
0043
67
103
D
0044
68
104
E
0045
69
105
F
0046
70
106
G
0047
71
107
H
0048
72
110
I
0049
73
111
J
004A
74
112
K
004B
75
113
L
004C
76
114
M
004D
77
115
N
004E
78
116
O
004F
79
117
 
5−
 
P
0050
80
120
Q
0051
81
121
R
0052
82
122
S
0053
83
123
T
0054
84
124
U
0055
85
125
V
0056
86
126
W
0057
87
127
X
0058
88
130
Y
0059
89
131
Z
005A
90
132
[
005B
91
133
\
005C
92
134
]
005D
93
135
^
005E
94
136
_
005F
95
137
 
6−
 
`
0060
96
140
a
0061
97
141
b
0062
98
142
c
0063
99
143
d
0064
100
144
e
0065
101
145
f
0066
102
146
g
0067
103
147
h
0068
104
150
i
0069
105
151
j
006A
106
152
k
006B
107
153
l
006C
108
154
m
006D
109
155
n
006E
110
156
o
006F
111
157
 
7−
 
p
0070
112
160
q
0071
113
161
r
0072
114
162
s
0073
115
163
t
0074
116
164
u
0075
117
165
v
0076
118
166
w
0077
119
167
x
0078
120
170
y
0079
121
171
z
007A
122
172
{
007B
123
173
|
007C
124
174
}
007D
125
175
~
007E
126
176
 
 
8−
 
                               
 
9−
 
                               
 
A−
 
NBSP
00A0
160
240
¡
00A1
161
241
¢
00A2
162
242
£
00A3
163
243
¤
00A4
164
244
¥
00A5
165
245
¦
00A6
166
246
§
00A7
167
247
¨
00A8
168
250
©
00A9
169
251
ª
00AA
170
252
«
00AB
171
253
¬
00AC
172
254
SHY
00AD
173
255
®
00AE
174
256
¯
00AF
175
257
 
B−
 
°
00B0
176
260
±
00B1
177
261
²
00B2
178
262
³
00B3
179
263
´
00B4
180
264
µ
00B5
181
265

00B6
182
266
·
00B7
183
267
¸
00B8
184
270
¹
00B9
185
271
º
00BA
186
272
»
00BB
187
273
¼
00BC
188
274
½
00BD
189
275
¾
00BE
190
276
¿
00BF
191
277
 
C−
 
À
00C0
192
300
Á
00C1
193
301
Â
00C2
194
302
Ã
00C3
195
303
Ä
00C4
196
304
Å
00C5
197
305
Æ
00C6
198
306
Ç
00C7
199
307
È
00C8
200
310
É
00C9
201
311
Ê
00CA
202
312
Ë
00CB
203
313
Ì
00CC
204
314
Í
00CD
205
315
Î
00CE
206
316
Ï
00CF
207
317
 
D−
 
Ð
00D0
208
320
Ñ
00D1
209
321
Ò
00D2
210
322
Ó
00D3
211
323
Ô
00D4
212
324
Õ
00D5
213
325
Ö
00D6
214
326
×
00D7
215
327
Ø
00D8
216
330
Ù
00D9
217
331
Ú
00DA
218
332
Û
00DB
219
333
Ü
00DC
220
334
Ý
00DD
221
335
Þ
00DE
222
336
ß
00DF
223
337
 
E−
 
à
00E0
224
340
á
00E1
225
341
â
00E2
226
342
ã
00E3
227
343
ä
00E4
228
344
å
00E5
229
345
æ
00E6
230
346
ç
00E7
231
347
è
00E8
232
350
é
00E9
233
351
ê
00EA
234
352
ë
00EB
235
353
ì
00EC
236
354
í
00ED
237
355
î
00EE
238
356
ï
00EF
239
357
 
F−
 
ð
00F0
240
360
ñ
00F1
241
361
ò
00F2
242
362
ó
00F3
243
363
ô
00F4
244
364
õ
00F5
245
365
ö
00F6
246
366
÷
00F7
247
367
ø
00F8
248
370
ù
00F9
249
371
ú
00FA
250
372
û
00FB
251
373
ü
00FC
252
374
ý
00FD
253
375
þ
00FE
254
376
ÿ
00FF
255
377
—0 —1 —2 —3 —4 —5 —6 —7 —8 —9 —A —B —C —D —E —F

Code values 00–1F, 7F–9F are not assigned to characters by ISO/IEC 8859-1.

The lower range 20 to 7E (the G0 subset) maps exactly to the same coded G0 subset of the ISO 646 US variant (commonly known as ASCII), whose ISO 2022 standard switch sequence is "ESC ( B". The higher range A0 to FF (the G1 subset) maps exactly to the same subset initiated by the ISO 2022 standard switch sequence "ESC . A".

[edit] Related character maps

The ISO/IEC 8859-1 standard has long been the basis of a number of character maps, also known as character sets, charsets, or code pages, the most popular being ISO-8859-1 (note the extra hyphen) and Windows-1252. Both of these maps are a superset of ISO/IEC 8859-1; they supplement the standard's 191 character assignments by mapping additional characters to at least some portion of the code value ranges 00–1F, 7F, and 80–9F.

[edit] ISO-8859-1

In 1992, the IANA registered the character map ISO_8859-1:1987, more commonly known by its preferred MIME name of ISO-8859-1 (note the extra hyphen over ISO 8859-1), a superset of ISO 8859-1, for use on the Internet. This map assigns the C0 and C1 control characters to the code values 00–1F, 7F, and 80–9F. It thus provides for 256 characters via every possible 8-bit value.

ISO-8859-1 is (according to the standards at least) the default encoding of documents delivered via HTTP with a MIME type beginning with "text/". It is the default encoding of the values of certain descriptive HTTP headers, and is the standard encoding used by the X Window System on most Unix machines in locales which use that character set. It was also the basis of the repertoire of characters allowed in HTML 3.2 documents (HTML 4.0, however, is based on Unicode).

Escape sequences (from ISO/IEC 6429 or ISO/IEC 2022) are not to be interpreted in documents labeled as ISO-8859-1 encoded. As well as the canonical name and preferred MIME name mentioned above, the following other aliases are registered for ISO-8859-1: ISO_8859-1, ISO-8859-1, iso-ir-100, csISOLatin1, latin1, l1, IBM819, CP819. ISO-8859-1 was also incorporated as the first 256 code points of ISO/IEC 10646 and Unicode.

Code point Control character Abbreviation
00 Null NUL
01 Start Of Heading SOH
02 Start of Text STX
03 End of Text ETX
04 End Of Transmission EOT
05 Enquiry ENQ
06 Acknowledge ACK
07 Bell BEL
08 Backspace BS
09 Horizontal Tab HT
0A Line Feed LF
0B Vertical Tab VT
0C Form Feed FF
0D Carriage Return CR
0E Shift Out SO
0F Shift In SI
10 Data Link Escape DLE
11 Device Control 1 DC1
12 Device Control 2 DC2
13 Device Control 3 DC3
14 Device Control 4 DC4
15 Negative Acknowledge NAK
16 Synchronous idle SYN
17 End of Transmission Block ETB
18 Cancel CAN
19 End of Medium EM
1A Substitute (character) SUB
1B Escape character ESC
1C File separator FS
1D Group separator GS
1E Record separator RS
1F Unit separator US
7F Delete DEL
 
Code point Control character Abbreviation
80 Padding Character PAD
81 High Octet Preset HOP
82 Break Permitted Here BPH
83 No Break Here NBH
84 Index IND
85 Next Line NEL
86 Start of Selected Area SSA
87 End of Selected Area ESA
88 Character Tabulation Set HTS
89 Character Tabulation with Justification HTJ
8A Line Tabulation Set VTS
8B Partial Line Forward PLD
8C Partial Line Backward PLU
8D Reverse Line Feed RI
8E Single Shift 2 SS2
8F Single Shift 3 SS3
90 Device Control String DCS
91 Private Use 1 PU1
92 Private Use 2 PU2
93 Set Transmit State STS
94 Cancel Character CCH
95 Message Waiting MW
96 Start of Guarded Area SPA
97 End of Guarded Area EPA
98 Start of String SOS
99 Single Graphic Character Introducer SGCI
9A Single Character Introducer SCI
9B Control Sequence Introducer CSI
9C String Terminator ST
9D Operating System Command OSC
9E Privacy Message PM
9F Application Program Command APC
ISO-8859-1
—0 —1 —2 —3 —4 —5 —6 —7 —8 —9 —A —B —C —D —E —F
 
0−
 
NUL
0000
0
SOH
0001
1
STX
0002
2
ETX
0003
3
EOT
0004
4
ENQ
0005
5
ACK
0006
6
BEL
0007
7
BS
0008
8
HT
0009
9
LF
000A
10
VT
000B
11
FF
000C
12
CR
000D
13
SO
000E
14
SI
000F
15
 
1−
 
DLE
0010
16
DC1
0011
17
DC2
0012
18
DC3
0013
19
DC4
0014
20
NAK
0015
21
SYN
0016
22
ETB
0017
23
CAN
0018
24
EM
0019
25
SUB
001A
26
ESC
001B
27
FS
001C
28
GS
001D
29
RS
001E
30
US
001F
31
 
2−
 
SP
0020
32
!
0021
33
"
0022
34
#
0023
35
$
0024
36
%
0025
37
&
0026
38
'
0027
39
(
0028
40
)
0029
41
*
002A
42
+
002B
43
,
002C
44
-
002D
45
.
002E
46
/
002F
47
 
3−
 
0
0030
48
1
0031
49
2
0032
50
3
0033
51
4
0034
52
5
0035
53
6
0036
54
7
0037
55
8
0038
56
9
0039
57
:
003A
58
;
003B
59
<
003C
60
=
003D
61
>
003E
62
?
003F
63
 
4−
 
@
0040
64
A
0041
65
B
0042
66
C
0043
67
D
0044
68
E
0045
69
F
0046
70
G
0047
71
H
0048
72
I
0049
73
J
004A
74
K
004B
75
L
004C
76
M
004D
77
N
004E
78
O
004F
79
 
5−
 
P
0050
80
Q
0051
81
R
0052
82
S
0053
83
T
0054
84
U
0055
85
V
0056
86
W
0057
87
X
0058
88
Y
0059
89
Z
005A
90
[
005B
91
\
005C
92
]
005D
93
^
005E
94
_
005F
95
 
6−
 
`
0060
96
a
0061
97
b
0062
98
c
0063
99
d
0064
100
e
0065
101
f
0066
102
g
0067
103
h
0068
104
i
0069
105
j
006A
106
k
006B
107
l
006C
108
m
006D
109
n
006E
110
o
006F
111
 
7−
 
p
0070
112
q
0071
113
r
0072
114
s
0073
115
t
0074
116
u
0075
117
v
0076
118
w
0077
119
x
0078
120
y
0079
121
z
007A
122
{
007B
123
|
007C
124
}
007D
125
~
007E
126
DEL
007F
127
 
8−
 
PAD
0080
128
HOP
0081
129
BPH
0082
130
NBH
0083
131
IND
0084
132
NEL
0085
133
SSA
0086
134
ESA
0087
135
HTS
0088
136
HTJ
0089
137
VTS
008A
138
PLD
008B
139
PLU
008C
140
RI
008D
141
SS2
008E
142
SS3
008F
143
 
9−
 
DCS
0090
144
PU1
0091
145
PU2
0092
146
STS
0093
147
CCH
0094
148
MW
0095
149
SPA
0096
150
EPA
0097
151
SOS
0098
152
SGCI
0099
153
SCI
009A
154
CSI
009B
155
ST
009C
156
OSC
009D
157
PM
009E
158
APC
009F
159
 
A−
 
NBSP
00A0
160
¡
00A1
161
¢
00A2
162
£
00A3
163
¤
00A4
164
¥
00A5
165
¦
00A6
166
§
00A7
167
¨
00A8
168
©
00A9
169
ª
00AA
170
«
00AB
171
¬
00AC
172
SHY
00AD
173
®
00AE
174
¯
00AF
175
 
B−
 
°
00B0
176
±
00B1
177
²
00B2
178
³
00B3
179
´
00B4
180
µ
00B5
181

00B6
182
·
00B7
183
¸
00B8
184
¹
00B9
185
º
00BA
186
»
00BB
187
¼
00BC
188
½
00BD
189
¾
00BE
190
¿
00BF
191
 
C−
 
À
00C0
192
Á
00C1
193
Â
00C2
194
Ã
00C3
195
Ä
00C4
196
Å
00C5
197
Æ
00C6
198
Ç
00C7
199
È
00C8
200
É
00C9
201
Ê
00CA
202
Ë
00CB
203
Ì
00CC
204
Í
00CD
205
Î
00CE
206
Ï
00CF
207
 
D−
 
Ð
00D0
208
Ñ
00D1
209
Ò
00D2
210
Ó
00D3
211
Ô
00D4
212
Õ
00D5
213
Ö
00D6
214
×
00D7
215
Ø
00D8
216
Ù
00D9
217
Ú
00DA
218
Û
00DB
219
Ü
00DC
220
Ý
00DD
221
Þ
00DE
222
ß
00DF
223
 
E−
 
à
00E0
224
á
00E1
225
â
00E2
226
ã
00E3
227
ä
00E4
228
å
00E5
229
æ
00E6
230
ç
00E7
231
è
00E8
232
é
00E9
233
ê
00EA
234
ë
00EB
235
ì
00EC
236
í
00ED
237
î
00EE
238
ï
00EF
239
 
F−
 
ð
00F0
240
ñ
00F1
241
ò
00F2
242
ó
00F3
243
ô
00F4
244
õ
00F5
245
ö
00F6
246
÷
00F7
247
ø
00F8
248
ù
00F9
249
ú
00FA
250
û
00FB
251
ü
00FC
252
ý
00FD
253
þ
00FE
254
ÿ
00FF
255
—0 —1 —2 —3 —4 —5 —6 —7 —8 —9 —A —B —C —D —E —F

Note that most of these control characters are not made for use in portable ISO-8859-1 encoded plain text documents, but only within specific protocols or devices, except a few ones whose behavior are standardized: TAB (09), LF (0A), CR (0D) and NEL (85); all but the first one are used to encode end of lines or to separate paragraphs, and TAB is often considered equivalent to whitespace. However FF (0C) is commonly accepted in some applications interpreting plain-text documents as an additional ignorable whitespace at the beginning of lines, to mark the position of an explicit page break when printing.

However, some encodings allow using BS (08) to create additional characters by emulating the superposition of multiple characters on printing devices.

Some ISO standards assign specific functions to some controls (for example in ISO 2022) where SO (0E), SI (0F), DLE (10), ESC (1B) and SS2 (8E) are used to control the encoding of characters after them or to switch between multiple encodings.

The NUL character (00) is commonly used as a string terminator in some programming languages, or as a filler in database records that must be ignored and is not part of the encoded text. STX (02) and ETX (03) are commonly used for delimiting frames in some transmission protocols. SUB (1A) is also commonly used as a replacement character to mark errors detected in input transmission streams, and it may be rendered graphically. DC1 (11) and DC3 (13) are commonly used in the XON/XOFF protocol for controlling the transmission speed. Finally, EM (19) or EOT (04) may be used as an end-of-file marker in some text file formats.

[edit] ISO-8859-1 and Windows-1252 confusion

It is very common to mislabel text data with the charset label ISO-8859-1, even though the data is really Windows-1252 encoded. In Windows-1252, codes between 0x80 and 0x9F are used for letters and punctuation, whereas they are control codes in ISO-8859-1. Many web browsers and e-mail clients will interpret ISO-8859-1 control codes as Windows-1252 characters in order to accommodate such mislabeling but it is not standard behaviour and care should be taken to avoid generating these characters in ISO-8859-1 labeled content. However, the draft HTML 5 specification requires that documents advertised as ISO-8859-1 actually be parsed with the Windows-1252 encoding.[1]

[edit] Similar character sets

The Apple Macintosh computer introduced a character encoding called Mac Roman, or Mac-Roman, in 1984. It was meant to be suitable for Western European desktop publishing. It is a superset of ASCII, like ISO-8859-1, and has most of the characters that are in ISO-8859-1 but in a totally different arrangement. A later version, registered with IANA as "Macintosh", replaced the generic currency sign ¤ with the euro sign €. The few printable characters that are in ISO 8859-1 but not in this set are often a source of trouble when editing text on websites using older Macintosh browsers (including the last version of Internet Explorer for Mac). However the extra characters that Windows-1252 has in the C1 codepoint range are all supported in MacRoman and except for the few missing ISO-8859-1 characters a Macintosh can send/receive files (and email) that are encoded/marked as ISO-8859-1 (with the C1 Control Characters) and Windows-1252 by remapping the glyph's codepoint numbers.

DOS had code page 850, which had all printable characters that ISO-8859-1 had (albeit in a totally different arrangement) plus the most widely used graphic characters from code page 437.

[edit] See also

[edit] References

[edit] External links