Skip to content

Commit 5f0236a

Browse files
committed
Unicode and UTF-8 clarifications from toml-lang#924
1 parent 487b38c commit 5f0236a

File tree

2 files changed

+14
-7
lines changed

2 files changed

+14
-7
lines changed

CHANGELOG.md

+1
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33
## unreleased
44

5+
- Clarify Unicode and UTF-8 references.
56
- Allow newline after key/values in inline tables.
67
- Allow trailing comma in inline tables.
78
- Clarify where and how dotted keys define tables.

toml.md

+13-7
Original file line numberDiff line numberDiff line change
@@ -36,10 +36,15 @@ should be easy to parse into data structures in a wide variety of languages.
3636

3737
## Spec
3838

39+
A TOML file must be a valid UTF-8 encoded Unicode document. Specifically this
40+
means that, should a file as a whole not form a
41+
[well-formed code-unit sequence](https://unicode.org/glossary/#well_formed_code_unit_sequence),
42+
the file must be rejected (preferably) or ill-formed byte sequences must be
43+
replaced with U+FFFD as per the Unicode spec.
44+
3945
- TOML is case-sensitive.
40-
- A TOML file must be a valid UTF-8 encoded Unicode document.
41-
- Whitespace means tab (0x09) or space (0x20).
42-
- Newline means LF (0x0A) or CRLF (0x0D 0x0A).
46+
- Whitespace means tab (U+0009) or space (U+0020).
47+
- Newline means LF (U+000A) or CRLF (U+000D U+000A).
4348

4449
## Comment
4550

@@ -265,7 +270,7 @@ The above TOML maps to the following JSON.
265270
## String
266271

267272
There are four ways to express strings: basic, multi-line basic, literal, and
268-
multi-line literal. All strings must contain only valid UTF-8 characters.
273+
multi-line literal. All strings must contain only Unicode characters.
269274

270275
**Basic strings** are surrounded by quotation marks (`"`). Any Unicode character
271276
may be used except those that must be escaped: quotation mark, backslash, and
@@ -293,7 +298,7 @@ For convenience, some popular characters have a compact escape sequence.
293298
```
294299

295300
Any Unicode character may be escaped with the `\xHH`, `\uHHHH`, or `\UHHHHHHHH`
296-
forms. The escape codes must be valid Unicode
301+
forms. The escape codes must be Unicode
297302
[scalar values](https://unicode.org/glossary/#unicode_scalar_value).
298303

299304
All other escape sequences not listed above are reserved; if they are used, TOML
@@ -417,8 +422,9 @@ str = ''''That,' she said, 'is still pointless.''''
417422
```
418423

419424
Control characters other than tab are not permitted in a literal string. Thus,
420-
for binary data, it is recommended that you use Base64 or another suitable ASCII
421-
or UTF-8 encoding. The handling of that encoding will be application-specific.
425+
for binary data, it is recommended that you use Base64 or another suitable
426+
binary-to-text encoding. The handling of that encoding will be
427+
application-specific.
422428

423429
## Integer
424430

0 commit comments

Comments
 (0)