Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Select null locale as proposed solution & add Intl API specifics #18

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

eemeli
Copy link
Member

@eemeli eemeli commented Feb 5, 2025

Closes #2
Closes #7
Closes #8
Closes #17
Closes #19

I propose that we advance this proposal by choosing to define the behaviour of the 'zxx' or null locale, with support in most Intl APIs as defined here.

The solution of adding corresponding ECMA-262 (rather than ECMA-402) functionality is dismissed as infeasible, as that would require introducing wholly new functions for:

  • duration serialization
  • list serialization
  • collation
  • segmentation

Collation and segmentation have a significant data dependency that's already internalized in 402, but not in 262.

The behaviour for null locale in Intl.Collator and Intl.Segmenter are as proposed by @hsivonen in #13.

At least the following are left to be filled out in later PRs, but there's indubitably more:

  • A complete set of supported Intl.DateTimeFormat component options, and their detailed formatted output
  • Intl.Locale behaviour
  • Complete mapping of short unit identifers for Intl.NumberFormat

Edit: I've prepared a presentation for the changes proposed here.
Edit 2: Updated following changes proposed by TG2

### Intl.Segmenter

When the `zxx` locale is used, [UAX #29](https://unicode.org/reports/tr29/) segmentation
with extended grapheme clusters is used, without tailorings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably say not to have tailorings for 'grapheme' and 'sentence', but for 'word' saying that would turn off behaviors that are de facto on by default in a cartch-all case.

How about:

The 'grapheme' mode shall use untailored UAX 29 extended grapheme cluster rules.

The 'sentence' mode shall use untailored UAX 29 default sentence boundary rules.

The 'word' mode shall use UAX 29 default word boundary rules with the tailorings that the implementation supports for scripts that do not use spaces between words. (Note: This is intended to enable word segmentation for e.g. Han, Thai, Lao, and Khmer scripts.) If the implementation supports more than one tailoring for a script that does not use spaces, the most broadly applicable one of the alternatives for a given script shall be used. (Note: The currently-known or expected implementations do not currently have multiple mutually-exclusive tailorings for scripts that don't use spaces.)

CC @makotokato @aethanyc

@eemeli
Copy link
Member Author

eemeli commented Feb 7, 2025

I've updated the PR following yesterday's TG2 discussion:

  • Behaviour for Intl.DisplayNames is defined, always experiencing fallback
  • Intl.DurationFormat and Intl.RelativeTimeFormat both return an ISO 8601-2 duration string, the latter with a + or - prefix.
  • Array.p.toLocaleString uses a comma , as separator
  • String.p.toLocale{Lower,Upper}Case use the Unicode Default Case Conversion algorithm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment