CLDR and SLDR
What is the CLDR?
From https://cldr.unicode.org/:
The Unicode Common Locale Data Repository (CLDR) provides key building blocks for software to support the world’s languages, with the largest and most extensive standard repository of locale data available. This data is used by a wide spectrum of companies for their software internationalization and localization, adapting software to the conventions of different languages for such common software tasks. It includes:
- Locale-specific patterns for formatting and parsing: dates, times, timezones, numbers and currency values, measurement units,…
- Translations of names: languages, scripts, countries and regions, currencies, eras, months, weekdays, day periods, time zones, cities, and time units, emoji characters and sequences (and search keywords),…
- Language & script information: characters used; plural cases; gender of lists; capitalization; rules for sorting & searching; writing direction; transliteration rules; rules for spelling out numbers; rules for segmenting text into graphemes, words, and sentences; keyboard layouts…
- Country information: language usage, currency information, calendar preference, week conventions,…
- Validity: Definitions, aliases, and validity information for Unicode locales, languages, scripts, regions, and extensions,…
CLDR uses the XML format provided by UTS #35: Unicode Locale Data Markup Language (LDML). LDML is a format used not only for CLDR, but also for general interchange of locale data, such as in Microsoft’s .NET.
What is the SLDR?
The SLDR is the SIL Locale Data Repository, a repo that builds upon the structure and data of the CLDR with locale data that might not yet meet the minimum requirements for CLDR inclusion.
Like the CLDR, all data within the SLDR uses the LDML (Locale Data Markup Language). For more information, see the LDML page on this site.
The purpose of the SLDR is to gather information for publication on ScriptSource, to gather information for submission to the CLDR, and to serve the data to applications that need access to locales absent from the CLDR.
The SLDR imports all files from CLDR into its repository on a regular basis, and makes no changes to the content already within said files. The goal of the SLDR is to build upon data already within the CLDR, not to override it. All data within the CLDR is also located within the SLDR.
Information most commonly found within SLDR-only files includes:
- Autonyms
- Writing Direction
- Exemplars: Main, Auxiliary, Index, and Punctuation
- Collation
- Font and Keyboard Data
Font and keyboard data is a datapoint unique to the SLDR which provides recommenations of fonts and keyboards that best serve that locale. Since this information is not natively included in the CLDR, the SLDR appends this data to imported CLDR files as well.
Other data beyond the scope of the list above may also be included in an SLDR file if the information has been made available, but unless an effort is being made to bring a specific locale up to CLDR standards for submission, those other elements are not typically a priority.
SLDR data is sourced from manually curated research, data generated (with permission) from the contents of the Digital Bible Library, and external submissions via ScriptSource Contributions and GitHub Issues. While the SLDR strives to be as accurate as possible, the data within is not perfect and should not be treated as an unquestionable source of information. Corrections from external sources are extremely welcome.
How is the SLDR Used?
langtags.json
The langtags.json
file is generated by the LangTags repository and is used to parse tag equivalence. This is explained in-depth in the langtags documentation in the langtags repo.
In addition to using the data from its parent repository, langtags.json
also pulls autonym data from the SLDR. Specifically, the autonym data from the SLDR is listed under the field “localname” in langtags.json
. This should not be confused with the field “localnames”, which is an array featuring all of the names sourced from the Ethnologue.
Whenever new data is pushed to the SLDR, langtags.json
is automatically rebuilt through GitHub Actions.
The LDML API
SLDR information is primarily accessed and utilized by applications via the LDML API. This API also utilizes and distributes langtags.json
.
Here are some examples of how the LDML API is used:
- https://ldml.api.sil.org/lld will return the lld.xml file from the Release version of the SLDR
- https://ldml.api.sil.org/lld?staging=1 will return the lld.xml file from the yet-unreleased staging version of the SLDR.
- This is used to test upcoming versions of SLDR prior to a new release. Typically, devs are given notice at least 2 weeks prior to release via the SIL LangTech Slack channel.
- https://ldml.api.sil.org/langtags.json returns the entirity of langtag.json from the release branch, while https://ldml.api.sil.org/langtags.json?staging=1 returns the staging version.
Since langtags.json
is an important element of the LDML API, it is good practice for new versions of the SLDR and Langtags repositories to release simultaneously in order to avoid conflicts between them in output of the LDML API.
Examples of applications that use the SLDR via the LDML API include Bloom, Paratext, and Flex.
Language Font Finder API
The Language Font Finder API (LFF) is an API that returns recommended fonts for a specific language tag. The font recommendations are pulled from the font data located in the SLDR file for that locale. If there is no SLDR file for the passed language tag, or if the SLDR file does not contain any font data, a predefined fallback value is returned instead.
ScriptSource
The ScriptSource site uses the exemplar data of locales contained within the SLDR to populate the “Symbols & Characters” sections of the pages relating to said locales.
For example, the “Symbols & Characters” tab of the “Enga written with Latin script” page contains two lists of characters- main and auxiliary- that are pulled directly from the “main” and “auxiliary” exemplars of the enq.xml
file in the SLDR.
This is one of the most human-friendly ways that SLDR data can be accessed by the general public, as opposed to the data-driven formats of the SLDR itself and the aforementioned APIs. This is also why ScriptSource contributions are one of the most common methods used by individuals to submit corrections to the SLDR.
CLDR Submissions
If enough data is gathered in an SLDR file that it can fulfill the minimum requirements for CLDR inclusion, the locale will be submitted to the CLDR.
For more information on CLDR coverage levels and minimum data requirements, see these pages on cldr.unicode.org’s CLDR Specifications page:
- Core Data for New Locales: A summary of the minimum reqs for a locale to be submitted to the CLDR.
- Coverage Levels: A summary of all of the coverage tiers within the CLDR, beyond the bare minimum.