Skip to content

Latest commit

 

History

History
77 lines (57 loc) · 3.89 KB

faq.md

File metadata and controls

77 lines (57 loc) · 3.89 KB

Tackling common issues

Concepticon concept lists

It is highly recommended that the concept list(s) used to collect a lexibank dataset are submitted to Concepticon. If this is the case,

  • metadata.json should list the IDs of the concept lists using the "conceptlist" key
  • the content of the CLDF dataset's ParameterTable can be assembled easily, calling
    args.writer.add_concepts()
  • data in additional columns of the concept list can be copied into the ParameterTable by assigning a special value to the Datasets concept_class attribute:
    concept_class = pylexibank.CONCEPTICON_CONCEPTS

Orthography profiles

Data can be automatically segmented if a dataset provides an orthography profile (Moran and Cysouw 2018). In the simplest case, this would be a single file etc/orthography.tsv, specifying the orthography of the whole dataset.

Often, though, in particular for aggregated datasets, the orthographies vary considerably between languages in the dataset. In this case, per-language profiles can be useful. This can be accomplished by providing a set of profile files in the directory etc/orthography, where each profile follows the file name convention <Language_ID>.tsv.

Sometimes, even more flexibility is needed, e.g. when the orthographies used in a dataset vary per contributor and not per language. In this case, etc/orthography may hold any number of profile files named <Profile_ID>.tsv, and the profile selection per form is controlled by passing a keyword argument profile=<Profile_ID> into calls of LexibankWriter.add_lexemes.

Preparing initial orthography profiles with LingPy

In order to prepare an initial orthography profile from your data, you can use the profile command of lingpy, which will be installed along with pylexibank. To do so, we assume that you have already created a first cldf-version of your dataset, with Value and Form columns in the FormTable. In this case, creating an orthography profile is as easy as typing:

$ lingpy profile --clts --cldf --column=form --context -i cldf/cldf-metadata.json -o etc/orthography.tsv

This profile will try to normalize your data following the CLTS system, it assumes that data is provided in CLDF format, it takes entries in the column form, and also distinguishes three different contexts in which graphemes may occur, namely the beginning of a word, marked by ^, the end, marked by $, and the rest.

Caveats in the creation of orthography profiles with context

When correcting such a profile, you have to be careful to remind yourself of the greediness of the orthography profile algorithm provided by the segments package.

If you have a minimal profile like the following one, the profile will fail in parsing the string n̥ak, when passed as a form in lexibank.

Graphemes	IPA
^	NULL
$	NULL
^n	n
n	n
a	a
k$	k
n̥	n̥

The reason is that lexibank first converts the string into its context representation ^n̥ak$. It will then search for the longest subsequence in the beginning of the sequence, where it finds ^n. This will then be used as a first match, leaving the diacritic ◌̥ unmapped. Keep this in mind as it can otherwise seem very surprising, as if the profile would not correctly work.

Testing orthography profiles interactively with SegmentsJS

In order to test your profile interactively, you can check the interactive implementation of orthography profiles as they are provided in the SegmentsJS application. You can directly test the behavior by pasting above profile into the application and then pasting the word ^n̥ak$. Remember that you should always add the context markers when checking a given sequence that may be wrongly or surprisingly parsed in your lexibank dataset.