There are three repositories in GitHub/HuygensING that contain data produced by the Translatin project:
-
translatin-wemi Metadata preparation and analysis to identify works, expressions, manifestations and items (wemi), in the FRBR sense
-
translatin-manif Publication of a selection of manifestations, as text-fabric files, with an annotation export to the publishing pipeline of TeamText of HuC-DI.
-
translatin Data production for the final published result of the project: a collection of 100+ medieval, latin dramas.
The most comprehensive information on Translatin, the project, the people involved, the data and the programs, is in translatin-manif.
The final result of the Translatin project is a website with a selection of 100+ medieval, latin dramas.
The source data consists of word documents, collected by Jan Bloemendal, and then curated with the help of Dirk Roorda.
The result has been sent through a pipeline of conversions into a stream of annotations that can be presented by TAV (Text Annoviz), the Team Text front end by which corpora can be published on the Web.
We curated the source documents in a number of steps in order to arrive at a set of well-defined and well-formed pieces of text with meaningful and handy file names.
The stages of this process can be inspected in the directory datasource/transcriptions. Here are the successive steps:
-
docx Every input Word document is a drama. We gave all documents a concise name, of the form
aaa - www.docx
We weeded out excessive formatting, which mostly derived from html origins. Jan went through every document and marked the front, main and back parts. -
mdOrig The result of a mechanical conversion from word to markdown, done by the PanDoc program.
-
mdRefined The result of applying heuristics to the markdown of the previous steps. We detected sections haeadings and captions for acts, scenes, choruses, etc. We detected line numbers, page numbers and folio references, and wrapped them in special markers. During this process, Dirk has inspected every document and has written regular expressions tailored to each work to extract the numbers and headings.
-
teiSimple The result of a mechanical conversion from markdown to TEI, done by the PanDoc program.
-
tei The result of the curation is a set of TEI files in tei. The main purpose of this step is to add appropriate metadata to the teiHeader parts of the documents. This metadata comes from a spreadsheet prepared by Jan, with metadata on works and authors.
Reports of what these steps encountered and did can be found in the report/trans directory.
Another part of the curation was to select/customize a TEI schema that is geared to drama texts. We used the Roma tool to customize TEI's module performance texts.
However, although this schema contains sophisticated elements to encode all aspects of drama texts that are worth marking up, we have not actually tried to use those elements, because that was one step to far within the boundaries of the current project. We used the schema, and all documents validate against it, but the documents are all marked up using only the more generic elements of the TEI.
It is possible to gradually up-convert the current TEI to versions that make more use of the dramatic markup, and it can be done with the present schema.
The curation results, the TEI documents with proper metadata, are then pushed through a publishing pipeline. Here are the steps.
-
Validation We validated each document against the schema. We also produced some reports on element usage, id-references etc. See the report/tei directory.
-
Text-Fabric We converted the TEI documents to a Text-Fabric datasource. See the TF docs on how to install and use Text-Fabric. For this project, Text-Fabric is mainly used as a swiss-army-knife to untangle the TEI markup from the content and produce a stream of (web) annotations with the same information content.
-
WATM Web Annotation Text Model. This is a raw data format that encodes the information to be contained in web annotations. These annotations will be displayed on the web, and they constitute, together with the plain text, the documents as they can be browsed and searched on the web.