paper.qmd

---
title: "XRONOS: An Open Data Infrastructure for Archaeological Chronology"
short-title: XRONOS
author:
  - name: Joe Roe
    orcid: 0000-0002-1011-1244
    email: joeroe@hey.com
    corresponding: true
    affiliation:
    - ref: iaw
    roles: [writing, editing, curation, software, analysis]
  - name: Clemens Schmid
    orcid: 0000-0003-3448-5715
    affiliation:
    - ref: mpg
    roles: [curation, software, editing]
  - name: Setareh Ebrahimiabareghi
    orcid: 0000-0003-3749-3147
    affiliation:
    - ref: iaw
    roles: [curation, investigation, editing]
  - name: Caroline Heitz
    orcid: 0000-0001-7188-6775
    affiliation:
    - ref: iaw
    roles: [conceptualization, funding, editing]
  - name: Martin Hinz
    orcid: 0000-0002-9904-6548
    email: martin.hinz@unibe.ch
    corresponding: true
    affiliation:
    - ref: iaw
    roles: [conceptualization, funding, supervision, writing, editing, software]
affiliations:
  - id: iaw
    name: University of Bern
    department: Institute of Archaeological Sciences
    url: https://www.iaw.unibe.ch/
  - id: mpg
    name: Max Planck Institute for Geoanthropology
    url: https://www.shh.mpg.de/
abstract: |
  XRONOS (<https://xronos.ch>) is an open data infrastructure for the backbone of the archaeological record – chronology. It provides open access to published radiocarbon dates and other chronometric data from any period, anywhere in the world. By collating a large number of existing regional and global compilations of dates, XRONOS offers the most comprehensive radiocarbon database yet published, with over 350,000 radiocarbon and 75,000 site records. It also provides a foundation for expanding the systematic collection of chronometric information beyond radiocarbon, with support for typological and dendrochronological dates and a generalisable data model that can be adapted to other methods of absolute dating. Automated and semi-automated quality control processes ensure that data from diverse sources is continuously integrated and standardised, making it easier to find information of interest and reducing the need for manual data cleaning by end users. In this paper we describe the concept and implementation of XRONOS in relation to the state of the art in chronometric data-sharing, and evaluate its potential as a general-purpose open repository and curation platform for archaeological chronology.
keywords: 
  - open data
  - chronology
  - chronometry
  - radiocarbon dating
  - dendrochronology
  - typological dating
#published: "Manuscript for submission to the *Journal of Computer Applications in Archaeology*"
number-sections: true
bibliography: 
  - references.bib
  - data/c14_datasets.bib
nocite: | # workaround for lack of table references
  @Al-Abyadh-Balghelam-Dalma-JebelEtAl, @Alcantara2021, @AndersonEtAl2010, @BANADORA, @BarceloAlvarezEtAl2013, @Benz2010, @BirdEtAl2022, @BohnerSchyle2004, @BronkRamseyEtAl2009, @Capriles2023, @CapuzzoEtAl2014, @ClistEtAl2023, @CochraneEtAl2021, @CourtneyMustaphi2016, @CremaEtAl2016, @DeeEtAl2012, @DeeEtAl2013, @dErricoEtAl2011, @deSaulieuEtAl2017, @Diaz-RodriguezEtAl2023, @DouglassEtAl2019, @FlohrEtAl2016, @GajewskiEtAl2011, @GarnettEtAl2023, @GayoEtAl2015, @GillespieEtAl1984, @GregoriodeSouza2020, @GrossmannEtAl2023, @HoebeEtAl2023, @HoggarthEtAl2021, @HuetEtAl2022, @HuetEtAl2024, @IRDD, @KatsianisEtAl2020, @KellyEtAl2022, @KimEtAl2021, @Kudo2018, @KudoEtAl2023, @LinnenluckeEtAl2023, @Lipo2020, @LoftusEtAl2019, @LucariniEtAl2020, @ManningEtAl2016, @Martinez-GrauEtAl2021, @McFadgenEtAl2000, @MichczynskiEtAl1995, @PalmisanoEtAl2017, @PalmisanoEtAl2022, @PalmisanoEtAl2022a, @Pardo-GordoEtAl2023, @Perrin2019, @PetcheyEtAl2022, @Rademaker2024, @RademakerEtAl2013, @RADON, @RADONB, @RADONB2024, @Raetzel-Fabian1999, @RamseyEtAl2010, @ReingruberThissen2005, @ReingruberThissen2009, @ReingruberThissen2017, @Riris2021, @SeidenstickerSchmid2021, @TanindiErdogu2005, @UriarteGonzalezEtAl2017, @VanStrydonckDeRoock, @Vermeersch2024, @WangEtAl2014, @Weninger2022, @WilliamsEtAl2008, @WilliamsEtAl2014, @WilliamsSmith2012, @WilliamsSmith2013
format:
  elsevier-pdf:
    keep-tex: true    
    cite-method: citeproc
    journal:
      name: Journal of Computer Applications in Archaeology
      formatting: preprint
  # Pretty but broken:
  # https://github.com/andrewheiss/hikmah-academic-quarto/issues/7
  # hikmah-pdf:
  #   papersize: a4
  #   mainfont: "Linux Libertine O"
  #   sansfont: "URW Gothic"
  #   biblatex-chicago: true
  #   biblio-style: authordate
  #   biblatexoptions:
  #     # - backend=biber
  #     - autolang=hyphen
  #     - isbn=false
  #     - uniquename=false
#  html:
#    page-layout: article
#    toc: true
#    embed-resources: true
execute:
  echo: false
  cache: true
knitr:
  opts_chunk:
    fig.path: figures/
---

```{r setup, cache=FALSE}
#| include: false
library("countrycode")
library("cowplot")
library("dplyr", warn.conflicts = FALSE)
library("dm")
library("english")
library("giscoR")
library("ggplot2")
library("glue")
library("gt", warn.conflicts = FALSE)
library("here")
library("khroma")
library("magick")
library("patchwork", warn.conflicts = FALSE)
library("purrr")
library("readr")
library("RPostgres")
library("sf")
library("spatstat")
library("stars")
library("stringr")
library("tidyr")
library("webshot2")

#' Count number of rows in a filtered data frame
n_filter <- function(.data, ...) {
  nrow(dplyr::filter(.data, ...))
}
```

# Introduction

Chronology is the backbone of the archaeological record.
As a necessary prerequisite to understanding the context of any past event or process [@Lucas2004], it is has unsurprisingly been at the forefront of methodological development in archaeology for as long as the discipline has existed: from putting finds and events in sequence [@Thomsen1836; @Worsaae1843; @Petrie1899; @Ford1962; @Harris1979] to an increasingly wide array of scientific methods that place them on an absolute timescale [@Douglass1929; @DanielsEtAl1953; @Libby1955; @EverndenEtAl1965; @BadaHelfman1975] and an increasingly sophisticated set of statistical tools to build them into chronologies [@Suess1967; @BuckEtAl1991; @Mischka2004; @LevyEtAl2021; @Crema2024].
If archaeology is to be an open science [@Lake2012], it is therefore critical that effective open access to chronological information be placed front and centre.

Over the last two decades, archaeologists have answered this call by publishing an increasing number of compilations of dates from archaeological contexts as open data.
These efforts have facilitated re-evaluations of chronologies themselves [e.g. @HighamEtAl2014; @LoftusEtAl2019; @PratesEtAl2020; @KatsianisEtAl2020] but also the development of novel ways of using chronological data [e.g. @Grove2011; @SilvaSteele2014; @Crema2022; @CremaEtAl2024; @RirisEtAl2024; @MaromWolkowski2024].
The focus has been overwhelmingly on radiocarbon dating and most compilations focus on a single region and/or period.
The profusion of open radiocarbon data in particular has prompted several initiatives towards a global synthesis [e.g. @SchmidEtAl2019; @BronkRamseyEtAl2019; @BirdEtAl2022].

At the same time, the broad range of other types of chronological information used in archaeology—from other radiometric methods to dendrochronology to typological dating and epigraphy—remains relatively difficult to access as open data.
Even when it comes to radiocarbon data, the coverage of available compilations is patchy both geographically and in time and of variable quality (see @sec-c14-compilation).
The publication of many overlapping, non-standardised and mostly static open data resources means that it is still difficult to obtain reliable and up-to-date chronological datasets, especially for applications that crosscut convential geographic and temporal domains of research.
Initiatives towards synthesis have improved this situation, but the goal of a global dataset that is both comprehensive and up-to-date remains elusive.

XRONOS is a new open data infrastructure that aims to provide access to published radiocarbon dates and other chronometric data from any period, anywhere in the world.
It is our attempt to move the state of the art in open archaeological chronology beyond the publication of static, one-off resources ['uploading CSVs', @Batist2023, pp. 188-189], and towards a living digital infrastructure [@Kintigh2006] embedded in a transparent and sustainable collabrative network.
The core of XRONOS is a server application that ingests chronological data from diverse sources, facilitates semi-automated and manual curation of this data, and makes it available via both a web-based graphical user interface (GUI) and machine-readable application programming interface (API).
The web frontend can be accessed via <https://xronos.ch> and all components of the software are developed as free and open source software with source code available at <https://github.com/xronos-ch>.

In the remainder of this paper, we describe the concept and implementation of XRONOS in relation to the state of the art in open chronometric data in archaeology, and evaluate our progress in achieving these goals as of writing.
Since we envisage both XRONOS as a dataset and XRONOS as software to be continually developing resources, the description here should be read as a 'snapshot' of the project as of writing rather than its final state.

# State of the Art

## Compilations of radiocarbon dates {#sec-c14-compilation}

Though an *explicit* emphasis on 'open data' is a relatively recent phenomenon in archaeology [@Lake2012], the open publication of compiled radiocarbon dates has a substantial prehistory.
Arnold and Libby [-@ArnoldLibby1951] initiated the tradition of regularly publishing 'data lists', a practice was subsequently continued by radiocarbon laboratories as supplements to journals such as *Radiocarbon* and *Archaeometry*.
However, as the number of labs and volume of radiocarbon dates being produced grew, this paper-based format became impractical and mostly disappeared [@BronkRamseyEtAl2019; c.f. e.g. @NdeyeEtAl2022], without being replaced by another form of systematic data-sharing or dissemination.
Additionally, because date lists were sourced from radiocarbon laboratories directly—not from those who collected the sample—they typically included only very limited contextual information.
On the eve of the AMS revolution there was an effort to create a computerised 'International Radiocarbon Database' [@Kra1988]—already by 1989 described as a "much needed, long overdue enterprise" [@Kra1989, p. 1067]—but it never came to fruition.

Thus, even though radiocarbon data comes from a relatively limited number of sources [some 172 active labs, @RadiocarbonLabList] and has relatively standardised reporting conventions [@Millard2014; @Bayliss2015], in practice the only way to produce aggregated datasets in recent decades has been to manually search through relevant literature for dates reported secondarily by the submitter of the sample.
This already laborious process is further hampered by a significant inconsistency in how much authors adhere to reporting conventions for measurements and sample metadata, a lack of conventions on the reporting of *contextual* information, weak or nonexistent disciplinary norms regarding the responsibility to publish results openly in a timely fashion, and a range of other issues affecting data reuse [@MoodyEtAl2021].

```{r data-c14-datasets}
c14_datasets <- read_tsv("data/c14_datasets.tsv", show_col_types = FALSE)

n_c14_datasets <- nrow(c14_datasets)
n_c14_datasets_sans_publication_year <- n_filter(c14_datasets, is.na(publication_year))
```

```{r tbl-c14-datasets}
#| tbl-cap: Summary of published compilations of radiocarbon dates. For full data, see supplementary materials.
# References in tables broken by:
# https://github.com/quarto-dev/quarto-cli/issues/9342
c14_datasets |>
  transmute(
    database = glue("[{name}]({url})"),
    publication_year, n_dates, citations
  ) |>
  arrange(publication_year) |>
  gt() |>
  cols_label(
    publication_year = "published",
    citations = "references",
    n_dates = "dates"
    ) |>
  cols_label_with(everything(), str_to_sentence) |>
  sub_missing(everything()) |>
  fmt_markdown(c(database, citations)) |>
  fmt_number(c(n_dates), decimals = 0) |>
  cols_width(
    database ~ pct(40),
    publication_year ~ pct(15),
    n_dates ~ pct(15),
    citations ~ pct(30)
  ) |>
  cols_align(
    "left", c(database)
  ) |>
  tab_options(
    table.font.size = 13
  )
```

Despite these inefficiencies, there have been a profusion of published radiocarbon compilations since the decline of the date list.
Our review of the literature identified `r n_c14_datasets` published since 1994 (Table @tbl-c14-datasets and supplementary materials).
This is almost certainly an undercount, because our firsthand knowledge of regional literature is limited to Europe and West Asia and many resources only ever existed in 'grey' formats (e.g. websites that were not indexed and no longer exist).
We also restricted ourselves to structured datasets disseminated primarily in a digital format; 
'date lists' in printed periodicals and gazetteers were excluded.

```{r fig-c14-datasets-time}
#| fig-cap: Cumulative number of radiocarbon compilations published since 1995
c14_datasets |>
  drop_na(publication_year) |>
  count(publication_year, sort = TRUE) |>
  arrange(publication_year) |>
  mutate(cumn = cumsum(n)) |>
  ggplot(aes(publication_year, cumn)) +
  geom_line() +
  scale_x_continuous(breaks = scales::breaks_width(5)) +
  labs(x = NULL, y = "Datasets*",
       caption = glue("* Excludes {english(n_c14_datasets_sans_publication_year)} datasets with an unknown year of publication")) +
  theme_minimal_vgrid()
```

The number of available compilations has increased exponentially since around 1995 (@fig-c14-datasets-time).
The first generation came around the turn of the century and consists mostly of online databases with a web frontend.
These include some databases operated by radiocarbon labs, for example the Oxford Radiocarbon Lab (ORAU) and the Belgian Royal Institute for Cultural Heritage (KIK-IRPA), and essentially represent a continuation of their date lists in a digital format.
The majority, however, were compiled from the literature by individual researchers interested in a particular region and/or period.
Notable early examples include ANDES 14C in 1994 [Central Andes, @MichczynskiEtAl1995], CARD [Canada, @GajewskiEtAl2011] and RADON [Europe, @Raetzel-Fabian1999] in 1999, and CANEW in 2001 [Near East, @ReingruberThissen2005].
From 2010, coinciding with broader shifts in scientific publishing [@TenopirEtAl2011], it became more common to publish standalone 'open data' products in the form of journal supplements, archives in repositories and/or data papers;
the *[Journal of Open Archaeology Data](<https://openarchaeologydata.metajnl.com>)*, launched in 2012, has been a prominent venue for this latter category.
Most recently there has been a trend towards providing version-controlled plain text data via platforms such as [GitHub](https://github.com), reflecting the broader adoption of these tools amongst computational archaeologists over the last decade [@BatistRoe2024].
The shift from online databases towards more static but more preservable open data products is welcome, given how many databases from the first generation have subsequently ceased to be accessible.
Version-controlled repositories are particular well-suited to data compilation projects because they allow for continued updates whilst still providing snapshot 'releases' that are citeable and can be archived in long-term repositories.

```{r data-basemap}
#| include: false
coast <- gisco_get_coastallines()
m49_regions <- gisco_get_countries() |>
  mutate(
    m49_macroregion = countrycode(ISO3_CODE, "iso3c", "un.region.name"),
    m49_subregion = countrycode(ISO3_CODE, "iso3c", "un.regionsub.name"),
    m49_intregion = countrycode(ISO3_CODE, "iso3c", "un.regionintermediate.name"),
    m49_region = coalesce(m49_intregion, m49_subregion)
  ) |>
  summarise(geometry = st_union(geometry), .by = m49_region)
```

```{r fig-c14-datasets-map, message=FALSE}
#| fig-cap: Geographic coverage of published regional radiocarbon compilations according to our survey (see Supplementary Material).
c14_datasets |>
  separate_longer_delim(m49_region, "; ") |>
  drop_na(m49_region) |>
  summarise(n_dates = sum(n_dates, na.rm = TRUE), .by = m49_region) |>
  right_join(m49_regions) |>
  mutate(n_dates = replace_na(n_dates, 0)) |>
  st_as_sf() |>
  ggplot() +
  geom_sf(aes(fill = n_dates), colour = NA) +
  geom_sf(data = coast, fill = NA) +
  scale_fill_turku(reverse = TRUE, labels = scales::comma) +
  labs(fill = "Dates") +
  coord_sf(crs = "+proj=moll") +
  theme_minimal_grid()
```

Although this body of work has greatly improved the accessibility of radiocarbon dates and supported significant methodological advances [@Crema2022; @CremaEtAl2024], some limitations are apparent.

The geographic coverage of regional radiocarbon compilations is markedly uneven (@fig-c14-datasets-map).
Europe and, especially, North America are over-represented [@ChaputGajewski2016; @AlcantaraPedrozainpress].
South America, West Asia, and East Asia are reasonably well-covered, but there practically no systematically compiled dates from East or West Africa, Central or South Asia, or Mainland Southeast Asia.
This is probably explained in part by a lower volume of archaeological research and access to radiocarbon dating in these regions, but a lack of attention in compilation work must also be a factor.
For example, radiocarbon dating has been an established part of Indian archaeology since at least 1961 [@KusumgarEtAl1963], but we have not able to locate a single systematic compilation of dates from South Asia.^[We would be very happy to be corrected on this point.]

```{r data-c14-datasets-maintainance}
n_c14_datasets_maintained <- n_filter(c14_datasets, maintained)
n_c14_datasets_recently_updated <- n_filter(c14_datasets, maintained, last_updated > 2022)
mean_c14_datasets_lifespan <- mean(c14_datasets$last_updated - c14_datasets$publication_year, na.rm = TRUE)
```

Datasets based on literature review also become out of date almost immediately upon publication, due the the constant production of new dates.
Unfortunately this applies to many databases that are in theory continuously updated, as it is common to see them become unmaintained and or unexpectedly become unavailable.
Of the `r n_c14_datasets` published datasets we identified, `r n_c14_datasets_maintained` were intended to be continuously updated, but only `r n_c14_datasets_recently_updated` have received updates in the last two years.
The average 'lifespan' of a dataset from its publication to its last update is around `r english(round(mean_c14_datasets_lifespan))` years.
Most radiocarbon datasets we reviewed were compiled with a specific goal in mind (e.g. a particular analysis) and, even where there is the intention to keep them updated afterwards, the exigencies of scientific production combined with the labour-intensive nature of the process make that difficult to achieve in practice.

Laboratory databases solve the problem of currency, but tend to have more arbitary coverage, since the inclusion of data is determined by who submits dates to that lab, not any form of principled curation.
There are also comparatively few of them – most active labs no longer directly publish dates that they produce (if they ever did).

Other outstanding problems with existing compilations include various systematic biases in data collection [@ClistEtAl2023] and a large degree of overlap and duplication between individual databases.
For example, we identified `r english(sum(str_detect(c14_datasets$m49_region, "Western Europe"), na.rm = TRUE))` different resources covering Western Europe but none covering South Asia.
The quality and accessibility of published compilations is also variable.
`r sum(c14_datasets$open, na.rm = TRUE)` of the `r nrow(c14_datasets)` resources we reviewed are not 'open' according to the Open Knowledge Foundation's definition of data openness ["Open data and content can be freely used, modified, and shared by anyone for any purpose", @OpenKnowledgeFoundation], which both limits the access to and reuse potential of these datasets.
And even of these, many are not currently available in readily machine-readable formats (e.g. plain text or database files rather than PDFs or hypertext).

The fragmentation of the radiocarbon record into regional datasets also hinders analysis at larger scales.
Although the core elements of a radiocarbon date—laboratory identifier, radiocarbon age, measurement error—are more or less standardised, there is no such consistency in contextual information on the sample or site.
Such contextual information is important not just for the interpretation of dates, but for filtering out unreliable dates based on sample information ['chronometric hygiene' sensu @PettittEtAl2003] and for correcting for known systematic errors such as the marine reservoir effect [@AlvesEtAl2018].
Most published datasets incorporate all or part of earlier compilations, meaning duplicate records are also very common, but deduplicating them is not a trivial problem due to format variations (see @sec-implementation-data).
These issues are by no means impossible to overcome, but adds a significant amount of data-cleaning effort to a process that would otherwise be very amenable to standardisation.

## Global radiocarbon compilations {#sec-global-compilations}

```{r data-fig-c14-global, include=FALSE}
c14_global <- read_tsv(here("data", "c14_global.tsv"))
c14_global_sum <- read_tsv(here("data", "c14_global_sum.tsv"))

ppp_window <- st_transform(st_union(gisco_get_countries()), "+proj=moll")

c14_global_density <- c14_global |>
  drop_na(longitude, latitude) |>
  nest(.by = source) |>
  mutate(
    sf = map(data, st_as_sf, coords = c("longitude", "latitude"), crs = 4326),
    sf = map(sf, st_transform, crs = "+proj=moll"),
    sf = map(sf, st_filter, ppp_window), # no dates in the sea, please
    n_sf = map_int(sf, nrow),
    ppp = map(sf, as.ppp, W = as.owin(ppp_window)),
    density = map(ppp, density, sigma = 50000, adjust = 10, eps = 50000),
    density = map(density, st_as_stars)
  )
```

```{r fig-c14-global, message=FALSE, warning=FALSE}
#| fig-cap: Geographic and temporal (sum calibration) of georeferenced dates in XRONOS and other global radiocarbon compilations
#| fig-height: 6
plot_c14_density <- function(density, title, coastline = coast) {
  ggplot() +
    geom_stars(
      aes(fill = ecdf(v)(v)), # transform density to quantiles like plot.stars()
      data = density
    ) +
    geom_sf(data = coastline, fill = NA) +
    scale_fill_stepsn(
      name = "Density",
      n.breaks = 11,
      colours = colour("turku", reverse = TRUE)(11),
      labels = c("Low", rep("", 7), "High"),
      na.value = NA,
      guide = guide_none()
    ) +
    labs(title = title) +
    coord_sf(crs = "+proj=moll") +
    theme_minimal_grid() +
    theme(
      plot.title = element_text(hjust = 0.5),
      plot.margin = unit(c(11, 5.5, 0, 5.5), "pt")
    )
}

fig_density <- map2(c14_global_density$density, c14_global_density$source, 
                    plot_c14_density)
names(fig_density) <- c14_global_density$source

# From https://stackoverflow.com/questions/11053899/how-to-get-a-reversed-log10-scale-in-ggplot2
transform_reverse_log <- function(base = exp(1)) {
  trans <- function(x) -log(x, base)
  inv <- function(x) base^(-x)
  scales::trans_new(paste0("reverselog-", format(base)), trans, inv, 
                    scales::log_breaks(base = base), 
                    domain = c(1e-100, Inf))
}

plot_c14_sum <- function(data, n) {
  ggplot(data, aes(age, pdens)) +
    geom_area(fill = "#3C3B39") + # 'XRONOS grey'
    scale_x_continuous(name = NULL, breaks = c(50000, 5000, 500),
                       labels = c("50000", "5000", "500 BP"),
                       transform = transform_reverse_log()) +
    scale_y_continuous(name = NULL, labels = NULL) +
    labs(caption = paste0("N=", n)) +
    theme_minimal_vgrid() + 
    theme(
      axis.line.y = element_blank(),
      axis.ticks.y = element_blank(),
      plot.margin = unit(c(0, 5.5, 11, 5.5), "pt")
    )
}

fig_sum <- c14_global_sum |>
  nest(.by = source) |>
  pull(data) |>
  map2(c14_global |> nest(.by = source) |> pull(data) |> map(nrow), plot_c14_sum)
names(fig_sum) <- unique(c14_global_sum$source)

# Lay out with patchwork
fig_density[["c14bazAAR"]] + fig_density[["IntChron"]] +
  fig_sum[["c14bazAAR"]] + fig_sum[["IntChron"]] +
  fig_density[["p3k14c"]] + fig_density[["XRONOS"]] +
  fig_sum[["p3k14c"]] + fig_sum[["XRONOS"]] +
  plot_layout(ncol = 2, heights = rep(c(5, 1), 2))

# Alt. plot by country
# c14_global |>
#   count(country, source) |>
#   left_join(gisco_get_countries(), by = c("country" = "CNTR_ID")) |>
#   mutate(n = replace_na(n, 0)) |>
#   st_as_sf() |>
#   ggplot() +
#   facet_wrap(vars(source)) +
#   geom_sf(data = gisco_get_countries(), fill = "white", colour = NA) +
#   geom_sf(aes(fill = n), colour = NA) +
#   geom_sf(data = coast, fill = NA) +
#   scale_fill_turku(reverse = TRUE, labels = scales::comma) +
#   labs(fill = "Dates") +
#   coord_sf(crs = "+proj=moll") +
#   theme_minimal_grid()
```

The profusion of radiocarbon compilations over the last decade has naturally prompted many to think globally.
Three existing initiatives in particular share similar aims to XRONOS (at least as far as radiocarbon is concerned): c14bazAAR, IntChron, and p3k14c.

The first available synthetic radiocarbon database was c14bazAAR [@SchmidEtAl2019], an R package that provides an index of openly published radiocarbon databases and a common interface for retrieving them and performing basic data cleaning.
Because c14bazAAR downloads data from its original source repositories, rather than mirroring it, it only includes resources that have been published in a fully open and machine-accessible format.
Despite this limitation, it has global coverage and a large number of dates (@fig-c14-global), and was therefore our starting point for data collection for XRONOS.

Another indexical approach is taken by the IntChron project [@BronkRamseyEtAl2019], which exposes data from multiple sources and exposes them with a common JSON-based web interface.
The IntChron specification is open, meaning that radiocarbon labs or compilation projects can implement it independently and thereby allow end users to access their data through a common interface (though to our knowledge it has so far only been adopted by databases associated with the Oxford Radiocarbon Lab).
The JSON format also lends itself to the implementation of wrapper libraries, for example the rIntChron package gives direct access to IntChron-indexed databases in R [@Roe2024].

p3k14c [@BirdEtAl2022] instead compiles multiple source databases into a single flat file dataset, with a similar level of coverage to c14bazAAR. 
The major advantage of this approach is that the data is made internally consistent and has been manually cleaned to an extent, which makes it particularly well-suited to global analyses. 
The downside is that without the continuous link to the source databases present in the c14bazAAR and IntChron, it can only be kept up to date manually with periodic re-releases. 
An accompanying package [@BirdEtAl2024] provides direct access to the p3k14c dataset in R.

As of December 2024, c14bazAAR had `r format(nrow(filter(c14_global, source == "c14bazAAR")), big.mark = ",")` radiocarbon dates with unique laboratory identifiers (excluding those sourced from p3k14c and XRONOS), IntChron had `r format(nrow(filter(c14_global, source == "IntChron")), big.mark = ",")` (excluding those from non-archaeological contexts), and p3k14c had `r format(nrow(filter(c14_global, source == "p3k14c")), big.mark = ",")`.
The geographic distribution of dates from each is similar (@fig-c14-global), reflecting the large degree in overlap between the sources of each compilation.
IntChron, which in practice is currently only used to publish dates associated with the Oxford Radiocarbon Lab, has dates from more diverse contexts, but is an order of magnitude smaller.

## Beyond radiocarbon

Radiocarbon has been by far the most active area of open data compilation, but archaeological chronology incorporates a much more diverse range of sources of information [@Harding1999].
In periods beyond the practical limit of radiocarbon dating (c. 55,000 BP), other types of radiometric (K–Ar, U–Pb, etc.), chemical or luminescence dating offer an alternative [@Aitken1999].
Conversely, in historic periods, radiocarbon is often relatively underused compared to conventional typological dating (based on artefact characteristics), which in these periods can offer comparable or better temporal resolution, or direct dating based on epigraphy [@HermankovaEtAl2021], numismatics [@KemmersMyrberg2011] or historical sources.
In places where it is widely available, dendrochronology [@Baillie2014] also produces significantly better resolved chronologies and therefore tends to be the main source of chronometric data.
Other more application-specific chronological methods include shoreline dating [@Brogger1905; @Roalkvam2023], lichenometry [@Benedict2009] and rock weathering dating [@Whitley2012; @Bednarik2020].

Compared to radiocarbon, there are few examples of systematic, open compilations of any of these other types of data. This is most striking when it comes to other radiometric/scientific dating methods, as the data structures and publication modes are very similar to radiocarbon.
The 'Radiocarbon Palaeolithic Europe Database' [@Vermeersch2020], despite the name, includes a significant number of thermo- and optically stimulated luminescence, electon spin resonance, uranium–thorium and amino acid dates.
Similarly, the AustArch database [@WilliamsEtAl2014] includes luminescence dates alongside radiocarbon, but is limited to Australia and was last updated in 2013.
Apart from these and a few other exceptions where other scientific dates are collected alongside radiocarbon, we are not aware of any open compilations of them.

### Dendrochronology

<!-- TODO: add references? -->

With regard to tree-ring data, some databases provide valuable resources for dendrochronological studies in general but are not primarily intended for archaeological contexts. For instance, Dendro4Art specializes in dendrochronological data related to wooden art objects, such as sculptures and panel paintings. While this focus serves art historians and conservationists well, its utility for studying prehistoric datasets is minimal. Similarly, the Dendrochronological Picture Database, maintained by the Swiss Federal Institute for Snow and Avalanche Research (SLF), offers a visual archive of approximately 1,400 images documenting dendrochronological phenomena. Although valuable as an educational resource, it does not provide raw data necessary for chronological or archaeological analyses. Additionally, the OLDLIST and Eastern OLDLIST databases focus on documenting the maximum ages of trees worldwide. Their emphasis on biological longevity, while significant for ecological research, limits their applicability to archaeological or prehistoric investigations.

Among databases that do provide dendrochronological data, the degree to which they support prehistoric research varies substantially. The NOAA International Tree-Ring Data Bank (ITRDB) serves as a global repository of tree-ring measurements. However, its focus remains predominantly on North America, with only 34 datasets representing European prehistoric contexts. This restricts its relevance for studying the European past. Similarly, the ADS database, maintained by the UK-based Vernacular Architecture Group, compiles dendrochronological data from the UK but is limited to medieval and later periods, making it unsuitable for prehistoric studies.

DendroDB, hosted by the Swiss Federal Institute for Forest, Snow and Landscape Research (WSL), emphasizes ecological and climate studies over archaeological wood material. While it claims a broad scope, the database remains non-functional, rendering it ineffective for research needs. The CFS-TRenD database, managed by the Canadian Forest Service, compiles over 4,600 datasets from Canadian forests, primarily focusing on boreal ecosystems. Despite its extensive coverage for North America, its geographical specificity and lack of open access restrict its utility for European prehistoric contexts. Similarly, the QUB Dendrochronology Database, managed by Queen’s University Belfast, offers valuable datasets for Ireland and the UK but lacks significant representation of prehistoric material, limiting its application in broader archaeological investigations. The Building Archaeology Research Database (BARD) contains over 24,000 records, including dendrochronological data from more than 2,700 buildings. However, its focus on medieval and post-medieval timber-frame construction further narrows its utility for studies involving prehistoric wood samples.

The Digital Collaboratory for Cultural Dendrochronology (DCCD) presents itself as a potentially valuable international platform for dendrochronology, particularly through its integration with archaeological data services such as ARIADNE. However, it remains heavily biased toward datasets from the Netherlands, which account for more than two-thirds of its entries, while only 0.08% of its *Quercus* data pertain to Switzerland and just 2.5% represent prehistoric datasets. Notably, these estimates date back to 2021, and an updated assessment is currently unattainable following the platform’s migration to DataverseNL, which now charges an annual fee of nearly €6,000 for access. Furthermore, database activity has declined significantly, from 3,846 new project records between 2010 and 2014 to only 83 by the end of 2019. Although 519 additional records have been reported since 2021, it remains unclear whether this figure includes revisions to pre-existing entries, potentially inflating the count. The database’s narrow focus and restrictive access model significantly limit its broader utility for prehistoric research.

Finally, the Strategic Environmental Archaeology Database (SEAD), hosted by Umeå University, integrates multiple environmental proxies, including dendrochronological data. However, the dendrochronological component is largely confined to Swedish data, with limited relevance to prehistoric contexts. While SEAD aims for broader applications, its dendro component has limited utility for studies outside Sweden.

The utility of dendrochronological databases for prehistoric research varies widely. Global resources such as NOAA’s ITRDB and DCCD offer substantial datasets but face significant limitations in practical geographical and temporal scope. Similarly, platforms like DendroDB and BARD primarily cater to historical studies, leaving critical gaps in prehistoric coverage. Specialized resources like OLDLIST, Dendro4Art, and the Dendrochronological Picture Database provide valuable contributions but lack the direct relevance necessary for archaeological tree-ring analysis. Consequently, researchers focused on prehistoric dendrochronology must navigate a fragmented landscape of databases, each offering distinct strengths and limitations. Addressing these gaps remains crucial for advancing the field.

### Typological dating

Typological dates—i.e. relative, expertise- or seriation-based dating based on artefact characteristics—are ubiquitous in archaeological studies but rarely treated as a form of chronometric data in their own right.
For example, the majority of the radiocarbon datasets we reviewed (@sec-c14-compilation) included some form of typological chronological information in the form of a 'period' or 'culture' column.
This is also typically present in many other forms of systematic compilation work in archaeology, for example site gazetteers.
Aggregated typological information from such sources are often used in aoristic analysis and related methods [@Mischka2004; @Crema2024].
What is lacking in this presentation of typological dating is metadata on how the determination was made and how exactly it is to be understood.
Like any archaeological date, a typological date is derived from a physical sample – the object or set of object from which a chronological estimate was derived.
Typological dates on one class of object may well clash with other classes of object, or for that matter with scientific dates – does one trust the date on pottery, the date on architecture, or the radiocarbon date?
Without additional metadata on e.g. who made the typological determination or what the radiocarbon date was obtained on, such inconsistencies are difficult to resolve 
Similarly the absolute date range corresponding to a typological determination (e.g. "Late Neolithic") can be interpreted in multiple ways depending on the region and intentions of the expert making the determination.
PeriodO [@RabinowitzEtAl2016] is a linked open data infrastructure that includes a shared vocabulary of typological periods and corresponding calendar age estimates, and an important step towards addressing the latter problem.
However, it remains to be systematically linked to compilations of typological dates [though there are some efforts in this direction e.g. @HannahEtAl2022].

---

What is missing to date is a general-purpose infrastructure for combining all of these types of chronometric information on a global scale.
This is the gap that XRONOS aims to fill, starting with three methods: radiocarbon, dendrochronology, and typological dating.
These were chosen because they are widely used and relatively advanced in terms of open data, but an important aim of the project is to develop a generalisable data model that can easily scale to any and all types of archaeological chronology (see @sec-data-model).

# Concept

XRONOS inherits its basic structure from RADON [@Raetzel-Fabian1999; @RADON; @RADONB; @RADONB2024], with a database-backed web application and a data model that separates radiocarbon dates, contextual information, and sites.
Our overall aims in developing XRONOS is to bring this model, which RADON has operated on for more than twenty years, up to date, to generalise it to other types of chronometric information, and to transform it from an online database to a data infrastructure that supports the continuous ingestion, curation, and open dissemination of archaeological chronologies from diverse sources.

## Design goals

XRONOS is our answer to Kintigh's call [-@Kintigh2006] for digital infrastructures that don't just provide access to chronological data but enables researchers to "archive, access, integrate, and mine disparate data sets".
It complements several similar open data infrastructures within and outwith archaeology, such as the Global Biodiverisity Information Facility [GBIF, @CanhosEtAl2004], the Strategic Environmental Archaeology Database [SEAD, @Buckland2014], IMPACT for mummified human remains [@NelsonWade2015], Neotoma for palaeoecological data [@WilliamsEtAl2018], IsoArcH for stable isotope data [@PlompEtAl2022], the International Soil Radiocarbon Database [ISRaD, @LawrenceEtAl2020], and the 'Big Interdisciplinary Archaeological Database' (BIAD), an ambitious new initiative to combine many of these individual domains, including chronology [@ReiterEtAl2024].
To improve upon existing global syntheses of radiocarbon dates (see @sec-global-compilations), we aimed to develop a living infrastructure that both continually collected data from diverse sources and presented a seamless single database to the user.

Our principal goals for the software were therefore to:

1. Combine all available sources of radiocarbon and other chronometric data in single database
2. Develop robust tools for the continuous ingestion, collation and curation of this data
3. Disseminate the collated and curated data as linked open data within a FAIR framework

Meeting these goals required the development of a) a conceptual data model, including links to other open data resources, that is flexible enough for all forms of chronometric data; and b) a software implementation that supports the main functions of ingesting, curating, and disseminating this data.
The individual components of this work are described in more depth below but, briefly, consist of a relational data model implemented in a PostgreSQL database; a Ruby application providing server-based tools for ingestion, curation and dissemination of data; and multiple graphical and programmatic interfaces to the resulting dataset.

## Data model {#sec-data-model}

```{r fig-data-model}
#| fig-cap: Simplified entity relationship diagram showing the XRONOS data model
# Requires a running XRONOS dev database
# See https://github.com/xronos-ch/xronos.rails for setup instructions
xronos_dev_db <- DBI::dbConnect(
  RPostgres::Postgres(),
  user= "xronos",
  password = "xronos",
  dbname = "xronos_dev",
  host = "localhost"
)

xronos_dm <- dm_from_con(xronos_dev_db, learn_keys = TRUE) # doesn't learn much

xronos_dm_svg <- xronos_dm |>
  dm_select_tbl(c("c14_labs", "c14s", "cals", "citations", "contexts",
                  "materials", "references", "samples", "site_names",
                  "site_types", "site_types_sites", "sites", "taxons", "typos", 
                  "users", "versions", "wikidata_links")) |>
  # Relations
  dm_add_fk("c14s", "c14_lab_id", "c14_labs") |>
  dm_add_fk("c14s", "sample_id", "samples") |>
  dm_add_fk("cals", c("c14_age", "c14_error"), "c14s", c("bp", "std")) |>
  dm_add_fk("citations", "reference_id", "references") |>
  dm_add_fk("citations", "citing_id", "sites") |>
  dm_add_fk("citations", "citing_id", "c14s") |>
  dm_add_fk("citations", "citing_id", "typos") |>
  dm_add_fk("contexts", "site_id", "sites") |>
  dm_add_fk("samples", "context_id", "contexts") |>
  dm_add_fk("samples", "material_id", "materials") |>
  dm_add_fk("samples", "taxon_id", "taxons") |>
  # dm_add_fk("site_names", "site_id", "sites") |> # autodetected
  dm_add_fk("site_types_sites", "site_id", "sites") |>
  dm_add_fk("site_types_sites", "site_type_id", "site_types") |>
  dm_add_fk("typos", "sample_id", "samples") |>
  dm_add_fk("wikidata_links", "wikidata_linkable_id", "sites") |>
  dm_add_fk("wikidata_links", "wikidata_linkable_id", "c14s") |>
  dm_add_fk("wikidata_links", "wikidata_linkable_id", "c14_labs") |>
  dm_add_fk("wikidata_links", "wikidata_linkable_id", "materials") |>
  dm_add_fk("wikidata_links", "wikidata_linkable_id", "taxons") |>
  dm_add_fk("versions", "whodunnit", "users", "id") |>
  dm_add_fk("versions", "item_id", "c14_labs") |>
  dm_add_fk("versions", "item_id", "c14s") |>
  dm_add_fk("versions", "item_id", "contexts") |>
  dm_add_fk("versions", "item_id", "materials") |>
  dm_add_fk("versions", "item_id", "references") |>
  dm_add_fk("versions", "item_id", "samples") |>
  dm_add_fk("versions", "item_id", "site_names") |>
  dm_add_fk("versions", "item_id", "site_types") |>
  dm_add_fk("versions", "item_id", "sites") |>
  dm_add_fk("versions", "item_id", "taxons") |>
  dm_add_fk("versions", "item_id", "typos") |>
  # Exclude self-references (primarily supersedable models) for readability, e.g.
  # dm_add_fk("contexts", "superseded_by", "contexts") |>
  dm_draw("BT", view_type = "title_only", column_types = TRUE) |>
  DiagrammeRsvg::export_svg()

# Dear god why is this necessary
xronos_dm_svg_file <- tempfile("xronos_dm_", fileext = ".svg")
write(xronos_dm_svg, xronos_dm_svg_file)
ggdraw() + draw_image(xronos_dm_svg_file)
```

At the base of the XRONOS data model (@fig-data-model) are sets of spatiotemporal coordinates or, as we call them, *chrons*.
In an archaeological context, we conceptualise a chron as an assertion linking human activity with a particular point in space and time.
Our data model currently encompasses three types of chron: radiocarbon dates, typological dates (e.g. 'Early Neolithic') and dendrochronological dates.
However we anticipate that the concept will accommodate other types of absolute and relative dating techniques, as the scope of the database expands.

Chrons are conceptually useful because they emphasise that different types of archaeological 'dates', drawn from different sources, have essentially the same information content: the location of an event in space and time.
We thereby avoid privileging certain sources of chronological data over (as might be the case if, for example, we treated 'period' as a fixed attribute of a site) and can accommodate contradictory (e.g. differences of opinion on typological classification).
This is important given that XRONOS aspires to be an authoratative 'backbone' with a global scope, so we cannot realistically impose a single chronological scheme or resolve conflicting information provided by specialists.
They are useful practically because they expose a common interface for attributes that all types of chronological information share, such as a *terminus post quem* (TPQ), *terminus ante quem* (TAQ), and midpoint estimate.
This allows applications that use XRONOS' data model (including XRONOS itself) to collate chronological data from multiple sources, without necessarily having to be aware of the pecularities of each type of dating.

In order to unify chronological information in the form of a chron, we need a common chronological 'coordinate system'.
The natural choice is a *calendar probability distribution*, which expresses the probability that an event occurred as a function of time on a calendric scale.
Most archaeologists are familiar with working with this kind of representation in the form of calibrated radiocarbon dates, but it can be extended and generalised to essentially any kind of chronological information.
For example, in aoristic analysis [@Mischka2004], a periodic time estimate (e.g. the event occurred in the Neolithic) is conceptualised as a uniform probability distribution over the timespan between the known start and end dates of that period.
A similar model is used in OxCal [@BronkRamsey2009, a direct inspiration for our approach] to integrate prior chronological information from diverse sources.
In practical terms, this model means that the canonical representation the time component of any chron in XRONOS, regardless of source, is a probability distribution over the set of calendar years (arbitrarily measured in years Before Present) in which it could have plausibly occurred.
Further statistics, e.g. a midpoint estimate or TPQ/TAQ range, can be derived from this distribution using well-known methods.
In this way, we can support many different types of date and much of the implementation of XRONOS can be agnostic to the source of chronological information.

Chrons are located in space through association to a *sample* – the physical object from which a chronological determination was made.
The location of samples is represented with geographical coordinates and an associated coordinate reference system (CRS), though since in practice the precise location of single samples is rarely available, this property is usually inherited from the site.
We also record relevant metadata on the nature of the sample.
For radiocarbon dates, for example, we follow established conventions [@Millard2014] in recording the type (e.g. charcoal, charred seed) and, where applicable, taxonomic designation (e.g. *Quercus*, *Triticum dicoccum*) of the organic material used for dating.
For typological dates, an ideal scenario would be for the sample to represent the particular object from which an inference was made (e.g. 'Natufian' might be inferred from 'lunate-type microlith').
In practice, the best we can glean from most published datasets is the type of material used (e.g. 'pottery', 'lithics').
The same sample can be associated with multiple chrons, including different types of chron.
This is useful, for example, for representing replicate radiocarbon dates on the same sample, or radiocarbon dates and dendrochronological made on the same section of wood for wiggle-matching.

Further contextual information is associated with *contexts* and *sites*.
The site is the primary geographic container for chronological information.
As already mentioned, we typically record the spatial location of chrons using this entity, though it is possible to modify this by providing specific coordinates at the sample level.
Sites also have attributes describing their conventional name or names in different languages and are associated with a flexible 'site type' typology that combines information on their form and function.

A context represents the specific find-context of a sample, e.g. an architectural feature, stratigraphic unit, or phase.
Since the units and conventions for recording such information vary greatly between different regions and archaeological traditions---and XRONOS is designed with global data in mind---we leave the question of what a context precisely represents open, and only record an unstandardised, free text label for it.
Crucially, however, contexts can have a self-referential association to other contexts belonging to the same site.
This allows it to encode arbitrary relational structures between contexts, whether they be hierarchical (e.g. phases and sub-phases) or graphical (e.g. stratigraphic).
In this way, it can serve as a foundation for chronological modelling.

The series of relations `[chron] > sample > context > site` links the chronological and contextual sides of the XRONOS data model.
Each step is a many-to-one association, meaning for example that it is possible to attach multiple chrons to the same sample (e.g. replicated radiocarbon dates on the same material), multiple *types* of chrons to the same sample (e.g. radiocarbon dates on tree-rings for wiggle-matching).
Since this kind of information is rarely systematically recorded in our source databases, there are currently few actual records that make use of this feature of the data model.
However, we hope it will provide a foundation for more nuanced chronological modelling in the future.

Metadata is incorporated into XRONOS' data model at the level of the individual records (e.g. all records store their data of creation and last modification) and through two additional types of record: bibliographic *references* and *versions*.
Bibliographic references store information on the source of a record in BibTeX format and can be linked through many-to-many associations to
sites or chrons.
Versions are a special type of record that are associated with all other records (including bibliographic references) and store the previous versions of those records as a series of changesets.
In this way all changes to data are recorded and can be reconstructed (or reversed) precisely.
This 'paper trail' also stores contextual metadata, e.g. who made the change and why.
It also means that records that are deleted can be reviewed or restored from their stored version history, which is never discarded.
Together these two systems provide a transparent record where the data in XRONOS comes from and how it has been altered, which we view as essential in a scientific data infrastructure.

## Linked data

The XRONOS data model presents several opportunities to link to other resources as linked open data (LOD).
We use controlled vocabularies such as the GBIF Backbone Taxonomy [@GBIF2023, for taxonomic descriptions of samples] to both standardise fields and to link the two resources based on this shared concept.
There are concordances between XRONOS entities and several other specialised data infrastructures in archaeology, such as PeriodO [@RabinowitzEtAl2016, mapping to typological chrons] and gazetters like Pleiades [@Pleiades] or Vici [@Voorburg2012, mapping to site names].
Moving beyond archaeology, Wikidata (<https://wikidata.org>) already includes many of the concepts represented in the XRONOS model (e.g. an 'archaeological site', <https://www.wikidata.org/wiki/Q839954>).
Linking to Wikidata is especially useful because it dissimenates—and thereby preserves—the data compiled in XRONOS beyond the specialist/academic community.
It also allows us to enrich the database with contextual information that otherwise be beyond our scope and resources, for example embedding multilingual descriptions of sites from Wikipedia articles on site records where the site has been linked to a Wikidata item.

Conversely, we encourage others to use XRONOS as a linked open data resource by providing stable URLs and a machine-readable interface for every record.
Such usages could look like, for example, using a XRONOS URL as a canonical representation of a single radiocarbon date (e.g. <https://xronos.ch/c14s/SMU-257> -> <https://xronos.ch/c14s/23410>).
We also plan to implement additional view formats to facilitate this, such as IntChron-format JSON, to allow it to be indexed through the IntChron service, or RDF, which is widely used by many linked open data resources in the digital humanities.

# Implementation

Following a short pilot project in 2019, the first phase of development of XRONOS was completed in 2021–2024.
The web interface (<https://xronos.ch>) has been publicly accessible since July 2021.
Though we envisage XRONOS as a continuously-developing open source project and 'living database', the following offers a snapshot of progress at the end of our first grant-funded implementation phase.
We do not aim to be comprehensive, but rather to describe some key elements of XRONOS' current implementation that illustrate how our concept has been realised in practice.

## Software architecture

<!-- Overview -->
The XRONOS data model is implemented as a relational database using the free and open source database management system PostgreSQL.
However, apart from backups and other routine maintenance procedures, all interaction with the database is via a web application, which thus forms the core of XRONOS' architecture.
The XRONOS web application is written in Ruby and uses CRUD (Create, Read, Update, Delete) and MVC (Model, View, Controller) patterns as implemented in the Ruby on Rails framework.

<!-- REST -->
The XRONOS web application exposes two distinct user interfaces: a graphical user interface accessed through a web browser; and an application programming interface (API). 
Both interfaces follow a REST (Representational State Transfer) pattern [@VerborghEtAl2015], where each resource (e.g. a single radiocarbon date, a single bibliographic reference, or a single user) is statelessly mapped to a single address.
Users can then interact with resources at these addresses using a preditable and uniform interface based on HTTP verbs.
For example, the radiocarbon date RTD-8904 is represented by the address <https://xronos.ch/c14s/156205>.
Users can view information on this resource by sending a GET request to that address, regardless of which interface they are using, and authorised users can modify it using POST, PATCH, or DELETE requests.
The bibliographic reference associated with this date [@RichterEtAl2017] is similarly represented at the address <https://xronos.ch/references/17778>, and can be accessed at that address using the same interface as the radiocarbon date.
These uniform REST interfaces are another example of a boring architectural choice that make it easier for us to enrich the scientific contents of XRONOS, by  adding new types of modular resource that represent new scientific entities.

<!-- Actions and queries -->
This basic REST pattern is augmented by seven 'actions' (following the standard pattern in Rails application) that express different ways of interacting with a resource: index, show, destroy, new, create, edit, and update. 
The 'show' action represents interaction with a single resource, as described above. 
The 'index' action, which lists resources of a given type (e.g. <https://xronos.ch/c14s> for radiocarbon dates), is worth special mention because it is through this that the filtering logic at the core of XRONOS' two interfaces is implemented.
By passing a query as HTTP GET parameters to the index action of a resource, the list returned the user is modified to only include records that match that query.
For example, <https://xronos.ch/sites?site[country_code]=CH> (the part of the URL after the `?` character encodes the SQL WHERE clause `country = 'CH'` as a GET parameter) lists sites in Switzerland.
More complex queries can be executed using nested parameters.
For example, <https://xronos.ch/c14s?sample[material][name]=charcoal> (encoding that the `c14` table should be joined to the `material` table via `sample`, followed by the WHERE clause `material.name = 'charcoal'`) lists radiocarbon dates obtained from charcoal samples.
Index actions can also respond with the result in a tabular data format (i.e. `.csv`).

## Data ingestion and curation {#sec-implementation-data}

The chronological data in XRONOS comes from a variety of sources, including published structured datasets in repositories and journal supplements, other online databases, literature review, and direct input from collaborators.
Our aim is not just to 'mirror' these sources as they are, but integrate them into a single curated and continuously updated database.
For the purposes of ingestion, we classify data resources into three categories: static resources, such as supplementary data in published papers, which are imported once; versioned resources, updated on a periodic basis, which we import after each new version; and live resources, which are continuously updated and therefore continuously imported.
Records of each type are imported into XRONOS in as close to their original state as possible, i.e. without any corrections or standardisation applied.
This ensures that any subsequent changes (even immediate, automatic ones) are entered into the record's version history, so that the source of any deviations or potential errors can always be reconstructed.
The version history also records the direct source of the data for attribution purposes.
In addition, a bibliographic reference to the original resource is attached to ensure that the source is clearly attributed even if the record is merged with another one.

```{r tbl-issues}
#| tbl-cap: Automatically-recognised data quality issues currently implemented in XRONOS
tribble(
  ~entity,                    ~issue_code,            ~n,     ~issue_description,
  "Sites",                    "MISSING_COORDINATES",  4452,   "Missing geographic coordinates",
  "Sites",                    "INVALID_COORDINATES",  0,      "Geographic coordinates fall outside the earth's ellipsoid",
  "Sites",                    "MISSING_COUNTRY_CODE", 1221,   "Missing data on what country the site is in",
  "Samples",                  "MISSING_MATERIAL",     138177, "Missing data on the sample material",
  "Samples",                  "MISSING_TAXON",        138182, "Missing data on the sample taxon",
  "Samples",                  "MISSING_CRS",          0,      "Sample has coordinates but the coordinate reference system is not given",
  "Taxons",                   "UNKNOWN_TAXON",        9260,   "Sample taxon has not been matched to the GBIF Backbone Taxonomy",
  "Taxons",                   "LONG_TAXON",           509,    "Description of the sample taxon is implausibly long",
  "Radiocarbon dates",        "MISSING_C14_AGE",      447,    "Missing radiocarbon age",
  "Radiocarbon dates",        "VERY_OLD_C14",         76,     "Radiocarbon age older than the effective range of the method (50 ka)",
  "Radiocarbon dates",        "MISSING_C14_ERROR",    585,    "Missing measurement error",
  "Radiocarbon dates",        "MISSING_D14C",         238875, "Missing δ13C measurement",
  "Radiocarbon dates",        "MISSING_D14C_ERROR",   238875, "Missing δ13C measurement error",
  "Radiocarbon dates",        "MISSING_C14_METHOD",   242233, "Missing data on radiocarbon dating method (conventional, AMS, etc.)",
  "Radiocarbon dates",        "MISSING_C14_LAB_ID",   1233,   "Missing laboratory identifier",
  "Radiocarbon dates",        "INVALID_LAB_ID",       16053,  "Laboratory identifier does not match the standard format (e.g. 'Abc-1234')",
  "Radiocarbon dates",        "MISSING_C14_LAB",      350190, "Missing data on the radiocarbon laboratory",
  "Bibliographic references", "MIXED_REFERENCE",      21933,  "Bibliographic reference appears to combine multiple publications",
  "Bibliographic references", "MISSING_BIBTEX",       42109,  "Bibliographic reference without structured data in BibTeX format",
) |>
  group_by(entity) |>
  gt() |>
  cols_label(issue_code = "Issue", n = "N", issue_description = "Description") |>
  cols_width(
    issue_code ~ pct(30),
    n ~ pct(20),
    issue_description ~ pct(50)
  ) |>
  tab_options(
    table.font.size = 13
  )
```

Once ingested, we apply a number of automated and semi-automated quality control processes to integrate new data into the existing database.
Controlled vocabularies are used in a number of places in the data model (@fig-data-model), and we use thesauri to automatically standardise these fields as much as possible.
For example, as mentioned above, the taxonomic description of samples is controlled using GBIF's backbone taxonomy [@GBIF2023], and we also use a thesaurus service provided by GBIF to automatically change variant or obsolete taxonomic names to the canonical version.
If the system is not able to standardise a field using the available thesaurus, it is flagged for manual correction.
A wide variety of other potential data quality issues (e.g. missing data on what country a site is in) are also flagged for human review by this system [@tbl-issues], which can often be semi-automated (e.g. suggesting close matches in the thesaurus or the country indicated by the record's coordinates).

A final critical component of XRONOS' data curation system is duplicate handling.
We import data from many overlapping resources (many of which incorporate each other either in whole or in part), so duplicate records are common [as recently discussed by @ReiterEtAl2024].
The end result of standardising and correcting a record is also often to create a duplicate: e.g. the same sample imported from one source as 'oak' but another as '*Quercus* sp.' will become a duplicate pair as '*Quercus*', and thus be recognised as a single sample.
Such exact duplicates can be merged automatically, with the oldest record becoming the authoratitive version, but detecting fuzzier duplicated information (e.g. differences in the spelling of site names) has proved a more difficult problem.
As of writing there are therefore still many duplicate records in XRONOS that need to be manually resolved, but we hope to automate much more of this work in the future.

## User interfaces

::: {#fig-ui layout-nrow=2}
![Data browser](figures/xronos-ui-data-browser.png){#fig-ui-data-browser}

![Site record](figures/xronos-ui-show.png){#fig-ui-show}

![Record change log](figures/xronos-ui-papertrail.png){#fig-ui-papertrail}

![Curation interface](figures/xronos-ui-curation.png){#fig-ui-curation}

Views in the XRONOS web graphical user interface
:::

<!-- Public GUI -->
The graphical user interface (GUI) to XRONOS, accessed through a web browser (e.g. at <https://xronos.ch>), uses REST resources and actions as the building blocks for various interfaces through which users can browse, search, retrieve, and analyse chronometric data (@fig-ui).
Each action on each resource is represented by a page, though not all of these are publicly accessible.
The most import of these are the 'index' views, which list and summaries all instances of a resource and support filtering and sorting;
and 'show' views (e.g. @fig-ui-show for a site), which give a more comprehensive information on an individual record along with visualisations, links to related records, external linked open data resources, and a log of changes made to the data since it was imported into XRONOS (@fig-ui-papertrail).
Pages representing REST resources directly are supplemented by a number of synthetic interfaces, for example the 'data browser' (<https://xronos.ch/data>, @fig-ui-data-browser), which facilitates more complex filtering, or the search interface (<https://xronos.ch/search>).
The GUI also includes several resources which are not part of XRONOS' scientific data model, for example documentation pages, user profiles and news articles; these are not exposed in the API.

<!-- Non-public GUI -->
Access to various 'backstage' interfaces for creating, editing, and deleting data, and monitoring data quality is managed using a user permissions system.
Currently only authorised users affiliated with the XRONOS project can access these, but in the future we intend to support open registration and expose editing interfaces to all authenticated users.
For this reason, there is no sharp division between a 'public' and 'private' areas – viewing/querying data and editing/curating data share the same architecture and interface patterns.

<!-- API / R package -->
The XRONOS API uses the same addresses as the web-based GUI (with the exception of some of the synthetic interfaces mentioned above) but responds with machine-readable data in JSON format, rather than a HTML page.
This response can be triggered by appending `.json` to the address or by including a HTTP `content-format` header in the request.
Though users can make such requests manually and parse the data with one of several off-the-shelf tools, the primary intended uses of this interface is to provide access for 1) programmatic clients to XRONOS and 2) other web services.
The XRONOS R package [@HinzRoe2024] is an example of a programmatic client; it uses the API to facilitate direct querying and retrieval of data from XRONOS in the R statistical programming language [@R2024].
Similar libraries could be developed in other programming environments used for scientific computing, such as Python or Haskell.
The API also provides the foundation for other web services to access XRONOS directly, to embed chronological information in other contexts or otherwise make use of its data resources.

An overarching principle of this software architecture is that all interaction with XRONOS' data store, and as much of the data processing and 'business logic' of responding to REST requests as possible, is directed through the same server-side routines.
First and foremost, this allows us to provide multiple interfaces (i.e. the GUI and API, perhaps more in the future) without duplicating these elements of our codebase.
It also improves accessibility for users accessing XRONOS through devices with limited processing capability or through text-only browsers.
More broadly, avoiding reliance on client-side processing, e.g. with Javascript or Web Assembly (WASM)---which would be the other option---allows us keep our client interfaces simple (in most cases plain, semantic HTML pages and self-contained stylesheets) and therefore, we hope, sustainable in the face of constantly-evolving client-side technologies and standards.
It does have the weakness that, in practical terms, XRONOS is difficult to run and relies on the continued existence of a maintained external server.
We have however tried to mitigate this by providing clearly-documented source code and regular data dumps so that, if our instance of XRONOS disappears, or if one simply does not want to use it, it is possible for others to host a XRONOS server of their own.
As the sustainability of scientific software and data infrastructures is a pressing problem, in the future it may be desirable to support further decentralisation through, for example, a federated server-server model.

# Evaluation

In this paper we have outlined the conceptual and technical infrastructure developed during the initial, SNF-funded phases of development on XRONOS in 2019 and 2021–2024.
These include a generalised data model for radiocarbon and typological dates, extendible to other chronometric information, and associated contextual information;
an R- and Ruby-based pipeline for continuous ingestion of data from a variety of sources; 
continuous, semi-automated data cleaning protocols; 
a Ruby-on-Rails application providing a web-based frontend to the data and a REST API for programmatic access; 
and an R package for interfacing with the API. 
These systems are in production and publicly accessible at <https://xronos.ch>.

At the current stage of implementation, we argue XRONOS provides a framework for access to chronometric data that is more open, more reliable, and more comprehensive than the previously available global radiocarbon compilations.
XRONOS blends aspects of the three existing approaches (c14bazAAR, IntChron, and p3k14c; see Section @sec-global-compilations) to achieve the same aim of providing access to the global radiocarbon data through a common interface.
Like c14bazAAR and IntChron, it is a 'metadatabase' that draws from existing data resources and maintains an explicit link to them.
But like p3k14c, it integrates these into a single database and applies data curation processes to harmonise them and improve the quality of the information.
It has a wider scope than c14bazAAR or IntChron, as it mirrors rather than directly retrieves the source data (allowing us to use resources that aren't openly published), and does not rely on the authors of these sources to implement a common specification.
It also goes beyond the functionality of p3k14c by providing systems for the continuous ingestion and curation of new data as it is published.
With that said, we see the approaches followed by as complementary rather than competing.
A c14bazAAR parser for XRONOS is in development (<https://github.com/ropensci/c14bazAAR/pull/150>), and we also aim to provide an IntChron interface to XRONOS' data in the near future.
New data and corrections from p3k14c are incorporated into XRONOS as they are released.

From 2025, development of XRONOS will continue within the framework of 'ESTER', an ERC-funded research project on estimation of prehistoric population development from large, multiproxy datasets.
Our immediate development goals include the incorporation of dendrochronology into the database, further refinement of our data curation pipelines, and the public release of the editing interfaces.

Looking beyond the near term, we have endeavoured to create a sustainable infrastructure that can be maintained by a wider scholarly commons – though we must acknowledge that this is a difficult problem, and one that is as much organisational than technical.
The source code for all the software components of the system are available online (at <https://github.com/xronos-ch>) under open licenses.
The databases of the instance at <https://xronos.ch> are also archived with Zenodo (*link omitted for blind review*<!-- TODO: link -->).
With these two resources we reduce XRONOS' 'bus factor'; if we are not able to continue operating XRONOS, somebody else can fully recreate it.
Equally importantly, we enable the 'right to fork' should others wish to take the software and/or database in another direction.
But aside from these extreme scenarios, the long-term sustainability of XRONOS is contingent on the existence of a community of scholars that use and contribute back to it.

# Acknowledgements

The XRONOS project was carried under the direction of Albert Hafner. 
Individual contributors to the XRONOS database to date include Chiara Huwiler, Rivana Moser, Tomasz Chmielewski, and Stephanie Döppler. 
A complete and up-to-date list of acknowledgements for the project can be found at <https://xronos.ch/about/acknowledgements>.
<!-- TODO: peer reviewers -->

# Funding Statement

This work was funded by the Swiss National Science Foundation ([SNSF Project #198152](https://data.snf.ch/grants/grant/198153)) and the University of Bern (UniBE Initiator Grant 2019, Caroline Heitz, Project 'Time and Temporality').

# Data Accessibility Statement

The data and R code used to produce our analysis of the state of the art in radiocarbon compilation, including all the figures presented here, can be accessed via Zenodo at <https://doi.org/10.5281/zenodo.14282598>.

The database and software described in this paper is open source and can be accessed at <https://github.com/xronos-ch>.

# References