This containerized application can be used to run workflows used for ingesting dataset metadata into a Dataverse instance. The different flows and tasks used in these workflows are created using Prefect. If you run the container locally they can be monitored and ran from the Prefect Orion UI at http://localhost:4200.
This project was originally funded under ODISSEI (https://odissei-data.nl/), as research funded by NWO (https://www.nwo.nl/). The exact grant number is unknown at present. The ingestion was delivered as part of the ODISSEI Portal section of the ODISSEI project, as are all related services. By name, Fjodor van Rijsselberg, Thomas van Erven, Vyacheslav Tykhonov were involved at it's cognition - but thanks belong to the ODISSEI team as a whole.
Accordingly, the LICENSE file has been updated honoring their commitment and work.
Most flows start with an entry workflow that can be found in the directory entry_workflows. Here the metadata is first harvested using OAI-PMH and uploaded to S3 storage. After, the metadata is fetched from that S3 storage, and a provenance object is created for the ingested metadata. A settings dictionary constructed with DynaConf that is specific to the Data Provider is also constructed here.
Next, For every dataset's metadata it runs a sub-flow to handle the actual ingestion. These flows can be found in the dataset_workflows directory. The dataset workflow uses simple tasks that make an API call to a service. These services often transform, improve or alter the metadata in some way.
In short, most ingestion workflows take the following steps:
- Harvest metadata and upload it to S3 storage.
- Fetch dataset metadata from S3 storage.
- Create version object of all services that will be used for ingestion.
- For every dataset's metadata run dataset workflow.
- Use tasks that make API calls to different services to transform the metadata.
In this section the different API services used in the workflows are shown. These services can be used in a workflow in different combinations, depending on the metadata provided by the data provider.
Service Name | Description | Deployment URL | GitHub Repo |
---|---|---|---|
Dataverse Importer | This service imports metadata into Dataverse. | https://dataverse-importer.labs.dansdemo.nl/docs | GitHub |
Publication Date Updater | Corrects the publication date of the imported metadata. | https://dataverse-date-updater.labs.dansdemo.nl/docs | GitHub |
Metadata Fetcher | Fetches the metadata of a dataset from a Dataverse. | https://dataverse-fetcher.labs.dansdemo.nl/docs | GitHub |
Dataverse Mapper | Maps any JSON to Dataverse's JSON format. | https://dataverse-mapper.labs.dansdemo.nl/docs | GitHub |
Dans Transformer Service | Transforms from XML to JSON (or from/to other formats). | https://transformer.labs.dansdemo.nl/docs | GitHub |
Metadata Refiner | Refines JSON metadata. | https://metadata-enhancer.labs.dansdemo.nl/docs | GitHub |
Metadata Enhancer | Enriches JSON metadata. | https://metadata-refiner.labs.dansdemo.nl/docs | GitHub |
Email Sanitizer | Removes all emails from the metadata. | https://emailsanitizer.labs.dansdemo.nl/docs | GitHub |
Version Tracker | Stores JSON containing version information. | https://version-tracker.labs.dansdemo.nl/docs | GitHub |
DOI Minter | Mints a DOI for a dataset. Should be used with CAUTION since if used with production settings this will mint a permanent DOI. | https://dataciteminter.labs.dansdemo.nl/docs | GitHub |
Semantic Enrichment | Enriches the SOLR index with ELSST translations of the keywords from the ELSST skosmos. | GitHub | |
OAI-PMH Harvester | Harvester service to harvest the metadata from data providers using OAI-PMH. | GitHub |
Here is a set list of make command that can be used for easy setup:
make build
: Build and start the project.make start
: Start the project in non-detached mode.make startbg
: Start the project in detached mode (background).make down
: Down the running project.make dev-build
: Build and start the development setup.make dev-down
: Down the ingest services in development mode.make deploy
: Deploy all ingestion workflows to the Prefect server.make ingest
: Run a specific ingest flow in Prefect with optional arguments for the target. It is also possible to specify if the metadata should be harvested. If not specified the metadata will be harvested.
If you want to develop new flows for the Ingestion Orchestrator you might want to run the services described above locally. This is possible by following the steps:
cp dot_env_example .env
cp dot_env_development_example .env.development
cp scripts/configuration/secrets_example.toml scripts/configuration/.secrets.toml
- Add the necessary API tokens and credentials to the .secrets.toml
- set
ENV_FOR_DYNACONF
in the .env todevelopment
make dev-build
This should set up the prefect container and the services used during the ingestion workflows.
cp dot_env_example .env
cp scripts/configuration/secrets_example.toml scripts/configuration/.secrets.toml
- Add the necessary API tokens and credentials to the .secrets.toml
- set
ENV_FOR_DYNACONF
in the .env tostaging
make build
make deploy
- Go to localhost:4200/deployments
- Click the ellipsis icon of a workflow and select either custom run or ** quick run**
If you've selected custom run you can optionally fill in a target url and key argument to specify a different target Dataverse. If you select quick run it will use the target in the settings in odissei_settings.toml and the key in .secrets.toml.
For the Dataverse ingestion pipeline, there is also a required argument for
the settings_dict_name
. The options for ingesting with Dataverse as both the
source and target use the following input:
DANS datastation SSH, subset of only the social science datasets:
'DANS'
IISG's datasets: 'HSN'
Subverses of dataverse.nl:
'TWENTE'
, 'DELFT'
, 'AVANS'
, 'FONTYS'
, 'GRONINGEN'
, 'HANZE'
, 'HR'
, 'LEIDEN'
, 'MAASTRICHT'
, 'TILBURG'
, 'TRIMBOS'
, 'UMCU'
, 'UTRECHT'
, 'VU'
make ingest data_provider=CBS TARGET_URL=https://portal.example.odissei.nl TARGET_KEY=abcde123-11aa-22bb-3c4d-098765432abc DO_HARVEST=False
- A prompt will show confirming the target
- Type yes to continue or anything else to abort.
The make ingest
command allows you to specify the url and API key of a
specific target Dataverse. If you do not provide them, it will use the target
in the settings in odissei_settings.toml and the key in .secrets.toml. It also
allows you to specify if the pipeline should first harvest the metadata.
This is useful for quick dev'ing after the metadata was already harvested or
to rerun the bucket with metadata files from failed dataset workflows.
This is the list of data providers that can be used in the make ingest
command:
'TWENTE'
, 'DELFT'
, 'AVANS'
, 'FONTYS'
, 'GRONINGEN'
, 'HANZE'
, 'HR'
, 'LEIDEN'
, 'MAASTRICHT'
, 'TILBURG'
, 'TRIMBOS'
, 'UMCU'
, 'UTRECHT'
, 'VU'
, 'DANS'
, 'CBS'
, 'LISS'
, 'HSN'
, 'CID'
To debug the services noted in the services table, use the development project
setup. After, remove the service that you want to debug.
This can be done in your docker interface or by using docker-compose stop <container_name>
and replacing <container_name> with the name of the service you want to stop.
After, go to the GitHub repository specified in the table for the service.
Clone it and follow the instructions in the readme. Add the service to the
ingest network with make network-add network_name=ingest container_name=<container_name>
.
Use a deployed flow or use make ingest
to test any changes made to the service.
When running a flow the flow will produce logging information that can be viewed in the prefect UI. If the flow is ran from the command line it will also show the logs in the terminal.
If you want to add logging, first use logger = get_run_logger()
in the context of a running flow or task and use logger.info()
to log any information.
If an ingestion pipeline workflow is run for a specific data provider, it will create a sub flow all dataset metadata files retrieved from s3 storage. One sub flow ingests a single metadata file.
In the case that a sub flow fails, a bucket will be created using the data provider's name and the parent workflow (the ingestion pipeline workflow's) id. The metadata file that sub flow was ingesting will be stored in the bucket. Any other failed sub flows after that will also store their metadata file in this bucket.
This is done for two reasons:
- Isolation of the failed metadata files for easier investigation.
- Possibility to rerun only the metadata files of the failed dataset sub flows.
The second point requires the user to change the data provider's bucket name in the settings. These settings can be found in scripts/configuration/odissei_settings.toml.
Follow these steps to run the failed metadata ingest:
- Find the bucket created for the failed metadata in the logs.
- Change the
<data provider>_BUCKET_NAME
to that bucket name, where is the data provider you ran the ingestion for. - run
make ingest DO_HARVEST=False
, so that you don't harvest the metadata from the data provider into the specified bucket.
The metadata that is used by the workflows is stored in s3 buckets. The key, id
and url of the server of the s3 storage should be set in the .secrets.toml as
AWS_SECRET_ACCESS_KEY
, AWS_ACCESS_KEY_ID
and MINIO_SERVER_URL
respectively.
For a specific data provider a BUCKET_NAME
should be added for that provider.
The bucket in s3 storage that contains the metadata for the provider should use
the same name as the BUCKET_NAME
for that provider.
example in odissei_settings.toml:
HSN_BUCKET_NAME="hsn-metadata"
HSN={"ALIAS"="HSN_NL", "BUCKET_NAME"="@format {this.HSN_BUCKET_NAME}", "SOURCE_DATAVERSE_URL"="@format {this.IISG_URL}", "DESTINATION_DATAVERSE_URL"="@format {this.ODISSEI_URL}", "DESTINATION_DATAVERSE_API_KEY"="@format {this.ODISSEI_API_KEY}", "REFINER_ENDPOINT"="@format {this.HSN_REFINER_ENDPOINT}"}
In this example, HSN contains all information relating to settings specific to ingesting the HSN metadata. The BUCKET_NAME set in the HSN dictionary can be generically used in the code when a bucket name is necessary. It is set to the HSN_BUCKET_NAME which declares the specific name for the bucket for HSN. Further explanation on the settings can be found in Settings files section.
A local Dataverse instance makes it easy to deposit via the API.
https://github.com/IQSS/dataverse-docker
Only a Super User can deposit via the API.
Set the superuser
boolean to true in the authenticateduser
table. You are
now a Super User.
More information on how to do this can be found in the documentation of the ODISSEI dataverse stack here.
If you use a containerized Dataverse instance it should live in the same network as the dev services.
The Ingestion Workflow Orchestrator uses Dynaconf to manage its settings. This chapter will give a very short introduction on Dynaconf. For more information read the docs.
Use the .env
file to set the environment to either development, staging or
production. Be careful that setting the env to production will mean that all
flows that use the DOI-minter will be minting persistent DOI's.
ENV_FOR_DYNACONF=development
The settings are split into multiple toml files. This makes it easier to manage
a large amount of settings. You can specify which files are loaded in
config.py
. The files are loaded in order and overwrite each other if they
share settings with the same name.
- settings.toml, contains the base settings
- .secrets.toml, contains all secrets
- _settings.toml, datastation specific settings
Each file is split into multiple sections: default, development, production.
Default settings are always loaded and usually contain one or more dynamic
parts using @format
. Development and production contain the values that
depend on the current environment.
The example below shows how dynamic settings work. The metadata directory changes based on the current environment.
[default]
"BUCKET_NAME" = "@format {this.BUCKET_NAME}"
[development]
"BUCKET_NAME" = "path/to/local/dir"
[production]
"BUCKET_NAME" = "path/to/s3/bucket"
The CBS Metadata Ingestion Workflow is responsible for ingesting metadata from the CBS (Central Bureau of Statistics) data provider into Dataverse. It processes the XML metadata, transforms it into JSON format, maps it to the required format for Dataverse, refines and enriches the metadata, mints a DOI, and finally imports the dataset into Dataverse. The workflow is implemented using Prefect, a workflow management library in Python.
-
Email Sanitizer: The XML metadata is passed through the Email Sanitizer service to remove any sensitive email information.
-
XML to JSON Transformation: The sanitized XML metadata is transformed into JSON format using the Dans Transformer Service.
-
Metadata Mapping: The JSON metadata is mapped to the required format for Dataverse using the Dataverse Mapper service.
-
Metadata Refinement: The mapped metadata is refined using the Metadata Refiner service. In CBS's case this means the Alternative Titles and Keywords are improved.
-
Workflow Versioning: The workflow versioning URL is added to the metadata using the Version Tracker service. This step ensures that the metadata includes information about the services that processed it.
-
DOI Minting: The metadata is passed to the DOI Minter service, which mints a DOI (Digital Object Identifier) for the dataset.
-
Metadata Enrichment: The metadata is enriched using two different endpoints of the Metadata Enhancer service. Each service adds specific enrichment to the metadata.
-
Dataverse Import: The enriched metadata, along with the DOI, is imported into Dataverse using the Dataverse Importer service.
-
Publication Date Update: The publication date is extracted from the metadata using a JMESPath query. If a valid publication date is found, it is passed to the Publication Date Updater service, which updates the publication date of the dataset in Dataverse.
-
Semantic Enrichment: The workflow performs semantic enrichment using the Semantic Enrichment service. The enrichment process adds additional information to the SOLR index using ELSST translations of the keywords.
-
Workflow Completion: If all the previous steps are completed successfully, the workflow is considered completed, indicating that the dataset has been ingested successfully, including the DOI.
Please note that each service mentioned in the workflow corresponds to the services listed in the table provided earlier.