Skip to content

Rails application for performing ETL for Dataworks

Notifications You must be signed in to change notification settings

sul-dlss/dataworks-etl

Repository files navigation

codecov

dataworks-etl

ETL application for Dataworks

Data model

Dataset source

The metadata for a dataset as retrieved from a provider. A Dataset Source may be associated with many Datasource Source Sets.

Dataset source set

Set of Dataset Source records that were extracted from a provider by a single job.

A Dataset Source Set is marked as complete if the job was successful (the metadata for all datasets was retrieved).

Configuration

Extra (hardcoded) datasets for providers

Extra datasets for a provider can be added to config/datasets/<provider>.yml.

Local dataset metadata

Local metadata can be added to config/local_datasets/<id>.yml. The metadata must conform to the Dataworks schema.

Schedule

The job schedule is set in config/recurring.yml.

Honeybadger checkins

Jobs in deployed environments use Honeybadger checkins to verify that they are running. These are environments specific and therefore should be set in shared_configs.

Development

Requirements

Credentials

Create credentials and add to config/settings/development.local.yml:

redivis:
  api_token: ~

zenodo:
  api_token: ~

Running locally

Spin up containers and the app, and then set up the application and solid-* databases:

docker compose up -d
bin/rails db:prepare
bin/dev

Mission Control (jobs monitoring)

Solid Queue jobs can be monitored with Mission Control at /jobs.

Solr

In development, the dataworks core is available at http://localhost:8983/solr/#/dataworks/core-overview.

Testing transforms

bin/rake "development:transform_dryrun[<provider, e.g., redivis>]"

About

Rails application for performing ETL for Dataworks

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages