ETL application for Dataworks
The metadata for a dataset as retrieved from a provider. A Dataset Source may be associated with many Datasource Source Sets.
Set of Dataset Source records that were extracted from a provider by a single job.
A Dataset Source Set is marked as complete if the job was successful (the metadata for all datasets was retrieved).
Extra datasets for a provider can be added to config/datasets/<provider>.yml
.
Local metadata can be added to config/local_datasets/<id>.yml
. The metadata must conform to the Dataworks schema.
The job schedule is set in config/recurring.yml
.
Jobs in deployed environments use Honeybadger checkins to verify that they are running. These are environments specific and therefore should be set in shared_configs
.
- docker & docker compose
- tmux (installation instructions)
- overmind (installed automatically via bundler)
Create credentials and add to config/settings/development.local.yml
:
redivis:
api_token: ~
zenodo:
api_token: ~
Spin up containers and the app, and then set up the application and solid-* databases:
docker compose up -d
bin/rails db:prepare
bin/dev
Solid Queue jobs can be monitored with Mission Control at /jobs
.
In development, the dataworks core is available at http://localhost:8983/solr/#/dataworks/core-overview.
bin/rake "development:transform_dryrun[<provider, e.g., redivis>]"