You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Apply default ruff linting
* Remove outdated design docs
* Add PyPI workflow; update license year
* Change version format to comply with PyPI's requirement for PEP 440 compliance
* README edits in prep for release
* Move development instructions for setup and GNVerifier OpenAPI specs to wiki
* Add badge info and project links
* Add keywords
* Add link-outs to data sources
* Add citation
---------
Co-authored-by: Elizabeth Campolongo <38985481+egrace479@users.noreply.github.com>
Co-authored-by: Hilmar Lapp <hlapp@drycafe.net>
Copy file name to clipboardExpand all lines: README.md
+13-45Lines changed: 13 additions & 45 deletions
Original file line number
Diff line number
Diff line change
@@ -1,19 +1,18 @@
1
1
# TaxonoPy
2
2
3
-
`TaxonoPy` (taxon-o-py) is a command-line tool for creating an internally consistent taxonomic hierarchy using the [Global Names Verifier (gnverifier)](https://github.com/gnames/gnverifier).
3
+
`TaxonoPy` (taxon-o-py) is a command-line tool for creating an internally consistent taxonomic hierarchy using the [Global Names Verifier (gnverifier)](https://github.com/gnames/gnverifier). See below for the structure of inputs and outputs.
4
4
5
5
## Purpose
6
-
The motivation for this package is to create an internally consistent and standardized classification set for organisms in the TreeOfLife-200M (TOL) dataset.
6
+
The motivation for this package is to create an internally consistent and standardized classification set for organisms in a large biodiversity dataset composed from different data providers that may use very similar and overlapping but not identical taxonomic hierarchies.
7
7
8
-
This dataset contains over 200 million samples of organisms from four core data providers:
8
+
Its development has been driven by its application in the TreeOfLife-200M (TOL) dataset. This dataset contains over 200 million samples of organisms from four core data providers:
9
9
10
-
- The GLobal Biodiversity Information Facility (GBIF)
11
-
- BIOSCAN-5M
12
-
- FathomNet
13
-
- The Encyclopedia of Life (EOL)
10
+
-[The GLobal Biodiversity Information Facility (GBIF)](https://www.gbif.org/)
-[The Encyclopedia of Life (EOL)](https://eol.org/)
14
14
15
-
16
-
This package is a tool for creating an internally consistent classification set for a list of organisms whose entries have inconsistent naming.
15
+
The names (and classification) of taxa may be (and often are) inconsistent across these resources. This package addresses this problem by creating an internally consistent classification set for such taxa.
17
16
18
17
### Input
19
18
@@ -42,7 +41,7 @@ Taxonomic authorities exist to standardize classification, but ...
42
41
- A given organism may be missing from some.
43
42
44
43
### Solution
45
-
`TaxonoPy` uses the taxonomic hierarchies provided by the TOL core data providers to query GNVerifier and create a standardized classification for each sample in the TOL dataset. It prioritizes the GBIF backbone taxonomy, since this represents the largest part of the TOL dataset. Where GBIF misses, backup sources such as the Catalogue of Life and Open Tree of Life (OTOL) taxonomy are used.
44
+
`TaxonoPy` uses the taxonomic hierarchies provided by the TOL core data providers to query GNVerifier and create a standardized classification for each sample in the TOL dataset. It prioritizes the [GBIF Backbone Taxonomy](https://verifier.globalnames.org/data_sources/11), since this represents the largest part of the TOL dataset. Where GBIF misses, backup sources such as the [Catalogue of Life](https://verifier.globalnames.org/data_sources/1) and [Open Tree of Life (OTOL) Reference Taxonomy](https://verifier.globalnames.org/data_sources/179) are used.
46
45
47
46
## Installation
48
47
@@ -55,20 +54,6 @@ To install the latest version of `TaxonoPy`, run:
55
54
pip install taxonopy
56
55
```
57
56
58
-
### Development Installation with `pip`
59
-
60
-
Clone the repository and install the package in development mode with an activated virtual environment:
61
-
```console
62
-
git clone git@github.com:Imageomics/TaxonoPy.git
63
-
cd TaxonoPy
64
-
```
65
-
Set up and activate a virtual environment.
66
-
67
-
Install the package in development mode:
68
-
```console
69
-
pip install -e .[dev]
70
-
```
71
-
72
57
### Usage
73
58
You may view the help for the command line interface by running:
74
59
```console
@@ -96,7 +81,7 @@ options:
96
81
--show-config Show current configuration and exit (default: False)
97
82
--version Show version number and exit
98
83
```
99
-
#### Commands: `resolve`
84
+
#### Command: `resolve`
100
85
The `resolve` command is used to perform taxonomic resolution on a dataset. It takes a directory of Parquet partitions as input and outputs a directory of resolved Parquet partitions.
It is recommended to keep GNVerifier settings at their defaults.
130
115
131
-
#### Commands: `trace`
116
+
#### Command: `trace`
132
117
The `trace` command is used to trace the provenance of a taxonomic entry. It takes a UUID and an input path as arguments and outputs the full path of the entry through TaxonoPy.
133
118
```console
134
119
usage: taxonopy trace [-h] {entry} ...
@@ -151,7 +136,7 @@ options:
151
136
--verbose Show full details including all UUIDs in group
152
137
```
153
138
154
-
#### Commands: `common-names`
139
+
#### Command: `common-names`
155
140
The `common-names` command is used to merge vernacular names into the resolved output. It takes a directory of resolved Parquet partitions as input and outputs a directory of resolved Parquet partitions with common names.
TaxonoPy creates a cache of the objects associated with input entries for use with the `trace` command. By default, this cache is stored in the `~/.cache/taxonopy` directory.
183
168
184
169
## Development
185
-
186
-
This section assumes that you have installed the package in development mode.
187
-
188
-
### OpenAPI Specification Managment and Type Generation
189
-
190
-
`TaxonoPy` uses GNVerifier to generate and integrates with its API from its OpenAPI specification.
191
-
192
-
The script that handles this is `scripts/generate_gnverifier_types.py`, which saves `api_specs/gnverifier_openapi.json` and from this produces `src/taxonopy/types/gnverifier.py`.
193
-
194
-
To check for changes in the OpenAPI specification, run:
195
-
```console
196
-
python scripts/generate_gnverifier_types.py
197
-
```
198
-
199
-
If the OpenAPI specification has changed, you will need to decide whether to update the generated types.
200
-
201
-
The script will save `api_specs/gnverifier_openapi.json.new` and `src/taxonopy/types/gnverifier.py.new` for you to compare with the existing files and decide whether to overwrite them and make any necessary changes to the rest of the codebase.
202
-
170
+
See the [Wiki Development Page](https://github.com/Imageomics/TaxonoPy/wiki/Development) for development instructions.
0 commit comments