Skip to content

Commit b17e8a7

Browse files
thompsonmjMatt Thompsonvimar-guegrace479
committed
Initial profile-based strategy implementation (#5)
* Auto configuration of GNVerifier response types * Set data structures for input entries, entry grouping, query term grouping, and resolution attempts * Update to use pydantic v2 * Set up core data model * Initialize CLI and logging * Initialize handling of resolution attempts * Log description of design decisions * Implement query planning * Make OpenAPI specs optional to avoid erroneous requirement problems * Implement caching and data flow through query execution * Centralize configuration * Add cache option details * Resolution groundwork * Move GNVerifier client into query submodule * Encapsulating query system; adding tracing * Fix critical bug passing new data source IDs down to query client during retires * New profiles * Build out entry tracing through ResolutionAttempt objects; chasing bug with retry query term selection for a set of no_match_nonempty_query case that should be handled by retries with next available query terms up the hierarchy * Default quiet mode for entry trace; opt-in to verbose for all entry group UUIDs * Increasing case coverage with new profiles * Relax match constraints for greater sing synonym result coverage * Add post-processing method to force resolution to use input hierarchy; some cleanup * Activate last resort forcing * Add common name retrieval * Roll common names apparoch into CLI as subcommand for post-resolution processing * Enable custom cache directory specification * Make outputs recreate input directory hierarchy * Remove common names script whose functionality has been added to a CLI command Co-authored-by: Elizabeth Campolongo <38985481+egrace479@users.noreply.github.com> * Update README.md --------- Co-authored-by: Matt Thompson <thompson.4590@osu.edu> Co-authored-by: Jianyang Gu <gu.1220@osu.edu> Co-authored-by: Elizabeth Campolongo <38985481+egrace479@users.noreply.github.com>
1 parent 03d323b commit b17e8a7

File tree

71 files changed

+10819
-470
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

71 files changed

+10819
-470
lines changed

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -162,3 +162,7 @@ cython_debug/
162162
# More
163163
.DS_Store
164164
.ruff_cache/
165+
.jsonl
166+
data/
167+
.vscode/
168+
.ipynb

README.md

Lines changed: 177 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -1,54 +1,202 @@
11
# TaxonoPy
22

3-
This tool is under development and is unstable.
3+
`TaxonoPy` (taxon-o-py) is a command-line tool for creating an internally consistent taxonomic hierarchy using the [Global Names Verifier (gnverifier)](https://github.com/gnames/gnverifier).
44

5-
`TaxonoPy` (/tækˈsɒnəpaɪ/) is a command-line tool for resolving taxonomic hierarchies using the [Global Names Resolver (GNR) API](http://resolver.globalnames.org/). It provides an interface to input species names and retrieve taxonomic classifications conforming to a strict 7-rank Linnehierarchy, helping researchers gather controlled taxonomic data about species.
5+
## Purpose
6+
The motivation for this package is to create an internally consistent and standardized classification set for organisms in the TreeOfLife-200M (TOL) dataset.
67

7-
Specifically, the ranks required include `kingdom`, `phylum`, `class`, `order`, `family`, `genus`, and `species`. Only results exactly matching these ranks are returned.
8+
This dataset contains over 200 million samples of organisms from four core data providers:
89

9-
## Installation
10+
- The GLobal Biodiversity Information Facility (GBIF)
11+
- BIOSCAN-5M
12+
- FathomNet
13+
- The Encyclopedia of Life (EOL)
1014

11-
`TaxonoPy` can be installed using `pip` after setting up a virtual environment.
1215

13-
### Virtual Environment Setup
16+
This package is a tool for creating an internally consistent classification set for a list of organisms whose entries have inconsistent naming.
1417

15-
For example, with `conda`, run:
16-
```bash
17-
conda create -n myenv -y
18-
conda activate myenv
19-
```
18+
### Input
19+
20+
A directory containing Parquet partitions of the seven-rank Linnaean taxonomic metadata for organisms in the dataset. Labels should include:
21+
- `uuid`: a unique identifier for each sample (required).
22+
- `kingdom`, `phylum`, `class`, `order`, `family`, `genus`, `species`: the taxonomic ranks of the organism (required, may have sparsity).
23+
- `scientific_name`: the scientific name of the organism, to the most specific rank available (optional).
24+
- `common_name`: the common (i.e. vernacular) name of the organism (optional).
25+
26+
See the example data in
27+
- `examples/input/sample.parquet`
28+
- `examples/resolved/sample.resolved.parquet` (generated with [`taxonopy resolve`](#commands-resolve))
29+
- `examples/resolved_with_common_names/sample.resolved.parquet` (generated with [`taxonopy common-names`](#commands-common-names))
30+
31+
### Challenges
32+
This taxonomy information is provided by each data provider and the original sources, but the classification can be...
33+
34+
- **Inconsistent**: both between and within sources (e.g. kingdom Metazoa vs. Animalia).
35+
- **Incomplete**: many samples are missing one or more ranks. Some have 'holes' where higher and lower ranks are present, but intermediate ranks are missing.
36+
- **Incorrect**: some samples have incorrect classifications. This can come in the form of spelling errors, nonstandard ideosyncratic terms, or outdated classifications.
37+
- **Ambiguous**: homonyms, synonyms, and other terms that can be interpreted in multiple ways unless handled systematically.
38+
39+
Taxonomic authorities exist to standardize classification, but ...
40+
- There are many authorities.
41+
- They may disagree.
42+
- A given organism may be missing from some.
2043

21-
### Installation with `pip`
44+
### Solution
45+
`TaxonoPy` uses the taxonomic hierarchies provided by the TOL core data providers to query GNVerifier and create a standardized classification for each sample in the TOL dataset. It prioritizes the GBIF backbone taxonomy, since this represents the largest part of the TOL dataset. Where GBIF misses, backup sources such as the Catalogue of Life and Open Tree of Life (OTOL) taxonomy are used.
2246

23-
To install the latest version of `TaxonoPy` directly from GitHub, run:
24-
```bash
25-
pip install git+ssh://git@github.com/Imageomics/TaxonoPy.git
47+
## Installation
48+
49+
`TaxonoPy` can be installed with `pip` after setting up a virtual environment.
50+
51+
### User Installation with `pip`
52+
53+
To install the latest version of `TaxonoPy`, run:
54+
```console
55+
pip install taxonopy
2656
```
2757

2858
### Development Installation with `pip`
2959

30-
Clone the repository and install the package in development mode:
31-
```bash
32-
git clone git@github.com:Imageomics/Taxonopy.git
60+
Clone the repository and install the package in development mode with an activated virtual environment:
61+
```console
62+
git clone git@github.com:Imageomics/TaxonoPy.git
3363
cd TaxonoPy
64+
```
65+
Set up and activate a virtual environment.
66+
67+
Install the package in development mode:
68+
```console
3469
pip install -e .[dev]
3570
```
3671

37-
## Usage
72+
### Usage
73+
You may view the help for the command line interface by running:
74+
```console
75+
taxonopy --help
76+
```
77+
This will show you the available commands and options:
78+
```console
79+
usage: taxonopy [-h] [--cache-dir CACHE_DIR] [--show-cache-path] [--cache-stats] [--clear-cache] [--show-config] [--version] {resolve,trace,common-names} ...
80+
81+
TaxonoPy: Resolve taxonomic names using GNVerifier and trace data provenance.
82+
83+
positional arguments:
84+
{resolve,trace,common-names}
85+
resolve Run the taxonomic resolution workflow
86+
trace Trace data provenance of TaxonoPy objects
87+
common-names Merge vernacular names (post-process) into resolved outputs
88+
89+
options:
90+
-h, --help show this help message and exit
91+
--cache-dir CACHE_DIR
92+
Directory for TaxonoPy cache (can also be set with TAXONOPY_CACHE_DIR environment variable) (default: None)
93+
--show-cache-path Display the current cache directory path and exit (default: False)
94+
--cache-stats Display statistics about the cache and exit (default: False)
95+
--clear-cache Clear the TaxonoPy object cache. May be used in isolation. (default: False)
96+
--show-config Show current configuration and exit (default: False)
97+
--version Show version number and exit
98+
```
99+
#### Commands: `resolve`
100+
The `resolve` command is used to perform taxonomic resolution on a dataset. It takes a directory of Parquet partitions as input and outputs a directory of resolved Parquet partitions.
101+
```
102+
usage: taxonopy resolve [-h] -i INPUT -o OUTPUT_DIR [--output-format {csv,parquet}] [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}] [--log-file LOG_FILE] [--force-input] [--batch-size BATCH_SIZE] [--all-matches]
103+
[--capitalize] [--fuzzy-uninomial] [--fuzzy-relaxed] [--species-group] [--refresh-cache]
104+
105+
options:
106+
-h, --help show this help message and exit
107+
-i, --input INPUT Path to input Parquet or CSV file/directory
108+
-o, --output-dir OUTPUT_DIR
109+
Directory to save resolved and unsolved output files
110+
--output-format {csv,parquet}
111+
Output file format
112+
--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
113+
Set logging level
114+
--log-file LOG_FILE Optional file to write logs to
115+
--force-input Force use of input metadata without resolution
116+
117+
GNVerifier Settings:
118+
--batch-size BATCH_SIZE
119+
Max number of name queries per GNVerifier API/subprocess call
120+
--all-matches Return all matches instead of just the best one
121+
--capitalize Capitalize the first letter of each name
122+
--fuzzy-uninomial Enable fuzzy matching for uninomial names
123+
--fuzzy-relaxed Relax fuzzy matching criteria
124+
--species-group Enable group species matching
125+
126+
Cache Management:
127+
--refresh-cache Force refresh of cached objects (input parsing, grouping) before running.
128+
```
129+
It is recommended to keep GNVerifier settings at their defaults.
130+
131+
#### Commands: `trace`
132+
The `trace` command is used to trace the provenance of a taxonomic entry. It takes a UUID and an input path as arguments and outputs the full path of the entry through TaxonoPy.
133+
```console
134+
usage: taxonopy trace [-h] {entry} ...
135+
136+
positional arguments:
137+
{entry}
138+
entry Trace an individual taxonomic entry by UUID
139+
140+
options:
141+
-h, --help show this help message and exit
142+
143+
usage: taxonopy trace entry [-h] --uuid UUID --from-input FROM_INPUT [--format {json,text}] [--verbose]
144+
145+
options:
146+
-h, --help show this help message and exit
147+
--uuid UUID UUID of the taxonomic entry
148+
--from-input FROM_INPUT
149+
Path to the original input dataset
150+
--format {json,text} Output format
151+
--verbose Show full details including all UUIDs in group
152+
```
38153

39-
`TaxonoPy` can be run from the command line with the following syntax:
40-
```bash
41-
taxonopy <species_name>
42-
taxonopy -f <file_path>
154+
#### Commands: `common-names`
155+
The `common-names` command is used to merge vernacular names into the resolved output. It takes a directory of resolved Parquet partitions as input and outputs a directory of resolved Parquet partitions with common names.
156+
```console
157+
usage: taxonopy common-names [-h] --resolved-dir ANNOTATION_DIR --output-dir OUTPUT_DIR
158+
159+
options:
160+
-h, --help show this help message and exit
161+
--resolved-dir ANNOTATION_DIR
162+
Directory containing your *.resolved.parquet files
163+
--output-dir OUTPUT_DIR
164+
Directory to write annotated .parquet files
43165
```
166+
Note that the `common-names` command is a post-processing step and should be run after the `resolve` command.
167+
168+
### Example Usage
44169

45-
For example, to resolve the taxonomic hierarchy of the species name `Homo sapiens`, run:
46-
```bash
47-
taxonopy "Homo sapiens"
170+
To perform taxonomic resolution on a dataset with subsequent common name annotation, run:
171+
```console
172+
taxonopy resolve \
173+
--input /path/to/formatted/input \
174+
--output-dir /path/to/resolved/output
48175
```
176+
```console
177+
taxonopy common-names \
178+
--resolved-dir /path/to/resolved/output \
179+
--output-dir /path/to/resolved_with_common-names/output
180+
```
181+
182+
TaxonoPy creates a cache of the objects associated with input entries for use with the `trace` command. By default, this cache is stored in the `~/.cache/taxonopy` directory.
183+
184+
## Development
49185

50-
Or for a list of species names stored in a file, run:
51-
```bash
52-
taxonopy -f ids.txt
186+
This section assumes that you have installed the package in development mode.
187+
188+
### OpenAPI Specification Managment and Type Generation
189+
190+
`TaxonoPy` uses GNVerifier to generate and integrates with its API from its OpenAPI specification.
191+
192+
The script that handles this is `scripts/generate_gnverifier_types.py`, which saves `api_specs/gnverifier_openapi.json` and from this produces `src/taxonopy/types/gnverifier.py`.
193+
194+
To check for changes in the OpenAPI specification, run:
195+
```console
196+
python scripts/generate_gnverifier_types.py
53197
```
54198

199+
If the OpenAPI specification has changed, you will need to decide whether to update the generated types.
200+
201+
The script will save `api_specs/gnverifier_openapi.json.new` and `src/taxonopy/types/gnverifier.py.new` for you to compare with the existing files and decide whether to overwrite them and make any necessary changes to the rest of the codebase.
202+

0 commit comments

Comments
 (0)