GitHub - BIDS-Xu-Lab/Dataset_Entity_Extraction

Here is all the code for the Gui. Created on Jan/3/2025

The folder, “Biomedical_datasets”, is the “Dataset and Repository Ecosystem” project folder. The following are useful scripts, data and results.

The folder, “Biomedical_datasets/NER”, contains scripts, data and results to fine-tune BERT models to conduct NER tasks.

1.1. The folder, “Biomedical_datasets/NER/nerbert” contains scripts and results to fine-tune BERT models to conduct NER tasks.

1.1.1. “Biomedical_datasets/NER/nerbert/nerbert.py” and “Biomedical_datasets/NER/nerbert/nerbert_nospacy.py” are the pipelines to fine-tune BERT models to conduct NER tasks.

  1.1.1.1 “Biomedical_datasets/data_update/precessed_data/train_data.json”  and “Biomedical_datasets/data_update/precessed_data/test_data.json” are the training and testing data for “nerbert_nospacy.py”.
  
  1.1.1.2. “Biomedical_datasets/NER/data/train_data.json” and “Biomedical_datasets/NER/data/test_data.json” are the training and testing data for “nerbert.py”.

  1.1.1.3. “Biomedical_datasets/NER/nerbert/{_name}_output/time_{time}'” is the output path. _name is the name of the model. time is the the times for fine-tuning the model. Please see the details in the pipeline. 

1.1.2. “Biomedical_datasets/NER/nerbert/biobert-large-cased-v1.1-squad_output/time_1/hyperparameter.json” is the hyperparameter for fine-tune “biobert-large-cased-v1.1-squad” model, the 1 time. And the f1 score for this fine-tuning is “Biomedical_datasets/NER/nerbert/biobert-large-cased-v1.1-squad_output/time_1/performance.csv”.

The folder, “Biomedical_datasets/RE”, contains scripts, data and results to fine-tune BERT models to conduct RE tasks.
“Biomedical_datasets/total_pubmed” is the pipeline for getting the currently annotated Batches files.

3.1. “Biomedical_datasets/total_pubmed/Sampled_from_total_PubMed_specific_name_12_16” is useful and the others are ueseless.

3.2. “Biomedical_datasets/total_pubmed/Sampled_from_total_PubMed_specific_name_12_16/example_batch.ipynb” is the pipeline to use GPT-4o to filter the abstracts and get the files to be annotated.

3.3. “/home/gy237/project/Biomedical_datasets/total_pubmed/Sampled_from_total_PubMed_specific_name_12_16/abstracts_nonull_1-1575.joblib” is the total PubMed abstracts file and it has the data: “['pmid', 'title', 'abstract', 'journal', 'pubdate', 'authors', 'mesh_terms', 'pub_year', 'pub_month']”

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
Biomedical_datasets		Biomedical_datasets
EHR_Foundation_model		EHR_Foundation_model
R		R
annotation_compound_detection		annotation_compound_detection
light_weight_llama		light_weight_llama
llama3		llama3
scFoundationModel_data		scFoundationModel_data
.gitignore		.gitignore
README.md		README.md
install.py		install.py
note.ipynb		note.ipynb
tubian.py		tubian.py
tubian_appention.py		tubian_appention.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

BIDS-Xu-Lab/Dataset_Entity_Extraction

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages