A transformer-based model for genomic selection.
preprocessing/split.py
- get the data information: which split, env, and hybrid- input
data/raw/Training_Data/1_Training_Trait_Data_2014_2021.csv
data/raw/Testing_Data/1_Submission_Template_2022.csv
- output -
data/splits.csv
- input
-
preprocessing/vcf2num.sh
- Convert VCF to numeric matrix- input -
data/raw/Training_Data/5_Genotype_Data_All_Years.vcf
- output -
data/plink/geno.raw
- input -
-
preprocessing/hybrid2p.py
- Infer parental genotypes from the hybrid info provided by G2F- input -
data/plink/geno.raw
- output -
data/g_parents.csv
- input -
-
preprocessing/synthesize_f1.py
- Synthesize the F1 genotypes back from parental genotypes- input -
data/g_parents.csv
- output -
data/g_f1.csv
- input -
make_images.py
- Combine genotype and EC data into feature images- input
data/g_parents.csv
data/splits.csv
data/raw/Training_Data/6_Training_EC_Data_2014_2021.csv
data/raw/Testing_Data/6_Testing_EC_Data_2022.csv
- output
data/images/<split>/%id.png
- feature images: 384 x 1152 x 3data/images/<split>/annotation.txt
- labels (yield)
- input
-
gsformer.py
- GSformer PyTorch module -
train.py
- Train the model- input
data/images/train/
data/images/val/
- output
out/gsformer.pt
- trained model weights
- input
-
inference.py
- Make predictions on the test set- input
data/images/test/
out/gsformer.pt
- output
out/pred.csv
- raw predicted values
- input
-
submission.py
- Format the submission file- input
out/pred.csv
data/splits.csv
data/raw/Testing_Data/1_Submission_Template_2022.csv
- output
out/submission.csv
- formatted submission file
- input
- 26,213 SNP markers
- 567 (9 * 63) sequential EC variables (1-9 soil layers)
- 144 non-sequential EC variables
data/images/<split>/annotation.txt
- 1 label (yield)
-
data/
- data files-
images/
- feature imagestrain/
- training settest/
- testing setval/
- validation set%id.png
- feature images: 384 x 1152 x 3annotation.txt
- labels (yield)
-
plink/
- genotype data (PLINK)geno.raw
- genotype data (FID, IID, PAT, MAT, SEX, PHENOTYPE, 26213 SNPs)
-
raw/
- original dataset provided by G2FTraining_Data/
- training datasetsTesting_Data/
- testing datasets
-
g_f1.csv
- F1 genotypes (synthesized) -
g_parents.csv
- Parental genotypes (inferred) -
splits.csv
- envs and lines of each data split
-
-
out/
- project outputsgsformer.pt
- trained model weightspred.csv
- raw predicted valuessubmission.csv
- formatted submission file
-
preprocessing/
- scripts for data preprocessing