Skip to content

Commit c9da50d

Browse files
IrlirionBarzaHQazyBi
authored
Add model configurations and datasets for the new materials framework (#190)
* Added dash user interface * added essential rows to config (#187) * + tox21, catboost classifier and remove deepchem * fix smiles may be none * + qsar mcl1 pic50 model * + pipes * + qm7 dataset * + bopp * + bitumen * + handling multitarget * + r2 to catboost metrics * fix multitarget in bitumen metrics * fix bitumen data * + qm7 * + polymers * fix classification metrics * remove infer from mcl1_pic50 * fix task name in tox21 * remove unnecessary metric config * fix bugs * fix clearml * replace pickle by joblib for clearml * add support for output_uri to clearml config * fix tox21 project naming * fix tests * + chembl * + upload chembl to public s3 --------- Co-authored-by: ibragimmergaliev <i.mergaliev@innopolis.ru> Co-authored-by: Kazybek Askarbek <k.askarbek@innopolis.university>
1 parent ded6065 commit c9da50d

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

50 files changed

+4503
-2897
lines changed

config/callbacks/classification.yaml

+15-7
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,12 @@
1-
requirements:
2-
task:
3-
- table-classification
4-
framework:
5-
- xgboost
6-
- sklearn
1+
task:
2+
- table-classification
3+
- qsar-classification
4+
framework:
5+
- xgboost
6+
- sklearn
7+
- catboost
78

8-
objects:
9+
implementations:
910
xgboost:
1011
accuracy:
1112
_target_: innofw.core.metrics.custom_metrics.metrics.Accuracy
@@ -14,6 +15,13 @@ objects:
1415
average: macro
1516

1617
sklearn:
18+
accuracy:
19+
_target_: sklearn.metrics.accuracy_score
20+
f_one:
21+
_target_: sklearn.metrics.f1_score
22+
average: macro
23+
24+
catboost:
1725
accuracy:
1826
_target_: sklearn.metrics.accuracy_score
1927
f_one:

config/callbacks/regression.yaml

+2
Original file line numberDiff line numberDiff line change
@@ -24,3 +24,5 @@ implementations:
2424
_target_: sklearn.metrics.mean_squared_error
2525
mae:
2626
_target_: sklearn.metrics.mean_absolute_error
27+
r2:
28+
_target_: innofw.core.metrics.custom_metrics.metrics.R2

config/clear_ml/disabled.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,3 @@
11
enable: False
22
queue:
3+
output_uri:

config/clear_ml/enabled.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,3 @@
11
enable: True
22
queue:
3+
output_uri:

config/clear_ml/test_queue.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,3 @@
11
enable: True
22
queue: test
3+
output_uri:

config/datasets/bitumen.yaml

+25
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
task:
2+
- table-regression
3+
4+
name: bitumen
5+
description: bitumen properties
6+
7+
markup_info: Информация о разметке
8+
date_time: 21.07.2022
9+
10+
_target_: innofw.core.datamodules.pandas_datamodules.PandasDataModule
11+
12+
13+
train:
14+
source: ./data/bitumen/train/train.csv
15+
test:
16+
source: ./data/bitumen/test/test.csv
17+
18+
val_size: 0.2
19+
target_col:
20+
- "Время окисления, ч"
21+
- "Расход воздуха, мл/сек"
22+
- "Минимальная температура окисления, °С"
23+
- "Максимальная температура окисления, °С"
24+
- "Количество гудрона, л"
25+
- "Температура воздуха, °С"

config/datasets/bopp.yaml

+19
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
task:
2+
- table-regression
3+
4+
name: bopp
5+
description: Bopp films
6+
7+
markup_info: Информация о разметке
8+
date_time: 21.07.2022
9+
10+
_target_: innofw.core.datamodules.pandas_datamodules.PandasDataModule
11+
12+
13+
train:
14+
source: ./data/bopp/train/train.csv
15+
test:
16+
source: ./data/bopp/test/test.csv
17+
18+
val_size: 0.2
19+
target_col: turbidity

config/datasets/chembl_33_smiles.yaml

+20
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
task:
2+
- qsar-regression
3+
- text-vae-forward
4+
- text-vae
5+
6+
name: chembl_33_smiles
7+
description: "Link: https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_33/"
8+
9+
markup_info: Информация о разметке
10+
date_time: 09.06.2023
11+
12+
_target_: innofw.core.datamodules.lightning_datamodules.QsarSelfiesDataModule
13+
train:
14+
source: https://api.blackhole.ai.innopolis.university/public-datasets/chembl_33_smiles/train.zip
15+
target: ./data/chembl_33/train/
16+
test:
17+
source: https://api.blackhole.ai.innopolis.university/public-datasets/chembl_33_smiles/test.zip
18+
target: ./data/chembl_33/test/
19+
smiles_col: SMILES
20+
target_col:
+35
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
task:
2+
- qsar-classification
3+
4+
name: tox21
5+
description: "Link: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data"
6+
7+
markup_info: Информация о разметке
8+
date_time: 18.08.2014
9+
10+
_target_: innofw.core.datamodules.pandas_datamodules.QsarDataModule
11+
###### Case: remote data #####
12+
train:
13+
source: https://api.blackhole.ai.innopolis.university/public-datasets/tox21/train.zip
14+
target: ./data/tox21/train
15+
test:
16+
source: https://api.blackhole.ai.innopolis.university/public-datasets/tox21/test.zip
17+
target: ./data/tox21/test
18+
19+
infer:
20+
source: https://api.blackhole.ai.innopolis.university/public-datasets/tox21/test.zip
21+
target: ./data/tox21/test
22+
23+
##############################
24+
###### Case: local data ######
25+
#train:
26+
# source: /local/path/train.csv
27+
#test:
28+
# source: /local/path/test.csv
29+
##############################
30+
31+
32+
# Available targets
33+
#target_col: [NR-AR, NR-AR-LBD, NR-AhR, NR-Aromatase, NR-ER, NR-ER-LBD, NR-PPAR-gamma, SR-ARE, SR-ATAD5, SR-HSE, SR-MMP, SR-p53]
34+
smiles_col: smiles
35+
val_size: 0.2

config/datasets/mcl1_pic50.yaml

+19
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
task:
2+
- qsar-regression
3+
4+
name: mcl1_pic50
5+
description: Preprocessed MCL1 dataset
6+
7+
markup_info: Markap info
8+
date_time: 15.08.23
9+
10+
_target_: innofw.core.datamodules.pandas_datamodules.QsarDataModule
11+
12+
train:
13+
source: ./data/mcl1_pic50/train
14+
test:
15+
source: ./data/mcl1_pic50/test
16+
17+
smiles_col: Clean Smiles
18+
target_col: pIC50
19+
val_size: 0.2

config/datasets/pipes.yaml

+18
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
task:
2+
- table-regression
3+
4+
name: pipes
5+
description: Sibur pipes
6+
7+
markup_info: Информация о разметке
8+
date_time: 31.08.2020
9+
10+
_target_: innofw.core.datamodules.pandas_datamodules.PandasDataModule
11+
12+
train:
13+
source: ./data/pipes/train/train.csv
14+
test:
15+
source: ./data/pipes/test/test.csv
16+
17+
18+
target_col: result

config/datasets/polymers.yaml

+33
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
task:
2+
- table-regression
3+
4+
name: polymers
5+
description: polymers properties
6+
7+
markup_info: Информация о разметке
8+
date_time: 21.07.2022
9+
10+
_target_: innofw.core.datamodules.pandas_datamodules.PandasDataModule
11+
12+
13+
train:
14+
source: ./data/polymers/train/train.csv
15+
test:
16+
source: ./data/polymers/test/test.csv
17+
18+
val_size: 0.2
19+
target_col:
20+
- "Модуль упругости при изгибе_МПа"
21+
- "Xs_ISO 16152_%"
22+
- "Модуль упругости при растяжении_МПа"
23+
- "Относительное удлинение при пределе текучести_%"
24+
- "Относительное удлинение при разрыве_%"
25+
- "Предел текучести при растяжении_МПа"
26+
- "Прочность при разрыве_МПа"
27+
- "Твердость по Шору_D/1"
28+
- "Твердость по Шору_D/15"
29+
- "Температура изгиба под нагрузкой 0,45МПа_C"
30+
- "Температура размягчения по Вика, С_10Н"
31+
- "Температура размягчения по Вика, С_50Н"
32+
- "Ударная вязкость по Изоду с/н, 23 C_ISO 180_кДж/м2"
33+
- "Ударная вязкость по Изоду с/н, 23 C_Дж/м"

config/datasets/qm7.yaml

+19
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
task:
2+
- qsar-regression
3+
4+
name: qm7
5+
description: "Link: http://quantum-machine.org/datasets/"
6+
7+
markup_info: Информация о разметке
8+
date_time: 01.01.2012
9+
10+
_target_: innofw.core.datamodules.pandas_datamodules.PandasDataModule
11+
12+
train:
13+
source: https://api.blackhole.ai.innopolis.university/public-datasets/qm7/train.zip
14+
target: ./data/qm7/train
15+
test:
16+
source: https://api.blackhole.ai.innopolis.university/public-datasets/qm7/test.zip
17+
target: ./data/qm7/test
18+
19+
target_col: target
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# @package _global_
2+
defaults:
3+
- override /models: classification/catboost_classification
4+
- override /datasets: classification/tox21
5+
- override /callbacks: classification
6+
7+
8+
project: "tox21"
9+
task: "qsar-classification"
10+
random_seed: 42
11+
12+
datasets:
13+
# Available targets
14+
#[NR-AR, NR-AR-LBD, NR-AhR, NR-Aromatase, NR-ER, NR-ER-LBD, NR-PPAR-gamma, SR-ARE, SR-ATAD5, SR-HSE, SR-MMP, SR-p53]
15+
target_col: NR-AR
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# @package _global_
2+
defaults:
3+
- override /models: regression/catboost_regression
4+
- override /datasets: pipes
5+
- override /callbacks: regression
6+
7+
8+
project: "pipes"
9+
task: "table-regression"
10+
random_seed: 42
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# @package _global_
2+
defaults:
3+
- override /models: regression/catboost_regression
4+
- override /datasets: bopp
5+
- override /callbacks: regression
6+
7+
8+
project: "bopp"
9+
task: "table-regression"
10+
random_seed: 42
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# @package _global_
2+
defaults:
3+
- override /models: regression/catboost_regression
4+
- override /datasets: mcl1_pic50
5+
- override /callbacks: regression
6+
7+
8+
project: "mcl1_pic50"
9+
task: "qsar-regression"
10+
random_seed: 42
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# @package _global_
2+
defaults:
3+
- override /models: regression/catboost_regression
4+
- override /datasets: bitumen
5+
- override /callbacks: regression
6+
7+
8+
project: "bitumen"
9+
task: "table-regression"
10+
random_seed: 42
11+
12+
models:
13+
loss_function: MultiRMSE
14+
allow_const_label: true
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# @package _global_
2+
defaults:
3+
- override /models: regression/catboost_regression
4+
- override /datasets: qm7
5+
- override /callbacks: regression
6+
7+
8+
project: "qm7"
9+
task: "table-regression"
10+
random_seed: 42
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# @package _global_
2+
defaults:
3+
- override /models: regression/catboost_regression
4+
- override /datasets: polymers
5+
- override /callbacks: regression
6+
7+
8+
project: "polymers"
9+
task: "table-regression"
10+
random_seed: 42
11+
12+
models:
13+
loss_function: MultiRMSEWithMissingValues
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# @package _global_
2+
defaults:
3+
- override /models: text-vae/hier_vae.py
4+
- override /datasets: chembl_33_smiles
5+
- override /losses: simple_vae
6+
7+
8+
project: chem-vae
9+
task: text-vae
10+
random_seed: 42
11+
accelerator: gpu
12+
devices: 1
13+
batch_size: 128
14+
epochs: 1
15+
num_workers: 0
16+
17+
trainer:
18+
limit_train_batches: 10
19+
20+
datasets:
21+
work_mode: vae
22+
23+
models:
24+
encoder:
25+
in_dim: 439383 # len(alphabet) * max(len_mols)
26+
decoder:
27+
out_dimension: 343 # len(alphabet)

config/losses/simple_vae.yaml

+16
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
name: ELBO
2+
description: Evidence lower bound
3+
task:
4+
- text-vae
5+
- text-vae-forward
6+
7+
implementations:
8+
torch:
9+
mse:
10+
weight: 1.0
11+
object:
12+
_target_: torch.nn.MSELoss
13+
kld:
14+
weight: 0.1
15+
object:
16+
_target_: innofw.core.losses.kld.KLD
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
name: catboost classifier
2+
description: CatBoost classification model
3+
_target_: catboost.CatBoostClassifier
4+
verbose: 100

0 commit comments

Comments
 (0)