feat: templates and sampler testing/fixing #1

AdrianM0 · 2025-02-20T16:21:26Z

Summary by Sourcery

Update datasets with new templates and enhance continuous variable handling. Improve the transform scripts by simplifying the process and removing redundant steps.

Enhancements:

Updated the choline_transporter_butkiewicz, flashpoint, thermosol, nlmchem, ord_masked, ord_procedure_steps, mp_descriptions, block_polymers_morphology, rhea_db_masked, MUV_466, MUV_548, MUV_600, MUV_644, MUV_652, MUV_689, MUV_692, MUV_712, MUV_713, MUV_733, MUV_737, MUV_810, MUV_832, MUV_846, MUV_852, MUV_858, MUV_859, solubility_aqsoldb, and uniprot_organisms datasets with additional templates and improved prompts.
Improved the handling of continuous variables in the _get_target_from_row and _get_choices_with_indicator functions within the sampler.py module to ensure accurate formatting and representation of numerical values.
Simplified the transform.py scripts for several datasets to directly save the processed data to a CSV file, removing the redundant creation of a meta.yaml file within the script.

sourcery-ai · 2025-02-20T16:21:30Z

Reviewer's Guide by Sourcery

This pull request introduces new templates for various datasets, fixes a bug in the sampler, and updates the data processing pipeline for some datasets. The changes primarily focus on improving the usability and functionality of the datasets for various tasks, such as property prediction, structure generation, and reaction component prediction. The transform scripts for choline_transporter_butkiewicz and nlmchem datasets were also updated to remove the meta.yaml creation.

Updated class diagram for DataSampler

classDiagram
    class DataSampler {
        -meta: dict
        -df: pd.DataFrame
        -config: dict
        -_get_target_from_row(sample: pd.Series, var: str) str
        -_get_choices_with_indicator(symbols: List[str], multiple_choice_var: str, multiple_choice_indicator: str) Tuple[List[str], int]
        -_format_choices(symbols: List[str], choices: List[str]) str
        -_fill_template(template: str, sample_dict: Dict[str, Union[str, List[str]]]) str
    }
    note for DataSampler "The _get_target_from_row method was updated to handle continuous variables with significant digits."

File-Level Changes

Change	Details	Files
Updated metadata for the `choline_transporter_butkiewicz` dataset to include a more detailed description, target information, benchmark details, identifiers, license, relevant links, number of data points, and corresponding publications in BibTeX format.	Updated the dataset description to provide more context on its origin and curation process. Added target information, including the ID, description, units, type, names, and PubChem AIDs. Included benchmark details, such as the name and link to the TDC benchmark, and the split column. Specified the SMILES identifier with its type and description. Defined the license under which the dataset is published. Added links to the original dataset and corresponding publications. Updated the number of data points in the dataset. Provided BibTeX entries for relevant publications. Added templates for generating prompts and tasks related to the dataset.	`data/tabular/choline_transporter_butkiewicz/meta.yaml`
Added templates to the `flashpoint` dataset's metadata for generating prompts and tasks related to flashpoint prediction.	Added templates for predicting the flashpoint of a compound given its SMILES representation. Included templates for answering questions about the flashpoint of a compound. Added templates for generating a compound with a specific flashpoint. Added templates for conversational interactions about flashpoints. Included templates for multiple-choice questions related to flashpoint prediction.	`data/tabular/flashpoint/meta.yaml`
Added templates to the `thermosol` dataset's metadata for generating prompts and tasks related to solubility prediction.	Added templates for identifying the solubility of a compound given its SMILES representation. Included templates for multiple-choice questions related to solubility prediction.	`data/tabular/thermosol/meta.yaml`
Updated the `block_polymers_morphology` dataset processing script to download the dataset from Hugging Face Hub instead of a direct URL.	Replaced the direct CSV URL with a Hugging Face Hub download link. Removed data cleaning and transformation steps previously performed in the script.	`data/tabular/block_polymers_morphology/transform.py`
Added templates to the `mofdscribe` dataset's metadata for generating prompts and tasks related to MOF structure generation.	Added templates for generating CIF files from MOF descriptions. Included templates for translating descriptions into CIF representations. Added templates for converting MOF structure descriptions into CIF files.	`data/tabular/mofdscribe/meta.yaml`
Fixed a bug in the sampler that caused continuous variables to not be formatted correctly.	Ensured that continuous variables are formatted with the correct number of significant digits when generating multiple-choice options. Added a check to ensure that the value is a string before attempting to join it.	`src/chemnlp/data/sampler.py`
Added templates to the `uniprot_organisms` dataset's metadata for generating prompts and tasks related to organism prediction from protein sequences.	Added templates for predicting the organism from a given amino acid sequence. Included templates for multiple-choice questions related to organism prediction.	`data/tabular/uniprot_organisms/meta.yaml`
Added templates to the MUV datasets' metadata for generating prompts and tasks related to activity prediction.	Added templates for predicting activity of a molecule given its SMILES representation. Included templates for generating a molecule with a specific activity. Added templates for conversational interactions about activity.	`data/tabular/MUV_466/meta.yaml` `data/tabular/MUV_548/meta.yaml` `data/tabular/MUV_600/meta.yaml` `data/tabular/MUV_644/meta.yaml` `data/tabular/MUV_652/meta.yaml` `data/tabular/MUV_689/meta.yaml` `data/tabular/MUV_692/meta.yaml` `data/tabular/MUV_712/meta.yaml` `data/tabular/MUV_713/meta.yaml` `data/tabular/MUV_733/meta.yaml` `data/tabular/MUV_737/meta.yaml` `data/tabular/MUV_810/meta.yaml` `data/tabular/MUV_832/meta.yaml` `data/tabular/MUV_852/meta.yaml` `data/tabular/MUV_858/meta.yaml` `data/tabular/MUV_859/meta.yaml` `data/tabular/MUV_846/meta.yaml`
Added templates to the `solubility_aqsoldb` dataset's metadata for generating prompts and tasks related to solubility prediction.	Added templates for predicting the solubility of a compound given its SMILES representation. Included templates for conversational interactions about solubility.	`data/tabular/solubility_aqsoldb/meta.yaml`
Added templates to the `nlmchem` dataset's metadata for generating prompts and tasks related to abbreviation expansion.	Added templates for expanding abbreviations. Included templates for conversational interactions about abbreviation expansion.	`data/tabular/nlmchem/meta.yaml`
Added templates to the `ord_masked` dataset's metadata for generating prompts and tasks related to reaction component prediction.	Added templates for predicting the masked component in a reaction. Included templates for analyzing reactions and identifying masked chemical entities.	`data/tabular/ord_masked/meta.yaml`
Added templates to the `ord_procedure_steps` dataset's metadata for generating prompts and tasks related to procedure step extraction.	Added templates for converting procedures into step strings. Included templates for identifying steps involved in a procedure.	`data/tabular/ord_procedure_steps/meta.yaml`
Added templates to the `mp_descriptions` dataset's metadata for generating prompts and tasks related to crystal structure generation from descriptions.	Added templates for generating CIF strings from descriptions. Included templates for conversational interactions about crystal structure generation.	`data/tabular/mp_descriptions/meta.yaml`
Added templates to the `block_polymers_morphology` dataset's metadata for generating prompts and tasks related to polymer property prediction.	Added templates for predicting polymer properties given BigSMILES representation. Included templates for conversational interactions about polymer design.	`data/tabular/block_polymers_morphology/meta.yaml`
Added templates to the `rhea_db_masked` dataset's metadata for generating prompts and tasks related to reaction component prediction.	Added templates for predicting the masked component in a reaction. Included templates for identifying undisclosed chemicals in reactions.	`data/tabular/rhea_db_masked/meta.yaml`
Removed the meta.yaml creation from the transform scripts for `choline_transporter_butkiewicz` and `nlmchem` datasets.	Removed the code that creates the meta.yaml file. Removed the code that dumps the meta dictionary to a yaml file.	`data/tabular/choline_transporter_butkiewicz/transform.py` `data/tabular/nlmchem/transform.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!
Generate a plan of action for an issue: Comment @sourcery-ai plan on
an issue to generate a plan of action for it.

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey @AdrianM0 - I've reviewed your changes - here's some feedback:

Overall Comments:

Consider adding a script to automatically update the number of data points in the meta.yaml file.
It would be helpful to include a brief description of the changes made in the transform.py files in the PR description.

Here's what I looked at during the review

🟡 General issues: 2 issues found
🟢 Security: all looks good
🟢 Testing: all looks good
🟢 Complexity: all looks good
🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-02-20T16:22:27Z

data/tabular/choline_transporter_butkiewicz/transform.py

@@ -37,207 +37,8 @@ def get_and_transform_data():

    # save to csv
    fn_data_csv = "data_clean.csv"
+    # shuffle 


suggestion: Clarify the shuffling intent in the data transform script.

If the comment indicates that the data should be shuffled, please implement the corresponding shuffle logic; if not, consider removing the comment to avoid confusion.

sourcery-ai · 2025-02-20T16:22:27Z

data/tabular/block_polymers_morphology/transform.py

-    print(len(df))
-    df[columns_to_keep].to_csv("data_clean.csv", index=False)
-
+    df = hf_hub_download(repo_id="AdrianM0/block_polymers_morphology", filename="diblock.csv", repo_type="dataset")


suggestion: Switching to hf_hub_download for file fetching improves reproducibility.

Consider adding error handling or logging for the download process to gracefully handle failures or missing files.

review-notebook-app · 2025-02-21T10:19:14Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

AdrianM0 added 15 commits February 20, 2025 09:01

feat: mofdscribe templates

de063b9

feat: update nlchem templates

161d702

feat: aqsoldb templates

e37da57

feat: enhance MUV_600

a9d30aa

feat: MUV templates updated

eb4f2e6

feat: enhanced rhea_db_masked templates

79ca3dd

feat: flashpoint templatees

8b4ff1b

feat: complede ord_masked templates

8753b41

feat: mp_description templates

e2544a4

fix: dataset path

1456282

fix: bug with the sampler

8307fdc

chore: fix yaml

9c08b6f

feat: MCQ templates for uniprot

24eb5b6

feat: add 2 templates to ord procedure_steps

3a417b0

feat: improve sampling

71ad84c

sourcery-ai bot reviewed Feb 20, 2025

View reviewed changes

AdrianM0 added 13 commits February 20, 2025 16:29

feat: more templates

3d115de

feat: add more templates

78fe354

feat: enrich chem-caption templates

2972caa

feat: astrazeneca clearance templates

f14ae35

feat: more templates

68e165f

feat: more templates

a6709f9

chore: linting

291b883

feat: add templates chemistry stackexchange

5b784c6

feat: templates for ocp complete

7d1e836

feat: add more templates ld50

0a00630

feat: physics stack exchange templates

0a49978

feat: chemdner complete templates

f5fe85d

feat: ord_steps templates

23f9943

AdrianM0 added 13 commits February 21, 2025 09:07

feat: mattermodelling stackexchange templates

954c452

feat: more herg templates

bd7c177

feat: add more templates

8d107ba

feat: more templates

a0184ae

feat: caco2wang templates

180c99e

feat: bc5 datasets

a49992a

feat: add meta and fix templates

61ffff6

feat: remove EOI

09acb26

feat: more corrections

7351448

feat: remove meta from script

02b3116

feat: major lint

6b7e9fd

feat: update templates

f25c5a0

feat: remove meta.yaml saver

8b91211

AdrianM0 added 2 commits February 21, 2025 10:27

fix: chebi20 dataset

55cca1c

feat: final changes

820e789

AdrianM0 merged commit 73172bc into main Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: templates and sampler testing/fixing #1

feat: templates and sampler testing/fixing #1

AdrianM0 commented Feb 20, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Feb 20, 2025 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

sourcery-ai bot left a comment

sourcery-ai bot Feb 20, 2025

sourcery-ai bot Feb 20, 2025

review-notebook-app bot commented Feb 21, 2025

feat: templates and sampler testing/fixing #1

feat: templates and sampler testing/fixing #1

Conversation

AdrianM0 commented Feb 20, 2025 • edited by sourcery-ai bot Loading

Summary by Sourcery

sourcery-ai bot commented Feb 20, 2025 • edited Loading

Reviewer's Guide by Sourcery

Updated class diagram for DataSampler

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

sourcery-ai bot left a comment

Choose a reason for hiding this comment

sourcery-ai bot Feb 20, 2025

Choose a reason for hiding this comment

sourcery-ai bot Feb 20, 2025

Choose a reason for hiding this comment

review-notebook-app bot commented Feb 21, 2025

AdrianM0 commented Feb 20, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Feb 20, 2025 •

edited

Loading