Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: templates and sampler testing/fixing #1

Merged
merged 43 commits into from
Feb 21, 2025
Merged

feat: templates and sampler testing/fixing #1

merged 43 commits into from
Feb 21, 2025

Conversation

AdrianM0
Copy link
Collaborator

@AdrianM0 AdrianM0 commented Feb 20, 2025

Summary by Sourcery

Update datasets with new templates and enhance continuous variable handling. Improve the transform scripts by simplifying the process and removing redundant steps.

Enhancements:

  • Updated the choline_transporter_butkiewicz, flashpoint, thermosol, nlmchem, ord_masked, ord_procedure_steps, mp_descriptions, block_polymers_morphology, rhea_db_masked, MUV_466, MUV_548, MUV_600, MUV_644, MUV_652, MUV_689, MUV_692, MUV_712, MUV_713, MUV_733, MUV_737, MUV_810, MUV_832, MUV_846, MUV_852, MUV_858, MUV_859, solubility_aqsoldb, and uniprot_organisms datasets with additional templates and improved prompts.
  • Improved the handling of continuous variables in the _get_target_from_row and _get_choices_with_indicator functions within the sampler.py module to ensure accurate formatting and representation of numerical values.
  • Simplified the transform.py scripts for several datasets to directly save the processed data to a CSV file, removing the redundant creation of a meta.yaml file within the script.

Copy link

sourcery-ai bot commented Feb 20, 2025

Reviewer's Guide by Sourcery

This pull request introduces new templates for various datasets, fixes a bug in the sampler, and updates the data processing pipeline for some datasets. The changes primarily focus on improving the usability and functionality of the datasets for various tasks, such as property prediction, structure generation, and reaction component prediction. The transform scripts for choline_transporter_butkiewicz and nlmchem datasets were also updated to remove the meta.yaml creation.

Updated class diagram for DataSampler

classDiagram
    class DataSampler {
        -meta: dict
        -df: pd.DataFrame
        -config: dict
        -_get_target_from_row(sample: pd.Series, var: str) str
        -_get_choices_with_indicator(symbols: List[str], multiple_choice_var: str, multiple_choice_indicator: str) Tuple[List[str], int]
        -_format_choices(symbols: List[str], choices: List[str]) str
        -_fill_template(template: str, sample_dict: Dict[str, Union[str, List[str]]]) str
    }
    note for DataSampler "The _get_target_from_row method was updated to handle continuous variables with significant digits."
Loading

File-Level Changes

Change Details Files
Updated metadata for the choline_transporter_butkiewicz dataset to include a more detailed description, target information, benchmark details, identifiers, license, relevant links, number of data points, and corresponding publications in BibTeX format.
  • Updated the dataset description to provide more context on its origin and curation process.
  • Added target information, including the ID, description, units, type, names, and PubChem AIDs.
  • Included benchmark details, such as the name and link to the TDC benchmark, and the split column.
  • Specified the SMILES identifier with its type and description.
  • Defined the license under which the dataset is published.
  • Added links to the original dataset and corresponding publications.
  • Updated the number of data points in the dataset.
  • Provided BibTeX entries for relevant publications.
  • Added templates for generating prompts and tasks related to the dataset.
data/tabular/choline_transporter_butkiewicz/meta.yaml
Added templates to the flashpoint dataset's metadata for generating prompts and tasks related to flashpoint prediction.
  • Added templates for predicting the flashpoint of a compound given its SMILES representation.
  • Included templates for answering questions about the flashpoint of a compound.
  • Added templates for generating a compound with a specific flashpoint.
  • Added templates for conversational interactions about flashpoints.
  • Included templates for multiple-choice questions related to flashpoint prediction.
data/tabular/flashpoint/meta.yaml
Added templates to the thermosol dataset's metadata for generating prompts and tasks related to solubility prediction.
  • Added templates for identifying the solubility of a compound given its SMILES representation.
  • Included templates for multiple-choice questions related to solubility prediction.
data/tabular/thermosol/meta.yaml
Updated the block_polymers_morphology dataset processing script to download the dataset from Hugging Face Hub instead of a direct URL.
  • Replaced the direct CSV URL with a Hugging Face Hub download link.
  • Removed data cleaning and transformation steps previously performed in the script.
data/tabular/block_polymers_morphology/transform.py
Added templates to the mofdscribe dataset's metadata for generating prompts and tasks related to MOF structure generation.
  • Added templates for generating CIF files from MOF descriptions.
  • Included templates for translating descriptions into CIF representations.
  • Added templates for converting MOF structure descriptions into CIF files.
data/tabular/mofdscribe/meta.yaml
Fixed a bug in the sampler that caused continuous variables to not be formatted correctly.
  • Ensured that continuous variables are formatted with the correct number of significant digits when generating multiple-choice options.
  • Added a check to ensure that the value is a string before attempting to join it.
src/chemnlp/data/sampler.py
Added templates to the uniprot_organisms dataset's metadata for generating prompts and tasks related to organism prediction from protein sequences.
  • Added templates for predicting the organism from a given amino acid sequence.
  • Included templates for multiple-choice questions related to organism prediction.
data/tabular/uniprot_organisms/meta.yaml
Added templates to the MUV datasets' metadata for generating prompts and tasks related to activity prediction.
  • Added templates for predicting activity of a molecule given its SMILES representation.
  • Included templates for generating a molecule with a specific activity.
  • Added templates for conversational interactions about activity.
data/tabular/MUV_466/meta.yaml
data/tabular/MUV_548/meta.yaml
data/tabular/MUV_600/meta.yaml
data/tabular/MUV_644/meta.yaml
data/tabular/MUV_652/meta.yaml
data/tabular/MUV_689/meta.yaml
data/tabular/MUV_692/meta.yaml
data/tabular/MUV_712/meta.yaml
data/tabular/MUV_713/meta.yaml
data/tabular/MUV_733/meta.yaml
data/tabular/MUV_737/meta.yaml
data/tabular/MUV_810/meta.yaml
data/tabular/MUV_832/meta.yaml
data/tabular/MUV_852/meta.yaml
data/tabular/MUV_858/meta.yaml
data/tabular/MUV_859/meta.yaml
data/tabular/MUV_846/meta.yaml
Added templates to the solubility_aqsoldb dataset's metadata for generating prompts and tasks related to solubility prediction.
  • Added templates for predicting the solubility of a compound given its SMILES representation.
  • Included templates for conversational interactions about solubility.
data/tabular/solubility_aqsoldb/meta.yaml
Added templates to the nlmchem dataset's metadata for generating prompts and tasks related to abbreviation expansion.
  • Added templates for expanding abbreviations.
  • Included templates for conversational interactions about abbreviation expansion.
data/tabular/nlmchem/meta.yaml
Added templates to the ord_masked dataset's metadata for generating prompts and tasks related to reaction component prediction.
  • Added templates for predicting the masked component in a reaction.
  • Included templates for analyzing reactions and identifying masked chemical entities.
data/tabular/ord_masked/meta.yaml
Added templates to the ord_procedure_steps dataset's metadata for generating prompts and tasks related to procedure step extraction.
  • Added templates for converting procedures into step strings.
  • Included templates for identifying steps involved in a procedure.
data/tabular/ord_procedure_steps/meta.yaml
Added templates to the mp_descriptions dataset's metadata for generating prompts and tasks related to crystal structure generation from descriptions.
  • Added templates for generating CIF strings from descriptions.
  • Included templates for conversational interactions about crystal structure generation.
data/tabular/mp_descriptions/meta.yaml
Added templates to the block_polymers_morphology dataset's metadata for generating prompts and tasks related to polymer property prediction.
  • Added templates for predicting polymer properties given BigSMILES representation.
  • Included templates for conversational interactions about polymer design.
data/tabular/block_polymers_morphology/meta.yaml
Added templates to the rhea_db_masked dataset's metadata for generating prompts and tasks related to reaction component prediction.
  • Added templates for predicting the masked component in a reaction.
  • Included templates for identifying undisclosed chemicals in reactions.
data/tabular/rhea_db_masked/meta.yaml
Removed the meta.yaml creation from the transform scripts for choline_transporter_butkiewicz and nlmchem datasets.
  • Removed the code that creates the meta.yaml file.
  • Removed the code that dumps the meta dictionary to a yaml file.
data/tabular/choline_transporter_butkiewicz/transform.py
data/tabular/nlmchem/transform.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!
  • Generate a plan of action for an issue: Comment @sourcery-ai plan on
    an issue to generate a plan of action for it.

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @AdrianM0 - I've reviewed your changes - here's some feedback:

Overall Comments:

  • Consider adding a script to automatically update the number of data points in the meta.yaml file.
  • It would be helpful to include a brief description of the changes made in the transform.py files in the PR description.
Here's what I looked at during the review
  • 🟡 General issues: 2 issues found
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟢 Complexity: all looks good
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@@ -37,207 +37,8 @@ def get_and_transform_data():

# save to csv
fn_data_csv = "data_clean.csv"
# shuffle
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Clarify the shuffling intent in the data transform script.

If the comment indicates that the data should be shuffled, please implement the corresponding shuffle logic; if not, consider removing the comment to avoid confusion.

print(len(df))
df[columns_to_keep].to_csv("data_clean.csv", index=False)

df = hf_hub_download(repo_id="AdrianM0/block_polymers_morphology", filename="diblock.csv", repo_type="dataset")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Switching to hf_hub_download for file fetching improves reproducibility.

Consider adding error handling or logging for the download process to gracefully handle failures or missing files.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@AdrianM0 AdrianM0 merged commit 73172bc into main Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant