-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: templates and sampler testing/fixing #1
Conversation
Reviewer's Guide by SourceryThis pull request introduces new templates for various datasets, fixes a bug in the sampler, and updates the data processing pipeline for some datasets. The changes primarily focus on improving the usability and functionality of the datasets for various tasks, such as property prediction, structure generation, and reaction component prediction. The transform scripts for Updated class diagram for DataSamplerclassDiagram
class DataSampler {
-meta: dict
-df: pd.DataFrame
-config: dict
-_get_target_from_row(sample: pd.Series, var: str) str
-_get_choices_with_indicator(symbols: List[str], multiple_choice_var: str, multiple_choice_indicator: str) Tuple[List[str], int]
-_format_choices(symbols: List[str], choices: List[str]) str
-_fill_template(template: str, sample_dict: Dict[str, Union[str, List[str]]]) str
}
note for DataSampler "The _get_target_from_row method was updated to handle continuous variables with significant digits."
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @AdrianM0 - I've reviewed your changes - here's some feedback:
Overall Comments:
- Consider adding a script to automatically update the number of data points in the meta.yaml file.
- It would be helpful to include a brief description of the changes made in the transform.py files in the PR description.
Here's what I looked at during the review
- 🟡 General issues: 2 issues found
- 🟢 Security: all looks good
- 🟢 Testing: all looks good
- 🟢 Complexity: all looks good
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
@@ -37,207 +37,8 @@ def get_and_transform_data(): | |||
|
|||
# save to csv | |||
fn_data_csv = "data_clean.csv" | |||
# shuffle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Clarify the shuffling intent in the data transform script.
If the comment indicates that the data should be shuffled, please implement the corresponding shuffle logic; if not, consider removing the comment to avoid confusion.
print(len(df)) | ||
df[columns_to_keep].to_csv("data_clean.csv", index=False) | ||
|
||
df = hf_hub_download(repo_id="AdrianM0/block_polymers_morphology", filename="diblock.csv", repo_type="dataset") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Switching to hf_hub_download for file fetching improves reproducibility.
Consider adding error handling or logging for the download process to gracefully handle failures or missing files.
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Summary by Sourcery
Update datasets with new templates and enhance continuous variable handling. Improve the transform scripts by simplifying the process and removing redundant steps.
Enhancements:
choline_transporter_butkiewicz
,flashpoint
,thermosol
,nlmchem
,ord_masked
,ord_procedure_steps
,mp_descriptions
,block_polymers_morphology
,rhea_db_masked
,MUV_466
,MUV_548
,MUV_600
,MUV_644
,MUV_652
,MUV_689
,MUV_692
,MUV_712
,MUV_713
,MUV_733
,MUV_737
,MUV_810
,MUV_832
,MUV_846
,MUV_852
,MUV_858
,MUV_859
,solubility_aqsoldb
, anduniprot_organisms
datasets with additional templates and improved prompts._get_target_from_row
and_get_choices_with_indicator
functions within thesampler.py
module to ensure accurate formatting and representation of numerical values.transform.py
scripts for several datasets to directly save the processed data to a CSV file, removing the redundant creation of ameta.yaml
file within the script.