Skip to content

Modify process_local_sources to support mount paths #1635

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

Efrat1
Copy link
Contributor

@Efrat1 Efrat1 commented May 19, 2025

Summary

Modify process_local_sources to support mount paths

Type of Change (Mandatory)

  • Feature enhancement

Description (Mandatory)

Use common base_dir for all local data sources, and source_paths relative to that base.
The reason is to simplify path management in containerized environments, such as Docker. By using a common base_dir, we can ensure that paths remain consistent when mounting volumes, as only the base_dir needs to be adjusted to point to the mount path inside the container. This way, we only need to adjust the base_dir to point to the mount path.

  • Cosmetic changes in histology_s3 dataloader.
  • Change types names: "local" to "fs" and "azure_blob" to "ab"

Testing

Run full experiment in OpenFL and in OpenFL-Security.

Copy link
Collaborator

@teoparvanov teoparvanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

EDIT: additional changes and testing requested below, so please do not merge in the meantime.

@Efrat1 Efrat1 force-pushed the edar/datasets_parse_local_ds branch 2 times, most recently from 1580f40 to 391b2da Compare May 20, 2025 12:47
"""Process and validate data sources."""
cwd = os.getcwd()
os.getcwd()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only returns cwd. If cwd is unused, line also can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry right. removed of course

return LocalDataSource(source_name, rel_path, base_path=Path("."))
else:
raise ValueError(f"Invalid path for local data source '{source_name}': {path}.")
def process_local_sources(local_datasources, datasources, check_dir_traversal=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is also supposed to be staticmethod?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, right.

# to the mount path inside the container.
# This way, we only need to adjust the base_dir to point to the mount path.
absolute_paths = {
source_name: os.path.realpath(params.get("path", None))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

realpath would raise an exception if path is None. If we know params will always have path, better to set it as params['path'] IMO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, fixed.

Comment on lines +234 to +238
"This file should contain a JSON object with the data sources to be registered. For local "
"data source, 'type' is 'fs', and 'params' must include: 'path'. For 's3' type, 'params' "
"must include: 'uri', 'access_key_env_name', 'secret_key_env_name', 'secret_name', and "
"optionally 'endpoint'. For azure_blob, 'type' is 'ab', and 'params' must include: "
"'connection_string', 'container_name', and optionally 'folder_prefix'."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion would be to point to OpenFL documentation URL that describes data sources in detail with examples, in addition to the format of JSON files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. @teoparvanov what doc file should we insert that info to?

@@ -258,7 +260,7 @@ def calchash(data_path):
sys.exit(1)
with open(datasources_json_path, "r", encoding="utf-8") as file:
data = file.read()
vds = DataSourcesJsonParser.parse(data)
vds = DataSourcesJsonParser.parse(data, check_dir_traversal=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead, can we check_dir_traversal before calling this function? I don't think the class needs to know about whether it is in a valid subdirectory...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. That means we go over the json one more time to look for the local data sources. Makes sense? currently happens only in calchash

@teoparvanov
Copy link
Collaborator

teoparvanov commented May 21, 2025

Hi @Efrat1, after addressing @MasterSkepticista's comments, I'd request that you additionally test those changes by running a containerized OpenFL federation locally.

@Efrat1 Efrat1 force-pushed the edar/datasets_parse_local_ds branch 2 times, most recently from d8751d7 to de73c55 Compare May 22, 2025 13:26
Signed-off-by: Dar, Efrat <efrat.dar@intel.com>
@Efrat1 Efrat1 force-pushed the edar/datasets_parse_local_ds branch from de73c55 to 465447c Compare May 25, 2025 08:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants