Skip to content

Commit b8e917a

Browse files
hakan458hakan458matt-bernstein
authored
feat: DIA-2202: Support imported tasks that point to different buckets (#7458)
Co-authored-by: hakan458 <hakan@heartex.com> Co-authored-by: matt-bernstein <matt-bernstein@users.noreply.github.com>
1 parent a0ca4a2 commit b8e917a

File tree

5 files changed

+50
-16
lines changed

5 files changed

+50
-16
lines changed

label_studio/io_storages/README.md

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -184,4 +184,29 @@ The Storage Proxy API behavior can be configured using the following environment
184184
| `RESOLVER_PROXY_MAX_RANGE_SIZE` | Maximum size in bytes for a single range request | 7*1024*1024 |
185185
| `RESOLVER_PROXY_CACHE_TIMEOUT` | Cache TTL in seconds for proxy responses | 3600 |
186186

187-
These optimizations ensure that the Proxy API remains responsive and resource-efficient, even when handling large files or many concurrent requests.
187+
These optimizations ensure that the Proxy API remains responsive and resource-efficient, even when handling large files or many concurrent requests.
188+
189+
## Multiple Storages and URL Resolving
190+
191+
There are use cases where multiple storages can/must be used in a single project. This can cause some confusion as to which storage gets used when. Here are some common cases and how to set up mutliple storages properly.
192+
193+
### Case 1 - Tasks Referencing Other Buckets
194+
* bucket-A containing JSON tasks
195+
* bucket-B containing images/text/other data
196+
* Tasks synced from bucket-A have references to data in bucket-B
197+
198+
##### How To Setup
199+
* Add storage 1 for bucket-A
200+
* Add storage 2 for bucket-B (might be same or different credentials than bucket-A)
201+
* Sync storage 1
202+
* All references to data in bucket-B will be resolved using storage 2 automatically
203+
204+
### Case 2 - Buckets with Different Credentials
205+
* bucket-A accessible by credentials 1
206+
* bucket-B accessible by credentials 2
207+
208+
##### How To Setup
209+
* Add storage 1 for bucket-A with credentials 1
210+
* Add storage 2 for bucket-B with credentials 2
211+
* Sync both storages
212+
* The appropriate storage will be used to resolve urls/generate presigned URLs

label_studio/io_storages/base_models.py

Lines changed: 14 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@
2727
from django.utils import timezone
2828
from django.utils.translation import gettext_lazy as _
2929
from django_rq import job
30-
from io_storages.utils import get_uri_via_regex
30+
from io_storages.utils import get_uri_via_regex, parse_bucket_uri
3131
from rq.job import Job
3232
from tasks.models import Annotation, Task
3333
from tasks.serializers import AnnotationSerializer, PredictionSerializer
@@ -255,8 +255,19 @@ def can_resolve_scheme(self, url: Union[str, None]) -> bool:
255255
return False
256256
# TODO: Search for occurrences inside string, e.g. for cases like "gs://bucket/file.pdf" or "<embed src='gs://bucket/file.pdf'/>"
257257
_, prefix = get_uri_via_regex(url, prefixes=(self.url_scheme,))
258-
if prefix == self.url_scheme:
259-
return True
258+
bucket_uri = parse_bucket_uri(url, self)
259+
260+
# If there is a prefix and the bucket matches the storage's bucket/container/path
261+
if prefix == self.url_scheme and bucket_uri:
262+
# bucket is used for s3 and gcs
263+
if hasattr(self, 'bucket') and bucket_uri.bucket == self.bucket:
264+
return True
265+
# container is used for azure blob
266+
if hasattr(self, 'container') and bucket_uri.bucket == self.container:
267+
return True
268+
# path is used for redis
269+
if hasattr(self, 'path') and bucket_uri.bucket == self.path:
270+
return True
260271
# if not found any occurrences - this Storage can't resolve url
261272
return False
262273

label_studio/io_storages/functions.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,5 @@ def get_storage_by_url(url: Union[str, List, Dict], storage_objects: Iterable[Im
5454
for storage_object in storage_objects:
5555
if storage_object.can_resolve_url(url):
5656
# note: only first found storage_object will be used for link resolving
57-
# probably we need to use more advanced can_resolve_url mechanics
58-
# that takes into account not only prefixes, but bucket path too
57+
# can_resolve_url now checks both the scheme and the bucket to ensure the correct storage is used
5958
return storage_object

label_studio/io_storages/s3/serializers.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,8 @@ def validate(self, data):
6868
except TypeError as e:
6969
logger.info(f'It seems access keys are incorrect: {e}', exc_info=True)
7070
raise ValidationError('It seems access keys are incorrect')
71+
except KeyError:
72+
raise ValidationError(f'{storage.url_scheme}://{storage.bucket}/{storage.prefix} not found.')
7173
return data
7274

7375

label_studio/tasks/models.py

Lines changed: 7 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -420,12 +420,10 @@ def prepare_filename(filename):
420420
def resolve_storage_uri(self, url) -> Optional[Mapping[str, Any]]:
421421
from io_storages.functions import get_storage_by_url
422422

423-
storage = self.storage
424-
project = self.project
425-
426-
if not storage:
427-
storage_objects = project.get_all_import_storage_objects
428-
storage = get_storage_by_url(url, storage_objects)
423+
# Instead of using self.storage, we check all storage objects for the project to
424+
# support imported tasks that point to another bucket
425+
storage_objects = self.project.get_all_import_storage_objects
426+
storage = get_storage_by_url(url, storage_objects)
429427

430428
if storage:
431429
return {
@@ -468,10 +466,9 @@ def resolve_uri(self, task_data, project):
468466

469467
# project storage
470468
# TODO: to resolve nested lists and dicts we should improve get_storage_by_url(),
471-
# TODO: problem with current approach: it can be used only the first storage that get_storage_by_url
472-
# TODO: returns. However, maybe the second storage will resolve uris properly.
473-
# TODO: resolve_uri() already supports them
474-
storage = self.storage or get_storage_by_url(task_data[field], storage_objects)
469+
# Now always using get_storage_by_url to ensure the storage with the correct bucket is used
470+
# As a last fallback we can use self.storage which is the storage the Task was imported from
471+
storage = get_storage_by_url(task_data[field], storage_objects) or self.storage
475472
if storage:
476473
try:
477474
resolved_uri = storage.resolve_uri(task_data[field], self)

0 commit comments

Comments
 (0)