Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rebuild] process fails when issue has empty pages #62

Open
mromanello opened this issue Sep 1, 2020 · 0 comments
Open

[rebuild] process fails when issue has empty pages #62

mromanello opened this issue Sep 1, 2020 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@mromanello
Copy link
Member

mromanello commented Sep 1, 2020

Example:
oecaen-1914-12-02-a from BNF data.

Extent:

~18 issues of oecaen (as of 01-09-2020).

Complete log

Uploading 8 rebuilt bz2files to canonical-rebuilt-testing
Processing batch 9/11 [{'oecaen': [1912, 1943]}]% Completed | 22.3s
Processing year 1912
Retrieving issues...
Fleshing out articles by issue...
Number of partitions: 97
Skipped articles: []
done.
Processing year 1913
Retrieving issues...
Fleshing out articles by issue...
Number of partitions: 117
Skipped articles: []
done.
Processing year 1914
Retrieving issues...
Fleshing out articles by issue...
Number of partitions: 117
  File "impresso_commons/text/rebuilder.py", line 703, in main
    filter_language=languages
  File "impresso_commons/text/rebuilder.py", line 541, in rebuild_issues
    .pluck('id')\
  File "/home/romanell/.pyenv/versions/impresso-pycommons/lib/python3.6/site-packages/dask/base.py", line 175, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/home/romanell/.pyenv/versions/impresso-pycommons/lib/python3.6/site-packages/dask/base.py", line 446, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/home/romanell/.pyenv/versions/impresso-pycommons/lib/python3.6/site-packages/distributed/client.py", line 2510, in get
    results = self.gather(packed, asynchronous=asynchronous, direct=direct)
  File "/home/romanell/.pyenv/versions/impresso-pycommons/lib/python3.6/site-packages/distributed/client.py", line 1812, in gather
    asynchronous=asynchronous,
  File "/home/romanell/.pyenv/versions/impresso-pycommons/lib/python3.6/site-packages/distributed/client.py", line 753, in sync
    self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
  File "/home/romanell/.pyenv/versions/impresso-pycommons/lib/python3.6/site-packages/distributed/utils.py", line 337, in sync
    six.reraise(*error[0])
  File "/home/romanell/.pyenv/versions/impresso-pycommons/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/romanell/.pyenv/versions/impresso-pycommons/lib/python3.6/site-packages/distributed/utils.py", line 322, in f
    result[0] = yield future
  File "/home/romanell/.pyenv/versions/impresso-pycommons/lib/python3.6/site-packages/tornado/gen.py", line 735, in run
    value = future.result()
  File "/home/romanell/.pyenv/versions/impresso-pycommons/lib/python3.6/site-packages/distributed/client.py", line 1668, in _gather
    six.reraise(type(exception), exception, traceback)
  File "/home/romanell/.pyenv/versions/impresso-pycommons/lib/python3.6/site-packages/six.py", line 692, in reraise
    raise value.with_traceback(tb)
  File "/home/romanell/.pyenv/versions/3.6.0/envs/impresso-pycommons/lib/python3.6/site-packages/impresso_commons/text/helpers.py", line 70, in read_issue_pages
    for page in alternative_read_text(filename, IMPRESSO_STORAGEOPT)
  File "/home/romanell/.pyenv/versions/3.6.0/envs/impresso-pycommons/lib/python3.6/site-packages/impresso_commons/utils/s3.py", line 443, in alternative_read_text
    with s_open(s3_key, 'r', transport_params=transport_params) as infile:
  File "/home/romanell/.pyenv/versions/3.6.0/envs/impresso-pycommons/lib/python3.6/site-packages/smart_open/smart_open_lib.py", line 348, in open
    binary, filename = _open_binary_stream(uri, binary_mode, transport_params)
  File "/home/romanell/.pyenv/versions/3.6.0/envs/impresso-pycommons/lib/python3.6/site-packages/smart_open/smart_open_lib.py", line 556, in _open_binary_stream
    return _s3_open_uri(parsed_uri, mode, transport_params), filename
  File "/home/romanell/.pyenv/versions/3.6.0/envs/impresso-pycommons/lib/python3.6/site-packages/smart_open/smart_open_lib.py", line 628, in _s3_open_uri
    return smart_open_s3.open(parsed_uri.bucket_id, parsed_uri.key_id, mode, **kwargs)
  File "/home/romanell/.pyenv/versions/3.6.0/envs/impresso-pycommons/lib/python3.6/site-packages/smart_open/s3.py", line 117, in open
    resource_kwargs=resource_kwargs,
  File "/home/romanell/.pyenv/versions/3.6.0/envs/impresso-pycommons/lib/python3.6/site-packages/smart_open/s3.py", line 345, in __init__
    'or is forbidden for access' % (key, bucket)
'oecaen/pages/oecaen-1914/oecaen-1914-12-02-a-pages.jsonl.bz2' does not exist in the bucket 'original-canonical-staging', or is forbidden for access
@mromanello mromanello added the bug Something isn't working label Sep 1, 2020
@mromanello mromanello self-assigned this Sep 1, 2020
mromanello pushed a commit that referenced this issue Sep 1, 2020
mromanello pushed a commit that referenced this issue Sep 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant