-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added the needed changes to support EOS #2997
Conversation
0d9c299
to
7d653b6
Compare
@@ -147,12 +149,26 @@ def _try_to_log(logfn, *args, **kwargs): | |||
return _decorator | |||
|
|||
|
|||
def get_uri(uri, outdir=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mh, from the name of the function I would expect that this function returns an URI, while instead it returns a file from a URI... what about get_file_from_uri
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not returning a file, but just a path. Better retrieve_uri
__name__, | ||
os.path.join('fixtures', '1704.00452.pdf'), | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like this wasn't correct before... why would get_pdf_in_workflow
return a filename? Perhaps it should have been pkg_resources.resource_file
here, and then we no longer need the get_uri
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_pdf_in_workflow
retrieves the filename of the pdf that is in the workflow. get_uri
is required there because the refextract
task expects that file to be a temporary one, and removes it when done.
0fdaf91
to
cb5742f
Compare
2d5940d
to
36afe22
Compare
inspirehep/config.py
Outdated
@@ -91,7 +91,7 @@ | |||
BASE_FILES_LOCATION = os.path.join(sys.prefix, 'var/data') | |||
|
|||
# This is needed in order to be able to use EOS files locations | |||
MAX_CONTENT_LENGTH = 100 * 1024 * 1024 # 100 MiB | |||
MAX_CONTENT_LENGTH = 100 * 1024 * 1024 # 100 MiB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be squashed with first commit
@@ -263,12 +264,15 @@ def refextract(obj, eng): | |||
pdf_references, text_references = [], [] | |||
source = get_value(obj.data, 'acquisition_source.source') | |||
|
|||
pdf = get_pdf_in_workflow(obj) | |||
if pdf: | |||
tmp_pdf = get_pdf_in_workflow(obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to make explicit that the pdf that you expect to get is a temporary one (and you'll have to clean it up).
except TimeoutError: | ||
obj.log.error('Timeout when extracting references from PDF.') | ||
finally: | ||
if os.path.exists(tmp_pdf): | ||
os.unlink(tmp_pdf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this safe? Aren't we deleting files from under invenio-files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thetmp_pdf
is the locally downloaded copy of the pdf, not really the original one. It's safe to remove (actually, in this case, it's mandatory to cleanup).
inspirehep/utils/url.py
Outdated
|
||
def retrieve_uri(uri, outdir=None): | ||
"""Retrieves the given uri and stores it in a temporary file.""" | ||
fd, local_file = tempfile.mkstemp(prefix='inspire', dir=outdir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use https://docs.python.org/2/library/tempfile.html#tempfile.NamedTemporaryFile with deleted=False
which is slightly easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and the next will be solved with:
7 def retrieve_uri(uri, outdir=None):
6 """Retrieves the given uri and stores it in a temporary file."""
5 fd, local_file = tempfile.mkstemp(prefix='inspire', dir=outdir)
4 os.close(fd)
3
2 fs.copy.copy_file(uri, local_file)
1
81 return local_file
as the copy_file requires a string or fs.FS instance, not an open file (will open it with whatever opener is needed for it).
is that ok? I've checked the code underneath, and it uses chunks of:
In [1]: import io
In [2]: io.DEFAULT_BUFFER_SIZE
Out[2]: 8192
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dammit... this requires fs>=2, but it's capped by invenio-files-rest, see inveniosoftware/invenio-files-rest#130
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do this with PyFSFileStorage which has all the support for chunking, size limit checking etc:
>>> import os
>>> import tempfile
>>> import requests
>>> from invenio_files_rest.storage import PyFSFileStorage
>>> fd, local_file = tempfile.mkstemp(prefix='inspire')
>>> os.close(fd)
>>> dst = PyFSFileStorage(local_file)
>>> dst.save(requests.get(uri, stream=True).raw, size_limit=1*1024*1024, chunk_size=2*1024*1024)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For local disk 8kb chunk size is ok....for EOS you definitely want to go to MB size for best performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the tip, though I think it's a bit premature to strive for performance (and probably it's at the pyfs level where we want to handle each different fs type tweaking, here we know one of the files is local, but the other we are not sure).
inspirehep/utils/url.py
Outdated
os.close(fd) | ||
|
||
with fsopen(uri, 'rb') as remote_fd: | ||
data = remote_fd.read() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not safe. It must be bufferized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not expecting to have big files (GB order), but sure, better safe than sorry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be hundredths of MB indeed. And also we might have some malicious user pointing to something huge...
OK, very improbable, but given the small size of our VMs... :-)
691d446
to
2992952
Compare
if len(kb_data) == batch_size: | ||
with fsopen(refextract_journal_kb_path, mode='wb') as fd: | ||
fd.write(''.join(kb_data)) | ||
kb_data = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest putting this code in a function since the pattern used is the same in three points.
Something like
. . .
for row in titles_query:
_add_normalized_title(kb_data, row['short_title'])
_add_normalized_title(kb_data, row['journal_title'])
_flush_kb_data(batch_size, kb_data, refextract_journal_kb_path)
. . .
from ..utils import download_file_to_workflow, with_debug_logging | ||
from ..utils import download_file_to_workflow, with_debug_logging, convert | ||
from ....utils.record import get_arxiv_categories, get_arxiv_id | ||
from ....utils.url import is_pdf_link, retrieve_uri |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit but...it's super hard to me reading relative imports :P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Google's Python style guide actually disallows relative imports)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also here this is escaping the current module and reaching the utils
folder, which is just confusing...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
The fact that we have those "modules" (that actually are python subpackages, don't confuse with python modules) is by itself confusing. And imo a symptom of a wider structural issue in the general organization of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I would like to have here is a fairly common architecture of large Flask applications based on blueprints, each implementing a distinct feature of INSPIRE: editor
, holdingpen
, forms
, search
, theme
...
This is what Explore Flask calls a "divisional structure": https://exploreflask.com/en/latest/blueprints.html#divisional. I detailed in several issues some of the steps needed to get there: #2251, #2260, #2641, #2642.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that I see there is that it has, tops, a main package (top level) and supackages (one depth level) of python modules. That's something that's completely missed on all those issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If what you want is to remove the modules
folder I agree with that, but that's a much simpler (although quite wide) change than splitting code in their own blueprint-driven modules.
Mind you, we didn't even start with blueprints: in the past some modules were actually single-file Flask applications that were injecting themselves in the main application through internal Flask APIs...
Apart from the above comments, I don't notice any other major change. If somebody has some, please provide them, since this is solving an high priority issue. |
89f208e
to
2ffa7d9
Compare
I've added the adaptation of the latest editor file upload, and refactored according to the comments. Ready again for review. |
c9d648f
to
65abea3
Compare
Once this passes the tests, it's ready for review/merge. |
Passed the tests, ready :) |
65abea3
to
1ffc4e5
Compare
There were a few places where we were expecting to be able to open
invenio files uris as local files, just needed to add the proper
abstraction by using PyFilesystem.
RFC
and look for it).