Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added the needed changes to support EOS #2997

Closed

Conversation

david-caro
Copy link
Contributor

@david-caro david-caro commented Nov 27, 2017

There were a few places where we were expecting to be able to open
invenio files uris as local files, just needed to add the proper
abstraction by using PyFilesystem.

  • I have all the information that I need (if not, move to RFC and look for it).
  • I linked the related issue(s) in the corresponding commit logs.
  • I wrote good commit log messages.
  • My code follows the code style of this project.
  • I've added any new docs if API/utils methods were added.
  • I have updated the existing documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@ghost ghost assigned david-caro Nov 27, 2017
@ghost ghost added the Status: WIP label Nov 27, 2017
@david-caro david-caro force-pushed the support_eos_file_storage branch 3 times, most recently from 0d9c299 to 7d653b6 Compare November 28, 2017 12:02
@@ -147,12 +149,26 @@ def _try_to_log(logfn, *args, **kwargs):
return _decorator


def get_uri(uri, outdir=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mh, from the name of the function I would expect that this function returns an URI, while instead it returns a file from a URI... what about get_file_from_uri ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not returning a file, but just a path. Better retrieve_uri

__name__,
os.path.join('fixtures', '1704.00452.pdf'),
)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like this wasn't correct before... why would get_pdf_in_workflow return a filename? Perhaps it should have been pkg_resources.resource_file here, and then we no longer need the get_uri ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_pdf_in_workflow retrieves the filename of the pdf that is in the workflow. get_uri is required there because the refextract task expects that file to be a temporary one, and removes it when done.

@david-caro david-caro force-pushed the support_eos_file_storage branch 3 times, most recently from 0fdaf91 to cb5742f Compare November 29, 2017 10:33
@david-caro david-caro requested a review from kaplun November 29, 2017 16:43
@david-caro david-caro force-pushed the support_eos_file_storage branch 2 times, most recently from 2d5940d to 36afe22 Compare November 29, 2017 18:16
@@ -91,7 +91,7 @@
BASE_FILES_LOCATION = os.path.join(sys.prefix, 'var/data')

# This is needed in order to be able to use EOS files locations
MAX_CONTENT_LENGTH = 100 * 1024 * 1024 # 100 MiB
MAX_CONTENT_LENGTH = 100 * 1024 * 1024 # 100 MiB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be squashed with first commit

@@ -263,12 +264,15 @@ def refextract(obj, eng):
pdf_references, text_references = [], []
source = get_value(obj.data, 'acquisition_source.source')

pdf = get_pdf_in_workflow(obj)
if pdf:
tmp_pdf = get_pdf_in_workflow(obj)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make explicit that the pdf that you expect to get is a temporary one (and you'll have to clean it up).

except TimeoutError:
obj.log.error('Timeout when extracting references from PDF.')
finally:
if os.path.exists(tmp_pdf):
os.unlink(tmp_pdf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this safe? Aren't we deleting files from under invenio-files?

Copy link
Contributor Author

@david-caro david-caro Nov 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thetmp_pdf is the locally downloaded copy of the pdf, not really the original one. It's safe to remove (actually, in this case, it's mandatory to cleanup).


def retrieve_uri(uri, outdir=None):
"""Retrieves the given uri and stores it in a temporary file."""
fd, local_file = tempfile.mkstemp(prefix='inspire', dir=outdir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use https://docs.python.org/2/library/tempfile.html#tempfile.NamedTemporaryFile with deleted=False which is slightly easier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the next will be solved with:

  7 def retrieve_uri(uri, outdir=None):                                              
  6     """Retrieves the given uri and stores it in a temporary file."""             
  5     fd, local_file = tempfile.mkstemp(prefix='inspire', dir=outdir)              
  4     os.close(fd)                                                                 
  3                                                                                  
  2     fs.copy.copy_file(uri, local_file)                                                   
  1                                                                                  
81      return local_file 

as the copy_file requires a string or fs.FS instance, not an open file (will open it with whatever opener is needed for it).
is that ok? I've checked the code underneath, and it uses chunks of:

In [1]: import io

In [2]: io.DEFAULT_BUFFER_SIZE
Out[2]: 8192

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dammit... this requires fs>=2, but it's capped by invenio-files-rest, see inveniosoftware/invenio-files-rest#130

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do this with PyFSFileStorage which has all the support for chunking, size limit checking etc:

>>> import os
>>> import tempfile
>>> import requests
>>> from invenio_files_rest.storage import PyFSFileStorage
>>> fd, local_file = tempfile.mkstemp(prefix='inspire')
>>> os.close(fd)
>>> dst = PyFSFileStorage(local_file)
>>> dst.save(requests.get(uri, stream=True).raw, size_limit=1*1024*1024, chunk_size=2*1024*1024)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For local disk 8kb chunk size is ok....for EOS you definitely want to go to MB size for best performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tip, though I think it's a bit premature to strive for performance (and probably it's at the pyfs level where we want to handle each different fs type tweaking, here we know one of the files is local, but the other we are not sure).

kaplun
kaplun previously requested changes Nov 29, 2017
os.close(fd)

with fsopen(uri, 'rb') as remote_fd:
data = remote_fd.read()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not safe. It must be bufferized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not expecting to have big files (GB order), but sure, better safe than sorry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be hundredths of MB indeed. And also we might have some malicious user pointing to something huge...

OK, very improbable, but given the small size of our VMs... :-)

@david-caro david-caro force-pushed the support_eos_file_storage branch 2 times, most recently from 691d446 to 2992952 Compare December 1, 2017 18:43
@david-caro david-caro dismissed kaplun’s stale review December 1, 2017 18:47

Already addressed

if len(kb_data) == batch_size:
with fsopen(refextract_journal_kb_path, mode='wb') as fd:
fd.write(''.join(kb_data))
kb_data = []
Copy link
Contributor

@ammirate ammirate Dec 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest putting this code in a function since the pattern used is the same in three points.
Something like

. . . 
    for row in titles_query:
        _add_normalized_title(kb_data, row['short_title'])
        _add_normalized_title(kb_data, row['journal_title'])
        _flush_kb_data(batch_size, kb_data, refextract_journal_kb_path)
. . . 

from ..utils import download_file_to_workflow, with_debug_logging
from ..utils import download_file_to_workflow, with_debug_logging, convert
from ....utils.record import get_arxiv_categories, get_arxiv_id
from ....utils.url import is_pdf_link, retrieve_uri
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit but...it's super hard to me reading relative imports :P

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Google's Python style guide actually disallows relative imports)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here this is escaping the current module and reaching the utils folder, which is just confusing...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

The fact that we have those "modules" (that actually are python subpackages, don't confuse with python modules) is by itself confusing. And imo a symptom of a wider structural issue in the general organization of the code.

Copy link
Contributor

@jacquerie jacquerie Dec 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I would like to have here is a fairly common architecture of large Flask applications based on blueprints, each implementing a distinct feature of INSPIRE: editor, holdingpen, forms, search, theme...

This is what Explore Flask calls a "divisional structure": https://exploreflask.com/en/latest/blueprints.html#divisional. I detailed in several issues some of the steps needed to get there: #2251, #2260, #2641, #2642.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that I see there is that it has, tops, a main package (top level) and supackages (one depth level) of python modules. That's something that's completely missed on all those issues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If what you want is to remove the modules folder I agree with that, but that's a much simpler (although quite wide) change than splitting code in their own blueprint-driven modules.

Mind you, we didn't even start with blueprints: in the past some modules were actually single-file Flask applications that were injecting themselves in the main application through internal Flask APIs...

@ammirate
Copy link
Contributor

ammirate commented Dec 4, 2017

Apart from the above comments, I don't notice any other major change. If somebody has some, please provide them, since this is solving an high priority issue.

@david-caro david-caro force-pushed the support_eos_file_storage branch 3 times, most recently from 89f208e to 2ffa7d9 Compare December 4, 2017 19:57
@david-caro
Copy link
Contributor Author

I've added the adaptation of the latest editor file upload, and refactored according to the comments. Ready again for review.

@david-caro david-caro force-pushed the support_eos_file_storage branch 3 times, most recently from c9d648f to 65abea3 Compare December 5, 2017 13:13
@david-caro
Copy link
Contributor Author

Once this passes the tests, it's ready for review/merge.

@david-caro
Copy link
Contributor Author

Passed the tests, ready :)

@david-caro david-caro closed this Dec 6, 2017
@david-caro david-caro force-pushed the support_eos_file_storage branch from 65abea3 to 1ffc4e5 Compare December 6, 2017 09:50
@ghost ghost removed the Status: Ready for review label Dec 6, 2017
@david-caro david-caro deleted the support_eos_file_storage branch December 6, 2017 09:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants