Added the needed changes to support EOS #2997

david-caro · 2017-11-27T19:48:25Z

There were a few places where we were expecting to be able to open
invenio files uris as local files, just needed to add the proper
abstraction by using PyFilesystem.

I have all the information that I need (if not, move to RFC and look for it).
I linked the related issue(s) in the corresponding commit logs.
I wrote good commit log messages.
My code follows the code style of this project.
I've added any new docs if API/utils methods were added.
I have updated the existing documentation accordingly.
I have added tests to cover my changes.
All new and existing tests passed.

jacquerie · 2017-11-28T14:53:56Z

inspirehep/modules/workflows/utils/__init__.py

@@ -147,12 +149,26 @@ def _try_to_log(logfn, *args, **kwargs):
    return _decorator


+def get_uri(uri, outdir=None):


Mh, from the name of the function I would expect that this function returns an URI, while instead it returns a file from a URI... what about get_file_from_uri ?

It's not returning a file, but just a path. Better retrieve_uri

jacquerie · 2017-11-28T14:57:24Z

tests/unit/workflows/test_workflows_tasks_actions.py

+            __name__,
+            os.path.join('fixtures', '1704.00452.pdf'),
+        )
+    )


It sounds like this wasn't correct before... why would get_pdf_in_workflow return a filename? Perhaps it should have been pkg_resources.resource_file here, and then we no longer need the get_uri ?

get_pdf_in_workflow retrieves the filename of the pdf that is in the workflow. get_uri is required there because the refextract task expects that file to be a temporary one, and removes it when done.

kaplun · 2017-11-29T21:44:07Z

inspirehep/config.py

@@ -91,7 +91,7 @@
 BASE_FILES_LOCATION = os.path.join(sys.prefix, 'var/data')

 # This is needed in order to be able to use EOS files locations
-MAX_CONTENT_LENGTH = 100 * 1024 * 1024 # 100 MiB
+MAX_CONTENT_LENGTH = 100 * 1024 * 1024  # 100 MiB


This can be squashed with first commit

kaplun · 2017-11-29T21:45:52Z

inspirehep/modules/workflows/tasks/actions.py

@@ -263,12 +264,15 @@ def refextract(obj, eng):
    pdf_references, text_references = [], []
    source = get_value(obj.data, 'acquisition_source.source')

-    pdf = get_pdf_in_workflow(obj)
-    if pdf:
+    tmp_pdf = get_pdf_in_workflow(obj)


to make explicit that the pdf that you expect to get is a temporary one (and you'll have to clean it up).

kaplun · 2017-11-29T21:46:45Z

inspirehep/modules/workflows/tasks/actions.py

        except TimeoutError:
            obj.log.error('Timeout when extracting references from PDF.')
+        finally:
+            if os.path.exists(tmp_pdf):
+                os.unlink(tmp_pdf)


Is this safe? Aren't we deleting files from under invenio-files?

thetmp_pdf is the locally downloaded copy of the pdf, not really the original one. It's safe to remove (actually, in this case, it's mandatory to cleanup).

kaplun · 2017-11-29T21:49:29Z

inspirehep/utils/url.py

+
+def retrieve_uri(uri, outdir=None):
+    """Retrieves the given uri and stores it in a temporary file."""
+    fd, local_file = tempfile.mkstemp(prefix='inspire', dir=outdir)


You can use https://docs.python.org/2/library/tempfile.html#tempfile.NamedTemporaryFile with deleted=False which is slightly easier.

This and the next will be solved with:

7 def retrieve_uri(uri, outdir=None): 6 """Retrieves the given uri and stores it in a temporary file.""" 5 fd, local_file = tempfile.mkstemp(prefix='inspire', dir=outdir) 4 os.close(fd) 3 2 fs.copy.copy_file(uri, local_file) 1 81 return local_file

as the copy_file requires a string or fs.FS instance, not an open file (will open it with whatever opener is needed for it).
is that ok? I've checked the code underneath, and it uses chunks of:

In [1]: import io In [2]: io.DEFAULT_BUFFER_SIZE Out[2]: 8192

dammit... this requires fs>=2, but it's capped by invenio-files-rest, see inveniosoftware/invenio-files-rest#130

You can do this with PyFSFileStorage which has all the support for chunking, size limit checking etc:

>>> import os >>> import tempfile >>> import requests >>> from invenio_files_rest.storage import PyFSFileStorage >>> fd, local_file = tempfile.mkstemp(prefix='inspire') >>> os.close(fd) >>> dst = PyFSFileStorage(local_file) >>> dst.save(requests.get(uri, stream=True).raw, size_limit=1*1024*1024, chunk_size=2*1024*1024)

For local disk 8kb chunk size is ok....for EOS you definitely want to go to MB size for best performance.

Thanks for the tip, though I think it's a bit premature to strive for performance (and probably it's at the pyfs level where we want to handle each different fs type tweaking, here we know one of the files is local, but the other we are not sure).

kaplun · 2017-11-29T21:50:09Z

inspirehep/utils/url.py

+    os.close(fd)
+
+    with fsopen(uri, 'rb') as remote_fd:
+        data = remote_fd.read()


This is not safe. It must be bufferized.

I was not expecting to have big files (GB order), but sure, better safe than sorry.

There might be hundredths of MB indeed. And also we might have some malicious user pointing to something huge...

OK, very improbable, but given the small size of our VMs... :-)

Already addressed

ammirate · 2017-12-04T15:18:24Z

inspirehep/modules/refextract/tasks.py

+        if len(kb_data) == batch_size:
+            with fsopen(refextract_journal_kb_path, mode='wb') as fd:
+                fd.write(''.join(kb_data))
+            kb_data = []


I would suggest putting this code in a function since the pattern used is the same in three points.
Something like

. . . for row in titles_query: _add_normalized_title(kb_data, row['short_title']) _add_normalized_title(kb_data, row['journal_title']) _flush_kb_data(batch_size, kb_data, refextract_journal_kb_path) . . .

ammirate · 2017-12-04T15:32:00Z

inspirehep/modules/workflows/tasks/arxiv.py

-from ..utils import download_file_to_workflow, with_debug_logging
+from ..utils import download_file_to_workflow, with_debug_logging, convert
+from ....utils.record import get_arxiv_categories, get_arxiv_id
+from ....utils.url import is_pdf_link, retrieve_uri


nit but...it's super hard to me reading relative imports :P

(Google's Python style guide actually disallows relative imports)

Also here this is escaping the current module and reaching the utils folder, which is just confusing...

Changed.

The fact that we have those "modules" (that actually are python subpackages, don't confuse with python modules) is by itself confusing. And imo a symptom of a wider structural issue in the general organization of the code.

What I would like to have here is a fairly common architecture of large Flask applications based on blueprints, each implementing a distinct feature of INSPIRE: editor, holdingpen, forms, search, theme...

This is what Explore Flask calls a "divisional structure": https://exploreflask.com/en/latest/blueprints.html#divisional. I detailed in several issues some of the steps needed to get there: #2251, #2260, #2641, #2642.

One thing that I see there is that it has, tops, a main package (top level) and supackages (one depth level) of python modules. That's something that's completely missed on all those issues.

If what you want is to remove the modules folder I agree with that, but that's a much simpler (although quite wide) change than splitting code in their own blueprint-driven modules.

Mind you, we didn't even start with blueprints: in the past some modules were actually single-file Flask applications that were injecting themselves in the main application through internal Flask APIs...

ammirate · 2017-12-04T15:50:08Z

Apart from the above comments, I don't notice any other major change. If somebody has some, please provide them, since this is solving an high priority issue.

david-caro · 2017-12-04T20:42:06Z

I've added the adaptation of the latest editor file upload, and refactored according to the comments. Ready again for review.

david-caro · 2017-12-05T13:13:50Z

Once this passes the tests, it's ready for review/merge.

david-caro · 2017-12-05T17:48:23Z

Passed the tests, ready :)

ghost assigned david-caro Nov 27, 2017

ghost added the Status: WIP label Nov 27, 2017

david-caro force-pushed the support_eos_file_storage branch 3 times, most recently from 0d9c299 to 7d653b6 Compare November 28, 2017 12:02

jacquerie reviewed Nov 28, 2017

View reviewed changes

david-caro force-pushed the support_eos_file_storage branch 3 times, most recently from 0fdaf91 to cb5742f Compare November 29, 2017 10:33

david-caro added Status: Ready for review and removed Status: WIP labels Nov 29, 2017

david-caro requested a review from kaplun November 29, 2017 16:43

david-caro force-pushed the support_eos_file_storage branch 2 times, most recently from 2d5940d to 36afe22 Compare November 29, 2017 18:16

kaplun reviewed Nov 29, 2017

View reviewed changes

kaplun previously requested changes Nov 29, 2017

View reviewed changes

david-caro force-pushed the support_eos_file_storage branch 2 times, most recently from 691d446 to 2992952 Compare December 1, 2017 18:43

ammirate reviewed Dec 4, 2017

View reviewed changes

david-caro force-pushed the support_eos_file_storage branch 3 times, most recently from 89f208e to 2ffa7d9 Compare December 4, 2017 19:57

david-caro force-pushed the support_eos_file_storage branch 3 times, most recently from c9d648f to 65abea3 Compare December 5, 2017 13:13

ammirate approved these changes Dec 6, 2017

View reviewed changes

david-caro closed this Dec 6, 2017

david-caro force-pushed the support_eos_file_storage branch from 65abea3 to 1ffc4e5 Compare December 6, 2017 09:50

ghost removed the Status: Ready for review label Dec 6, 2017

david-caro deleted the support_eos_file_storage branch December 6, 2017 09:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added the needed changes to support EOS #2997

Added the needed changes to support EOS #2997

david-caro commented Nov 27, 2017 •

edited

Loading

jacquerie Nov 28, 2017

david-caro Nov 28, 2017

jacquerie Nov 28, 2017

david-caro Nov 28, 2017

kaplun Nov 29, 2017

kaplun Nov 29, 2017

david-caro Nov 30, 2017

kaplun Nov 29, 2017

david-caro Nov 30, 2017 •

edited

Loading

kaplun Nov 29, 2017

david-caro Nov 30, 2017

david-caro Nov 30, 2017

lnielsen Nov 30, 2017

lnielsen Nov 30, 2017

david-caro Nov 30, 2017

kaplun Nov 29, 2017

david-caro Nov 30, 2017

kaplun Nov 30, 2017

ammirate Dec 4, 2017 •

edited

Loading

ammirate Dec 4, 2017

jacquerie Dec 4, 2017

jacquerie Dec 4, 2017

david-caro Dec 4, 2017

jacquerie Dec 5, 2017 •

edited

Loading

david-caro Dec 5, 2017

jacquerie Dec 5, 2017

ammirate commented Dec 4, 2017

david-caro commented Dec 4, 2017

david-caro commented Dec 5, 2017

david-caro commented Dec 5, 2017

		@@ -147,12 +149,26 @@ def _try_to_log(logfn, args, *kwargs):
		return _decorator


		def get_uri(uri, outdir=None):

Added the needed changes to support EOS #2997

Added the needed changes to support EOS #2997

Conversation

david-caro commented Nov 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

david-caro Nov 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ammirate Dec 4, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacquerie Dec 5, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ammirate commented Dec 4, 2017

david-caro commented Dec 4, 2017

david-caro commented Dec 5, 2017

david-caro commented Dec 5, 2017

david-caro commented Nov 27, 2017 •

edited

Loading

david-caro Nov 30, 2017 •

edited

Loading

ammirate Dec 4, 2017 •

edited

Loading

jacquerie Dec 5, 2017 •

edited

Loading