-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataLad command to convert ODS/XLSX files into "archival" single-sheet collection of TSVs #14
Comments
https://github.com/pyexcel/pyexcel is an alternative (wrapper) that may be useful for supporting more than xlsx (e.g., ods files). |
Would this just be a utility that runs on a (possibly multi-sheet) in put spreadsheet (xlsx or ods format) and then outputs a collection of TSV fles? i.e. before any form of validation, or would it make sense to also incorporate validation of at least some structured interpreting into this process? I'm thinking that this command might need to have some understanding of the intended tabby structure for the conversion process, rather than just doing a dumb transformation. |
ATM I am thinking about it as a dumb comverter. However, in my brain the optimal point for performing validation is not yet clear. |
pyexcel is not good. A change in a dependency broke basic functionality in Feb 2023 and no fix was released yet, although an applicable fix appears to be known since a fix days after the initial report. pyexcel/pyexcel-xlsx#52 We better stick to openpyxl (see #8) |
I implemented the XLSX -> TSV part. We would need to think more about how (and if) we would support the representation of custom contexts and frames when going from tabby (back) to XLSX. |
With #50 settled, we know all the pieces. A record in XLSX format would still carry all the other files (context, overrides, etc). Conversion to TSV brings it into an archival format, with no changes necessary to the non-TSV parts. The only thing TODO here is exposing this functionality via the CLI |
This is the format with which we want/need to keep things long-term, and the format a metadata extractor would eat.
At first glance openpyxl looks like a sufficient and lightweight solution. Pandas can also do it, but it much heavier.
A more detailed analysis was done by @mslw in #8 already.
The text was updated successfully, but these errors were encountered: