Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataLad command to convert ODS/XLSX files into "archival" single-sheet collection of TSVs #14

Open
mih opened this issue Jun 29, 2023 · 6 comments

Comments

@mih
Copy link
Contributor

mih commented Jun 29, 2023

This is the format with which we want/need to keep things long-term, and the format a metadata extractor would eat.

At first glance openpyxl looks like a sufficient and lightweight solution. Pandas can also do it, but it much heavier.

A more detailed analysis was done by @mslw in #8 already.

@mih
Copy link
Contributor Author

mih commented Jun 30, 2023

https://github.com/pyexcel/pyexcel is an alternative (wrapper) that may be useful for supporting more than xlsx (e.g., ods files).

@jsheunis
Copy link
Contributor

jsheunis commented Jul 3, 2023

Would this just be a utility that runs on a (possibly multi-sheet) in put spreadsheet (xlsx or ods format) and then outputs a collection of TSV fles? i.e. before any form of validation, or would it make sense to also incorporate validation of at least some structured interpreting into this process? I'm thinking that this command might need to have some understanding of the intended tabby structure for the conversion process, rather than just doing a dumb transformation.

@mih
Copy link
Contributor Author

mih commented Jul 3, 2023

ATM I am thinking about it as a dumb comverter. However, in my brain the optimal point for performing validation is not yet clear.

@mih mih self-assigned this Jul 3, 2023
@mih
Copy link
Contributor Author

mih commented Jul 3, 2023

pyexcel is not good. A change in a dependency broke basic functionality in Feb 2023 and no fix was released yet, although an applicable fix appears to be known since a fix days after the initial report. pyexcel/pyexcel-xlsx#52

We better stick to openpyxl (see #8)

@mih
Copy link
Contributor Author

mih commented Jul 3, 2023

I implemented the XLSX -> TSV part.

We would need to think more about how (and if) we would support the representation of custom contexts and frames when going from tabby (back) to XLSX.

@mih mih removed their assignment Jul 4, 2023
@mih
Copy link
Contributor Author

mih commented Jul 18, 2023

With #50 settled, we know all the pieces. A record in XLSX format would still carry all the other files (context, overrides, etc). Conversion to TSV brings it into an archival format, with no changes necessary to the non-TSV parts.

The only thing TODO here is exposing this functionality via the CLI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants