An enumerative synthesizer for recovering Excel formulas from CSVs.
Input: CSV File (samples/example.csv) | Output: Formula Mask |
---|---|
|
|
- Get
docker
for your OS. - Pull the docker image#:
docker pull padhi/excelsynth
. - Run a container over the image:
docker run -it padhi/excelsynth
.
This would give you abash
shell within ExcelSynth directory. - To run ExcelSynth on
samples/unit_test.csv
, execute:dune exec bin/App.exe -- samples/unit_test.csv
- To run the unit tests, execute:
dune runtest
# Alternatively, you could also build the Docker image locally:
docker build -t padhi/excelsynth github.com/SaswatPadhi/ExcelSynth
$ dune exec bin/App.exe -- -h
Synthesize Excel formulas for a CSV file.
App.exe [flag] ... FILENAME
=== flags ===
[-check-last-col-aggregations BOOLEAN] synthesize aggregation formulas for
cells in the last column
[-check-last-row-aggregations BOOLEAN] synthesize aggregation formulas for
cells in the last row
[-check-pointwise-col-operations BOOLEAN] synthesize pointwise
transformations for columns
[-check-pointwise-row-operations BOOLEAN] synthesize pointwise
transformations for rows
[-constant FLOAT] ... additional Boolean/numeric/string
constants
[-disable-constant-solutions BOOLEAN] disable constant formulas (e.g.
=0.0) for cells
[-enable-2d-aggregation BOOLEAN] use 2D ranges in aggregation
operations
[-enable-booleans BOOLEAN] enable Boolean and conditional
expressions
[-log-path FILENAME] enable logging and output to the
specified path
[-mask-path FILENAME] a known formula mask for the CSV
file
[-max-expr-size INTEGER] maximum cost (AST size) of
expressions to explore
[-max-threads INTEGER] maximum number of threads to create
[-range STRING] a range (in RC:R'C' format) that
bounds the synthesis space
[-relative-error FLOAT] the fractional relative error
allowed in float comparisons
[-restrict-to-top-left-data BOOLEAN] only use data to the top left of a
cell in formulas
[-type-error-threshold FLOAT] maximum fraction of cells that may
be ignored due to type errors
[-value-error-threshold FLOAT] maximum fraction of cells that may
be ignored due to value errors
Bulk Processing (scripts/evaluate.sh
)
The following input directory structure is required:
<data>
|
+-- table_ranges.csv <--- Contains table ranges for CSV files
|
+-- evaluated_csvs <--- Contains fully evaluated CSV files
| |
| +-- <a>.csv
| |
| `-- <b>.csv
|
+-- formula_csvs <--- Contains CSV files with formulas
|
+-- <a>.csv
|
`-- <b>.csv
The following output data is generated within this directory:
<data>
|
: · · ·
|
: · · ·
|
+-- extracted_masks <--- Contents generated by scripts/extract_mask.py
| |
| +-- <a>.csv <--- Ground truth mask from `../formula_csvs/<a>.csv`
| |
| `-- <b>.csv
|
+-- recovered_masks <--- Contents generated by scripts/recover_mask.py
| |
| +-- Baseline <--- Unrestricted synthesis (over whole sheet)
| | |
| | +-- <a>.csv <--- Synthesized mask from `../evaluated_csvs/<a>.csv`
| | |
| | `-- <b>.csv
| |
: : · · ·
| |
| `-- <table_detector_n> <--- Synthesis restricted to tables from <table_detector_n>
| |
| +-- <a>.csv <--- Synthesized mask from `../evaluated_csvs/<a>.csv`
| |
| `-- <b>.csv
|
`-- comparison_masks <--- Contents generated by scripts/compare_masks.py
|
+-- full <--- All cells within a recovered mask are checked
| |
| +-- Baseline <--- Evaluation of unrestricted-synthesis masks
| | |
| | +-- <a>.csv <--- Evaluation of `../recovered_masks/Baseline/<a>.csv`
| | |
| | `-- <b>.csv
| |
: : · · ·
| |
| `-- <table_detector_n> <--- Evaluation of restricted-synthesis masks
| |
| +-- <a>.csv
| |
| `-- <b>.csv
|
`-- in-table <--- Only in-table cells are checked
|
+-- Baseline
| |
| +-- <a>.csv
| |
| `-- <b>.csv
|
: · · ·
|
`-- <table_detector_n>
|
+-- <a>.csv
|
`-- <b>.csv
$ python3 scripts/extract_mask.py -h
usage: extract_mask.py [-h] -i INPUT_DIR -o OUTPUT_DIR
optional arguments:
-h, --help show this help message and exit
-i INPUT_DIR, --input-dir INPUT_DIR
-o OUTPUT_DIR, --output-dir OUTPUT_DIR
$ python3 scripts/extract_mask.py -i data/formula_csvs -o data/extracted_masks
$ python3 scripts/recover_mask.py -h
usage: recover_mask.py [-h] -e EVAL_CSV_DIR -o OUTPUT_DIR -c TABLE_RANGE_COLUMN
tables_data_csv
positional arguments:
tables_data_csv
optional arguments:
-h, --help show this help message and exit
-e EVAL_CSV_DIR, --eval-csv-dir EVAL_CSV_DIR
-o OUTPUT_DIR, --output-dir OUTPUT_DIR
-c TABLE_RANGE_COLUMN, --table-range-column TABLE_RANGE_COLUMN
$ python3 scripts/recover_mask.py -e data/evaluated_csvs -o data/recovered_masks \
-c 1 data/table_ranges.csv
tables_data_csv
has extracted table ranges in TABLE_RANGE_COLUMN
(0-indexed).
$ python3 scripts/compare_masks.py -h
usage: compare_masks.py [-h] -g GROUND_TRUTH_DIR -p PREDICTION_DIR -o OUTPUT_DIR
optional arguments:
-h, --help show this help message and exit
-g GROUND_TRUTH_DIR, --ground-truth-dir GROUND_TRUTH_DIR
-p PREDICTION_DIR, --prediction-dir PREDICTION_DIR
-o OUTPUT_DIR, --output-dir OUTPUT_DIR
$ python3 scripts/compare_mask.py -g data/extracted_masks \
-p data/recovered_masks/table_detector_1 \
-o data/comparison_masks/table_detector_1