-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Improve documentation for format OPTIONS
clause
#15708
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…e for both reading and writing
Run extended tests |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much @marvelshan -- this is a nice improvement
I left a few comments -- let me know what you think. It would be great to put the NULL
option into the csv options and fix the examples so they work in this PR
|
||
# Format Options | ||
|
||
DataFusion supports customizing how data is read from or written to disk as a result of a `COPY`, `INSERT INTO`, or `CREATE EXTERNAL TABLE` query. There are a few special options, file format (e.g., CSV or Parquet) specific options, and Parquet column-specific options. Options can also in some cases be specified in multiple ways with a set order of precedence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataFusion supports customizing how data is read from or written to disk as a result of a `COPY`, `INSERT INTO`, or `CREATE EXTERNAL TABLE` query. There are a few special options, file format (e.g., CSV or Parquet) specific options, and Parquet column-specific options. Options can also in some cases be specified in multiple ways with a set order of precedence. | |
DataFusion supports customizing how data is read from or written to disk as a result of a `COPY`, `INSERT INTO`, or `CREATE EXTERNAL TABLE` statements. There are a few special options, file format (e.g., CSV or Parquet) specific options, and Parquet column-specific options. In some cases, Options can be specified in multiple ways with a set order of precedence. |
Format-related options can be specified in the following ways: | ||
|
||
- Session-level config defaults | ||
- `CREATE EXTERNAL TABLE` options | ||
- `COPY` option tuples | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be helpful to explicit specify the order of precedence here. Something like
Format-related options can be specified in the following ways: | |
- Session-level config defaults | |
- `CREATE EXTERNAL TABLE` options | |
- `COPY` option tuples | |
Format-related options can be specified in three ways, in decreasing order of precedence: | |
- `CREATE EXTERNAL TABLE` syntax | |
- `COPY` option tuples | |
- Session-level config defaults | |
|
||
| Option | Description | Default Value | | ||
| ---------- | ------------------------------------------------------------- | ---------------- | | ||
| NULL_VALUE | Sets the string which should be used to indicate null values. | arrow-rs default | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a CSV specific option (not a generic option)
For example
> create external table my_table(a int) stored as JSON location '/tmp/foo' options('NULL_VALUE' 'NULL');
Invalid or Unsupported Configuration: Config value "null_value" not found on JsonOptions
CREATE EXTERNAL TABLE t | ||
STORED AS JSON | ||
LOCATION '/tmp/foo.json' | ||
OPTIONS('COMPRESSION', 'gzip'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to have a column definition
> CREATE EXTERNAL TABLE t
STORED AS JSON
LOCATION '/tmp/foo.json'
OPTIONS('COMPRESSION', 'gzip'); 🤔 Invalid statement: sql parser error: Expected: string or numeric value, found: , at Line: 4, Column: 22
Also, to write data you need to specify a directory otherwise you get an error
> CREATE EXTERNAL TABLE t(a int)
STORED AS JSON
LOCATION '/tmp/foo.json'
OPTIONS('COMPRESSION' 'gzip');
0 row(s) fetched.
Elapsed 0.003 seconds.
> insert into t values(1);
Error during planning: Inserting into a ListingTable backed by a single file is not supported, URL is possibly missing a trailing `/`. To append to an existing file use StreamTable, e.g. by using CREATE UNBOUNDED EXTERNAL TABLE
Also there is an extra ,
So maybe something like
CREATE EXTERNAL TABLE t | |
STORED AS JSON | |
LOCATION '/tmp/foo.json' | |
OPTIONS('COMPRESSION', 'gzip'); | |
CREATE EXTERNAL TABLE t(a int) | |
STORED AS JSON | |
LOCATION '/tmp/foo' | |
OPTIONS('COMPRESSION' 'gzip'); | |
-- Inserting arow creates a new file in /tmp/foo | |
INSERT INTO t VALUES(1); |
Thank you for the suggestions! |
OPTIONS
clause
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also wanted to check with you: since there are quite a few available options on file config.rs, should we just list a few key ones in the documentation, or aim to include all of them? I’m wondering if listing everything might make the docs overly complex.
I think we should include all of the options in the documentation as a reference. While I agree there are a lot of options, I think it would be best for the documentation to reflects that (complex) reality
| DICTIONARY_PAGE_SIZE_LIMIT | No | Sets best effort maximum dictionary page size in bytes | | ||
| CREATED_BY | No | Sets the "created by" property in the parquet file | | ||
| COLUMN_INDEX_TRUNCATE_LENGTH | No | Sets the max length of min/max value fields in the column index. | | ||
| DATA_PAGE_ROW_COUNT_LIMIT | No | Sets best effort maximum number of rows in a data page. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We seem to have lost some of these options in the new doc
Which issue does this PR close?
CREATE EXTERNAL TABLE ... OPTIONS
#10451Rationale for this change
This PR adds documentation for the
OPTIONS
clause, including generic options and format-specific options, to ensure users have clear guidance on available settings.What changes are included in this PR?
write_options.md
toformat_options.md
to reflect its scope for both reading and writing.OPTIONS
clause.Are these changes tested?
Do not require automated tests
Are there any user-facing changes?
Yes, the documentation now includes detailed examples and descriptions of the
OPTIONS
clause forCREATE EXTERNAL TABLE
andCOPY
queries.