Skip to content

Improve documentation for format OPTIONS clause #15708

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

marvelshan
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

This PR adds documentation for the OPTIONS clause, including generic options and format-specific options, to ensure users have clear guidance on available settings.

What changes are included in this PR?

  • Renamed write_options.md to format_options.md to reflect its scope for both reading and writing.
  • Added examples for each format (JSON, CSV, Parquet) in the OPTIONS clause.
  • Documented generic options and format-specific options in a structured format.
  • Updated heading levels to ensure all sections appear in the table of contents.

Are these changes tested?

Do not require automated tests

Are there any user-facing changes?

Yes, the documentation now includes detailed examples and descriptions of the OPTIONS clause for CREATE EXTERNAL TABLE and COPY queries.

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Apr 14, 2025
@github-actions github-actions bot added the core Core DataFusion crate label Apr 14, 2025
@alamb
Copy link
Contributor

alamb commented Apr 14, 2025

Run extended tests

@alamb
Copy link
Contributor

alamb commented Apr 14, 2025

Run extended tests

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @marvelshan -- this is a nice improvement

I left a few comments -- let me know what you think. It would be great to put the NULL option into the csv options and fix the examples so they work in this PR


# Format Options

DataFusion supports customizing how data is read from or written to disk as a result of a `COPY`, `INSERT INTO`, or `CREATE EXTERNAL TABLE` query. There are a few special options, file format (e.g., CSV or Parquet) specific options, and Parquet column-specific options. Options can also in some cases be specified in multiple ways with a set order of precedence.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DataFusion supports customizing how data is read from or written to disk as a result of a `COPY`, `INSERT INTO`, or `CREATE EXTERNAL TABLE` query. There are a few special options, file format (e.g., CSV or Parquet) specific options, and Parquet column-specific options. Options can also in some cases be specified in multiple ways with a set order of precedence.
DataFusion supports customizing how data is read from or written to disk as a result of a `COPY`, `INSERT INTO`, or `CREATE EXTERNAL TABLE` statements. There are a few special options, file format (e.g., CSV or Parquet) specific options, and Parquet column-specific options. In some cases, Options can be specified in multiple ways with a set order of precedence.

Comment on lines 26 to 31
Format-related options can be specified in the following ways:

- Session-level config defaults
- `CREATE EXTERNAL TABLE` options
- `COPY` option tuples

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be helpful to explicit specify the order of precedence here. Something like

Suggested change
Format-related options can be specified in the following ways:
- Session-level config defaults
- `CREATE EXTERNAL TABLE` options
- `COPY` option tuples
Format-related options can be specified in three ways, in decreasing order of precedence:
- `CREATE EXTERNAL TABLE` syntax
- `COPY` option tuples
- Session-level config defaults


| Option | Description | Default Value |
| ---------- | ------------------------------------------------------------- | ---------------- |
| NULL_VALUE | Sets the string which should be used to indicate null values. | arrow-rs default |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a CSV specific option (not a generic option)

For example

> create external table my_table(a int) stored as JSON location '/tmp/foo' options('NULL_VALUE' 'NULL');
Invalid or Unsupported Configuration: Config value "null_value" not found on JsonOptions

Comment on lines 94 to 97
CREATE EXTERNAL TABLE t
STORED AS JSON
LOCATION '/tmp/foo.json'
OPTIONS('COMPRESSION', 'gzip');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to have a column definition

> CREATE EXTERNAL TABLE t
STORED AS JSON
LOCATION '/tmp/foo.json'
OPTIONS('COMPRESSION', 'gzip');  🤔 Invalid statement: sql parser error: Expected: string or numeric value, found: , at Line: 4, Column: 22

Also, to write data you need to specify a directory otherwise you get an error

> CREATE EXTERNAL TABLE t(a int)
STORED AS JSON
LOCATION '/tmp/foo.json'
OPTIONS('COMPRESSION' 'gzip');
0 row(s) fetched.
Elapsed 0.003 seconds.

> insert into t values(1);
Error during planning: Inserting into a ListingTable backed by a single file is not supported, URL is possibly missing a trailing `/`. To append to an existing file use StreamTable, e.g. by using CREATE UNBOUNDED EXTERNAL TABLE

Also there is an extra ,

So maybe something like

Suggested change
CREATE EXTERNAL TABLE t
STORED AS JSON
LOCATION '/tmp/foo.json'
OPTIONS('COMPRESSION', 'gzip');
CREATE EXTERNAL TABLE t(a int)
STORED AS JSON
LOCATION '/tmp/foo'
OPTIONS('COMPRESSION' 'gzip');
-- Inserting arow creates a new file in /tmp/foo
INSERT INTO t VALUES(1);

@marvelshan
Copy link
Contributor Author

Thank you for the suggestions!
I realized there were several places I hadn’t tested thoroughly. After some testing, I’ve made updates and pushed a new version.
I also wanted to check with you: since there are quite a few available options on file config.rs, should we just list a few key ones in the documentation, or aim to include all of them? I’m wondering if listing everything might make the docs overly complex.

@alamb alamb changed the title doc/document options clause Improve documentation for format OPTIONS clause Apr 16, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also wanted to check with you: since there are quite a few available options on file config.rs, should we just list a few key ones in the documentation, or aim to include all of them? I’m wondering if listing everything might make the docs overly complex.

I think we should include all of the options in the documentation as a reference. While I agree there are a lot of options, I think it would be best for the documentation to reflects that (complex) reality

| DICTIONARY_PAGE_SIZE_LIMIT | No | Sets best effort maximum dictionary page size in bytes |
| CREATED_BY | No | Sets the "created by" property in the parquet file |
| COLUMN_INDEX_TRUNCATE_LENGTH | No | Sets the max length of min/max value fields in the column index. |
| DATA_PAGE_ROW_COUNT_LIMIT | No | Sets best effort maximum number of rows in a data page. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We seem to have lost some of these options in the new doc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants