diff --git a/src/pages/latest/delta-column-mapping.mdx b/src/pages/latest/delta-column-mapping.mdx deleted file mode 100644 index 1eb5b8c..0000000 --- a/src/pages/latest/delta-column-mapping.mdx +++ /dev/null @@ -1,58 +0,0 @@ ---- -title: Delta Column Mapping -description: Learn about column mapping in Delta. -menu: docs ---- - -Delta Lake supports column mapping, which allows Delta table columns and the -corresponding Parquet columns to use different names. Column mapping enables -Delta schema evolution operations such as `RENAME COLUMN` on a Delta table -without the need to rewrite the underlying Parquet files. It also allows users -to name Delta table columns by using characters that are not allowed by Parquet, -such as spaces, so that users can directly ingest CSV or JSON data into Delta -without the need to rename columns due to previous character constraints. - -## Requirements - -- DBR 10.2 or above. -- Column mapping requires the Delta [table version](versioning.md) to be reader - version 2 and writer version 5. For a Delta table with the required table - version, you can enable column mapping by setting `delta.columnMappingMode` to - `name`. You can upgrade the table version and enable column mapping by using a - single `ALTER TABLE` command: - - - -```sql -ALTER TABLE SET TBLPROPERTIES ( -'delta.minReaderVersion' = '2', -'delta.minWriterVersion' = '5', -'delta.columnMapping.mode' = 'name' -) -``` - - - - - After you set these properties in the table, you can only read from and write - to this Delta table by using DBR 10.2 and above. - - -## Supported characters in column names - -When column mapping is enabled for a Delta table, you can include spaces as well -as any of these characters in the table's column names: `,;{}()\n\t=` . - -## Rename a column - -When column mapping is enabled for a Delta table, you can rename a column: - - - -```sql -ALTER TABLE RENAME COLUMN old_col_name TO new_col_name -``` - - - -For more examples, see [\_](/delta/delta-batch.md#rename-columns). diff --git a/src/pages/latest/delta-intro.mdx b/src/pages/latest/delta-intro.mdx index 6f706f3..b2d058b 100644 --- a/src/pages/latest/delta-intro.mdx +++ b/src/pages/latest/delta-intro.mdx @@ -14,26 +14,23 @@ metadata handling, and unifies [streaming](delta-streaming.md) and [batch](delta-batch.md) data processing on top of existing data lakes, such as S3, ADLS, GCS, and HDFS. -For a quick overview and benefits of Delta Lake, watch this YouTube video (3 -minutes). - Specifically, Delta Lake offers: -- [ACID transactions](concurrency-control.md) on Spark: Serializable isolation +- [ACID transactions](/latest/concurrency-control) on Spark: Serializable isolation levels ensure that readers never see inconsistent data. - Scalable metadata handling: Leverages Spark distributed processing power to handle all the metadata for petabyte-scale tables with billions of files at ease. -- [Streaming](delta-streaming.md) and [batch](delta-batch.md) unification: A +- [Streaming](/latest/delta-streaming) and [batch](/latest/delta-batch) unification: A table in Delta Lake is a batch table as well as a streaming source and sink. Streaming data ingest, batch historic backfill, interactive queries all just work out of the box. - Schema enforcement: Automatically handles schema variations to prevent insertion of bad records during ingestion. -- [Time travel](delta-batch.md#deltatimetravel): Data versioning enables +- [Time travel](/latest/delta-batch#deltatimetravel): Data versioning enables rollbacks, full historical audit trails, and reproducible machine learning experiments. -- [Upserts](delta-update.md#delta-merge) and - [deletes](delta-update.md#delta-delete): Supports merge, update and delete +- [Upserts](/latest/delta-update#upsert-into-a-table-using-merge) and + [deletes](/latest/delta-update#delete-from-a-table): Supports merge, update and delete operations to enable complex use cases like change-data-capture, slowly-changing-dimension (SCD) operations, streaming upserts, and so on. diff --git a/src/pages/latest/delta-streaming.mdx b/src/pages/latest/delta-streaming.mdx index db1cd6f..f94d69a 100644 --- a/src/pages/latest/delta-streaming.mdx +++ b/src/pages/latest/delta-streaming.mdx @@ -273,7 +273,7 @@ The preceding example continuously updates a table that contains the aggregate n For applications with more lenient latency requirements, you can save computing resources with one-time triggers. Use these to update summary aggregation tables on a given schedule, processing only new data that has arrived since the last update. -## Idempotent table writes in `foreachBatch` +## Idempotent table writes in foreachBatch Available in Delta Lake 2.0.0 and above. diff --git a/src/pages/latest/getting-started.mdx b/src/pages/latest/getting-started.mdx deleted file mode 100644 index 55c0a86..0000000 --- a/src/pages/latest/getting-started.mdx +++ /dev/null @@ -1,43 +0,0 @@ ---- -title: Getting Started with Delta Lake Spark -description: Learn how to start using Delta Lake Spark ---- - -Lorem ipsum dolor sit amet. - - - -```jsx -export const MyComponent = (props) => { - const { title, children } = props; - - return ( -
-

{title}

-

{children}

-
- ); -}; -``` - -```tsx -import type { ReactNode } from "react"; - -interface MyComponentProps { - title: string; - children?: ReactNode; -} - -export const MyComponent = (props): ReactElement => { - const { title, children } = props; - - return ( -
-

{title}

-

{children}

-
- ); -}; -``` - -
diff --git a/src/pages/latest/optimizations-oss.mdx b/src/pages/latest/optimizations-oss.mdx index 49c17b7..9c38c12 100644 --- a/src/pages/latest/optimizations-oss.mdx +++ b/src/pages/latest/optimizations-oss.mdx @@ -9,11 +9,10 @@ Delta Lake provides optimizations that accelerate data lake operations. To improve query speed, Delta Lake supports the ability to optimize the layout of data in storage. There are various ways to optimize the layout. + ### Compaction (bin-packing) -Note - This feature is available in Delta Lake 1.2.0 and above. @@ -58,11 +57,9 @@ deltaTable.optimize().where("date='2021-11-18'").executeCompaction() -For Scala, Java, and Python API syntax details, see the [Delta Lake APIs](/latest/delta-apidoc.html). +For Scala, Java, and Python API syntax details, see the [Delta Lake APIs](/latest/delta-apidoc). -Note - * Bin-packing optimization is _idempotent_, meaning that if it is run twice on the same dataset, the second run has no effect. * Bin-packing aims to produce evenly-balanced data files with respect to their size on disk, but not necessarily number of tuples per file. However, the two measures are most often correlated. @@ -77,20 +74,17 @@ Readers of Delta tables use snapshot isolation, which means that they are not in ## Data skipping -Note - This feature is available in Delta Lake 1.2.0 and above. -Data skipping information is collected automatically when you write data into a Delta Lake table. Delta Lake takes advantage of this information (minimum and maximum values for each column) at query time to provide faster queries. You do not need to configure data skipping; the feature is activated whenever applicable. However, its effectiveness depends on the layout of your data. For best results, apply [Z-Ordering](/latest/optimizations-oss.html#-z-ordering-multi-dimensional-clustering). +Data skipping information is collected automatically when you write data into a Delta Lake table. Delta Lake takes advantage of this information (minimum and maximum values for each column) at query time to provide faster queries. You do not need to configure data skipping; the feature is activated whenever applicable. However, its effectiveness depends on the layout of your data. For best results, apply [Z-Ordering](#zordering-multidimensional-clustering). -Collecting statistics on a column containing long values such as string or binary is an expensive operation. To avoid collecting statistics on such columns you can configure the [table property](/latest/delta-batch.html#-table-properties) `delta.dataSkippingNumIndexedCols`. This property indicates the position index of a column in the table’s schema. All columns with a position index less than the `delta.dataSkippingNumIndexedCols` property will have statistics collected. For the purposes of collecting statistics, each field within a nested column is considered as an individual column. To avoid collecting statistics on columns containing long values, either set the `delta.dataSkippingNumIndexedCols` property so that the long value columns are after this index in the table’s schema, or move columns containing long strings to an index position greater than the `delta.dataSkippingNumIndexedCols` property by using `[ALTER TABLE ALTER COLUMN](/latest/sql-ref-syntax-ddl-alter-table.html#alter-or-change-column)`. +Collecting statistics on a column containing long values such as string or binary is an expensive operation. To avoid collecting statistics on such columns you can configure the [table property](/latest/table-properties) `delta.dataSkippingNumIndexedCols`. This property indicates the position index of a column in the table’s schema. All columns with a position index less than the `delta.dataSkippingNumIndexedCols` property will have statistics collected. For the purposes of collecting statistics, each field within a nested column is considered as an individual column. To avoid collecting statistics on columns containing long values, either set the `delta.dataSkippingNumIndexedCols` property so that the long value columns are after this index in the table’s schema, or move columns containing long strings to an index position greater than the `delta.dataSkippingNumIndexedCols` property by using [`ALTER TABLE ALTER COLUMN`](https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-alter-table.html#alter-or-change-column). + ## Z-Ordering (multi-dimensional clustering) -Note - This feature is available in Delta Lake 2.0.0 and above. @@ -131,15 +125,13 @@ deltaTable.optimize().where("date='2021-11-18'").executeZOrderBy(eventType) -For Scala, Java, and Python API syntax details, see the [Delta Lake APIs](/latest/delta-apidoc.html). +For Scala, Java, and Python API syntax details, see the [Delta Lake APIs](/latest/delta-apidoc). If you expect a column to be commonly used in query predicates and if that column has high cardinality (that is, a large number of distinct values), then use `ZORDER BY`. -You can specify multiple columns for `ZORDER BY` as a comma-separated list. However, the effectiveness of the locality drops with each extra column. Z-Ordering on columns that do not have statistics collected on them would be ineffective and a waste of resources. This is because data skipping requires column-local stats such as min, max, and count. You can configure statistics collection on certain columns by reordering columns in the schema, or you can increase the number of columns to collect statistics on. See [Data skipping](https://docs.delta.io/latest/optimizations-oss.html#-data-skipping). +You can specify multiple columns for `ZORDER BY` as a comma-separated list. However, the effectiveness of the locality drops with each extra column. Z-Ordering on columns that do not have statistics collected on them would be ineffective and a waste of resources. This is because data skipping requires column-local stats such as min, max, and count. You can configure statistics collection on certain columns by reordering columns in the schema, or you can increase the number of columns to collect statistics on. See [Data skipping](#data-skipping). -Note - * Z-Ordering is _not idempotent_. Everytime the Z-Ordering is executed it will try to create a new clustering of data in all files (new and existing files that were part of previous Z-Ordering) in a partition. * Z-Ordering aims to produce evenly-balanced data files with respect to the number of tuples, but not necessarily data size on disk. The two measures are most often correlated, but there can be situations when that is not the case, leading to skew in optimize task times. @@ -147,11 +139,10 @@ Note * For example, if you `ZORDER BY` _date_ and your most recent records are all much wider (for example longer arrays or string values) than the ones in the past, it is expected that the `OPTIMIZE` job’s task durations will be skewed, as well as the resulting file sizes. This is, however, only a problem for the `OPTIMIZE` command itself; it should not have any negative impact on subsequent queries. + ## Multi-part checkpointing -Note - This feature is available in Delta Lake 2.0.0 and above. This feature is in experimental support mode. @@ -160,7 +151,5 @@ Delta Lake table periodically and automatically compacts all the incremental upd Delta Lake protocol allows [splitting the checkpoint](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#checkpoints) into multiple Parquet files. This parallelizes and speeds up writing the checkpoint. In Delta Lake, by default each checkpoint is written as a single Parquet file. To to use this feature, set the SQL configuration `spark.databricks.delta.checkpoint.partSize=`, where `n` is the limit of number of actions (such as `AddFile`) at which Delta Lake on Apache Spark will start parallelizing the checkpoint and attempt to write a maximum of this many actions per checkpoint file. -Note - This feature requires no reader side configuration changes. The existing reader already supports reading a checkpoint with multiple files. diff --git a/src/pages/latest/table-properties.mdx b/src/pages/latest/table-properties.mdx index 336c8d4..b705d9f 100644 --- a/src/pages/latest/table-properties.mdx +++ b/src/pages/latest/table-properties.mdx @@ -7,14 +7,14 @@ Delta Lake reserves Delta table properties starting with `delta`. These properti | Property | Description | Data Type | Default | |-|-|-|-| -| `delta.appendOnly` | `true` for this Delta table to be append-only. If append-only, existing records cannot be deleted, and existing values cannot be updated. See [Table properties](/latest/delta-batch.html#-table-properties). | `Boolean` | `false`| -|`delta.checkpoint.writeStatsAsJson` | `true` for Delta Lake to write file statistics in checkpoints in JSON format for the stats column. | `Boolean` | `true`| -|`delta.checkpoint.writeStatsAsStruct` | `true` for Delta Lake to write file statistics to checkpoints in struct format for the stats_parsed column and to write partition values as a struct for partitionValues_parsed. | `Boolean`| (none)| -|`delta.compatibility.symlinkFormatManifest.enabled`|`true` for Delta Lake to configure the Delta table so that all write operations on the table automatically update the manifests. See Step 3: [Update manifests](https://docs.delta.io/latest/presto-integration.html#-step-3-update-manifests).|`Boolean`|`false`| -|`delta.dataSkippingNumIndexedCols`|The number of columns for Delta Lake to collect statistics about for data skipping. A value of -1 means to collect statistics for all columns. Updating this property does not automatically collect statistics again; instead, it redefines the statistics schema of the Delta table. For example, it changes the behavior of future statistics collection (such as during appends and optimizations) as well as data skipping (such as ignoring column statistics beyond this number, even when such statistics exist).|`Int`|32| -|`delta.deletedFileRetentionDuration`|The shortest duration for Delta Lake to keep logically deleted data files before deleting them physically. This is to prevent failures in stale readers after compactions or partition overwrites. This value should be large enough to ensure that: 1. It is larger than the longest possible duration of a job if you run VACUUM when there are concurrent readers or writers accessing the Delta table. 2. If you run a streaming query that reads from the table, that the query does not stop for longer than this value. Otherwise, the query may not be able to restart, as it must still read old files. See [Data retention](/latest/delta-batch.html#-data-retention).|`CalendarInterval`|`interval 1 week`| -|`delta.enableChangeDataFeed`|`true` to enable change data feed. See [Enable change data feed](/latest/delta-change-data-feed.html#-enable-change-data-feed).|`Boolean`|`false`| -|`delta.logRetentionDuration`|How long the history for a Delta table is kept. Each time a checkpoint is written, Delta Lake automatically cleans up log entries older than the retention interval. If you set this property to a large enough value, many log entries are retained. This should not impact performance as operations against the log are constant time. Operations on history are parallel but will become more expensive as the log size increases. See [Data retention](/latest/delta-batch.html#-data-retention).|`CalendarInterval`|`interval 30 days`| -|`delta.minReaderVersion`|The minimum required protocol reader version for a reader that allows to read from this Delta table. See [Table protocol versioning](/latest/versioning.html).|`Int`|`1`| -|`delta.minWriterVersion`|The minimum required protocol writer version for a writer that allows to write to this Delta table. See [Table protocol versioning](/latest/versioning.html).|`Int`|`2`| -|`delta.setTransactionRetentionDuration`|The shortest duration within which new snapshots will retain transaction identifiers (for example, SetTransactions). When a new snapshot sees a transaction identifier older than or equal to the duration specified by this property, the snapshot considers it expired and ignores it. The SetTransaction identifier is used when making the writes idempotent. See [Idempotent table writes in foreachBatch](https://docs.delta.io/latest/delta-streaming.html#-idempotent-table-writes-in-foreachbatch) for details.|`CalendarInterval`|(none)| \ No newline at end of file +| `delta.appendOnly` | `true` for this Delta table to be append-only. If append-only, existing records cannot be deleted, and existing values cannot be updated. See [Table properties](/latest/delta-batch#table-properties). | `Boolean` | `false`| +|`delta.checkpoint.writeStatsAsJson` | `true` for Delta Lake to write file statistics in checkpoints in JSON format for the `stats` column. | `Boolean` | `true`| +|`delta.checkpoint.writeStatsAsStruct` | `true` for Delta Lake to write file statistics to checkpoints in struct format for the `stats_parsed` column and to write partition values as a struct for `partitionValues_parsed`. | `Boolean`| (none)| +|`delta.compatibility.symlinkFormatManifest.enabled`|`true` for Delta Lake to configure the Delta table so that all write operations on the table automatically update the manifests. See [Step 3: Update manifests](/latest/presto-integration#step-3-update-manifests).|`Boolean`|`false`| +|`delta.dataSkippingNumIndexedCols`|The number of columns for Delta Lake to collect statistics about for data skipping. A value of -1 means to collect statistics for all columns. Updating this property does not automatically collect statistics again; instead, it redefines the statistics schema of the Delta table. For example, it changes the behavior of future statistics collection (such as during appends and optimizations) as well as data skipping (such as ignoring column statistics beyond this number, even when such statistics exist). See [Data skipping](/latest/optimizations-oss#data-skipping).|`Int`|32| +|`delta.deletedFileRetentionDuration`|The shortest duration for Delta Lake to keep logically deleted data files before deleting them physically. This is to prevent failures in stale readers after compactions or partition overwrites. This value should be large enough to ensure that: 1. It is larger than the longest possible duration of a job if you run `VACUUM` when there are concurrent readers or writers accessing the Delta table. 2. If you run a streaming query that reads from the table, that the query does not stop for longer than this value. Otherwise, the query may not be able to restart, as it must still read old files. See [Data retention](/latest/delta-batch#data-retention).|`CalendarInterval`|`interval 1 week`| +|`delta.enableChangeDataFeed`|`true` to enable change data feed. See [Enable change data feed](/latest/delta-change-data-feed#enable-change-data-feed).|`Boolean`|`false`| +|`delta.logRetentionDuration`|How long the history for a Delta table is kept. Each time a checkpoint is written, Delta Lake automatically cleans up log entries older than the retention interval. If you set this property to a large enough value, many log entries are retained. This should not impact performance as operations against the log are constant time. Operations on history are parallel but will become more expensive as the log size increases. See [Data retention](/latest/delta-batch#data-retention).|`CalendarInterval`|`interval 30 days`| +|`delta.minReaderVersion`|The minimum required protocol reader version for a reader that allows to read from this Delta table. See [Table protocol versioning](/latest/versioning).|`Int`|`1`| +|`delta.minWriterVersion`|The minimum required protocol writer version for a writer that allows to write to this Delta table. See [Table protocol versioning](/latest/versioning).|`Int`|`2`| +|`delta.setTransactionRetentionDuration`|The shortest duration within which new snapshots will retain transaction identifiers (for example, `SetTransaction`s). When a new snapshot sees a transaction identifier older than or equal to the duration specified by this property, the snapshot considers it expired and ignores it. The `SetTransaction` identifier is used when making the writes idempotent. See [Idempotent table writes in foreachBatch](/latest/delta-streaming#idempotent-table-writes-in-foreachbatch) for details.|`CalendarInterval`|(none)| \ No newline at end of file