Skip to content

Commit c134020

Browse files
authored
Add delta lake to guides (logicalclocks#355)
Co-authored-by: Jim Dowling <jim@hopsworks.ai>
1 parent 1b7af20 commit c134020

File tree

1 file changed

+16
-9
lines changed

1 file changed

+16
-9
lines changed

docs/user_guides/fs/feature_group/create.md

+16-9
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ The first step to create a feature group is to create the API metadata object re
3232
primary_key=['location_id'],
3333
partition_key=['day'],
3434
event_time='event_time',
35+
time_travel_format='DELTA',
3536
)
3637
```
3738

@@ -47,7 +48,7 @@ The last parameter used in the examples above is `stream`. The `stream` paramete
4748

4849
##### Primary key
4950

50-
A primary key is required when using the default Hudi file format to store offline feature data. When inserting data in a feature group on the offline feature store, the DataFrame you are writing is checked against the existing data in the feature group. If a row with the same primary key is found in the feature group, the row will be updated. If the primary key is not found, the row is appended to the feature group.
51+
A primary key is required when using the default table format (Hudi) to store offline feature data. When inserting data in a feature group on the offline feature store, the DataFrame you are writing is checked against the existing data in the feature group. If a row with the same primary key is found in the feature group, the row will be updated. If the primary key is not found, the row is appended to the feature group.
5152
When writing data on the online feature store, existing rows with the same primary key will be overwritten by new rows with the same primary key.
5253

5354
##### Event time
@@ -80,6 +81,11 @@ MaxDirectoryItemsExceededException - The directory item limit is exceeded: limit
8081

8182
By using partitioning the system will write the feature data in different subdirectories, thus allowing you to write 10240 files per partition.
8283

84+
##### Table format
85+
86+
When you create a feature group, you can specify the table format you want to use to store the data in your feature group by setting the `time_travel_format` parameter. The currently support values are "HUDI", "DELTA", "NONE" (which defaults to Parquet).
87+
88+
8389
#### Streaming Write API
8490

8591
As explained above, the stream parameter controls whether to enable the streaming write APIs to the online and offline feature store.
@@ -95,6 +101,7 @@ For Python environments, only the stream API is supported (stream=True).
95101
primary_key=['location_id'],
96102
partition_key=['day'],
97103
event_time='event_time'
104+
time_travel_format='HUDI',
98105
)
99106
```
100107

@@ -108,6 +115,7 @@ For Python environments, only the stream API is supported (stream=True).
108115
primary_key=['location_id'],
109116
partition_key=['day'],
110117
event_time='event_time',
118+
time_travel_format='HUDI',
111119
stream=True
112120
)
113121
```
@@ -132,8 +140,8 @@ By default, feature groups in hopsworks will share a project-wide topic.
132140
#### Best Practices for Writing
133141

134142
When designing a feature group, it is worth taking a look at how this feature group will be queried in the future, in order to optimize it for those query patterns.
135-
At the same time, Spark and Hudi tend to overpartition writes, creatingtoo many small parquet files, which is inefficient and slowing down the write.
136-
But they also slow down queries, because file listings are taking more time, but also reading many small files is usually slower.
143+
At the same time, Spark and Hudi tend to overpartition writes, creating too many small parquet files, which is inefficient and slows down writes.
144+
But they also slow down queries, because file listings take more time and reading many small files is slower than fewer larger files.
137145
The best practices described in this section hold both for the Streaming API and the Batch API.
138146

139147
Four main considerations influence the write and the query performance:
@@ -145,8 +153,7 @@ Four main considerations influence the write and the query performance:
145153

146154
##### Partitioning on a feature group level
147155

148-
**Partitioning on the feature group level** allows Hopsworks and Hudi to push down filters to the filesystem during training dataset or batch data generation.
149-
In practice that means, less directories need to be listed and less files need to be read, speeding up queries.
156+
**Partitioning on the feature group level** allows Hopsworks and the table format (Hudi or Delta) to push down filters to the filesystem when reading from feature groups. In practice that means, less directories need to be listed and less files need to be read, speeding up queries.
150157

151158
For example, most commonly, filtering is done on the event time column of a feature group when generating training data or batches of data:
152159
```python
@@ -199,13 +206,13 @@ fg = feature_store.create_feature_group(...
199206
##### Parquet file size within a feature group partition
200207

201208
Once you have decided on the feature group level partitioning and you start inserting data to the feature group, there are multiple ways in order to
202-
influence how Hudi will **split the data between parquet files within the feature group partitions**.
209+
influence how the table format (Hudi or Delta) will **split the data between parquet files within the feature group partitions**.
203210
The two things that influence the number of parquet files per partition are
204211

205212
1. The number of feature group partitions written in a single insert
206-
2. The shuffle parallelism used by Hudi
213+
2. The shuffle parallelism used by the table format
207214

208-
In general, the inserted dataframe (unique combination of partition key values) will be parallised according to the following Hudi settings:
215+
For example, the inserted dataframe (unique combination of partition key values) will be parallised according to the following Hudi settings:
209216
!!! example "Default Hudi partitioning"
210217
```python
211218
write_options = {
@@ -261,7 +268,7 @@ In that case you can increase the Hudi shuffle parallelism accordingly.
261268

262269
When creating a feature group that uses streaming write APIs for data ingestion it is possible to define the Kafka topics that should be utilized.
263270
The default approach of using a project-wide topic functions great for use cases involving little to no overlap when producing data. However,
264-
concurrently inserting into multiple feature groups could cause read amplification for the Hudi delta streamer job. Therefore, it is
271+
concurrently inserting into multiple feature groups could cause read amplification for the offline materialization job (e.g., Hudi Delta Streamer). Therefore, it is
265272
advised to utilize separate topics when ingestions overlap or there is a large frequently running insertion into a specific feature group.
266273

267274
### Register the metadata and save the feature data

0 commit comments

Comments
 (0)