You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: docs/user_guides/fs/feature_group/create.md
+16-9
Original file line number
Diff line number
Diff line change
@@ -32,6 +32,7 @@ The first step to create a feature group is to create the API metadata object re
32
32
primary_key=['location_id'],
33
33
partition_key=['day'],
34
34
event_time='event_time',
35
+
time_travel_format='DELTA',
35
36
)
36
37
```
37
38
@@ -47,7 +48,7 @@ The last parameter used in the examples above is `stream`. The `stream` paramete
47
48
48
49
##### Primary key
49
50
50
-
A primary key is required when using the default Hudi file format to store offline feature data. When inserting data in a feature group on the offline feature store, the DataFrame you are writing is checked against the existing data in the feature group. If a row with the same primary key is found in the feature group, the row will be updated. If the primary key is not found, the row is appended to the feature group.
51
+
A primary key is required when using the default table format (Hudi) to store offline feature data. When inserting data in a feature group on the offline feature store, the DataFrame you are writing is checked against the existing data in the feature group. If a row with the same primary key is found in the feature group, the row will be updated. If the primary key is not found, the row is appended to the feature group.
51
52
When writing data on the online feature store, existing rows with the same primary key will be overwritten by new rows with the same primary key.
52
53
53
54
##### Event time
@@ -80,6 +81,11 @@ MaxDirectoryItemsExceededException - The directory item limit is exceeded: limit
80
81
81
82
By using partitioning the system will write the feature data in different subdirectories, thus allowing you to write 10240 files per partition.
82
83
84
+
##### Table format
85
+
86
+
When you create a feature group, you can specify the table format you want to use to store the data in your feature group by setting the `time_travel_format` parameter. The currently support values are "HUDI", "DELTA", "NONE" (which defaults to Parquet).
87
+
88
+
83
89
#### Streaming Write API
84
90
85
91
As explained above, the stream parameter controls whether to enable the streaming write APIs to the online and offline feature store.
@@ -95,6 +101,7 @@ For Python environments, only the stream API is supported (stream=True).
95
101
primary_key=['location_id'],
96
102
partition_key=['day'],
97
103
event_time='event_time'
104
+
time_travel_format='HUDI',
98
105
)
99
106
```
100
107
@@ -108,6 +115,7 @@ For Python environments, only the stream API is supported (stream=True).
108
115
primary_key=['location_id'],
109
116
partition_key=['day'],
110
117
event_time='event_time',
118
+
time_travel_format='HUDI',
111
119
stream=True
112
120
)
113
121
```
@@ -132,8 +140,8 @@ By default, feature groups in hopsworks will share a project-wide topic.
132
140
#### Best Practices for Writing
133
141
134
142
When designing a feature group, it is worth taking a look at how this feature group will be queried in the future, in order to optimize it for those query patterns.
135
-
At the same time, Spark and Hudi tend to overpartition writes, creatingtoo many small parquet files, which is inefficient and slowing down the write.
136
-
But they also slow down queries, because file listings are taking more time, but also reading many small files is usually slower.
143
+
At the same time, Spark and Hudi tend to overpartition writes, creating too many small parquet files, which is inefficient and slows down writes.
144
+
But they also slow down queries, because file listings take more time and reading many small files is slower than fewer larger files.
137
145
The best practices described in this section hold both for the Streaming API and the Batch API.
138
146
139
147
Four main considerations influence the write and the query performance:
@@ -145,8 +153,7 @@ Four main considerations influence the write and the query performance:
145
153
146
154
##### Partitioning on a feature group level
147
155
148
-
**Partitioning on the feature group level** allows Hopsworks and Hudi to push down filters to the filesystem during training dataset or batch data generation.
149
-
In practice that means, less directories need to be listed and less files need to be read, speeding up queries.
156
+
**Partitioning on the feature group level** allows Hopsworks and the table format (Hudi or Delta) to push down filters to the filesystem when reading from feature groups. In practice that means, less directories need to be listed and less files need to be read, speeding up queries.
150
157
151
158
For example, most commonly, filtering is done on the event time column of a feature group when generating training data or batches of data:
##### Parquet file size within a feature group partition
200
207
201
208
Once you have decided on the feature group level partitioning and you start inserting data to the feature group, there are multiple ways in order to
202
-
influence how Hudi will **split the data between parquet files within the feature group partitions**.
209
+
influence how the table format (Hudi or Delta) will **split the data between parquet files within the feature group partitions**.
203
210
The two things that influence the number of parquet files per partition are
204
211
205
212
1. The number of feature group partitions written in a single insert
206
-
2. The shuffle parallelism used by Hudi
213
+
2. The shuffle parallelism used by the table format
207
214
208
-
In general, the inserted dataframe (unique combination of partition key values) will be parallised according to the following Hudi settings:
215
+
For example, the inserted dataframe (unique combination of partition key values) will be parallised according to the following Hudi settings:
209
216
!!! example "Default Hudi partitioning"
210
217
```python
211
218
write_options = {
@@ -261,7 +268,7 @@ In that case you can increase the Hudi shuffle parallelism accordingly.
261
268
262
269
When creating a feature group that uses streaming write APIs for data ingestion it is possible to define the Kafka topics that should be utilized.
263
270
The default approach of using a project-wide topic functions great for use cases involving little to no overlap when producing data. However,
264
-
concurrently inserting into multiple feature groups could cause read amplification for the Hudi delta streamer job. Therefore, it is
271
+
concurrently inserting into multiple feature groups could cause read amplification for the offline materialization job (e.g., Hudi Delta Streamer). Therefore, it is
265
272
advised to utilize separate topics when ingestions overlap or there is a large frequently running insertion into a specific feature group.
266
273
267
274
### Register the metadata and save the feature data
0 commit comments