Skip to content

Commit 756b2f2

Browse files
committed
updates
1 parent 0b3efbe commit 756b2f2

File tree

1 file changed

+69
-13
lines changed

1 file changed

+69
-13
lines changed

docs/user_guides/fs/feature_group/data_types.md

+69-13
Original file line numberDiff line numberDiff line change
@@ -120,7 +120,7 @@ When a feature is being used as a primary key, certain types are not allowed.
120120
Examples of such types are *FLOAT*, *DOUBLE*, *TEXT* and *BLOB*.
121121
Additionally, the size of the sum of the primary key online data types storage requirements **should not exceed 4KB**.
122122

123-
#### Online restrictions for row size
123+
#### Online restrictions for row size
124124

125125
The online feature store supports **up to 500 columns** and all column types combined **should not exceed 30000 Bytes**.
126126
The byte size of each column is determined by its data type and calculated as follows:
@@ -145,18 +145,74 @@ The byte size of each column is determined by its data type and calculated as fo
145145

146146

147147
#### Pre-insert schema validation for online feature groups
148-
149-
The input dataframe can be validated for schema as per the valid online schema data types before online ingestion. The most important checks are mentioned below along with possible corrective actions. It is enabled by setting the keyword argument `validation_options={'run_validation':True}` in the `insert()` API of feature groups.
150-
151-
152-
153-
| Error type | Requirement | Suggested corrections |
154-
|-------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------|
155-
| Primary key contains null values | Primary key columns must not contain any null values. For composite keys, all primary key columns are checked for nulls. | Remove the null rows from dataframe. OR impute the null values as applicable. |
156-
| Primary key column is missing | The dataframe to be inserted must contain all the features defined in the primary key as per the feature group schema. | Add all the primary key columns in the dataframe. |
157-
| Event time column is missing | The dataframe to be inserted must contain an event time column if it was specified in the schema while feature group creation. | Add the event time column in the dataframe. |
158-
| String length exceeded | The character length of a string row exceeds the maximum length specified in feature online schema. However, if the feature group is not created and if no explicit schema was provided during feature group creation, then the length will be auto-increased to the maximum length found in a string column. This is handled during the first data ingestion and no user action is needed in this case. **Note:** The maximum row size in bytes should be less than 30000. | Trim the string values to fit within maximum set during feature group creation. OR remove the invalid rows. If the lengths are very long consider changing the feature schema to **TEXT** or **BLOB.** |
159-
148+
For online enabled feature groups, the dataframe to be ingested needs to adhere to the online schema definitions. The input dataframe is validated for schema checks accordingly.
149+
The validation is enabled by setting below property when calling `insert()`
150+
=== "Python"
151+
```python
152+
feature_group.insert(df, validation_options={'run_validation':True})
153+
```
154+
The most important validation checks or error messages are mentioned below along with possible corrective actions.
155+
156+
1. Primary key contains null values
157+
158+
- **Rule** Primary key column should not contain any null values.
159+
- **Example correction** Drop the rows containing null primary keys. Alternatively, find the null values and assign them an unique value as per preferred strategy for data imputation.
160+
161+
=== "Pandas"
162+
```python
163+
# Assuming 'id' is the primary key column
164+
df = df.dropna(subset=['id'])
165+
# For composite keys
166+
df = df.dropna(subset=['id1', 'id2'])
167+
```
168+
169+
2. Primary key column missing
170+
171+
- **Rule** The dataframe to be inserted must contain all the columns defined as primary key(s) in the feature group.
172+
- **Example correction** Add all the primary key columns in the dataframe.
173+
174+
=== "Pandas"
175+
```python
176+
# Add missing primary key column
177+
df['id'] = some_value
178+
# If primary key is an auto-incrementing
179+
df['id'] = range(1, len(df) + 1)
180+
```
181+
182+
3. String length exceeded
183+
184+
- **Rule** The character length of a string should be within the maximum length capacity in the online schema type of a feature. If the feature group is not created and explicit feature schema was not provided, the limit will be auto-increased to the maximum length found in a string column in the dataframe.
185+
- **Example correction**
186+
Trim the string values to fit within maximum limit set during feature group creation.
187+
188+
=== "Pandas"
189+
```python
190+
max_length = 100
191+
df['text_column'] = df['text_column'].str.slice(0, max_length)
192+
```
193+
194+
!!!note
195+
The total row size limit should be less than 30kb as per [row size restrictions](#online-restrictions-for-row-size). In such cases it is possible to define the feature as **TEXT** or **BLOB**.
196+
Below is an example of explicitly defining the string column as TEXT as online type.
197+
198+
=== "Pandas"
199+
```python
200+
import pandas as pd
201+
# example dummy datafrane with the string column
202+
df = pd.DataFrame(columns=['id', 'string_col'])
203+
from hsfs.feature import Feature
204+
features = [
205+
Feature(name="id",type="bigint",online_type="bigint"),
206+
Feature(name="string_col",type="string",online_type="text")
207+
]
208+
209+
fg = fs.get_or_create_feature_group(name="fg_manual_text_schema",
210+
version=1,
211+
features=features,
212+
online_enabled=True,
213+
primary_key=['id'])
214+
fg.insert(df)
215+
```
160216

161217
### Timestamps and Timezones
162218

0 commit comments

Comments
 (0)