Skip to content

Commit 72eb5b3

Browse files
authored
Merge branch 'ibis-project:main' into fix-all-unique-value-when-scale
2 parents d6446c7 + 6dce35e commit 72eb5b3

21 files changed

+622
-489
lines changed

docs/_quarto.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -214,9 +214,9 @@ quartodoc:
214214
name: Temporal feature extraction
215215
desc: Feature extraction for temporal columns
216216
contents:
217-
- ExpandDateTime
218217
- ExpandDate
219218
- ExpandTime
219+
- ExpandTimestamp
220220

221221
- kind: page
222222
path: steps-other

docs/index.qmd

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,9 @@ hide-description: true
1111

1212
- Preprocess your data at scale on any [Ibis](https://ibis-project.org/)-supported
1313
backend.
14-
- Compose [`Recipe`](/reference/core.html#ibis_ml.Recipe)s with other scikit-learn
15-
estimators using
14+
- Compose
15+
[`Recipe`](https://ibis-project.github.io/ibis-ml/reference/core.html#ibis_ml.Recipe)s
16+
with other scikit-learn estimators using
1617
[`Pipeline`](https://scikit-learn.org/stable/modules/compose.html#pipeline-chaining-estimators)s.
1718
- Seamlessly integrate with [scikit-learn](https://scikit-learn.org/stable/),
1819
[XGBoost](https://xgboost.readthedocs.io/en/stable/python/sklearn_estimator.html), and

docs/reference/support-matrix/step_config.yml

Lines changed: 24 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -90,7 +90,30 @@ ExpandDate:
9090
components:
9191
- doy
9292

93-
ExpandDateTime:
93+
ExpandTime:
94+
configurations:
95+
- name: h
96+
config:
97+
inputs: time
98+
components:
99+
- hour
100+
- name: m
101+
config:
102+
inputs: time
103+
components:
104+
- minute
105+
- name: s
106+
config:
107+
inputs: time
108+
components:
109+
- second
110+
- name: ms
111+
config:
112+
inputs: time
113+
components:
114+
- millisecond
115+
116+
ExpandTimestamp:
94117
configurations:
95118
- name: ms
96119
config:
@@ -137,26 +160,3 @@ ExpandDateTime:
137160
inputs: timestamp
138161
components:
139162
- doy
140-
141-
ExpandTime:
142-
configurations:
143-
- name: h
144-
config:
145-
inputs: time
146-
components:
147-
- hour
148-
- name: m
149-
config:
150-
inputs: time
151-
components:
152-
- minute
153-
- name: s
154-
config:
155-
inputs: time
156-
components:
157-
- second
158-
- name: ms
159-
config:
160-
inputs: time
161-
components:
162-
- millisecond

docs/tutorial/pytorch.qmd

Lines changed: 12 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,7 @@ flight_data = (
102102
"time_hour",
103103
)
104104
# Exclude missing data
105-
.dropna()
105+
.drop_null()
106106
)
107107
flight_data
108108
```
@@ -122,44 +122,24 @@ To get started, let's split this single dataset into two: a _training_ set and a
122122
Because the order of rows in an Ibis table is undefined, we need a unique key to split the data reproducibly. [It is permissible for airlines to use the same flight number for different routes, as long as the flights do not operate on the same day. This means that the combination of the flight number and the date of travel is always unique.](https://www.euclaim.com/blog/flight-numbers-explained#:~:text=Can%20flight%20numbers%20be%20reused,of%20travel%20is%20always%20unique.)
123123

124124
```{python}
125-
flight_data_with_unique_key = flight_data.mutate(
126-
unique_key=ibis.literal(",").join(
127-
[flight_data.carrier, flight_data.flight.cast(str), flight_data.date.cast(str)]
128-
)
129-
)
130-
flight_data_with_unique_key
131-
```
132-
133-
```{python}
134-
# FIXME(deepyaman): Proposed key isn't unique for actual departure date.
135-
flight_data_with_unique_key.group_by("unique_key").mutate(
136-
cnt=flight_data_with_unique_key.count()
137-
)[ibis._.cnt > 1]
138-
```
139-
140-
```{python}
141-
import random
142-
143-
# Fix the random numbers by setting the seed
144-
# This enables the analysis to be reproducible when random numbers are used
145-
random.seed(222)
146-
147-
# Put 3/4 of the data into the training set
148-
random_key = str(random.getrandbits(256))
149-
data_split = flight_data_with_unique_key.mutate(
150-
train=(flight_data_with_unique_key.unique_key + random_key).hash().abs() % 4 < 3
151-
)
125+
import ibis_ml as ml
152126
153127
# Create data frames for the two sets:
154-
train_data = data_split[data_split.train].drop("unique_key", "train")
155-
test_data = data_split[~data_split.train].drop("unique_key", "train")
128+
train_data, test_data = ml.train_test_split(
129+
flight_data,
130+
unique_key=["carrier", "flight", "date"],
131+
# Put 3/4 of the data into the training set
132+
test_size=0.25,
133+
num_buckets=4,
134+
# Fix the random numbers by setting the seed
135+
# This enables the analysis to be reproducible when random numbers are used
136+
random_seed=222,
137+
)
156138
```
157139

158140
## Create features
159141

160142
```{python}
161-
import ibis_ml as ml
162-
163143
flights_rec = ml.Recipe(
164144
ml.ExpandDate("date", components=["dow", "month"]),
165145
ml.Drop("date"),

docs/tutorial/scikit-learn.qmd

Lines changed: 12 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -101,7 +101,7 @@ flight_data = (
101101
"time_hour",
102102
)
103103
# Exclude missing data
104-
.dropna()
104+
.drop_null()
105105
)
106106
flight_data
107107
```
@@ -121,44 +121,24 @@ To get started, let's split this single dataset into two: a _training_ set and a
121121
Because the order of rows in an Ibis table is undefined, we need a unique key to split the data reproducibly. [It is permissible for airlines to use the same flight number for different routes, as long as the flights do not operate on the same day. This means that the combination of the flight number and the date of travel is always unique.](https://www.euclaim.com/blog/flight-numbers-explained#:~:text=Can%20flight%20numbers%20be%20reused,of%20travel%20is%20always%20unique.)
122122

123123
```{python}
124-
flight_data_with_unique_key = flight_data.mutate(
125-
unique_key=ibis.literal(",").join(
126-
[flight_data.carrier, flight_data.flight.cast(str), flight_data.date.cast(str)]
127-
)
128-
)
129-
flight_data_with_unique_key
130-
```
131-
132-
```{python}
133-
# FIXME(deepyaman): Proposed key isn't unique for actual departure date.
134-
flight_data_with_unique_key.group_by("unique_key").mutate(
135-
cnt=flight_data_with_unique_key.count()
136-
)[ibis._.cnt > 1]
137-
```
138-
139-
```{python}
140-
import random
141-
142-
# Fix the random numbers by setting the seed
143-
# This enables the analysis to be reproducible when random numbers are used
144-
random.seed(222)
145-
146-
# Put 3/4 of the data into the training set
147-
random_key = str(random.getrandbits(256))
148-
data_split = flight_data_with_unique_key.mutate(
149-
train=(flight_data_with_unique_key.unique_key + random_key).hash().abs() % 4 < 3
150-
)
124+
import ibis_ml as ml
151125
152126
# Create data frames for the two sets:
153-
train_data = data_split[data_split.train].drop("unique_key", "train")
154-
test_data = data_split[~data_split.train].drop("unique_key", "train")
127+
train_data, test_data = ml.train_test_split(
128+
flight_data,
129+
unique_key=["carrier", "flight", "date"],
130+
# Put 3/4 of the data into the training set
131+
test_size=0.25,
132+
num_buckets=4,
133+
# Fix the random numbers by setting the seed
134+
# This enables the analysis to be reproducible when random numbers are used
135+
random_seed=222,
136+
)
155137
```
156138

157139
## Create features
158140

159141
```{python}
160-
import ibis_ml as ml
161-
162142
flights_rec = ml.Recipe(
163143
ml.ExpandDate("date", components=["dow", "month"]),
164144
ml.Drop("date"),

docs/tutorial/xgboost.qmd

Lines changed: 12 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -101,7 +101,7 @@ flight_data = (
101101
"time_hour",
102102
)
103103
# Exclude missing data
104-
.dropna()
104+
.drop_null()
105105
)
106106
flight_data
107107
```
@@ -121,44 +121,24 @@ To get started, let's split this single dataset into two: a _training_ set and a
121121
Because the order of rows in an Ibis table is undefined, we need a unique key to split the data reproducibly. [It is permissible for airlines to use the same flight number for different routes, as long as the flights do not operate on the same day. This means that the combination of the flight number and the date of travel is always unique.](https://www.euclaim.com/blog/flight-numbers-explained#:~:text=Can%20flight%20numbers%20be%20reused,of%20travel%20is%20always%20unique.)
122122

123123
```{python}
124-
flight_data_with_unique_key = flight_data.mutate(
125-
unique_key=ibis.literal(",").join(
126-
[flight_data.carrier, flight_data.flight.cast(str), flight_data.date.cast(str)]
127-
)
128-
)
129-
flight_data_with_unique_key
130-
```
131-
132-
```{python}
133-
# FIXME(deepyaman): Proposed key isn't unique for actual departure date.
134-
flight_data_with_unique_key.group_by("unique_key").mutate(
135-
cnt=flight_data_with_unique_key.count()
136-
)[ibis._.cnt > 1]
137-
```
138-
139-
```{python}
140-
import random
141-
142-
# Fix the random numbers by setting the seed
143-
# This enables the analysis to be reproducible when random numbers are used
144-
random.seed(222)
145-
146-
# Put 3/4 of the data into the training set
147-
random_key = str(random.getrandbits(256))
148-
data_split = flight_data_with_unique_key.mutate(
149-
train=(flight_data_with_unique_key.unique_key + random_key).hash().abs() % 4 < 3
150-
)
124+
import ibis_ml as ml
151125
152126
# Create data frames for the two sets:
153-
train_data = data_split[data_split.train].drop("unique_key", "train")
154-
test_data = data_split[~data_split.train].drop("unique_key", "train")
127+
train_data, test_data = ml.train_test_split(
128+
flight_data,
129+
unique_key=["carrier", "flight", "date"],
130+
# Put 3/4 of the data into the training set
131+
test_size=0.25,
132+
num_buckets=4,
133+
# Fix the random numbers by setting the seed
134+
# This enables the analysis to be reproducible when random numbers are used
135+
random_seed=222,
136+
)
155137
```
156138

157139
## Create features
158140

159141
```{python}
160-
import ibis_ml as ml
161-
162142
flights_rec = ml.Recipe(
163143
ml.ExpandDate("date", components=["dow", "month"]),
164144
ml.Drop("date"),

0 commit comments

Comments
 (0)