Skip to content

Commit 6dce35e

Browse files
committed
docs: use train_test_split util over custom code
1 parent 8479b8d commit 6dce35e

File tree

4 files changed

+92
-268
lines changed

4 files changed

+92
-268
lines changed

docs/tutorial/pytorch.qmd

Lines changed: 11 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -122,43 +122,24 @@ To get started, let's split this single dataset into two: a _training_ set and a
122122
Because the order of rows in an Ibis table is undefined, we need a unique key to split the data reproducibly. [It is permissible for airlines to use the same flight number for different routes, as long as the flights do not operate on the same day. This means that the combination of the flight number and the date of travel is always unique.](https://www.euclaim.com/blog/flight-numbers-explained#:~:text=Can%20flight%20numbers%20be%20reused,of%20travel%20is%20always%20unique.)
123123

124124
```{python}
125-
flight_data_with_unique_key = flight_data.mutate(
126-
unique_key=ibis.literal(",").join(
127-
[flight_data.carrier, flight_data.flight.cast(str), flight_data.date.cast(str)]
128-
)
129-
)
130-
flight_data_with_unique_key
131-
```
132-
133-
```{python}
134-
flight_data_with_unique_key.group_by("unique_key").mutate(
135-
count=flight_data_with_unique_key.count()
136-
).filter(ibis._["count"] > 1)
137-
```
138-
139-
```{python}
140-
import random
141-
142-
# Fix the random numbers by setting the seed
143-
# This enables the analysis to be reproducible when random numbers are used
144-
random.seed(222)
145-
146-
# Put 3/4 of the data into the training set
147-
random_key = str(random.getrandbits(256))
148-
data_split = flight_data_with_unique_key.mutate(
149-
train=(flight_data_with_unique_key.unique_key + random_key).hash().abs() % 4 < 3
150-
)
125+
import ibis_ml as ml
151126
152127
# Create data frames for the two sets:
153-
train_data = data_split[data_split.train].drop("unique_key", "train")
154-
test_data = data_split[~data_split.train].drop("unique_key", "train")
128+
train_data, test_data = ml.train_test_split(
129+
flight_data,
130+
unique_key=["carrier", "flight", "date"],
131+
# Put 3/4 of the data into the training set
132+
test_size=0.25,
133+
num_buckets=4,
134+
# Fix the random numbers by setting the seed
135+
# This enables the analysis to be reproducible when random numbers are used
136+
random_seed=222,
137+
)
155138
```
156139

157140
## Create features
158141

159142
```{python}
160-
import ibis_ml as ml
161-
162143
flights_rec = ml.Recipe(
163144
ml.ExpandDate("date", components=["dow", "month"]),
164145
ml.Drop("date"),

docs/tutorial/scikit-learn.qmd

Lines changed: 11 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -121,43 +121,24 @@ To get started, let's split this single dataset into two: a _training_ set and a
121121
Because the order of rows in an Ibis table is undefined, we need a unique key to split the data reproducibly. [It is permissible for airlines to use the same flight number for different routes, as long as the flights do not operate on the same day. This means that the combination of the flight number and the date of travel is always unique.](https://www.euclaim.com/blog/flight-numbers-explained#:~:text=Can%20flight%20numbers%20be%20reused,of%20travel%20is%20always%20unique.)
122122

123123
```{python}
124-
flight_data_with_unique_key = flight_data.mutate(
125-
unique_key=ibis.literal(",").join(
126-
[flight_data.carrier, flight_data.flight.cast(str), flight_data.date.cast(str)]
127-
)
128-
)
129-
flight_data_with_unique_key
130-
```
131-
132-
```{python}
133-
flight_data_with_unique_key.group_by("unique_key").mutate(
134-
count=flight_data_with_unique_key.count()
135-
).filter(ibis._["count"] > 1)
136-
```
137-
138-
```{python}
139-
import random
140-
141-
# Fix the random numbers by setting the seed
142-
# This enables the analysis to be reproducible when random numbers are used
143-
random.seed(222)
144-
145-
# Put 3/4 of the data into the training set
146-
random_key = str(random.getrandbits(256))
147-
data_split = flight_data_with_unique_key.mutate(
148-
train=(flight_data_with_unique_key.unique_key + random_key).hash().abs() % 4 < 3
149-
)
124+
import ibis_ml as ml
150125
151126
# Create data frames for the two sets:
152-
train_data = data_split[data_split.train].drop("unique_key", "train")
153-
test_data = data_split[~data_split.train].drop("unique_key", "train")
127+
train_data, test_data = ml.train_test_split(
128+
flight_data,
129+
unique_key=["carrier", "flight", "date"],
130+
# Put 3/4 of the data into the training set
131+
test_size=0.25,
132+
num_buckets=4,
133+
# Fix the random numbers by setting the seed
134+
# This enables the analysis to be reproducible when random numbers are used
135+
random_seed=222,
136+
)
154137
```
155138

156139
## Create features
157140

158141
```{python}
159-
import ibis_ml as ml
160-
161142
flights_rec = ml.Recipe(
162143
ml.ExpandDate("date", components=["dow", "month"]),
163144
ml.Drop("date"),

docs/tutorial/xgboost.qmd

Lines changed: 11 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -121,43 +121,24 @@ To get started, let's split this single dataset into two: a _training_ set and a
121121
Because the order of rows in an Ibis table is undefined, we need a unique key to split the data reproducibly. [It is permissible for airlines to use the same flight number for different routes, as long as the flights do not operate on the same day. This means that the combination of the flight number and the date of travel is always unique.](https://www.euclaim.com/blog/flight-numbers-explained#:~:text=Can%20flight%20numbers%20be%20reused,of%20travel%20is%20always%20unique.)
122122

123123
```{python}
124-
flight_data_with_unique_key = flight_data.mutate(
125-
unique_key=ibis.literal(",").join(
126-
[flight_data.carrier, flight_data.flight.cast(str), flight_data.date.cast(str)]
127-
)
128-
)
129-
flight_data_with_unique_key
130-
```
131-
132-
```{python}
133-
flight_data_with_unique_key.group_by("unique_key").mutate(
134-
count=flight_data_with_unique_key.count()
135-
).filter(ibis._["count"] > 1)
136-
```
137-
138-
```{python}
139-
import random
140-
141-
# Fix the random numbers by setting the seed
142-
# This enables the analysis to be reproducible when random numbers are used
143-
random.seed(222)
144-
145-
# Put 3/4 of the data into the training set
146-
random_key = str(random.getrandbits(256))
147-
data_split = flight_data_with_unique_key.mutate(
148-
train=(flight_data_with_unique_key.unique_key + random_key).hash().abs() % 4 < 3
149-
)
124+
import ibis_ml as ml
150125
151126
# Create data frames for the two sets:
152-
train_data = data_split[data_split.train].drop("unique_key", "train")
153-
test_data = data_split[~data_split.train].drop("unique_key", "train")
127+
train_data, test_data = ml.train_test_split(
128+
flight_data,
129+
unique_key=["carrier", "flight", "date"],
130+
# Put 3/4 of the data into the training set
131+
test_size=0.25,
132+
num_buckets=4,
133+
# Fix the random numbers by setting the seed
134+
# This enables the analysis to be reproducible when random numbers are used
135+
random_seed=222,
136+
)
154137
```
155138

156139
## Create features
157140

158141
```{python}
159-
import ibis_ml as ml
160-
161142
flights_rec = ml.Recipe(
162143
ml.ExpandDate("date", components=["dow", "month"]),
163144
ml.Drop("date"),

0 commit comments

Comments
 (0)