Skip to content

bug: cannot convert y to numpy on kaggle notebook in sklearn pipeline #149

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jitingxu1 opened this issue Sep 4, 2024 · 1 comment
Open

Comments

@jitingxu1
Copy link
Collaborator

jitingxu1 commented Sep 4, 2024

In this competition, y column cannot be converted to numpy array.

I could run this on my local machine, but not on kaggle notebook.

~~**I could reproduce this on my local.**~~

local env

Python version: 3.12.4 | packaged by Anaconda, Inc. | (main, Jun 18 2024, 10:07:17) [Clang 14.0.6 ]
scikit-learn version: 1.5.1
skorch version: 1.0.0
torch version: 2.4.0
ibis-framework version: 9.3.0

kaggle env

Python version: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]
scikit-learn version: 1.2.2
skorch version: 1.0.0
torch version: 2.4.0+cpu
ibis-framework version: 9.3.0

# Wrap the PyTorch model with skorch
net = NeuralNetClassifier(
    MyModel,
    module__input_dim=635,  # Specify the input dimension
    max_epochs=1,
    lr=0.001,
    batch_size=32,
    optimizer=optim.Adam,
    criterion=nn.BCELoss,
    iterator_train__shuffle=True,
    callbacks=[
        EarlyStopping(monitor='valid_loss', patience=25, load_best=True),  # Early stopping
        LRScheduler(policy='ReduceLROnPlateau', monitor='valid_loss', factor=0.1, patience=25, min_lr=1e-6)
    ],
    verbose=1
)

# Define the sklearn pipeline with preprocessing and PyTorch model
pipeline = Pipeline([
    ('ibisml-prep', recipe),  # Preprocessing step in IbisML
    ('model', net)  # The PyTorch model wrapped as NeuralNetClassifier via skorch
])

pipeline.fit(X_train, y_train)

log

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[19], line 1
----> 1 pipeline.fit(X_train, y_train)

File /opt/conda/lib/python3.10/site-packages/sklearn/pipeline.py:405, in Pipeline.fit(self, X, y, **fit_params)
    403     if self._final_estimator != "passthrough":
    404         fit_params_last_step = fit_params_steps[self.steps[-1][0]]
--> 405         self._final_estimator.fit(Xt, y, **fit_params_last_step)
    407 return self

File /opt/conda/lib/python3.10/site-packages/skorch/classifier.py:165, in NeuralNetClassifier.fit(self, X, y, **fit_params)
    154 """See ``NeuralNet.fit``.
    155 
    156 In contrast to ``NeuralNet.fit``, ``y`` is non-optional to
   (...)
    160 
    161 """
    162 # pylint: disable=useless-super-delegation
    163 # this is actually a pylint bug:
    164 # https://github.com/PyCQA/pylint/issues/1085
--> 165 return super(NeuralNetClassifier, self).fit(X, y, **fit_params)

File /opt/conda/lib/python3.10/site-packages/skorch/net.py:1319, in NeuralNet.fit(self, X, y, **fit_params)
   1316 if not self.warm_start or not self.initialized_:
   1317     self.initialize()
-> 1319 self.partial_fit(X, y, **fit_params)
   1320 return self

File /opt/conda/lib/python3.10/site-packages/skorch/net.py:1278, in NeuralNet.partial_fit(self, X, y, classes, **fit_params)
   1276 self.notify('on_train_begin', X=X, y=y)
   1277 try:
-> 1278     self.fit_loop(X, y, **fit_params)
   1279 except KeyboardInterrupt:
   1280     pass

File /opt/conda/lib/python3.10/site-packages/skorch/net.py:1172, in NeuralNet.fit_loop(self, X, y, epochs, **fit_params)
   1136 def fit_loop(self, X, y=None, epochs=None, **fit_params):
   1137     """The proper fit loop.
   1138 
   1139     Contains the logic of what actually happens during the fit
   (...)
   1170 
   1171     """
-> 1172     self.check_data(X, y)
   1173     self.check_training_readiness()
   1174     epochs = epochs if epochs is not None else self.max_epochs

File /opt/conda/lib/python3.10/site-packages/skorch/classifier.py:141, in NeuralNetClassifier.check_data(self, X, y)
    137         pass
    139 if y is not None:
    140     # pylint: disable=attribute-defined-outside-init
--> 141     self.classes_inferred_ = np.unique(to_numpy(y))

File /opt/conda/lib/python3.10/site-packages/skorch/utils.py:152, in to_numpy(X)
    149     return np.asarray(X)
    151 if not is_torch_data_type(X):
--> 152     raise TypeError("Cannot convert this data type to a numpy array.")
@jitingxu1 jitingxu1 changed the title bug: bug: cannot convert y to numpy on kaggle notebook in sklearn pipeline Sep 4, 2024
@zy662
Copy link

zy662 commented Apr 11, 2025

Use the following code to fit the model:

import ibis_ml as ml
import ibis.expr.datatypes as dt
# Create data frames for the two sets:
train_data, test_data = ml.train_test_split(
    flight_data,
    unique_key=["carrier", "flight", "date"],
    # Put 3/4 of the data into the training set
    test_size=0.25,
    num_buckets=4,
    # Fix the random numbers by setting the seed
    # This enables the analysis to be reproducible when random numbers are used
    random_seed=222,
)
X_train = train_data.drop("arr_delay")
y_train = train_data.arr_delay.cast(dt.int64)

X_test = test_data.drop("arr_delay")
y_test = test_data.arr_delay.cast(dt.int64)

last_mile_preprocessing = ml.Recipe(
    ml.ExpandDate("date", components=["dow", "month"]),
    ml.Drop("date"),
    ml.TargetEncode(ml.nominal()),
    ml.DropZeroVariance(ml.everything()),
    ml.MutateAt("dep_time", ibis._.hour() * 60 + ibis._.minute()),
    ml.MutateAt(ml.timestamp(), ibis._.epoch_seconds()),
    # By default, PyTorch requires that the type of `X` is `np.float32`.
    # https://discuss.pytorch.org/t/mat1-and-mat2-must-have-the-same-dtype-but-got-double-and-float/197555/2
    ml.Cast(ml.numeric(), "float32"),
)
# train preprocessing recipe using training dataset
last_mile_preprocessing.fit(X_train, y_train)

# transform train and test dataset using IbisML recipe
X_train_transformed = last_mile_preprocessing.transform(X_train)
X_test_transformed = last_mile_preprocessing.transform(X_test)

pipe = Pipeline([("flights_rec", last_mile_preprocessing), ("net", net)])
pipe.fit(X_train_transformed, y_train)
pipe.score(X_test_transformed, y_test)

reference: https://ibis-project.org/posts/ibisml/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: backlog
Development

No branches or pull requests

2 participants