MultiDeviceKernel is not distributing the memory usage #2319

NicPy4 · 2023-04-10T16:50:26Z

NicPy4
Apr 10, 2023

I try to use the MultiDeviceKernel for a time series forecast. My data has ~100.000 data samples and one input feature. To start with, I just used the example from GPyTorch repository for ExactGP with MultiDeviceKernel (https://docs.gpytorch.ai/en/latest/examples/02_Scalable_Exact_GPs/Simple_MultiGPU_GP_Regression.html) but instead of the protein data, I used my own data. I have 8 GPUs (NVIDIA A100 40Gb or NVIDIA Tesla V100 32Gb), and I work on a supercomputer at my university.
According to the paper linked to the example, running this with this set-up and the provided code should be no problem. However, I always run into 'CUDA of out memory' and to be more particular it is because of imbalanced memory usage. So when I display my memory usage with 'nividia-smi -l', I can see that my GPU0 is used ~99% while the rest is used around 30-40%. This then leads to a stop of the code at some point. This is my code:

    output_device = torch.device('cuda:0')
    x_train = torch.tensor(x_train, device=output_device)
    y_train = torch.tensor(y_train, device=output_device)
    x_test = torch.tensor(x_test, device=output_device)
    y_test = torch.tensor(y_test, device=output_device)
    train_x, train_y = torch.flatten(x_train.contiguous()), torch.flatten(y_train.contiguous())
    test_x, test_y = torch.flatten(x_test.contiguous()), torch.flatten(y_test.contiguous())
    n_devices = torch.cuda.device_count()
    print('Planning to run on {} GPUs.'.format(n_devices))

    class ExactGPModel(gpytorch.models.ExactGP):
        def __init__(self, train_x, train_y, likelihood, n_devices):
            super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
            self.mean_module = gpytorch.means.ConstantMean()
            base_covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())

            self.covar_module = gpytorch.kernels.MultiDeviceKernel(
                base_covar_module, device_ids=list(range(n_devices)),
                output_device=output_device)

        def forward(self, x):
            mean_x = self.mean_module(x)
            covar_x = self.covar_module(x)
            return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)


    def train(train_x, train_y, n_devices, output_device, checkpoint_size, preconditioner_size, n_training_iter):

        likelihood = gpytorch.likelihoods.GaussianLikelihood().to(output_device)
        model = ExactGPModel(train_x, train_y, likelihood, n_devices).to(output_device)
        print('model built.')
        model.train()
        likelihood.train()
        print('model in training mode')

        optimizer = FullBatchLBFGS(model.parameters(), lr=0.1)
        print('optimizer is calculated.')
        # "Loss" for GPs - the marginal log likelihood
        mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

        print('mll is calculated.')

        with gpytorch.beta_features.checkpoint_kernel(checkpoint_size), \
                gpytorch.settings.max_preconditioner_size(preconditioner_size):

            def closure():
                optimizer.zero_grad()
                print('set optimizer to zero grad.')
                output = model(train_x)
                print('made predictions')
                loss = -mll(output, train_y)
                print('loss was calculated')
                return loss

            print('start with loss calculation')
            loss = closure()
            loss.backward(torch.ones_like(loss))
            print('loss backward was calculated')

            for i in range(n_training_iter):
                options = {'closure': closure, 'current_loss': loss, 'max_ls': 10}
                loss, _, _, _, _, _, _, fail = optimizer.step(options)

                print('Iter %d/%d - Loss: %.3f   lengthscale: %.3f   noise: %.3f' % (
                    i + 1, n_training_iter, loss.mean().item(),
                    model.covar_module.module.base_kernel.lengthscale.item(),
                    model.likelihood.noise.item()
                ))

                if fail:
                    print('Convergence reached!')
                    break

        print(f"Finished training on {train_x.size(0)} data points using {n_devices} GPUs.")
        return model, likelihood

    def find_best_gpu_setting(train_x,
                              train_y,
                              n_devices,
                              output_device,
                              preconditioner_size
                              ):
        N = train_x.size(0)

        # Find the optimum partition/checkpoint size by decreasing in powers of 2
        # Start with no partitioning (size = 0)
        settings = [0] + [int(n) for n in np.ceil(N / 2 ** np.arange(1, np.floor(np.log2(N))))]

        for checkpoint_size in settings:
            print('Number of devices: {} -- Kernel partition size: {}'.format(n_devices, checkpoint_size))
            try:
                # Try a full forward and backward pass with this setting to check memory usage
                _, _ = train(train_x, train_y,
                             n_devices=n_devices, output_device=output_device,
                             checkpoint_size=checkpoint_size,
                             preconditioner_size=preconditioner_size, n_training_iter=1)

                # when successful, break out of for-loop and jump to finally block
                break
            except RuntimeError as e:
                print('RuntimeError: {}'.format(e))
                print(torch.cuda.memory_summary(device=None, abbreviated=False))
                gc.collect()
                torch.cuda.empty_cache()
            except AttributeError as e:
                print('AttributeError: {}'.format(e))
            finally:
                # handle CUDA OOM error
                gc.collect()
                torch.cuda.empty_cache()
                print('emptied cache')
        return checkpoint_size

    # Set a large enough preconditioner size to reduce the number of CG iterations run
    preconditioner_size = 100
    checkpoint_size = find_best_gpu_setting(train_x, train_y,
                                            n_devices=n_devices,
                                            output_device=output_device,
                                            preconditioner_size=preconditioner_size)

    model, likelihood = train(train_x, train_y,
                              n_devices=n_devices, output_device=output_device,
                              checkpoint_size=checkpoint_size,
                              preconditioner_size=preconditioner_size,
                              n_training_iter=5)

To test the program I run it on only 4 GPUs with less data (average over half an hour instead of a data point every 5min). The memory usage while running the program with ~18000 training samples and a kernel partition of 29 looks like this:

and I'll get this error message:

 Traceback (most recent call last):                                                                                                                                                                          
  File "main_regression.py", line 192, in <module>                                                                                                                                                          
    train_model()                                                                                                                                                                                           
  File "main_regression.py", line 181, in train_model                                                                                                                                                       
    torch_model = reg_model.Exact_gp(x_train, y_train, x_test, y_test)                                                                                                                                      
  File "/cluster/home/niclasfl/PMLaks_opt/Regression/RegressionModel.py", line 353, in Exact_gp                                                                                                             
    checkpoint_size = find_best_gpu_setting(train_x, train_y,                                                                                                                                               
  File "/cluster/home/niclasfl/PMLaks_opt/Regression/RegressionModel.py", line 330, in find_best_gpu_setting                                                                                                
    _, _ = train(train_x, train_y,
  File "/cluster/home/niclasfl/PMLaks_opt/Regression/RegressionModel.py", line 293, in train
    loss = closure()
  File "/cluster/home/niclasfl/PMLaks_opt/Regression/RegressionModel.py", line 288, in closure
    loss = -mll(output, train_y)
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/gpytorch/module.py", line 31, in __call__                                                                                                
    outputs = self.forward(*inputs, **kwargs)
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/gpytorch/mlls/exact_marginal_log_likelihood.py", line 64, in forward                                                                     
    res = output.log_prob(target)
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/gpytorch/distributions/multivariate_normal.py", line 193, in log_prob                                                                    
    inv_quad, logdet = covar.inv_quad_logdet(inv_quad_rhs=diff.unsqueeze(-1), logdet=True)
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/operators/_linear_operator.py", line 1642, in inv_quad_logdet                                                            
    preconditioner, precond_lt, logdet_p = self._preconditioner()
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/operators/added_diag_linear_operator.py", line 114, in _preconditioner                                                   
    self._piv_chol_self = self._linear_op.pivoted_cholesky(rank=max_iter)
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/operators/_linear_operator.py", line 1850, in pivoted_cholesky                                                           
    res, pivots = func(self.representation_tree(), rank, error_tol, *self.representation())
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/functions/_pivoted_cholesky.py", line 72, in forward                                                                     
    row = apply_permutation(matrix, pi_m.unsqueeze(-1), right_permutation=None).squeeze(-2)
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/utils/permutation.py", line 79, in apply_permutation                                                                     
    return to_dense(matrix.__getitem__((*batch_idx, left_permutation.unsqueeze(-1), right_permutation.unsqueeze(-2))))                                                                                     
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/gpytorch/lazy/lazy_evaluated_kernel_tensor.py", line 25, in wrapped                                                                      
    output = method(self, *args, **kwargs)
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/gpytorch/lazy/lazy_evaluated_kernel_tensor.py", line 426, in __getitem__                                                                 
    return super().__getitem__(index)
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/operators/_linear_operator.py", line 2692, in __getitem__                                                                
    res = self._get_indices(new_row_index, new_col_index, *new_batch_indices)
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/operators/_linear_operator.py", line 422, in _get_indices
    InterpolatedLinearOperator(
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/utils/memoize.py", line 59, in g
    return _add_to_cache(self, cache_name, method(self, *args, **kwargs), *args, kwargs_pkl=kwargs_pkl)
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/operators/_linear_operator.py", line 2461, in to_dense
    res = self.matmul(eye)
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/operators/interpolated_linear_operator.py", line 435, in matmul
    base_res = self.base_linear_op.matmul(right_interp_res)
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/operators/_linear_operator.py", line 1722, in matmul
    return Matmul.apply(self.representation_tree(), other, *self.representation())
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/functions/_matmul.py", line 21, in forward
    res = linear_op._matmul(rhs)
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/gpytorch/lazy/lazy_evaluated_kernel_tensor.py", line 264, in _matmul
    self.kernel(
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/gpytorch/kernels/kernel.py", line 524, in __call__
    super(Kernel, self).__call__(x1_, x2_, last_dim_is_batch=last_dim_is_batch, **params)
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/gpytorch/module.py", line 31, in __call__
    outputs = self.forward(*inputs, **kwargs)
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/gpytorch/kernels/multi_device_kernel.py", line 64, in forward
    inputs = tuple((x1_[0], x2_) for x1_, x2_ in zip(self._x1_scattered, self._x2_subs))
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/gpytorch/kernels/multi_device_kernel.py", line 64, in <genexpr>
    inputs = tuple((x1_[0], x2_) for x1_, x2_ in zip(self._x1_scattered, self._x2_subs))
IndexError: tuple index out of range

When I try to run it on the full set with ~100.000 points, I get this error message:

Traceback (most recent call last):
File "main_regression.py", line 192, in
train_model()
File "main_regression.py", line 181, in train_model
torch_model = reg_model.Exact_gp(x_train, y_train, x_test, y_test)
File "/cluster/home/niclasfl/PMLaks_opt/Regression/RegressionModel.py", line 358, in Exact_gp
model, likelihood = train(train_x, train_y,
File "/cluster/home/niclasfl/PMLaks_opt/Regression/RegressionModel.py", line 293, in train
loss = closure()
File "/cluster/home/niclasfl/PMLaks_opt/Regression/RegressionModel.py", line 288, in closure
loss = -mll(output, train_y)
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/gpytorch/module.py", line 31, in call
outputs = self.forward(*inputs, **kwargs)
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/gpytorch/mlls/exact_marginal_log_likelihood.py", line 64, in forward
res = output.log_prob(target)
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/gpytorch/distributions/multivariate_normal.py", line 193, in log_prob
inv_quad, logdet = covar.inv_quad_logdet(inv_quad_rhs=diff.unsqueeze(-1), logdet=True)
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/operators/_linear_operator.py", line 1642, in inv_quad_logdet
preconditioner, precond_lt, logdet_p = self._preconditioner()
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/operators/added_diag_linear_operator.py", line 114, in _preconditioner
self._piv_chol_self = self._linear_op.pivoted_cholesky(rank=max_iter)
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/operators/_linear_operator.py", line 1850, in pivoted_cholesky
res, pivots = func(self.representation_tree(), rank, error_tol, *self.representation())
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/functions/_pivoted_cholesky.py", line 72, in forward
row = apply_permutation(matrix, pi_m.unsqueeze(-1), right_permutation=None).squeeze(-2)
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/utils/permutation.py", line 79, in apply_permutation
return to_dense(matrix.getitem((*batch_idx, left_permutation.unsqueeze(-1), right_permutation.unsqueeze(-2))))
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/gpytorch/lazy/lazy_evaluated_kernel_tensor.py", line 25, in wrapped
output = method(self, *args, **kwargs)
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/gpytorch/lazy/lazy_evaluated_kernel_tensor.py", line 426, in getitem
return super().getitem(index)
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/operators/_linear_operator.py", line 2692, in getitem
res = self._get_indices(new_row_index, new_col_index, *new_batch_indices)
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/operators/_linear_operator.py", line 407, in _get_indices
base_linear_op = self._getitem(_noop_index, _noop_index, *batch_indices)._expand_batch(final_shape)
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/operators/_linear_operator.py", line 380, in _expand_batch
return self.repeat(*batch_repeat, 1, 1)
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/gpytorch/lazy/lazy_evaluated_kernel_tensor.py", line 25, in wrapped
output = method(self, *args, **kwargs)
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/gpytorch/lazy/lazy_evaluated_kernel_tensor.py", line 381, in repeat
x2 = self.x2.repeat(*batch_repeat, col_repeat, 1)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 28.09 GiB (GPU 0; 39.44 GiB total capacity; 28.12 GiB already allocated; 10.77 GiB free; 28.12 GiB reserved in total by PyTorch) If reser
ved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

And it doesn't matter how big the partition is. I hope someone can help me with this allocation problem of the memory usage.
Thanks!

gpleiss · 2023-05-26T14:48:20Z

gpleiss
May 26, 2023
Maintainer

@NicPy4 does it distribute the data when you don't use checkpointing? We will be deprecating the checkpointing feature in favor of KeOps.

3 replies

NicPy4 May 26, 2023
Author

@gpleiss First of all thank you for your reply. I commented out the 'find_best_gpu_setting' function and the 'gpytorch.beta_features.checkpoint_kernel(checkpoint_size)' to get rid of the checkpointing feature. I hope that I got it right and that was what you meant. If I try this then my process just gets killed. Even if I cut my data size to 10000 points or 1000 points. However, if I use SVGP with 1000 or more inducing points that is no problem. I know that the computational complexity is different between exact GP and SVGP but still, I wonder why it is not possible for me.

gpleiss May 26, 2023
Maintainer

By "killed" do you mean run out of memory? On a single GPU, you should be able to run up to 10000 data points.

NicPy4 May 31, 2023
Author

The terminal just displayed 'killed' so I suggest it is because it is out of memory, however I decided now to use the KeOps you suggested and try to implement the SpectralMixture Kernel next as we discussed here #1448 . Thanks for your feedback and help on this one!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MultiDeviceKernel is not distributing the memory usage #2319

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

MultiDeviceKernel is not distributing the memory usage #2319

NicPy4 Apr 10, 2023

Replies: 1 comment · 3 replies

gpleiss May 26, 2023 Maintainer

NicPy4 May 26, 2023 Author

gpleiss May 26, 2023 Maintainer

NicPy4 May 31, 2023 Author

NicPy4
Apr 10, 2023

Replies: 1 comment 3 replies

gpleiss
May 26, 2023
Maintainer

NicPy4 May 26, 2023
Author

gpleiss May 26, 2023
Maintainer

NicPy4 May 31, 2023
Author