Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPU v4 install guide #108

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,9 @@ We currently support a few LLM models targeting text generation scenarios:

## Installation

For installation on a TPU v4, use the `install-on-TPU-v4.sh` script. Make sure that you DO NOT install pallas or Jetstream as both are targeting TPU v5e!

Via package:
`optimum-tpu` comes with an handy PyPi released package compatible with your classical python dependency management tool.

`pip install optimum-tpu -f https://storage.googleapis.com/libtpu-releases/index.html`
Expand Down
24 changes: 24 additions & 0 deletions install-on-TPU-v4.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
sudo apt remove unattended-upgrades
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you remove unattended-upgrades?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They kicked off twice, each after a sudo apt update and kept the TPU VM stuck for more than 90 minutes before I decided to just kill them. I consider the lifetime of a TPU VM to be short and the VM not to be exposed to the outside world. Hence, I think getting a stuck (costly) VM due to some potentially non-critical updates seems worse than not having this service and instead doing updates as per your own schedule.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the issue, but I think that depends on the distribution you are using (I haven't experienced it so far), not necessarily related to optimum-tpu, that should provide tools for machine learning on TPUs. Please remove this command, consider doing the command when you are setting up your machine, before using optimum-tpu.

sudo apt update
export PJRT_DEVICE=TPU
export PATH="$HOME/.local/bin:$PATH"
pip install build
pip install --upgrade setuptools
sudo apt install python3.10-venv

git clone https://github.com/huggingface/optimum-tpu.git

cd optimum-tpu
make
make build_dist_install_tools
make build_dist

python -m venv optimum_tpu_env
source optimum_tpu_env/bin/activate
Comment on lines +16 to +17
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need a virtual environment?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regular install of optimum-tpu always tried to go for a system wide installation which would then fail. I had to choose between --install-option="--prefix=/SOME/DIR/" and a venv and considered the venv my prefered way of handling this (and future) conflicts.

I wanted a pip -e install as I was actively developing against some of the files. YMMV for a package install.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand, but this is a user choice too. Some people might prefer venv, others virtualenv, conda or even a docker image. I think it would be better to take it out from the script, leaving other users the freedom to choose their environment.


pip install torch==2.4.0 torch_xla[tpu]==2.4.0 torchvision -f https://storage.googleapis.com/libtpu-releases/index.html
pip uninstall torchvision # it might insist von 2.4.1
pip install -e .

huggingface-cli login
gsutil cp -r gs://entropix/huggingface_hub ~/.cache/huggingface/hub
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this for?

Copy link
Author

@artus-LYTiQ artus-LYTiQ Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be rejected. Local install for custom changes and experiments. The bucket is one of our project buckets anyway.

9 changes: 5 additions & 4 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -61,10 +61,11 @@ tests = ["pytest", "safetensors"]
quality = ["black", "ruff", "isort"]
# Jetstream/Pytorch support is experimental for now, it needs to be installed manually.
# Pallas is pulled because it will install a compatible version of jax[tpu].
jetstream-pt = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you do not need to comment this: you will only install it if you do pip install optimum-tpu[pallas], otherwise it should not pull the dependency

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

"jetstream-pt",
"torch-xla[pallas] == 2.4.0"
]
# pallas and jetstream are not supported before v5e. Therefore, comment out on v4 and earlier
#jetstream-pt = [
# "jetstream-pt",
# "torch-xla[pallas] == 2.4.0"
#]

[project.urls]
Homepage = "https://hf.co/hardware"
Expand Down