Skip to content

Commit abae4c9

Browse files
Update Lightning AI multi-node guide (Trainer) (#19530)
* update * update * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * configure_model --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent a6c0a31 commit abae4c9

File tree

6 files changed

+227
-22
lines changed

6 files changed

+227
-22
lines changed

docs/source-pytorch/clouds/cluster.rst

+14-8
Original file line numberDiff line numberDiff line change
@@ -1,45 +1,51 @@
1-
#########################
2-
Run on an on-prem cluster
3-
#########################
1+
###########################
2+
Run on a multi-node cluster
3+
###########################
44

55

66
.. raw:: html
77

88
<div class="display-card-container">
99
<div class="row">
1010

11-
.. Add callout items below this line
11+
.. displayitem::
12+
:header: Run single or multi-node on Lightning Studios
13+
:description: The easiest way to scale models in the cloud. No infrastructure setup required.
14+
:col_css: col-md-6
15+
:button_link: lightning_ai.html
16+
:height: 160
17+
:tag: basic
1218

1319
.. displayitem::
1420
:header: Run on an on-prem cluster
1521
:description: Learn to train models on a general compute cluster.
1622
:col_css: col-md-6
1723
:button_link: cluster_intermediate_1.html
18-
:height: 150
24+
:height: 160
1925
:tag: intermediate
2026

2127
.. displayitem::
2228
:header: Run with Torch Distributed
2329
:description: Run models on a cluster with torch distributed.
2430
:col_css: col-md-6
2531
:button_link: cluster_intermediate_2.html
26-
:height: 150
32+
:height: 160
2733
:tag: intermediate
2834

2935
.. displayitem::
3036
:header: Run on a SLURM cluster
3137
:description: Run models on a SLURM-managed cluster
3238
:col_css: col-md-6
3339
:button_link: cluster_advanced.html
34-
:height: 150
40+
:height: 160
3541
:tag: intermediate
3642

3743
.. displayitem::
3844
:header: Integrate your own cluster
3945
:description: Learn how to integrate your own cluster
4046
:col_css: col-md-6
4147
:button_link: cluster_expert.html
42-
:height: 150
48+
:height: 160
4349
:tag: expert
4450

4551
.. raw:: html
+192
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,192 @@
1+
:orphan:
2+
3+
#############################################
4+
Run single or multi-node on Lightning Studios
5+
#############################################
6+
7+
**Audience**: Users who don't want to waste time on cluster configuration and maintenance.
8+
9+
`Lightning Studios <https://lightning.ai>`_ is a cloud platform where you can build, train, finetune and deploy models without worrying about infrastructure, cost management, scaling, and other technical headaches.
10+
This guide shows you how easy it is to run a PyTorch Lightning training script across multiple machines on Lightning Studios.
11+
12+
13+
----
14+
15+
16+
*************
17+
Initial Setup
18+
*************
19+
20+
First, create a free `Lightning AI account <https://lightning.ai/>`_.
21+
You get free credits every month you can spend on GPU compute.
22+
To use machines with multiple GPUs or run jobs across machines, you need to be on the `Pro or Teams plan <https://lightning.ai/pricing>`_.
23+
24+
25+
----
26+
27+
28+
***************************************
29+
Launch multi-node training in the cloud
30+
***************************************
31+
32+
**Step 1:** Start a new Studio.
33+
34+
.. video:: https://pl-public-data.s3.amazonaws.com/assets_lightning/fabric/videos/start-studio-for-mmt.mp4
35+
:width: 800
36+
:loop:
37+
:muted:
38+
39+
|
40+
41+
**Step 2:** Bring your code into the Studio. You can clone a GitHub repo, drag and drop local files, or use the following demo example:
42+
43+
.. collapse:: Code Example
44+
45+
.. code-block:: python
46+
47+
import lightning as L
48+
import torch
49+
import torch.nn.functional as F
50+
from lightning.pytorch.demos import Transformer, WikiText2
51+
from torch.utils.data import DataLoader, random_split
52+
53+
54+
class LanguageDataModule(L.LightningDataModule):
55+
def __init__(self, batch_size):
56+
super().__init__()
57+
self.batch_size = batch_size
58+
self.vocab_size = 33278
59+
60+
def prepare_data(self):
61+
WikiText2(download=True)
62+
63+
def setup(self, stage):
64+
dataset = WikiText2()
65+
66+
# Split data in to train, val, test
67+
n = len(dataset)
68+
self.train_dataset, self.val_dataset, self.test_dataset = random_split(dataset, [n - 4000, 2000, 2000])
69+
70+
def train_dataloader(self):
71+
return DataLoader(self.train_dataset, batch_size=self.batch_size, shuffle=True)
72+
73+
def val_dataloader(self):
74+
return DataLoader(self.val_dataset, batch_size=self.batch_size, shuffle=False)
75+
76+
def test_dataloader(self):
77+
return DataLoader(self.test_dataset, batch_size=self.batch_size, shuffle=False)
78+
79+
80+
class LanguageModel(L.LightningModule):
81+
def __init__(self, vocab_size):
82+
super().__init__()
83+
self.vocab_size = vocab_size
84+
self.model = None
85+
86+
def configure_model(self):
87+
if self.model is None:
88+
self.model = Transformer(vocab_size=self.vocab_size)
89+
90+
def training_step(self, batch, batch_idx):
91+
input, target = batch
92+
output = self.model(input, target)
93+
loss = F.nll_loss(output, target.view(-1))
94+
self.log("train_loss", loss)
95+
return loss
96+
97+
def validation_step(self, batch, batch_idx):
98+
input, target = batch
99+
output = self.model(input, target)
100+
loss = F.nll_loss(output, target.view(-1))
101+
self.log("val_loss", loss)
102+
return loss
103+
104+
def test_step(self, batch, batch_idx):
105+
input, target = batch
106+
output = self.model(input, target)
107+
loss = F.nll_loss(output, target.view(-1))
108+
self.log("test_loss", loss)
109+
return loss
110+
111+
def configure_optimizers(self):
112+
return torch.optim.SGD(self.parameters(), lr=0.1)
113+
114+
115+
def main():
116+
L.seed_everything(42)
117+
118+
datamodule = LanguageDataModule(batch_size=20)
119+
model = LanguageModel(datamodule.vocab_size)
120+
121+
# Trainer
122+
trainer = L.Trainer(gradient_clip_val=0.25, max_epochs=2, strategy="ddp")
123+
trainer.fit(model, datamodule=datamodule)
124+
trainer.test(model, datamodule=datamodule)
125+
126+
127+
if __name__ == "__main__":
128+
main()
129+
130+
|
131+
132+
**Step 3:** Remove hardcoded accelerator settings if any and let Lightning automatically set them for you. No other changes are required in your script.
133+
134+
.. code-block:: python
135+
136+
# These are the defaults
137+
trainer = L.Trainer(accelerator="auto", devices="auto")
138+
139+
# DON'T hardcode these, leave them default/auto
140+
# trainer = L.Trainer(accelerator="cpu", devices=3)
141+
142+
|
143+
144+
**Step 4:** Install dependencies and download all necessary data. Test that your script runs in the Studio first. If it runs in the Studio, it will run in multi-node!
145+
146+
|
147+
148+
**Step 5:** Open the Multi-Machine Training (MMT) app. Type the command to run your script, select the machine type and how many machines you want to launch it on. Click "Run" to start the job.
149+
150+
.. video:: https://pl-public-data.s3.amazonaws.com/assets_lightning/lightning-ai-mmt-demo-pl.mp4
151+
:width: 800
152+
:loop:
153+
:muted:
154+
155+
After submitting the job, you will be redirected to a page where you can monitor the machine metrics and logs in real-time.
156+
157+
158+
----
159+
160+
161+
****************************
162+
Bring your own cloud account
163+
****************************
164+
165+
As a `Teams or Enterprise <https://lightning.ai/pricing>`_ customer, you have the option to connect your existing cloud account to Lightning AI.
166+
This gives your organization the ability to keep all compute and data on your own cloud account and your Virtual Private Cloud (VPC).
167+
168+
169+
----
170+
171+
**********
172+
Learn more
173+
**********
174+
175+
.. raw:: html
176+
177+
<div class="display-card-container">
178+
<div class="row">
179+
180+
.. displayitem::
181+
:header: Lightning Studios
182+
:description: Code together. Prototype. Train. Deploy. Host AI web apps. From your browser - with zero setup.
183+
:col_css: col-md-4
184+
:button_link: https://lightning.ai
185+
:height: 150
186+
187+
.. raw:: html
188+
189+
</div>
190+
</div>
191+
192+
|

docs/source-pytorch/common/index.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -112,8 +112,8 @@ How-to Guides
112112
:height: 180
113113

114114
.. displayitem::
115-
:header: Run on an on-prem cluster
116-
:description: Learn to run on your own cluster
115+
:header: Run on a multi-node cluster
116+
:description: Learn to run on multi-node in the cloud or on your cluster
117117
:button_link: ../clouds/cluster.html
118118
:col_css: col-md-4
119119
:height: 180

docs/source-pytorch/common_usecases.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -85,8 +85,8 @@ Customize and extend Lightning for things like custom hardware or distributed st
8585
:height: 100
8686

8787
.. displayitem::
88-
:header: Run on an on-prem cluster
89-
:description: Learn to run on your own cluster
88+
:header: Run on a multi-node cluster
89+
:description: Learn to run multi-node in the cloud or on your cluster
9090
:col_css: col-md-12
9191
:button_link: clouds/cluster.html
9292
:height: 100

docs/source-pytorch/levels/intermediate.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -64,8 +64,8 @@ Learn to scale up your models and enable collaborative model development at acad
6464
:tag: intermediate
6565

6666
.. displayitem::
67-
:header: Level 13: Run on on-prem clusters
68-
:description: Run on a custom on-prem cluster or SLURM cluster.
67+
:header: Level 13: Run on a multi-node cluster
68+
:description: Learn to run on multi-node in the cloud or on your cluster
6969
:col_css: col-md-6
7070
:button_link: intermediate_level_14.html
7171
:height: 150

docs/source-pytorch/levels/intermediate_level_14.rst

+15-8
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
:orphan:
22

3-
#################################
4-
Level 13: Run on on-prem clusters
5-
#################################
3+
#####################################
4+
Level 13: Run on a multi-node cluster
5+
#####################################
66

7-
In this level you'll learn to run on on-prem clusters.
7+
In this level you'll learn to run on cloud or on-prem clusters.
88

99
----
1010

@@ -13,30 +13,37 @@ In this level you'll learn to run on on-prem clusters.
1313
<div class="display-card-container">
1414
<div class="row">
1515

16-
.. Add callout items below this line
16+
17+
.. displayitem::
18+
:header: Run single or multi-node on Lightning Studios
19+
:description: The easiest way to scale models in the cloud. No infrastructure setup required.
20+
:col_css: col-md-4
21+
:button_link: ../clouds/lightning_ai.html
22+
:height: 160
23+
:tag: basic
1724

1825
.. displayitem::
1926
:header: Run on an on-prem cluster
2027
:description: Learn to train models on a general compute cluster.
2128
:col_css: col-md-4
2229
:button_link: ../clouds/cluster_intermediate_1.html
23-
:height: 150
30+
:height: 160
2431
:tag: intermediate
2532

2633
.. displayitem::
2734
:header: Run on a SLURM cluster
2835
:description: Run models on a SLURM-managed cluster
2936
:col_css: col-md-4
3037
:button_link: ../clouds/cluster_advanced.html
31-
:height: 150
38+
:height: 160
3239
:tag: intermediate
3340

3441
.. displayitem::
3542
:header: Run with Torch Distributed
3643
:description: Run models on a cluster with torch distributed.
3744
:col_css: col-md-4
3845
:button_link: ../clouds/cluster_intermediate_2.html
39-
:height: 150
46+
:height: 160
4047
:tag: intermediate
4148

4249
.. raw:: html

0 commit comments

Comments
 (0)