Skip to content

Commit 1faddcb

Browse files
awaelchlilantiga
andauthored
Update Lightning AI multi-node guide (#19324)
Co-authored-by: Luca Antiga <luca.antiga@gmail.com>
1 parent 3044e83 commit 1faddcb

File tree

2 files changed

+89
-54
lines changed

2 files changed

+89
-54
lines changed

docs/source-fabric/fundamentals/launch.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -172,7 +172,7 @@ Choose from the following options based on your expertise level and available in
172172
<div class="row">
173173

174174
.. displayitem::
175-
:header: Lightning Cloud
175+
:header: Run single or multi-node on Lightning Studios
176176
:description: The easiest way to scale models in the cloud. No infrastructure setup required.
177177
:col_css: col-md-4
178178
:button_link: ../guide/multi_node/cloud.html

docs/source-fabric/guide/multi_node/cloud.rst

+88-53
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,13 @@
11
:orphan:
22

3-
##########################
4-
Run in the Lightning Cloud
5-
##########################
3+
#############################################
4+
Run single or multi-node on Lightning Studios
5+
#############################################
66

77
**Audience**: Users who don't want to waste time on cluster configuration and maintenance.
88

9-
10-
The Lightning AI cloud is a platform where you can build, train, finetune and deploy models without worrying about infrastructure, cost management, scaling, and other technical headaches.
11-
In this guide, and within just 10 minutes, you will learn how to run a Fabric training script across multiple nodes in the cloud.
9+
`Lightning Studios <https://lightning.ai>`_ is a cloud platform where you can build, train, finetune and deploy models without worrying about infrastructure, cost management, scaling, and other technical headaches.
10+
This guide shows you how easy it is to run a Fabric training script across multiple machines on Lightning Studios.
1211

1312

1413
----
@@ -19,13 +18,8 @@ Initial Setup
1918
*************
2019

2120
First, create a free `Lightning AI account <https://lightning.ai/>`_.
22-
Then, log in from the CLI:
23-
24-
.. code-block:: bash
25-
26-
lightning login
27-
28-
A page opens in your browser where you can follow the instructions to complete the setup.
21+
You get free credits every month you can spend on GPU compute.
22+
To use machines with multiple GPUs or run jobs across machines, you need to be on the `Pro or Teams plan <https://lightning.ai/pricing>`_.
2923

3024

3125
----
@@ -35,66 +29,107 @@ A page opens in your browser where you can follow the instructions to complete t
3529
Launch multi-node training in the cloud
3630
***************************************
3731

38-
**Step 1:** Put your code inside a ``lightning.app.core.work.LightningWork``:
32+
**Step 1:** Start a new Studio.
3933

40-
.. code-block:: python
41-
:emphasize-lines: 5
42-
:caption: app.py
34+
.. video:: https://pl-public-data.s3.amazonaws.com/assets_lightning/fabric/videos/start-studio-for-mmt.mp4
35+
:width: 800
36+
:loop:
37+
:muted:
38+
39+
|
40+
41+
**Step 2:** Bring your code into the Studio. You can clone a GitHub repo, drag and drop local files, or use the following demo example:
42+
43+
.. collapse:: Code Example
44+
45+
.. code-block:: python
46+
47+
import lightning as L
48+
import torch
49+
import torch.nn.functional as F
50+
from lightning.pytorch.demos import Transformer, WikiText2
51+
from torch.utils.data import DataLoader
52+
53+
54+
def main():
55+
L.seed_everything(42)
56+
57+
fabric = L.Fabric()
58+
fabric.launch()
4359
44-
import lightning as L
45-
from lightning.app.components import FabricMultiNode
60+
# Data
61+
with fabric.rank_zero_first():
62+
dataset = WikiText2()
4663
64+
train_dataloader = DataLoader(dataset, batch_size=20, shuffle=True)
4765
48-
# 1. Put your code inside a LightningWork
49-
class MyTrainingComponent(L.LightningWork):
50-
def run(self):
51-
# Set up Fabric
52-
# The `devices` and `num_nodes` gets set by Lightning automatically
53-
fabric = L.Fabric(strategy="ddp", precision="16-mixed")
66+
# Model
67+
model = Transformer(vocab_size=dataset.vocab_size)
68+
69+
# Optimizer
70+
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
5471
55-
# Your training code
56-
model = ...
57-
optimizer = ...
5872
model, optimizer = fabric.setup(model, optimizer)
59-
...
73+
train_dataloader = fabric.setup_dataloaders(train_dataloader)
74+
75+
for batch_idx, batch in enumerate(train_dataloader):
76+
input, target = batch
77+
output = model(input, target)
78+
loss = F.nll_loss(output, target.view(-1))
79+
fabric.backward(loss)
80+
optimizer.step()
81+
optimizer.zero_grad()
82+
83+
if batch_idx % 10 == 0:
84+
fabric.print(f"iteration: {batch_idx} - loss {loss.item():.4f}")
85+
86+
87+
if __name__ == "__main__":
88+
main()
89+
90+
|
6091
61-
**Step 2:** Init a ``lightning.app.core.app.LightningApp`` with the ``FabricMultiNode`` component.
62-
Configure the number of nodes, the number of GPUs per node, and the type of GPU:
92+
**Step 3:** Remove hardcoded accelerator settings if any and let Lightning automatically set them for you. No other changes are required in your script.
6393

6494
.. code-block:: python
65-
:emphasize-lines: 5,7
66-
:caption: app.py
6795
68-
# 2. Create the app with the FabricMultiNode component inside
69-
app = L.LightningApp(
70-
FabricMultiNode(
71-
MyTrainingComponent,
72-
# Run with 2 nodes
73-
num_nodes=2,
74-
# Each with 4 x V100 GPUs, total 8 GPUs
75-
cloud_compute=L.CloudCompute("gpu-fast-multi"),
76-
)
77-
)
96+
# These are the defaults
97+
fabric = L.Fabric(accelerator="auto", devices="auto")
7898
99+
# DON'T hardcode these, leave them default/auto
100+
# fabric = L.Fabric(accelerator="cpu", devices=3)
79101
80-
**Step 3:** Run your code from the CLI:
102+
|
81103
82-
.. code-block:: bash
104+
**Step 4:** Install dependencies and download all necessary data. Test that your script runs in the Studio first. If it runs in the Studio, it will run in multi-node!
83105

84-
lightning run app app.py --cloud
106+
|
85107
86-
This command will upload your Python file and then opens the app admin view, where you can see the logs of what's happening.
108+
**Step 5:** Open the Multi-Machine Training (MMT) app. Type the command to run your script, select the machine type and how many machines you want to launch it on. Click "Run" to start the job.
87109

88-
.. figure:: https://pl-public-data.s3.amazonaws.com/assets_lightning/fabric/fabric-multi-node-admin.png
89-
:alt: The Lightning AI admin page of an app running a multi-node fabric training script
90-
:width: 100%
110+
.. video:: https://pl-public-data.s3.amazonaws.com/assets_lightning/fabric/videos/lightning-ai-mmt-demo-fabric.mp4
111+
:width: 800
112+
:loop:
113+
:muted:
114+
115+
After submitting the job, you will be redirected to a page where you can monitor the machine metrics and logs in real-time.
91116

92117

93118
----
94119

95120

121+
****************************
122+
Bring your own cloud account
123+
****************************
124+
125+
As a `Teams or Enterprise <https://lightning.ai/pricing>`_ customer, you have the option to connect your existing cloud account to Lightning AI.
126+
This gives your organization the ability to keep all compute and data on your own cloud account and your Virtual Private Cloud (VPC).
127+
128+
129+
----
130+
96131
**********
97-
Next steps
132+
Learn more
98133
**********
99134

100135
.. raw:: html
@@ -103,8 +138,8 @@ Next steps
103138
<div class="row">
104139

105140
.. displayitem::
106-
:header: Lightning Platform
107-
:description: Develop, Train and Deploy models on the cloud
141+
:header: Lightning Studios
142+
:description: Code together. Prototype. Train. Deploy. Host AI web apps. From your browser - with zero setup.
108143
:col_css: col-md-4
109144
:button_link: https://lightning.ai
110145
:height: 150

0 commit comments

Comments
 (0)