Skip to content

Commit a6273d1

Browse files
authored
Add Lightning Data + Update README (#19512)
1 parent eb0bbde commit a6273d1

File tree

2 files changed

+74
-27
lines changed

2 files changed

+74
-27
lines changed

src/lightning/data/README.md

+50-22
Original file line numberDiff line numberDiff line change
@@ -5,23 +5,62 @@
55
<br/>
66
<br/>
77

8-
## Blazing fast, distributed streaming of training data from cloud storage
8+
## Blazingly fast, distributed streaming of training data from cloud storage
99

1010
</div>
1111

1212
# ⚡ Welcome to Lightning Data
1313

1414
We developed `StreamingDataset` to optimize training of large datasets stored on the cloud while prioritizing speed, affordability, and scalability.
1515

16-
Specifically crafted for multi-node, distributed training with large models, it enhances accuracy, performance, and user-friendliness. Now, training efficiently is possible regardless of the data's location. Simply stream in the required data when needed.
16+
Specifically crafted for multi-gpu & multi-node (with [DDP](https://lightning.ai/docs/pytorch/stable/accelerators/gpu_intermediate.html), [FSDP](https://lightning.ai/docs/pytorch/stable/advanced/model_parallel/fsdp.html), etc...), distributed training with large models, it enhances accuracy, performance, and user-friendliness. Now, training efficiently is possible regardless of the data's location. Simply stream in the required data when needed.
1717

18-
The `StreamingDataset` is compatible with any data type, including **images, text, video, and multimodal data** and it is a drop-in replacement for your PyTorch [IterableDataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset) class. For example, it is used by [Lit-GPT](https://github.com/Lightning-AI/lit-gpt/blob/main/pretrain/tinyllama.py) to pretrain LLMs.
18+
The `StreamingDataset` is compatible with any data type, including **images, text, video, audio, geo-spatial, and multimodal data** and it is a drop-in replacement for your PyTorch [IterableDataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset) class. For example, it is used by [Lit-GPT](https://github.com/Lightning-AI/lit-gpt/blob/main/pretrain/tinyllama.py) to pretrain LLMs.
1919

20-
Finally, the `StreamingDataset` is fast! Check out our [benchmark](https://lightning.ai/lightning-ai/studios/benchmark-cloud-data-loading-libraries).
20+
# 🚀 Benchmarks
2121

22-
Here is an illustration showing how the `StreamingDataset` works.
22+
[Imagenet-1.2M](https://www.image-net.org/) is a commonly used dataset to compare computer vision models. Its training dataset contains `1,281,167 images`.
2323

24-
![An illustration showing how the Streaming Dataset works.](https://pl-flash-data.s3.amazonaws.com/streaming_dataset.gif)
24+
In this benchmark, we measured the streaming speed (`images per second`) loaded from [AWS S3](https://aws.amazon.com/s3/) for several frameworks.
25+
26+
Find the reproducible [Studio Benchmark](https://lightning.ai/lightning-ai/studios/benchmark-cloud-data-loading-libraries).
27+
28+
### Imagenet-1.2M Streaming from AWS S3
29+
30+
| Framework | Images / sec 1st Epoch (float32) | Images / sec 2nd Epoch (float32) | Images / sec 1st Epoch (torch16) | Images / sec 2nd Epoch (torch16) |
31+
| ----------- | ------------------------------------- | ------------------------------------- | ------------------------------------- | ------------------------------------- |
32+
| PL Data | ${\\textbf{\\color{Fuchsia}5800.34}}$ | ${\\textbf{\\color{Fuchsia}6589.98}}$ | ${\\textbf{\\color{Fuchsia}6282.17}}$ | ${\\textbf{\\color{Fuchsia}7221.88}}$ |
33+
| Web Dataset | 3134.42 | 3924.95 | 3343.40 | 4424.62 |
34+
| Mosaic ML | 2898.61 | 5099.93 | 2809.69 | 5158.98 |
35+
36+
Higher is better.
37+
38+
### Imagenet-1.2M Conversion
39+
40+
| Framework | Train Conversion Time | Val Conversion Time | Dataset Size | # Files |
41+
| ----------- | --------------------------------------- | --------------------------------------- | -------------------------------------- | ------- |
42+
| PL Data | ${\\textbf{\\color{Fuchsia}10:05 min}}$ | ${\\textbf{\\color{Fuchsia}00:30 min}}$ | ${\\textbf{\\color{Fuchsia}143.1 GB}}$ | 2.339 |
43+
| Web Dataset | 32:36 min | 01:22 min | 147.8 GB | 1.144 |
44+
| Mosaic ML | 49:49 min | 01:04 min | ${\\textbf{\\color{Fuchsia}143.1 GB}}$ | 2.298 |
45+
46+
The dataset needs to be converted into an optimized format for cloud streaming. We measured how fast the 1.2 million images are converted.
47+
48+
Faster is better.
49+
50+
# 📚 Real World Examples
51+
52+
We have built end-to-end free [Studios](https://lightning.ai) showing all the steps to prepare the following datasets:
53+
54+
| Dataset | Data type | Studio |
55+
| -------------------------------------------------------------------------------------------------------------------------------------------- | :-----------------: | --------------------------------------------------------------------------------------------------------------------------------------: |
56+
| [LAION-400M](https://laion.ai/blog/laion-400-open-dataset/) | Image & description | [Use or explore LAION-400MILLION dataset](https://lightning.ai/lightning-ai/studios/use-or-explore-laion-400million-dataset) |
57+
| [Chesapeake Roads Spatial Context](https://github.com/isaaccorley/chesapeakersc) | Image & Mask | [Convert GeoSpatial data to Lightning Streaming](https://lightning.ai/lightning-ai/studios/convert-spatial-data-to-lightning-streaming) |
58+
| [Imagenet 1M](https://paperswithcode.com/sota/image-classification-on-imagenet?tag_filter=171) | Image & Label | [Benchmark cloud data-loading libraries](https://lightning.ai/lightning-ai/studios/benchmark-cloud-data-loading-libraries) |
59+
| [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) & [StartCoder](https://huggingface.co/datasets/bigcode/starcoderdata) | Text | [Prepare the TinyLlama 1T token dataset](https://lightning.ai/lightning-ai/studios/prepare-the-tinyllama-1t-token-dataset) |
60+
| [English Wikepedia](https://huggingface.co/datasets/wikipedia) | Text | [Embed English Wikipedia under 5 dollars](https://lightning.ai/lightning-ai/studios/embed-english-wikipedia-under-5-dollars) |
61+
| Generated | Parquet Files | [Convert parquets to Lightning Streaming](https://lightning.ai/lightning-ai/studios/convert-parquets-to-lightning-streaming) |
62+
63+
[Lightning Studios](https://lightning.ai) are fully reproducible cloud IDE with data, code, dependencies, etc...
2564

2665
# 🎬 Getting Started
2766

@@ -32,7 +71,7 @@ Lightning Data can be installed with `pip`:
3271
<!--pytest.mark.skip-->
3372

3473
```bash
35-
pip install --no-cache-dir git+https://github.com/Lightning-AI/pytorch-lightning.git@master
74+
pip install --no-cache-dir git+https://github.com/Lightning-AI/lit-data.git@master
3675
```
3776

3877
## 🏁 Quick Start
@@ -102,6 +141,10 @@ cls = sample['class']
102141
dataloader = DataLoader(dataset)
103142
```
104143

144+
Here is an illustration showing how the `StreamingDataset` works under the hood.
145+
146+
![An illustration showing how the Streaming Dataset works.](https://pl-flash-data.s3.amazonaws.com/streaming_dataset.gif)
147+
105148
## Transform data
106149

107150
Similar to `optimize`, the `map` operator can be used to transform data by applying a function over a list of item and persist all the files written inside the output directory.
@@ -154,21 +197,6 @@ if __name__ == "__main__":
154197
)
155198
```
156199

157-
# 📚 End-to-end Lightning Studio Templates
158-
159-
We have end-to-end free [Studios](https://lightning.ai) showing all the steps to prepare the following datasets:
160-
161-
| Dataset | Data type | Studio |
162-
| -------------------------------------------------------------------------------------------------------------------------------------------- | :-----------------: | --------------------------------------------------------------------------------------------------------------------------------------: |
163-
| [LAION-400M](https://laion.ai/blog/laion-400-open-dataset/) | Image & description | [Use or explore LAION-400MILLION dataset](https://lightning.ai/lightning-ai/studios/use-or-explore-laion-400million-dataset) |
164-
| [Chesapeake Roads Spatial Context](https://github.com/isaaccorley/chesapeakersc) | Image & Mask | [Convert GeoSpatial data to Lightning Streaming](https://lightning.ai/lightning-ai/studios/convert-spatial-data-to-lightning-streaming) |
165-
| [Imagenet 1M](https://paperswithcode.com/sota/image-classification-on-imagenet?tag_filter=171) | Image & Label | [Benchmark cloud data-loading libraries](https://lightning.ai/lightning-ai/studios/benchmark-cloud-data-loading-libraries) |
166-
| [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) & [StartCoder](https://huggingface.co/datasets/bigcode/starcoderdata) | Text | [Prepare the TinyLlama 1T token dataset](https://lightning.ai/lightning-ai/studios/prepare-the-tinyllama-1t-token-dataset) |
167-
| [English Wikepedia](https://huggingface.co/datasets/wikipedia) | Text | [Embed English Wikipedia under 5 dollars](https://lightning.ai/lightning-ai/studios/embed-english-wikipedia-under-5-dollars) |
168-
| Generated | Parquet Files | [Convert parquets to Lightning Streaming](https://lightning.ai/lightning-ai/studios/convert-parquets-to-lightning-streaming) |
169-
170-
[Lightning Studios](https://lightning.ai) are fully reproducible cloud IDE with data, code, dependencies, etc... Finally reproducible science.
171-
172200
# 📈 Easily scale data processing
173201

174202
To scale data processing, create a free account on [lightning.ai](https://lightning.ai/) platform. With the platform, the `optimize` and `map` can start multiple machines to make data processing drastically faster as follows:

src/lightning/data/__init__.py

+24-5
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,27 @@
1+
import sys
2+
13
from lightning_utilities.core.imports import RequirementCache
24

3-
from lightning.data.processing.functions import map, optimize, walk
4-
from lightning.data.streaming.combined import CombinedStreamingDataset
5-
from lightning.data.streaming.dataloader import StreamingDataLoader
6-
from lightning.data.streaming.dataset import StreamingDataset
5+
_LIGHTNING_DATA_AVAILABLE = RequirementCache("lightning_data")
6+
_LIGHTNING_SDK_AVAILABLE = RequirementCache("lightning_sdk")
7+
8+
if _LIGHTNING_DATA_AVAILABLE:
9+
import lightning_data
10+
11+
# Enable resolution at least for lower data namespace
12+
sys.modules["lightning.data"] = lightning_data
13+
14+
from lightning_data.processing.functions import map, optimize, walk
15+
from lightning_data.streaming.combined import CombinedStreamingDataset
16+
from lightning_data.streaming.dataloader import StreamingDataLoader
17+
from lightning_data.streaming.dataset import StreamingDataset
18+
19+
else:
20+
# TODO: Delete all the code when everything is moved to lightning_data
21+
from lightning.data.processing.functions import map, optimize, walk
22+
from lightning.data.streaming.combined import CombinedStreamingDataset
23+
from lightning.data.streaming.dataloader import StreamingDataLoader
24+
from lightning.data.streaming.dataset import StreamingDataset
725

826
__all__ = [
927
"LightningDataset",
@@ -16,7 +34,8 @@
1634
"walk",
1735
]
1836

19-
if RequirementCache("lightning_sdk"):
37+
# TODO: Move this to lightning_data
38+
if _LIGHTNING_SDK_AVAILABLE:
2039
from lightning_sdk import Machine # noqa: F401
2140

2241
__all__.append("Machine")

0 commit comments

Comments
 (0)