Skip to content

Commit 5257600

Browse files
plagussgabrielmbmbpre-commit-ci[bot]
authored
Image Language Models and ImageGeneration task (#1060)
Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent e866345 commit 5257600

File tree

164 files changed

+2398
-632
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

164 files changed

+2398
-632
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# ImageGenerationModel Gallery
2+
3+
This section contains the existing [`ImageGenerationModel`][distilabel.models.image_generation] subclasses implemented in `distilabel`.
4+
5+
::: distilabel.models.image_generation
6+
options:
7+
filters:
8+
- "!^ImageGenerationModel$"
9+
- "!^AsyngImageGenerationModel$"
10+
- "!typing"
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
# ImageGenerationModel
2+
3+
This section contains the API reference for the `distilabel` image generation models, both for the [`ImageGenerationModel`][distilabel.models.image_generation.ImageGenerationModel] synchronous implementation, and for the [`AsyncImageGenerationModel`][distilabel.models.image_generation.AsyncImageGenerationModel] asynchronous one.
4+
5+
For more information and examples on how to use existing LLMs or create custom ones, please refer to [Tutorial - ImageGenerationModel](../../../sections/how_to_guides/basic/task/image_task.md).
6+
7+
::: distilabel.models.image_generation.base

docs/api/pipeline/typing.md

-3
This file was deleted.

docs/api/step/typing.md

-3
This file was deleted.

docs/api/task/image_task.md

+7
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
# ImageTask
2+
3+
This section contains the API reference for the `distilabel` image generation tasks.
4+
5+
For more information on how the [`ImageTask`][distilabel.steps.tasks.ImageTask] works and see some examples, check the [Tutorial - Task - ImageTask](../../sections/how_to_guides/basic/task/generator_task.md) page.
6+
7+
::: distilabel.steps.tasks.base.ImageTask

docs/api/task/task_gallery.md

+1
Original file line numberDiff line numberDiff line change
@@ -8,5 +8,6 @@ This section contains the existing [`Task`][distilabel.steps.tasks.Task] subclas
88
- "!Task"
99
- "!_Task"
1010
- "!GeneratorTask"
11+
- "!ImageTask"
1112
- "!ChatType"
1213
- "!typing"

docs/api/task/typing.md

-3
This file was deleted.

docs/api/typing.md

+8
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
# Types
2+
3+
This section contains the different types used accross the distilabel codebase.
4+
5+
::: distilabel.typing.base
6+
::: distilabel.typing.steps
7+
::: distilabel.typing.models
8+
::: distilabel.typing.pipeline

docs/sections/how_to_guides/advanced/distiset.md

+27
Original file line numberDiff line numberDiff line change
@@ -119,6 +119,33 @@ class MagpieGenerator(GeneratorTask, MagpieBase):
119119

120120
The `Citations` section can include any number of bibtex references. To define them, you can add as much elements as needed just like in the example: each citation will be a block of the form: ` ```@misc{...}``` `. This information will be automatically used in the README of your `Distiset` if you decide to call `distiset.push_to_hub`. Alternatively, if the `Citations` is not found, but in the `References` there are found any urls pointing to `https://arxiv.org/`, we will try to obtain the `Bibtex` equivalent automatically. This way, Hugging Face can automatically track the paper for you and it's easier to find other datasets citing the same paper, or directly visiting the paper page.
121121

122+
#### Image Datasets
123+
124+
!!! info "Keep reading if you are interested in Image datasets"
125+
126+
The `Distiset` object has a new method `transform_columns_to_image` specifically to transform the images to `PIL.Image.Image` before pushing the dataset to the hugging face hub.
127+
128+
Since version `1.5.0` we have the [`ImageGeneration`](https://distilabel.argilla.io/dev/components-gallery/task/imagegeneration/) task that is able to generate images from text. By default, all the process will work internally with a string representation for the images. This is done for simplicity while processing. But to take advantage of the Hugging Face Hub functionalities if the dataset generated is going to be stored there, a proper Image object may be preferable, so we can see the images in the dataset viewer for example. Let's take a look at the following pipeline extracted from "examples/image_generation.py" at the root of the repository to see how we can do it:
129+
130+
```diff
131+
# Assume all the imports are already done, we are only interested
132+
with Pipeline(name="image_generation_pipeline") as pipeline:
133+
img_generation = ImageGeneration(
134+
name="flux_schnell",
135+
llm=igm,
136+
InferenceEndpointsImageGeneration(model_id="black-forest-labs/FLUX.1-schnell")
137+
)
138+
...
139+
140+
if __name__ == "__main__":
141+
distiset = pipeline.run(use_cache=False, dataset=ds)
142+
# Save the images as `PIL.Image.Image`
143+
+ distiset = distiset.transform_columns_to_image("image")
144+
distiset.push_to_hub(...)
145+
```
146+
147+
After calling [`transform_columns_to_image`][distilabel.distiset.Distiset.transform_columns_to_image] on the image columns we may have generated (in this case we only want to transform the `image` column, but a list can be passed). This will apply to any leaf nodes we have in the pipeline, meaning if we have different subsets, the "image" column will be found in all of them, or we can pass a list of columns.
148+
122149
### Save and load from disk
123150

124151
Take into account that these methods work as `datasets.load_from_disk` and `datasets.Dataset.save_to_disk` so the arguments are directly passed to those methods. This means you can also make use of `storage_options` argument to save your [`Distiset`][distilabel.distiset.Distiset] in your cloud provider, including the distilabel artifacts (`pipeline.yaml`, `pipeline.log` and the `README.md` with the dataset card). You can read more in `datasets` documentation [here](https://huggingface.co/docs/datasets/filesystems#saving-serialized-datasets).

docs/sections/how_to_guides/advanced/pipeline_requirements.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ from typing import List
99

1010
from distilabel.steps import Step
1111
from distilabel.steps.base import StepInput
12-
from distilabel.steps.typing import StepOutput
12+
from distilabel.typing import StepOutput
1313
from distilabel.steps import LoadDataFromDicts
1414
from distilabel.utils.requirements import requirements
1515
from distilabel.pipeline import Pipeline

docs/sections/how_to_guides/advanced/structured_generation.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ The [`LLM`][distilabel.models.llms.LLM] has an argument named `structured_output
2121
We will start with a JSON example, where we initially define a `pydantic.BaseModel` schema to guide the generation of the structured output.
2222

2323
!!! NOTE
24-
Take a look at [`StructuredOutputType`][distilabel.steps.tasks.typing.StructuredOutputType] to see the expected format
24+
Take a look at [`StructuredOutputType`][distilabel.typing.models.StructuredOutputType] to see the expected format
2525
of the `structured_output` dict variable.
2626

2727
```python
@@ -139,7 +139,7 @@ For other LLM providers behind APIs, there's no direct way of accessing the inte
139139
```
140140

141141
!!! Note
142-
Take a look at [`InstructorStructuredOutputType`][distilabel.steps.tasks.typing.InstructorStructuredOutputType] to see the expected format
142+
Take a look at [`InstructorStructuredOutputType`][distilabel.typing.models.InstructorStructuredOutputType] to see the expected format
143143
of the `structured_output` dict variable.
144144

145145
The following is the same example you can see with `outlines`'s `JSON` section for comparison purposes.

docs/sections/how_to_guides/basic/step/generator_step.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ from typing_extensions import override
99
from distilabel.steps import GeneratorStep
1010

1111
if TYPE_CHECKING:
12-
from distilabel.steps.typing import StepColumns, GeneratorStepOutput
12+
from distilabel.typing import StepColumns, GeneratorStepOutput
1313

1414
class MyGeneratorStep(GeneratorStep):
1515
instructions: List[str]
@@ -67,7 +67,7 @@ We can define a custom generator step by creating a new subclass of the [`Genera
6767
The default signature for the `process` method is `process(self, offset: int = 0) -> GeneratorStepOutput`. The argument `offset` should be respected, no more arguments can be provided, and the type-hints and return type-hints should be respected too because it should be able to receive any number of inputs by default i.e. more than one [`Step`][distilabel.steps.Step] at a time could be connected to the current one.
6868

6969
!!! WARNING
70-
For the custom [`Step`][distilabel.steps.Step] subclasses to work properly with `distilabel` and with the validation and serialization performed by default over each [`Step`][distilabel.steps.Step] in the [`Pipeline`][distilabel.pipeline.Pipeline], the type-hint for both [`StepInput`][distilabel.steps.StepInput] and [`StepOutput`][distilabel.steps.typing.StepOutput] should be used and not surrounded with double-quotes or imported under `typing.TYPE_CHECKING`, otherwise, the validation and/or serialization will fail.
70+
For the custom [`Step`][distilabel.steps.Step] subclasses to work properly with `distilabel` and with the validation and serialization performed by default over each [`Step`][distilabel.steps.Step] in the [`Pipeline`][distilabel.pipeline.Pipeline], the type-hint for both [`StepInput`][distilabel.steps.StepInput] and [`StepOutput`][distilabel.typing.StepOutput] should be used and not surrounded with double-quotes or imported under `typing.TYPE_CHECKING`, otherwise, the validation and/or serialization will fail.
7171

7272
=== "Inherit from `GeneratorStep`"
7373

@@ -81,7 +81,7 @@ We can define a custom generator step by creating a new subclass of the [`Genera
8181
from distilabel.steps import GeneratorStep
8282

8383
if TYPE_CHECKING:
84-
from distilabel.steps.typing import StepColumns, GeneratorStepOutput
84+
from distilabel.typing import StepColumns, GeneratorStepOutput
8585

8686
class MyGeneratorStep(GeneratorStep):
8787
instructions: List[str]
@@ -104,7 +104,7 @@ We can define a custom generator step by creating a new subclass of the [`Genera
104104
from distilabel.steps import step
105105

106106
if TYPE_CHECKING:
107-
from distilabel.steps.typing import GeneratorStepOutput
107+
from distilabel.typing import GeneratorStepOutput
108108

109109
@step(outputs=[...], step_type="generator")
110110
def CustomGeneratorStep(offset: int = 0) -> "GeneratorStepOutput":

docs/sections/how_to_guides/basic/step/global_step.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ We can define a custom step by creating a new subclass of the [`GlobalStep`][dis
1616
The default signature for the `process` method is `process(self, *inputs: StepInput) -> StepOutput`. The argument `inputs` should be respected, no more arguments can be provided, and the type-hints and return type-hints should be respected too because it should be able to receive any number of inputs by default i.e. more than one [`Step`][distilabel.steps.Step] at a time could be connected to the current one.
1717

1818
!!! WARNING
19-
For the custom [`GlobalStep`][distilabel.steps.GlobalStep] subclasses to work properly with `distilabel` and with the validation and serialization performed by default over each [`Step`][distilabel.steps.Step] in the [`Pipeline`][distilabel.pipeline.Pipeline], the type-hint for both [`StepInput`][distilabel.steps.StepInput] and [`StepOutput`][distilabel.steps.typing.StepOutput] should be used and not surrounded with double-quotes or imported under `typing.TYPE_CHECKING`, otherwise, the validation and/or serialization will fail.
19+
For the custom [`GlobalStep`][distilabel.steps.GlobalStep] subclasses to work properly with `distilabel` and with the validation and serialization performed by default over each [`Step`][distilabel.steps.Step] in the [`Pipeline`][distilabel.pipeline.Pipeline], the type-hint for both [`StepInput`][distilabel.steps.StepInput] and [`StepOutput`][distilabel.typing.StepOutput] should be used and not surrounded with double-quotes or imported under `typing.TYPE_CHECKING`, otherwise, the validation and/or serialization will fail.
2020

2121
=== "Inherit from `GlobalStep`"
2222

@@ -27,7 +27,7 @@ We can define a custom step by creating a new subclass of the [`GlobalStep`][dis
2727
from distilabel.steps import GlobalStep, StepInput
2828

2929
if TYPE_CHECKING:
30-
from distilabel.steps.typing import StepColumns, StepOutput
30+
from distilabel.typing import StepColumns, StepOutput
3131

3232
class CustomStep(Step):
3333
@property
@@ -61,7 +61,7 @@ We can define a custom step by creating a new subclass of the [`GlobalStep`][dis
6161
from distilabel.steps import StepInput, step
6262

6363
if TYPE_CHECKING:
64-
from distilabel.steps.typing import StepOutput
64+
from distilabel.typing import StepOutput
6565

6666
@step(inputs=[...], outputs=[...], step_type="global")
6767
def CustomStep(inputs: StepInput) -> "StepOutput":

docs/sections/how_to_guides/basic/step/index.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ from typing import TYPE_CHECKING
1111
from distilabel.steps import Step, StepInput
1212

1313
if TYPE_CHECKING:
14-
from distilabel.steps.typing import StepColumns, StepOutput
14+
from distilabel.typing import StepColumns, StepOutput
1515

1616
class MyStep(Step):
1717
@property
@@ -87,7 +87,7 @@ We can define a custom step by creating a new subclass of the [`Step`][distilabe
8787
The default signature for the `process` method is `process(self, *inputs: StepInput) -> StepOutput`. The argument `inputs` should be respected, no more arguments can be provided, and the type-hints and return type-hints should be respected too because it should be able to receive any number of inputs by default i.e. more than one [`Step`][distilabel.steps.Step] at a time could be connected to the current one.
8888

8989
!!! WARNING
90-
For the custom [`Step`][distilabel.steps.Step] subclasses to work properly with `distilabel` and with the validation and serialization performed by default over each [`Step`][distilabel.steps.Step] in the [`Pipeline`][distilabel.pipeline.Pipeline], the type-hint for both [`StepInput`][distilabel.steps.StepInput] and [`StepOutput`][distilabel.steps.typing.StepOutput] should be used and not surrounded with double-quotes or imported under `typing.TYPE_CHECKING`, otherwise, the validation and/or serialization will fail.
90+
For the custom [`Step`][distilabel.steps.Step] subclasses to work properly with `distilabel` and with the validation and serialization performed by default over each [`Step`][distilabel.steps.Step] in the [`Pipeline`][distilabel.pipeline.Pipeline], the type-hint for both [`StepInput`][distilabel.steps.StepInput] and [`StepOutput`][distilabel.typing.StepOutput] should be used and not surrounded with double-quotes or imported under `typing.TYPE_CHECKING`, otherwise, the validation and/or serialization will fail.
9191

9292
=== "Inherit from `Step`"
9393

@@ -98,7 +98,7 @@ We can define a custom step by creating a new subclass of the [`Step`][distilabe
9898
from distilabel.steps import Step, StepInput
9999

100100
if TYPE_CHECKING:
101-
from distilabel.steps.typing import StepColumns, StepOutput
101+
from distilabel.typing import StepColumns, StepOutput
102102

103103
class CustomStep(Step):
104104
@property
@@ -132,7 +132,7 @@ We can define a custom step by creating a new subclass of the [`Step`][distilabe
132132
from distilabel.steps import StepInput, step
133133

134134
if TYPE_CHECKING:
135-
from distilabel.steps.typing import StepOutput
135+
from distilabel.typing import StepOutput
136136

137137
@step(inputs=[...], outputs=[...])
138138
def CustomStep(inputs: StepInput) -> "StepOutput":

docs/sections/how_to_guides/basic/task/generator_task.md

+2-3
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,7 @@ from typing import Any, Dict, List, Union
1212
from typing_extensions import override
1313

1414
from distilabel.steps.tasks.base import GeneratorTask
15-
from distilabel.steps.tasks.typing import ChatType
16-
from distilabel.steps.typing import GeneratorOutput
15+
from distilabel.typing import ChatType, GeneratorOutput
1716

1817

1918
class MyCustomTask(GeneratorTask):
@@ -78,7 +77,7 @@ We can define a custom generator task by creating a new subclass of the [`Genera
7877
from typing import Any, Dict, List, Union
7978

8079
from distilabel.steps.tasks.base import GeneratorTask
81-
from distilabel.steps.tasks.typing import ChatType
80+
from distilabel.typing import ChatType
8281

8382

8483
class MyCustomTask(GeneratorTask):

0 commit comments

Comments
 (0)