Skip to content

Commit 8cad376

Browse files
authored
README improvements (#207)
* Add link to accept conditions of segmentation 3.0 * Add table with available models. Add some latencies * Add more info on selecting different models * Add missing info on available models * Improve top menu * Improve python badge * Move things around. Simplify code and wording * Add dark themed logo * Remove whitespace at the top * Update README.md * Rename from_pyannote to from_pretrained in segmentation and embedding blocks * Separate huggingface links from model name * Fix reproducibility link * Add animated diarization pipeline diagram * Improve pipeline gif * Update README.md * Update snippet gif. Fix torch multiprocessing crash with pyannote 3.1. Other README improvements * Update README.md * Fix bad link
1 parent 6041c77 commit 8cad376

File tree

7 files changed

+96
-52
lines changed

7 files changed

+96
-52
lines changed

README.md

+86-47
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,5 @@
1-
<br/>
2-
31
<p align="center">
4-
<img width="50%" src="https://raw.githubusercontent.com/juanmc2005/diart/main/logo.jpg" title="Logo" />
2+
<img width="100%" src="https://github.com/juanmc2005/diart/blob/main/logo.jpg?raw=true" title="diart logo" />
53
</p>
64

75
<p align="center">
@@ -11,7 +9,7 @@
119
<p align="center">
1210
<img alt="PyPI Version" src="https://img.shields.io/pypi/v/diart?color=g">
1311
<img alt="PyPI Downloads" src="https://static.pepy.tech/personalized-badge/diart?period=total&units=international_system&left_color=grey&right_color=brightgreen&left_text=downloads">
14-
<img alt="Top language" src="https://img.shields.io/github/languages/top/juanmc2005/StreamingSpeakerDiarization?color=g">
12+
<img alt="Python Versions" src="https://img.shields.io/badge/python-3.8%20%7C%203.9%20%7C%203.10-dark_green">
1513
<img alt="Code size in bytes" src="https://img.shields.io/github/languages/code-size/juanmc2005/StreamingSpeakerDiarization?color=g">
1614
<img alt="License" src="https://img.shields.io/github/license/juanmc2005/StreamingSpeakerDiarization?color=g">
1715
<a href="https://joss.theoj.org/papers/cc9807c6de75ea4c29025c7bd0d31996"><img src="https://joss.theoj.org/papers/cc9807c6de75ea4c29025c7bd0d31996/status.svg"></a>
@@ -27,16 +25,16 @@
2725
🎙️ Stream audio
2826
</a>
2927
<span> | </span>
30-
<a href="#-custom-models">
31-
🤖 Add your model
28+
<a href="#-models">
29+
🧠 Models
3230
</a>
33-
<span> | </span>
31+
<br />
3432
<a href="#-tune-hyper-parameters">
35-
📈 Tune hyper-parameters
33+
📈 Tuning
3634
</a>
37-
<br />
35+
<span> | </span>
3836
<a href="#-build-pipelines">
39-
🧠🔗 Build pipelines
37+
🧠🔗 Pipelines
4038
</a>
4139
<span> | </span>
4240
<a href="#-websockets">
@@ -46,14 +44,6 @@
4644
<a href="#-powered-by-research">
4745
🔬 Research
4846
</a>
49-
<span> | </span>
50-
<a href="#-citation">
51-
📗 Citation
52-
</a>
53-
<span> | </span>
54-
<a href="#-reproducibility">
55-
👨‍💻 Reproducibility
56-
</a>
5747
</h4>
5848
</div>
5949

@@ -65,38 +55,58 @@
6555

6656
## ⚡ Quick introduction
6757

68-
Diart is a python framework to build AI-powered real-time audio applications. With diart you can
69-
create your own AI pipeline, benchmark it, tune its hyper-parameters, and even serve it on the web using websockets.
58+
Diart is a python framework to build AI-powered real-time audio applications.
59+
Its key feature is the ability to recognize different speakers in real time with state-of-the-art performance,
60+
a task commonly known as "speaker diarization".
7061

71-
**We provide pre-trained AI pipelines for:**
62+
The pipeline `diart.SpeakerDiarization` combines a speaker segmentation and a speaker embedding model
63+
to power an incremental clustering algorithm that gets more accurate as the conversation progresses:
64+
65+
<p align="center">
66+
<img width="100%" src="https://github.com/juanmc2005/diart/blob/main/pipeline.gif?raw=true" title="Real-time speaker diarization pipeline" />
67+
</p>
68+
69+
With diart you can also create your own custom AI pipeline, benchmark it,
70+
tune its hyper-parameters, and even serve it on the web using websockets.
71+
72+
**We provide pre-trained pipelines for:**
7273

7374
- Speaker Diarization
7475
- Voice Activity Detection
75-
- Transcription (coming soon)
76-
- [Speaker-Aware Transcription](https://betterprogramming.pub/color-your-captions-streamlining-live-transcriptions-with-diart-and-openais-whisper-6203350234ef) (coming soon)
76+
- Transcription ([coming soon](https://github.com/juanmc2005/diart/pull/144))
77+
- [Speaker-Aware Transcription](https://betterprogramming.pub/color-your-captions-streamlining-live-transcriptions-with-diart-and-openais-whisper-6203350234ef) ([coming soon](https://github.com/juanmc2005/diart/pull/147))
7778

7879
## 💾 Installation
7980

80-
1) Create environment:
81+
**1) Make sure your system has the following dependencies:**
82+
83+
```
84+
ffmpeg < 4.4
85+
portaudio == 19.6.X
86+
libsndfile >= 1.2.2
87+
```
88+
89+
Alternatively, we provide an `environment.yml` file for a pre-configured conda environment:
8190

8291
```shell
8392
conda env create -f diart/environment.yml
8493
conda activate diart
8594
```
8695

87-
2) Install the package:
96+
**2) Install the package:**
8897
```shell
8998
pip install diart
9099
```
91100

92101
### Get access to 🎹 pyannote models
93102

94-
By default, diart is based on [pyannote.audio](https://github.com/pyannote/pyannote-audio) models stored in the [huggingface](https://huggingface.co/) hub.
95-
To allow diart to use them, you need to follow these steps:
103+
By default, diart is based on [pyannote.audio](https://github.com/pyannote/pyannote-audio) models from the [huggingface](https://huggingface.co/) hub.
104+
In order to use them, please follow these steps:
96105

97106
1) [Accept user conditions](https://huggingface.co/pyannote/segmentation) for the `pyannote/segmentation` model
98-
2) [Accept user conditions](https://huggingface.co/pyannote/embedding) for the `pyannote/embedding` model
99-
3) Install [huggingface-cli](https://huggingface.co/docs/huggingface_hub/quick-start#install-the-hub-library) and [log in](https://huggingface.co/docs/huggingface_hub/quick-start#login) with your user access token (or provide it manually in diart CLI or API).
107+
2) [Accept user conditions](https://huggingface.co/pyannote/segmentation-3.0) for the newest `pyannote/segmentation-3.0` model
108+
3) [Accept user conditions](https://huggingface.co/pyannote/embedding) for the `pyannote/embedding` model
109+
4) Install [huggingface-cli](https://huggingface.co/docs/huggingface_hub/quick-start#install-the-hub-library) and [log in](https://huggingface.co/docs/huggingface_hub/quick-start#login) with your user access token (or provide it manually in diart CLI or API).
100110

101111
## 🎙️ Stream audio
102112

@@ -116,7 +126,8 @@ A live conversation:
116126
diart.stream microphone
117127
```
118128

119-
See `diart.stream -h` for more options.
129+
By default, diart runs a speaker diarization pipeline, equivalent to setting `--pipeline SpeakerDiarization`,
130+
but you can also set it to `--pipeline VoiceActivityDetection`. See `diart.stream -h` for more options.
120131

121132
### From python
122133

@@ -135,18 +146,50 @@ inference.attach_observers(RTTMWriter(mic.uri, "/output/file.rttm"))
135146
prediction = inference()
136147
```
137148

138-
For inference and evaluation on a dataset we recommend to use `Benchmark` (see notes on [reproducibility](#-reproducibility)).
149+
For inference and evaluation on a dataset we recommend to use `Benchmark` (see notes on [reproducibility](#reproducibility)).
139150

140-
## 🤖 Add your model
151+
## 🧠 Models
152+
153+
You can use other models with the `--segmentation` and `--embedding` arguments.
154+
Or in python:
155+
156+
```python
157+
import diart.models as m
158+
159+
segmentation = m.SegmentationModel.from_pretrained("model_name")
160+
embedding = m.EmbeddingModel.from_pretrained("model_name")
161+
```
162+
163+
### Pre-trained models
164+
165+
Below is a list of all the models currently supported by diart:
166+
167+
| Model Name | Model Type | CPU Time* | GPU Time* |
168+
|---------------------------------------------------------------------------------------------------------------------------|--------------|-----------|-----------|
169+
| [🤗](https://huggingface.co/pyannote/segmentation) `pyannote/segmentation` (default) | segmentation | 12ms | 8ms |
170+
| [🤗](https://huggingface.co/pyannote/segmentation-3.0) `pyannote/segmentation-3.0` | segmentation | 11ms | 8ms |
171+
| [🤗](https://huggingface.co/pyannote/embedding) `pyannote/embedding` (default) | embedding | 26ms | 12ms |
172+
| [🤗](https://huggingface.co/hbredin/wespeaker-voxceleb-resnet34-LM) `hbredin/wespeaker-voxceleb-resnet34-LM` (ONNX) | embedding | 48ms | 15ms |
173+
| [🤗](https://huggingface.co/pyannote/wespeaker-voxceleb-resnet34-LM) `pyannote/wespeaker-voxceleb-resnet34-LM` (PyTorch) | embedding | 150ms | 29ms |
174+
| [🤗](https://huggingface.co/speechbrain/spkrec-xvect-voxceleb) `speechbrain/spkrec-xvect-voxceleb` | embedding | 41ms | 15ms |
175+
| [🤗](https://huggingface.co/speechbrain/spkrec-ecapa-voxceleb) `speechbrain/spkrec-ecapa-voxceleb` | embedding | 41ms | 14ms |
176+
| [🤗](https://huggingface.co/speechbrain/spkrec-ecapa-voxceleb-mel-spec) `speechbrain/spkrec-ecapa-voxceleb-mel-spec` | embedding | 42ms | 14ms |
177+
| [🤗](https://huggingface.co/speechbrain/spkrec-resnet-voxceleb) `speechbrain/spkrec-resnet-voxceleb` | embedding | 41ms | 16ms |
178+
| [🤗](https://huggingface.co/nvidia/speakerverification_en_titanet_large) `nvidia/speakerverification_en_titanet_large` | embedding | 91ms | 16ms |
179+
180+
The latency of segmentation models is measured in a VAD pipeline (5s chunks).
181+
182+
The latency of embedding models is measured in a diarization pipeline using `pyannote/segmentation` (also 5s chunks).
183+
184+
\* CPU: AMD Ryzen 9 - GPU: RTX 4060 Max-Q
185+
186+
### Custom models
141187

142188
Third-party models can be integrated by providing a loader function:
143189

144190
```python
145191
from diart import SpeakerDiarization, SpeakerDiarizationConfig
146192
from diart.models import EmbeddingModel, SegmentationModel
147-
from diart.sources import MicrophoneAudioSource
148-
from diart.inference import StreamingInference
149-
150193

151194
def segmentation_loader():
152195
# It should take a waveform and return a segmentation tensor
@@ -156,17 +199,13 @@ def embedding_loader():
156199
# It should take (waveform, weights) and return per-speaker embeddings
157200
return load_pretrained_model("my_other_model.ckpt")
158201

159-
160202
segmentation = SegmentationModel(segmentation_loader)
161203
embedding = EmbeddingModel(embedding_loader)
162204
config = SpeakerDiarizationConfig(
163205
segmentation=segmentation,
164206
embedding=embedding,
165207
)
166208
pipeline = SpeakerDiarization(config)
167-
mic = MicrophoneAudioSource()
168-
inference = StreamingInference(pipeline, mic)
169-
prediction = inference()
170209
```
171210

172211
If you have an ONNX model, you can use `from_onnx()`:
@@ -204,7 +243,7 @@ optimizer(num_iter=100)
204243

205244
This will write results to an sqlite database in `/output/dir`.
206245

207-
### Distributed optimization
246+
### Distributed tuning
208247

209248
For bigger datasets, it is sometimes more convenient to run multiple optimization processes in parallel.
210249
To do this, create a study on a [recommended DBMS](https://optuna.readthedocs.io/en/stable/tutorial/10_key_features/004_distributed.html#sphx-glr-tutorial-10-key-features-004-distributed-py) (e.g. MySQL or PostgreSQL) making sure that the study and database names match:
@@ -248,8 +287,8 @@ import diart.operators as dops
248287
from diart.sources import MicrophoneAudioSource
249288
from diart.blocks import SpeakerSegmentation, OverlapAwareSpeakerEmbedding
250289

251-
segmentation = SpeakerSegmentation.from_pyannote("pyannote/segmentation")
252-
embedding = OverlapAwareSpeakerEmbedding.from_pyannote("pyannote/embedding")
290+
segmentation = SpeakerSegmentation.from_pretrained("pyannote/segmentation")
291+
embedding = OverlapAwareSpeakerEmbedding.from_pretrained("pyannote/embedding")
253292
mic = MicrophoneAudioSource()
254293

255294
stream = mic.stream.pipe(
@@ -278,7 +317,7 @@ Diart is also compatible with the WebSocket protocol to serve pipelines on the w
278317

279318
### From the command line
280319

281-
```commandline
320+
```shell
282321
diart.serve --host 0.0.0.0 --port 7007
283322
diart.client microphone --host <server-address> --port 7007
284323
```
@@ -319,7 +358,7 @@ and [Sophie Rosset](https://perso.limsi.fr/rosset/).
319358
<img height="400" src="https://github.com/juanmc2005/diart/blob/main/figure1.png?raw=true" title="Visual explanation of the system" width="325" />
320359
</p>
321360

322-
## 📗 Citation
361+
### Citation
323362

324363
If you found diart useful, please make sure to cite our paper:
325364

@@ -334,7 +373,7 @@ If you found diart useful, please make sure to cite our paper:
334373
}
335374
```
336375

337-
## 👨‍💻 Reproducibility
376+
### Reproducibility
338377

339378
![Results table](https://github.com/juanmc2005/diart/blob/main/table1.png?raw=true)
340379

@@ -389,7 +428,7 @@ This pre-calculates model outputs in batches, so it runs a lot faster.
389428
See `diart.benchmark -h` for more options.
390429

391430
For convenience and to facilitate future comparisons, we also provide the
392-
[expected outputs](https://github.com/juanmc2005/diart/tree/main/expected_outputs)
431+
<a href="https://github.com/juanmc2005/diart/tree/main/expected_outputs">expected outputs</a>
393432
of the paper implementation in RTTM format for every entry of Table 1 and Figure 5.
394433
This includes the VBx offline topline as well as our proposed online approach with
395434
latencies 500ms, 1s, 2s, 3s, 4s, and 5s.
@@ -423,4 +462,4 @@ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
423462
SOFTWARE.
424463
```
425464

426-
<p>Logo generated by <a href="https://www.designevo.com/" title="Free Online Logo Maker">DesignEvo free logo designer</a></p>
465+
<p style="color:grey;font-size:14px;">Logo generated by <a href="https://www.designevo.com/" title="Free Online Logo Maker">DesignEvo free logo designer</a></p>

demo.gif

-1.28 MB
Loading

logo.jpg

29.9 KB
Loading

pipeline.gif

127 KB
Loading

src/diart/blocks/embedding.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -158,7 +158,7 @@ def __init__(
158158
self.normalize = EmbeddingNormalization(norm)
159159

160160
@staticmethod
161-
def from_pyannote(
161+
def from_pretrained(
162162
model,
163163
gamma: float = 3,
164164
beta: float = 10,
@@ -167,7 +167,7 @@ def from_pyannote(
167167
normalize_weights: bool = False,
168168
device: Optional[torch.device] = None,
169169
):
170-
model = EmbeddingModel.from_pyannote(model, use_hf_token)
170+
model = EmbeddingModel.from_pretrained(model, use_hf_token)
171171
return OverlapAwareSpeakerEmbedding(
172172
model, gamma, beta, norm, normalize_weights, device
173173
)

src/diart/blocks/segmentation.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -18,12 +18,12 @@ def __init__(self, model: SegmentationModel, device: Optional[torch.device] = No
1818
self.formatter = TemporalFeatureFormatter()
1919

2020
@staticmethod
21-
def from_pyannote(
21+
def from_pretrained(
2222
model,
2323
use_hf_token: Union[Text, bool, None] = True,
2424
device: Optional[torch.device] = None,
2525
) -> "SpeakerSegmentation":
26-
seg_model = SegmentationModel.from_pyannote(model, use_hf_token)
26+
seg_model = SegmentationModel.from_pretrained(model, use_hf_token)
2727
return SpeakerSegmentation(seg_model, device)
2828

2929
def __call__(self, waveform: TemporalFeatures) -> TemporalFeatures:

src/diart/inference.py

+6-1
Original file line numberDiff line numberDiff line change
@@ -524,7 +524,12 @@ def __call__(
524524
num_audio_files = len(audio_file_paths)
525525

526526
# Workaround for multiprocessing with GPU
527-
torch.multiprocessing.set_start_method("spawn")
527+
try:
528+
torch.multiprocessing.set_start_method("spawn")
529+
except RuntimeError:
530+
# This may fail if the start method was set before
531+
pass
532+
528533
# For Windows support
529534
freeze_support()
530535

0 commit comments

Comments
 (0)