-
-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minimizing missed detection #224
Comments
Hi @arielrado,
What do you mean "annotated with webrtc vad"? How does that annotation process work? |
Thanks for responding! I used the code below, essentially I used webrtc to determine where in a single speaker recording there is speech (in 30ms chunks) and created a wav file containing a few speakers and a rttm file to go with it.
|
@arielrado do you know what's the detection error rate of your VAD on the same data? i.e. false alarm and missed detection. I think this may also be related to a deeper problem in diart that I've been meaning to address for some time, which is that Consider this: if you increase Conversely, if you decrease If my intuition is correct, the tuning should find a point in hyper-parameter space where these two scenarios either balance out or one wins over the other, but in any case it will be with the best DER possible. Your tuning doesn't really care if your miss is too high, it just cares about the sum of the three components (you could also try changing the tuning metric to weigh miss detection differently). This is also a very interesting use case for me to better understand this problem and find a nice solution. Another option would be to implement a different clustering strategy. For example, you could use the overlapping part of two consecutive chunk outputs to decide the speaker permutation of the new prediction and then attempt to detect new speakers with speaker embeddings (big challenge here). You may want to take a look at PR #201 that does something like this without the new speaker detection thing (notice that all hyper-parameters are gone here). |
Hi juanmc2005,
I don't know what the precise error rate is on my data since I don't have hand annotations, according to this webrtc has an f1 score of 0.819 on raw audio, which isn't very good. My data was recorded in a quiet envirinmnet so I will use a simple energy level vad instead.
I don't have a lot of experience working with rxpy, so the codebase is pretty confusing to me. how would you suggest I go about implementing this? what sections should I be looking at? Should I implement the vad as an extra block in the pipeline? I will use a more powerful vad like MarbleNet I had an idea to use a full offline diarization pipeline instead of the segmentation block. I am using a pretty powerful gpu and I understand that making the pipeline lightweight was a priority, I might be able to make use of the hardware overhead. thanks for the help so far, I'll keep you posted! |
You don't need to know rxpy to modify a pipeline, if you look at the implementation of
I don't think using a full offline diarization for each chunk is a good idea, as you'll have the same problems that you have with a segmentation block, for example deciding which speakers of the new chunk correspond to the speakers in previous ones. To verify if the issue is coming from segmentation or from clustering, you could use the segmentation block as a VAD (you can simply take the max score across all speakers) and measure the false alarm and miss on your data. If the miss is bad you have a segmentation problem, but if it's better then the problem comes from clustering as I hypothesized in my previous message. You could also consider using |
Hi
I can reduce that by adding VAD filtering at the end of the pipeline (using a more reliable VAD) my paln is to only indentify a speaker if the external vad is positive and segmentation has tagged them as active. I'd like to reduce the confusion, I tried reducing the max-speaker parameter but that caused confusion to increase as opposed to my intuition.
I tried out this method but it yields worse results overall.
I have been using this method since I found it to preform better when I first started teting diart. Thanks for all the help so far! |
@arielrado this will be more of a trial and error effort to bring the confusion down. Reducing speaker confusion in real-time applications is still an open problem in the speaker diarization community.
Well at some point you'll have to decide what you do when the VAD and diarization disagree, for example you can attempt a relabeling using the embedding model if the VAD detects speech that diart missed. It's rather difficult to say what can be done in your setting without in-depth knowledge of your data and concrete application. Optimizing the performance of the library to your particular application is not something trivial.
You can probably get a better performance by constraining the system to a maximum of 3 speakers. |
Iv'e been working on tuning the pipeline for my application which is a real time conversational system, the best results so far are:
36.02% DER, 2.41% false alarm, 20.04% missed detection, and 13.56% confusion
I used my own data which was recorded on the target platform's microphone array and annotated with webrtc vad. our system is reasonably tolerant to false alarms, since we require ASR to lable a speaker, but not to missed detections.
Is there a way to make the pipeline less conservative?
I am using segmentation-3.0, default embedding and hyperparameters acheived through tuning : tau=0.493, rho=0.055, delta=0.633.
I tried lowering tau to and increasing rho but that just increased the confusion rate.
I also tried changing gamma and beta, as i undestood that lowering gamma and beta can yield the desired results but it seems that they have a negligible effect.
The text was updated successfully, but these errors were encountered: