A curated list of awesome Speaker Diarization papers, libraries, datasets, and other resources.
This is a curated list of awesome Speaker Diarization papers, libraries, datasets, and other resources.
The purpose of this repo is to organize the world’s resources for speaker diarization, and make them universally accessible and useful.
To add items to this page, simply send a pull request. (contributing guide)
| Link | Language | Description | | ---- | -------- | ----------- | | pyannote-metrics | Python| A toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems. | | SimpleDER | Python | A lightweight library to compute Diarization Error Rate (DER). | | NIST md-eval | Perl | (1) modified md-eval.pl from Mary Tai Knox; (2) md-eval-v21.pl from jitendra; (3) md-eval-22.pl from nryant | | dscore | Python & Perl | Diarization scoring tools. | | Sequence Match Accuracy | Python | Match the accuracy of two sequences with Hungarian algorithm. |
| Link | Language | Description | | ---- | -------- | ----------- | | uis-rnn | Python & PyTorch | Google's Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, for Fully Supervised Speaker Diarization. This clustering algorithm is supervised. | | uis-rnn-sml | Python & PyTorch | A variant of UIS-RNN, for the paper Supervised Online Diarization with Sample Mean Loss for Multi-Domain Data. | | DNC | Python & ESPnet | Transformer-based Discriminative Neural Clustering (DNC) for Speaker Diarisation. Like UIS-RNN, it is supervised. | | SpectralCluster | Python | Spectral clustering with affinity matrix refinement operations. | | sklearn.cluster | Python | scikit-learn clustering algorithms. | | PLDA | Python | Probabilistic Linear Discriminant Analysis & classification, written in Python. | | PLDA | C++ | Open-source implementation of simplified PLDA (Probabilistic Linear Discriminant Analysis). | | Auto-Tuning Spectral Clustering | Python | Auto-tuning Spectral Clustering method that does not need development set or supervised tuning. |
| Link | Method | Language | Description | | ---- | ------ | -------- | ----------- | | resemble-ai/Resemblyzer | d-vector | Python & PyTorch | PyTorch implementation of generalized end-to-end loss for speaker verification, which can be used for voice cloning and diarization. | | Speaker_Verification | d-vector | Python & TensorFlow | Tensorflow implementation of generalized end-to-end loss for speaker verification. | | PyTorchSpeakerVerification | d-vector | Python & PyTorch | PyTorch implementation of "Generalized End-to-End Loss for Speaker Verification" by Wan, Li et al. With UIS-RNN integration. | | Real-Time Voice Cloning | d-vector | Python & PyTorch | Implementation of "Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis" (SV2TTS) with a vocoder that works in real-time. | | deep-speaker | d-vector |Python & Keras | Third party implementation of the Baidu paper Deep Speaker: an End-to-End Neural Speaker Embedding System. | | x-vector-kaldi-tf | x-vector | Python & TensorFlow & Perl | Tensorflow implementation of x-vector topology on top of Kaldi recipe. | | kaldi-ivector | i-vector | C++ & Perl | Extension to Kaldi implementing the standard i-vector hyperparameter estimation and i-vector extraction procedure. | | voxceleb-ivector | i-vector |Perl | Voxceleb1 i-vector based speaker recognition system. | | pytorch_xvectors | x-vector | Python & PyTorch | PyTorch implementation of Voxceleb x-vectors. Additionaly, includes meta-learning architectures for embedding training. Evaluated with speaker diarization and speaker verification. |
| Link | Language | Description | | ---- | -------- | ----------- | | change_detection | Python & Keras | Code for Speaker Change Detection in Broadcast TV using Bidirectional Long Short-Term Memory Networks. |
| Link | Language | Description | | ---- | -------- | ----------- | | LibROSA | Python | Python library for audio and music analysis. https://librosa.github.io/ | | pythonspeechfeatures | Python | This library provides common speech features for ASR including MFCCs and filterbank energies. https://python-speech-features.readthedocs.io/en/latest/ | | pyAudioAnalysis | Python | Python Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications. |
| Link | Language | Description | | ---- | -------- | ----------- | | pyroomacoustics | Python | Pyroomacoustics is a package for audio signal processing for indoor applications. It was developed as a fast prototyping platform for beamforming algorithms in indoor scenarios. https://pyroomacoustics.readthedocs.io | | gpuRIR | Python | Python library for Room Impulse Response (RIR) simulation with GPU acceleration | | rirsimulatorpython | Python | Room impulse response simulator using python |
| Link | Language | Description | | ---- | -------- | ----------- | | VB Diarization | Python | VB Diarization with Eigenvoice and HMM Priors. |
| Audio | Diarization ground truth | Language | Pricing | Additional information | | ----- | ------------------------ | -------- | ------- | ---------------------- | | 2000 NIST Speaker Recognition Evaluation | Disk-6 (Switchboard), Disk-8 (CALLHOME) | Multiple | $2400.00 | Evaluation Plan | | 2003 NIST Rich Transcription Evaluation Data | Together with audios | en, ar, zh | $2000.00 | telephone speech, broadcast news | | CALLHOME American English Speech | CALLHOME American English Transcripts | en | $1500.00 + $1000.00| CH109 whitelist | | The ICSI Meeting Corpus | Together with audios | en | Free | License | | The AMI Meeting Corpus | Together with audios (need to be processed) | Multiple | Free | License | | Fisher English Training Speech Part 1 Speech | Fisher English Training Speech Part 1 Transcripts| en | $7000.00 + $1000.00 | | Fisher English Training Part 2, Speech | Fisher English Training Part 2, Transcripts | en | $7000.00 + $1000.00 | | VoxConverse | TBD | TBD | Free | VoxConverse is an audio-visual diarisation dataset consisting of over 50 hours of multispeaker clips of human speech, extracted from YouTube videos |
| Name | Utterances | Speakers | Language | Pricing | Additional information | | ---- | ---------- | -------- | -------- | ------- | ---------------------- | | TIMIT | 6K+ | 630 | en | $250.00 | Published in 1993, the TIMIT corpus of read speech is one of the earliest speaker recognition datasets. | | VCTK | 43K+ | 109 | en | Free | Most were selected from a newspaper plus the Rainbow Passage and an elicitation paragraph intended to identify the speaker's accent. | | LibriSpeech | 292K | 2K+ | en | Free | Large-scale (1000 hours) corpus of read English speech. | | Multilingual LibriSpeech (MLS) | ? | ? | en, de, nl, es, fr, it, pt, po | Free | Multilingual LibriSpeech (MLS) dataset is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish. | | LibriVox | 180K | 9K+ | Multiple | Free | Free public domain audiobooks. LibriSpeech is a processed subset of LibriVox. Each original unsegmented utterance could be very long. | | VoxCeleb 1&2 | 1M+ | 7K | Multiple | Free | VoxCeleb is an audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube. | | The Spoken Wikipedia Corpora | 5K | 879 | en, de, nl | Free | Volunteer readers reading Wikipedia articles. | | CN-Celeb | 130K+ | 1K | zh | Free | A Free Chinese Speaker Recognition Corpus Released by [email protected] University. | | BookTubeSpeech | 8K | 8K | en | Free | Audio samples extracted from BookTube videos - videos where people share their opinions on books - from YouTube. The dataset can be downloaded using BookTubeSpeech-download. | | DeepMine | 540K | 1850 | fa, en | Unknown | A speech database in Persian and English designed to build and evaluate speaker verification, as well as Persian ASR systems. | | NISP-Dataset | ? | 345 | hi, kn, ml, ta, te (all Indian languages) | Free | This dataset contains speech recordings along with speaker physical parameters (height, weight, ... ) as well as regional information and linguistic information. |
| Name | Utterances | Pricing | Additional information | | ---- | ---------- | ------- | ---------------------- | | AudioSet | 2M | Free | A large-scale dataset of manually annotated audio events. | | MUSAN | N/A | Free | MUSAN is a corpus of music, speech, and noise recordings. |
| Conference/Workshop | Frequency | Page Limit | Organization | Blind Review | | ------------------- | --------- | ---------- | ------------ | ------------ | | ICASSP | Annual | 4 + 1 (ref) | IEEE | No | | InterSpeech | Annual | 4 + 1 (ref) | ISCA | No | | Speaker Odyssey | Biennial | 8 + 2 (ref) | ISCA | No | | SLT | Biennial | 6 + 2 (ref) | IEEE | Yes | | ASRU | Biennial | 6 + 2 (ref) | IEEE | Yes | | WASPAA | Biennial | 4 + 1 (ref) | IEEE | No |