Need help with awesome-diarization?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

wq2012
670 Stars 145 Forks Apache License 2.0 88 Commits 1 Opened issues

Description

A curated list of awesome Speaker Diarization papers, libraries, datasets, and other resources.

Services available

!
?

Need anything else?

Contributors list

# 29,707
Shell
Python
speaker...
python3
74 commits
# 295,819
speech-...
speaker...
2 commits
# 120,676
HTML
CSS
Linux
speaker...
2 commits
# 139,760
speaker...
Shell
speaker...
C
1 commit
# 358,822
speech-...
speaker...
1 commit
# 247,301
speaker...
Python
Shell
speaker...
1 commit
# 359,429
speech-...
speaker...
1 commit
# 358,977
speech-...
speaker...
1 commit
# 31,201
Python
pytorch
voice-a...
speaker...
1 commit
# 350,667
Python
Shell
speaker...
speech-...
1 commit
# 234,814
Objecti...
Shell
Jupyter...
sgd
1 commit
# 358,311
speech-...
speaker...
1 commit

Awesome Speaker Diarization Awesome Contribution

Table of contents

Overview

This is a curated list of awesome Speaker Diarization papers, libraries, datasets, and other resources.

The purpose of this repo is to organize the world’s resources for speaker diarization, and make them universally accessible and useful.

To add items to this page, simply send a pull request. (contributing guide)

Publications

Special topics

Review & survey papers

Supervisied diarization

Joint diarization and ASR

Challenges

Other

2020

2019

2018

2017

2016

2015

2014

2013

2011

2009

2008

2006

Software

Framework

| Link | Language | Description | | ---- | -------- | ----------- | | SIDEKIT for diarization (s4d) | Python | An open source package extension of SIDEKIT for Speaker diarization. | | pyAudioAnalysis GitHub stars | Python | Python Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications. | | AaltoASR GitHub stars | Python & Perl | Speaker diarization scripts, based on AaltoASR. | | LIUM SpkDiarization | Java | LIUMSpkDiarization is a software dedicated to speaker diarization (i.e. speaker segmentation and clustering). It is written in Java, and includes the most recent developments in the domain (as of 2013). | | kaldi-asr Build Status | Bash | Example scripts for speaker diarization on a portion of CALLHOME used in the 2000 NIST speaker recognition evaluation. | | Alize LIA_SpkSeg | C++ | ALIZÉ is an opensource platform for speaker recognition. LIASpkSeg is the tools for speaker diarization. | | pyannote-audio GitHub stars | Python | Neural building blocks for speaker diarization: speech activity detection, speaker change detection, speaker embedding. | | pyBK GitHub stars | Python | Speaker diarization using binary key speaker modelling. Computationally light solution that does not require external training data. | | Speaker-Diarization GitHub stars | Python | Speaker diarization using uis-rnn and GhostVLAD. An easier way to support openset speakers. | | EEND GitHub stars | Python & Bash & Perl | End-to-End Neural Diarization. | | VBDiarization GitHub stars | Python | Speaker diarization based on Kaldi x-vectors using pretrained model trained in Kaldi (kaldi-asr/kaldi) and converted to ONNX format (onnx/onnx) running in ONNXRuntime (Microsoft/onnxruntime). | | RE-VERB GitHub stars | Python & JavaScript | RE: VERB is speaker diarization system, it allows the user to send/record audio of a conversation and receive timestamps of who spoke when. |

Evaluation

| Link | Language | Description | | ---- | -------- | ----------- | | pyannote-metrics GitHub stars Build Status | Python| A toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems. | | SimpleDER GitHub stars Build Status | Python | A lightweight library to compute Diarization Error Rate (DER). | | NIST md-eval | Perl | (1) modified md-eval.pl from Mary Tai Knox; (2) md-eval-v21.pl from jitendra; (3) md-eval-22.pl from nryant | | dscore GitHub stars | Python & Perl | Diarization scoring tools. | | Sequence Match Accuracy | Python | Match the accuracy of two sequences with Hungarian algorithm. |

Clustering

| Link | Language | Description | | ---- | -------- | ----------- | | uis-rnn GitHub stars Build Status | Python & PyTorch | Google's Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, for Fully Supervised Speaker Diarization. This clustering algorithm is supervised. | | uis-rnn-sml GitHub stars | Python & PyTorch | A variant of UIS-RNN, for the paper Supervised Online Diarization with Sample Mean Loss for Multi-Domain Data. | | DNC GitHub stars | Python & ESPnet | Transformer-based Discriminative Neural Clustering (DNC) for Speaker Diarisation. Like UIS-RNN, it is supervised. | | SpectralCluster GitHub stars Build Status | Python | Spectral clustering with affinity matrix refinement operations. | | sklearn.cluster Build Status | Python | scikit-learn clustering algorithms. | | PLDA GitHub stars | Python | Probabilistic Linear Discriminant Analysis & classification, written in Python. | | PLDA GitHub stars | C++ | Open-source implementation of simplified PLDA (Probabilistic Linear Discriminant Analysis). | | Auto-Tuning Spectral Clustering GitHub stars | Python | Auto-tuning Spectral Clustering method that does not need development set or supervised tuning. |

Speaker embedding

| Link | Method | Language | Description | | ---- | ------ | -------- | ----------- | | resemble-ai/Resemblyzer GitHub stars | d-vector | Python & PyTorch | PyTorch implementation of generalized end-to-end loss for speaker verification, which can be used for voice cloning and diarization. | | Speaker_Verification GitHub stars | d-vector | Python & TensorFlow | Tensorflow implementation of generalized end-to-end loss for speaker verification. | | PyTorchSpeakerVerification GitHub stars | d-vector | Python & PyTorch | PyTorch implementation of "Generalized End-to-End Loss for Speaker Verification" by Wan, Li et al. With UIS-RNN integration. | | Real-Time Voice Cloning GitHub stars | d-vector | Python & PyTorch | Implementation of "Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis" (SV2TTS) with a vocoder that works in real-time. | | deep-speaker GitHub stars | d-vector |Python & Keras | Third party implementation of the Baidu paper Deep Speaker: an End-to-End Neural Speaker Embedding System. | | x-vector-kaldi-tf GitHub stars | x-vector | Python & TensorFlow & Perl | Tensorflow implementation of x-vector topology on top of Kaldi recipe. | | kaldi-ivector GitHub stars | i-vector | C++ & Perl | Extension to Kaldi implementing the standard i-vector hyperparameter estimation and i-vector extraction procedure. | | voxceleb-ivector GitHub stars | i-vector |Perl | Voxceleb1 i-vector based speaker recognition system. | | pytorch_xvectors GitHub stars | x-vector | Python & PyTorch | PyTorch implementation of Voxceleb x-vectors. Additionaly, includes meta-learning architectures for embedding training. Evaluated with speaker diarization and speaker verification. |

Speaker change detection

| Link | Language | Description | | ---- | -------- | ----------- | | change_detection GitHub stars | Python & Keras | Code for Speaker Change Detection in Broadcast TV using Bidirectional Long Short-Term Memory Networks. |

Audio feature extraction

| Link | Language | Description | | ---- | -------- | ----------- | | LibROSA GitHub stars | Python | Python library for audio and music analysis. https://librosa.github.io/ | | pythonspeechfeatures GitHub stars | Python | This library provides common speech features for ASR including MFCCs and filterbank energies. https://python-speech-features.readthedocs.io/en/latest/ | | pyAudioAnalysis GitHub stars | Python | Python Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications. |

Audio data augmentation

| Link | Language | Description | | ---- | -------- | ----------- | | pyroomacoustics GitHub stars | Python | Pyroomacoustics is a package for audio signal processing for indoor applications. It was developed as a fast prototyping platform for beamforming algorithms in indoor scenarios. https://pyroomacoustics.readthedocs.io | | gpuRIR GitHub stars | Python | Python library for Room Impulse Response (RIR) simulation with GPU acceleration | | rirsimulatorpython GitHub stars | Python | Room impulse response simulator using python |

Other software

| Link | Language | Description | | ---- | -------- | ----------- | | VB Diarization GitHub stars Build Status | Python | VB Diarization with Eigenvoice and HMM Priors. |

Datasets

Diarization datasets

| Audio | Diarization ground truth | Language | Pricing | Additional information | | ----- | ------------------------ | -------- | ------- | ---------------------- | | 2000 NIST Speaker Recognition Evaluation | Disk-6 (Switchboard), Disk-8 (CALLHOME) | Multiple | $2400.00 | Evaluation Plan | | 2003 NIST Rich Transcription Evaluation Data | Together with audios | en, ar, zh | $2000.00 | telephone speech, broadcast news | | CALLHOME American English Speech | CALLHOME American English Transcripts | en | $1500.00 + $1000.00| CH109 whitelist | | The ICSI Meeting Corpus | Together with audios | en | Free | License | | The AMI Meeting Corpus | Together with audios (need to be processed) | Multiple | Free | License | | Fisher English Training Speech Part 1 Speech | Fisher English Training Speech Part 1 Transcripts| en | $7000.00 + $1000.00 | | Fisher English Training Part 2, Speech | Fisher English Training Part 2, Transcripts | en | $7000.00 + $1000.00 | | VoxConverse | TBD | TBD | Free | VoxConverse is an audio-visual diarisation dataset consisting of over 50 hours of multispeaker clips of human speech, extracted from YouTube videos |

Speaker embedding training sets

| Name | Utterances | Speakers | Language | Pricing | Additional information | | ---- | ---------- | -------- | -------- | ------- | ---------------------- | | TIMIT | 6K+ | 630 | en | $250.00 | Published in 1993, the TIMIT corpus of read speech is one of the earliest speaker recognition datasets. | | VCTK | 43K+ | 109 | en | Free | Most were selected from a newspaper plus the Rainbow Passage and an elicitation paragraph intended to identify the speaker's accent. | | LibriSpeech | 292K | 2K+ | en | Free | Large-scale (1000 hours) corpus of read English speech. | | Multilingual LibriSpeech (MLS) | ? | ? | en, de, nl, es, fr, it, pt, po | Free | Multilingual LibriSpeech (MLS) dataset is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish. | | LibriVox | 180K | 9K+ | Multiple | Free | Free public domain audiobooks. LibriSpeech is a processed subset of LibriVox. Each original unsegmented utterance could be very long. | | VoxCeleb 1&2 | 1M+ | 7K | Multiple | Free | VoxCeleb is an audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube. | | The Spoken Wikipedia Corpora | 5K | 879 | en, de, nl | Free | Volunteer readers reading Wikipedia articles. | | CN-Celeb | 130K+ | 1K | zh | Free | A Free Chinese Speaker Recognition Corpus Released by [email protected] University. | | BookTubeSpeech | 8K | 8K | en | Free | Audio samples extracted from BookTube videos - videos where people share their opinions on books - from YouTube. The dataset can be downloaded using BookTubeSpeech-download. | | DeepMine | 540K | 1850 | fa, en | Unknown | A speech database in Persian and English designed to build and evaluate speaker verification, as well as Persian ASR systems. | | NISP-Dataset | ? | 345 | hi, kn, ml, ta, te (all Indian languages) | Free | This dataset contains speech recordings along with speaker physical parameters (height, weight, ... ) as well as regional information and linguistic information. |

Augmentation noise sources

| Name | Utterances | Pricing | Additional information | | ---- | ---------- | ------- | ---------------------- | | AudioSet | 2M | Free | A large-scale dataset of manually annotated audio events. | | MUSAN | N/A | Free | MUSAN is a corpus of music, speech, and noise recordings. |

Conferences

| Conference/Workshop | Frequency | Page Limit | Organization | Blind Review | | ------------------- | --------- | ---------- | ------------ | ------------ | | ICASSP | Annual | 4 + 1 (ref) | IEEE | No | | InterSpeech | Annual | 4 + 1 (ref) | ISCA | No | | Speaker Odyssey | Biennial | 8 + 2 (ref) | ISCA | No | | SLT | Biennial | 6 + 2 (ref) | IEEE | Yes | | ASRU | Biennial | 6 + 2 (ref) | IEEE | Yes | | WASPAA | Biennial | 4 + 1 (ref) | IEEE | No |

Other learning materials

Books

Tech blogs

Video tutorials

Products

| Company | Product | | ------- | ------- | | Google | Google Cloud Speech-to-Text API | | Amazon | Amazon Transcribe | | IBM | Watson Speech To Text API | | DeepAffects | Speaker Diarization API |

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.