Need help with deep-speaker?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

593 Stars 201 Forks MIT License 333 Commits 6 Opened issues


Deep Speaker: an End-to-End Neural Speaker Embedding System.

Services available


Need anything else?

Contributors list

Deep Speaker: An End-to-End Neural Speaker Embedding System.

Unofficial Keras implementation of Deep Speaker | Paper | Pretrained Models.

Sample Results

Models were trained on clean speech data. Keep in mind that the performance will be lower on noisy data. It is advised to remove silence and background noise before computing the embeddings (by using Sox for example).

Model name | Testing dataset | Num speakers | F | TPR | ACC | EER | Training Logs | Download model | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | ResCNN Softmax trained | LibriSpeech all() | 2484 | 0.789 | 0.733 | 0.996 | 0.043 | Click | Click ResCNN Softmax+Triplet trained | LibriSpeech all() | 2484 | 0.843 | 0.825 | 0.997 | 0.025 | Click | Click

(*) all includes: dev-clean, dev-other, test-clean, test-other, train-clean-100, train-clean-360, train-other-500.

The Softmax+Triplet checkpoint is also available on the Chinese cloud - WeiYun.


Deep Speaker is a neural speaker embedding system that maps utterances to a hypersphere where speaker similarity is measured by cosine similarity. The embeddings generated by Deep Speaker can be used for many tasks, including speaker identification, verification, and clustering.

Getting started

Install dependencies


  • tensorflow>=2.0
  • keras>=2.3.1
  • python>=3.6
    pip install -r requirements.txt

If you see this error:

libsndfile not found
, run this:
sudo apt-get install libsndfile-dev


The code for training is available in this repository. It takes a bit less than a week with a GTX1070 to train the models.

System requirements for a complete training are: - At least 300GB of free disk space on a fast SSD (250GB just for all the uncompressed + processed data) - 32GB of memory and at least 32GB of swap (can create swap with SSD space). - A NVIDIA GPU such as the 1080Ti.

pip uninstall -y tensorflow && pip install tensorflow-gpu
./deep-speaker download_librispeech    # if the download is too slow, consider replacing [wget] by [axel -n 10 -a] in
./deep-speaker build_mfcc              # will build MFCC for softmax pre-training and triplet training.
./deep-speaker build_model_inputs      # will build inputs for softmax pre-training.
./deep-speaker train_softmax           # takes ~3 days.
./deep-speaker train_triplet           # takes ~3 days.

NOTE: If you want to use your own dataset, make sure you follow the directory structure of librispeech. Audio files have to be in

. format. If you have
, you can use
to make the conversion. Both formats are flawless (FLAC is compressed WAV).

Test instruction using pretrained model

  • Download the trained models

Model name | Used datasets for training | Num speakers | Model Link | | :--- | :--- | :--- | :--- | ResCNN Softmax trained | LibriSpeech train-clean-360 | 921 | Click ResCNN Softmax+Triplet trained | LibriSpeech all | 2484 | Click

  • Run with pretrained model
import random

import numpy as np

from audio import read_mfcc from batcher import sample_from_mfcc from constants import SAMPLE_RATE, NUM_FRAMES from conv_models import DeepSpeakerModel from test import batch_cosine_similarity

Reproducible results.

np.random.seed(123) random.seed(123)

Define the model here.

model = DeepSpeakerModel()

Load the checkpoint.

Also available here: (Chinese users).

model.m.load_weights('ResCNN_triplet_training_checkpoint_265.h5', by_name=True)

Sample some inputs for WAV/FLAC files for the same speaker.

To have reproducible results every time you call this function, set the seed every time before calling it.



mfcc_001 = sample_from_mfcc(read_mfcc('samples/PhilippeRemy/PhilippeRemy_001.wav', SAMPLE_RATE), NUM_FRAMES) mfcc_002 = sample_from_mfcc(read_mfcc('samples/PhilippeRemy/PhilippeRemy_002.wav', SAMPLE_RATE), NUM_FRAMES)

Call the model to get the embeddings of shape (1, 512) for each file.

predict_001 = model.m.predict(np.expand_dims(mfcc_001, axis=0)) predict_002 = model.m.predict(np.expand_dims(mfcc_002, axis=0))

Do it again with a different speaker.

mfcc_003 = sample_from_mfcc(read_mfcc('samples/1255-90413-0001.flac', SAMPLE_RATE), NUM_FRAMES) predict_003 = model.m.predict(np.expand_dims(mfcc_003, axis=0))

Compute the cosine similarity and check that it is higher for the same speaker.

print('SAME SPEAKER', batch_cosine_similarity(predict_001, predict_002)) # SAME SPEAKER [0.81564593] print('DIFF SPEAKER', batch_cosine_similarity(predict_001, predict_003)) # DIFF SPEAKER [0.1419204]

  • Commands to reproduce the test results after the training
$ export CUDA_VISIBLE_DEVICES=0; python test-model --working_dir ~/.deep-speaker-wd/triplet-training/ --
checkpoint_file checkpoints-softmax/ResCNN_checkpoint_102.h5
f-measure = 0.789, true positive rate = 0.733, accuracy = 0.996, equal error rate = 0.043
$ export CUDA_VISIBLE_DEVICES=0; python test-model --working_dir ~/.deep-speaker-wd/triplet-training/ --checkpoint_file checkpoints-triplets/ResCNN_checkpoint_265.h5
f-measure = 0.849, true positive rate = 0.798, accuracy = 0.997, equal error rate = 0.025

Further work

  • LSTM model:
  • Fusion score:

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.