Python
Need help with deep-speaker?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.
philipperemy

Description

Deep Speaker: an End-to-End Neural Speaker Embedding System.

501 Stars 176 Forks MIT License 310 Commits 5 Opened issues

Services available

Need anything else?

Deep Speaker: An End-to-End Neural Speaker Embedding System.

Unofficial Keras implementation of Deep Speaker | Paper | Pretrained Models

Sample Results

Models were trained on clean speech data.

Model name | Testing dataset | Num speakers | F | TPR | ACC | EER | Training Logs | Download model | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | ResCNN Softmax trained | LibriSpeech all() | 2484 | 0.789 | 0.733 | 0.996 | 0.043 | Click | Click ResCNN Softmax+Triplet trained | LibriSpeech all() | 2484 | 0.843 | 0.825 | 0.997 | 0.025 | Click | Click

(*) all includes: dev-clean, dev-other, test-clean, test-other, train-clean-100, train-clean-360, train-other-500.

Overview

Deep Speaker is a neural speaker embedding system that maps utterances to a hypersphere where speaker similarity is measured by cosine similarity. The embeddings generated by Deep Speaker can be used for many tasks, including speaker identification, verification, and clustering.

Getting started

Install dependencies

Requirements

  • tensorflow>=2.0
  • keras>=2.3.1
    bash
    pip install -r requirements.txt
    

If you see this error:

libsndfile not found
, run this:
sudo apt-get install libsndfile-dev
.

Training

The code for training is available in this repository. It takes a bit less than a week with a GTX1070 to train the models.

System requirements for a complete training are: - At least 300GB of free disk space on a fast SSD (250GB just for all the uncompressed + processed data) - 32GB of memory and at least 32GB of swap (can create swap with SSD space). - A NVIDIA GPU such as the 1080Ti.

pip uninstall -y tensorflow && pip install tensorflow-gpu
./deep-speaker download_librispeech    # if the download is too slow, consider replacing [wget] by [axel -n 10 -a] in download_librispeech.sh.
./deep-speaker build_mfcc              # will build MFCC for softmax pre-training and triplet training.
./deep-speaker build_model_inputs      # will build inputs for softmax pre-training.
./deep-speaker train_softmax           # takes ~3 days.
./deep-speaker train_triplet           # takes ~3 days.

NOTE: If you want to use your own dataset, make sure you follow the directory structure of librispeech. Audio files have to be in

.flac
. format. If you have
.wav
, you can use
ffmpeg
to make the conversion. Both formats are flawless (FLAC is compressed WAV).

Test instruction using pretrained model

  • Download the trained models

Model name | Used datasets for training | Num speakers | Model Link | | :--- | :--- | :--- | :--- | ResCNN Softmax trained | LibriSpeech train-clean-360 | 921 | Click ResCNN Softmax+Triplet trained | LibriSpeech all | 2484 | Click

  • Run with pretrained model
import random

import numpy as np

from audio import read_mfcc from batcher import sample_from_mfcc from constants import SAMPLE_RATE, NUM_FRAMES from conv_models import DeepSpeakerModel from test import batch_cosine_similarity

Reproducible results.

np.random.seed(123) random.seed(123)

Define the model here.

model = DeepSpeakerModel()

Load the checkpoint.

model.m.load_weights('ResCNN_triplet_training_checkpoint_265.h5', by_name=True)

Sample some inputs for WAV/FLAC files for the same speaker.

To have reproducible results every time you call this function, set the seed every time before calling it.

np.random.seed(123)

random.seed(123)

mfcc_001 = sample_from_mfcc(read_mfcc('samples/PhilippeRemy/PhilippeRemy_001.wav', SAMPLE_RATE), NUM_FRAMES) mfcc_002 = sample_from_mfcc(read_mfcc('samples/PhilippeRemy/PhilippeRemy_002.wav', SAMPLE_RATE), NUM_FRAMES)

Call the model to get the embeddings of shape (1, 512) for each file.

predict_001 = model.m.predict(np.expand_dims(mfcc_001, axis=0)) predict_002 = model.m.predict(np.expand_dims(mfcc_002, axis=0))

Do it again with a different speaker.

mfcc_003 = sample_from_mfcc(read_mfcc('samples/1255-90413-0001.flac', SAMPLE_RATE), NUM_FRAMES) predict_003 = model.m.predict(np.expand_dims(mfcc_003, axis=0))

Compute the cosine similarity and check that it is higher for the same speaker.

print('SAME SPEAKER', batch_cosine_similarity(predict_001, predict_002)) # SAME SPEAKER [0.81564593] print('DIFF SPEAKER', batch_cosine_similarity(predict_001, predict_003)) # DIFF SPEAKER [0.1419204]

  • Commands to reproduce the test results after the training
$ export CUDA_VISIBLE_DEVICES=0; python cli.py test-model --working_dir ~/.deep-speaker-wd/triplet-training/ --
checkpoint_file checkpoints-softmax/ResCNN_checkpoint_102.h5
f-measure = 0.789, true positive rate = 0.733, accuracy = 0.996, equal error rate = 0.043
$ export CUDA_VISIBLE_DEVICES=0; python cli.py test-model --working_dir ~/.deep-speaker-wd/triplet-training/ --checkpoint_file checkpoints-triplets/ResCNN_checkpoint_265.h5
f-measure = 0.849, true positive rate = 0.798, accuracy = 0.997, equal error rate = 0.025

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.