Need help with Speaker_Verification?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

310 Stars 101 Forks MIT License 35 Commits 11 Opened issues


Tensorflow implementation of "Generalized End-to-End Loss for Speaker Verification"

Services available


Need anything else?

Contributors list


Tensorflow implementation of Generalized End-to-End Loss for Speaker Verification (Kaggle, paperswithcode). This paper is based on the previous work End-to-End Text-Dependent Speaker Verification.

Speaker Verification

  • Speaker verification does 1-1 check between the enrolled voice and the new voice. This task requires to achieve the higher accuracy than speaker identification which does N-1 check between the N enrolled voices and a new voice.
  • There are two types of speaker verification: 1) Text dependent speaker verification (TD-SV). 2) Text independent speaker verification (TI-SV). The former uses the text-specific utterances for enrollment and verification, whereas the latter uses text-independent utterances.
  • At each forward step of the method, the utterance similarity matrix is calculated and the integrated loss is used for the objective function. (see Section 2.1 of the paper)


  • : Argument parsing
  • : Extracts noise and performs STFT on raw audio. For each raw audio, the voice activity detection is performed via librosa library.
  • : Contains various util functions for training and test.
  • : Contains train and test functions.
  • : After the dataset is prepared, run
    python --train True --model_path where_you_want_to_save                 # training
    python --train False --model_path model_path used at training phase     # test


  • Note, The authors of the paper used their own private dataset, and I could not obtain it.
  • In this implementation, I used VTCK public dataset, CSTR VCTK Corpus and noise added VTCK dataset (from "Noisy speech database for training speech enhancement algorithms and TTS models").
  • The VCTK dataset includes speech data uttered by 109 native English speakers with various accents.
  • For TD-SV, I used the first audio file of each speaker, which is speaking "Call Stella". For the each training and test data, I added random noise extracted from the noise added VTCK dataset.
  • For TD-SI, I used randomly selected utterances from each speaker. The blanks of raw audio files are trimmed, and then slicing is performed.


I trained the model with my notebook CPU. The model hyperpameters are following the paper: - 3 LSTM layers with 128 hidden nodes, 64 projection nodes (Total 210434 variables) - 0.01 lr sgd with 0.5 decay - l2 norm clipping by 3

To finish training and test in time, I used smaller batch (4 speakers x 5 utterances) than the paper. I used the first 85% of the dataset as training set and used the remained parts as the testset. In the below, I used softmax loss (however, the contrastive loss is also implemented in this code). On my environment, it takes less than 1s for calculating 40 utterances embedding.

1) TD-SV
For each utterance, random noise is added at each forward step. I tested a model after 60000 iteration. As a result, Equal Error Rate (EER) is 0, and we can see the model performs well with a small population.

The figure below contains a similarity matrix and its EER, FAR, and FRR. Here, each matrix corresponds to each speaker. If we call the first matrix as A (5x4), then A[i,j] means the cosine similarity between the first speaker's i^th vertification utterance and the j^th speaker's enrollment.

2) TI-SV
Randomly selected utterances are used. I tested the model after 60000 iteration. Here, Equal Error Rate (EER) is 0.09.


MIT License

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.