Tensorflow implementation of "Generalized End-to-End Loss for Speaker Verification"
Tensorflow implementation of Generalized End-to-End Loss for Speaker Verification (Kaggle, paperswithcode). This paper is based on the previous work End-to-End Text-Dependent Speaker Verification.
python main.py --train True --model_path where_you_want_to_save # training python main.py --train False --model_path model_path used at training phase # test
I trained the model with my notebook CPU. The model hyperpameters are following the paper: - 3 LSTM layers with 128 hidden nodes, 64 projection nodes (Total 210434 variables) - 0.01 lr sgd with 0.5 decay - l2 norm clipping by 3
To finish training and test in time, I used smaller batch (4 speakers x 5 utterances) than the paper. I used the first 85% of the dataset as training set and used the remained parts as the testset. In the below, I used softmax loss (however, the contrastive loss is also implemented in this code). On my environment, it takes less than 1s for calculating 40 utterances embedding.
For each utterance, random noise is added at each forward step. I tested a model after 60000 iteration. As a result, Equal Error Rate (EER) is 0, and we can see the model performs well with a small population.
The figure below contains a similarity matrix and its EER, FAR, and FRR. Here, each matrix corresponds to each speaker. If we call the first matrix as A (5x4), then A[i,j] means the cosine similarity between the first speaker's i^th vertification utterance and the j^th speaker's enrollment.
Randomly selected utterances are used. I tested the model after 60000 iteration. Here, Equal Error Rate (EER) is 0.09.