Tensorflow implementation of "Generalized End-to-End Loss for Speaker Verification"
Tensorflow implementation of Generalized End-to-End Loss for Speaker Verification (Kaggle, paperswithcode). This paper is based on the previous work End-to-End Text-Dependent Speaker Verification.
python main.py --train True --model_path where_you_want_to_save # training python main.py --train False --model_path model_path used at training phase # test
I trained the model with my notebook CPU. The model hyperpameters are following the paper: - 3 LSTM layers with 128 hidden nodes, 64 projection nodes (Total 210434 variables) - 0.01 lr sgd with 0.5 decay - l2 norm clipping by 3
To finish training and test in time, I used smaller batch (4 speakers x 5 utterances) than the paper. I used the first 85% of the dataset as training set and used the remained parts as the testset. In the below, I used softmax loss (however, the contrastive loss is also implemented in this code). On my environment, it takes less than 1s for calculating 40 utterances embedding.
1) TD-SV
For each utterance, random noise is added at each forward step. I tested a model after 60000 iteration. As a result, Equal Error Rate (EER) is 0, and we can see the model performs well with a small population.
The figure below contains a similarity matrix and its EER, FAR, and FRR. Here, each matrix corresponds to each speaker. If we call the first matrix as A (5x4), then A[i,j] means the cosine similarity between the first speaker's i^th vertification utterance and the j^th speaker's enrollment.
2) TI-SV
Randomly selected utterances are used. I tested the model after 60000 iteration. Here, Equal Error Rate (EER) is 0.09.
MIT License