by ExplorerFreda

An open-source implementation of the paper ``A Structured Self-Attentive Sentence Embedding'' (Lin e...

412 Stars 93 Forks Last release: Not found GNU General Public License v3.0 14 Commits 0 Releases

Available items

No Items, yet!

The developer of this repository has not created any items for sale yet. Need a bug fixed? Help with integration? A different license? Create a request here:


An open-source implementation of the paper ``A Structured Self-Attentive Sentence Embedding'' published by IBM and MILA.




Please refer to for the information about obtaining GloVe model (in PyTorch model format .pt). Typically, the model should be a tuple (dict, torch.FloatTensor, int), where the first element (dict) is a mapping from word to its index, the third element (int) is the dimension of the word embeddings, and the second element (torch.FloatTensor) with the size of word_count * dim refers to the word embeddings.



python --input [Yelp dataset] --output [output path, will be a json file] --dict [output dictionary path, will be a json file]

Training Model

python \
--emsize [word embedding size default 300] \
--nhid [hidden layer size, default 300] \
--nlayers [hidden layer numbers in Bi-LSTM, default 2] \
--attention-unit [attention unit number, d_a in the paper, default 350] \
--attention-hops [hop number, r in the paper, default 1] \
--dropout [dropout ratio, default 0.5] \
--nfc [hidden layer size for MLP in the classifier, default 512] \
--lr [learning rate, default 0.001] \
--epochs [epoch number for training, default 40] \
--seed [initial seed for reproduction, default 1111] \
--log-interval [the interval for reporting training loss, default 200] \
--batch-size [size of a batch in training procedure, default 32] \
--optimizer [type of the optimizer, default Adam] \
--penalization-coeff [coefficient of the Frobenius Norm penalization term, default 1.0] \
--class-number [number of class for the last step of classification] \
--save [path to save model] \
--dictionary [location of the dictionary generated by the tokenizer] \
--word-vector [location of the initial word vector, e.g. GloVe, should be a torch .pt model] \
--train-data [location of training data, should be in the same format with tokenized productions] \
--val-data [development set] \
--test-data [location of testing dataset] \
--cuda [whether using GPU for training, remove this when using CPU] 

Differences between the paper and our implementation

  1. For faster Python based tokenization, we used spaCy instead of Stanford Tokenizer (

  2. For faster performance, we manually crop the comments in Yelp to a max length of 500.

Example Experimental Command and Result

We followed Lin et al.(2017) to generate the dataset, and obtained the following result:

python --train-data "data/train.json" --val-data "data/dev.json" --test-data "data/test.json" --cuda --emsize 300 --nhid 300 --nfc 300 --dropout 0.5 --attention-unit 350 --epochs 10 --lr 0.001 --clip 0.5 --dictionary "data/Yelp/data/dict.json" --word-vector "data/GloVe/" --save "models/" --batch-size 50 --class-number 5 --optimizer Adam --attention-hops 4 --penalization-coeff 1.0 --log-interval 100
# test loss (cross entropy loss, without the Frobenius norm penalization) 0.7544
# test accuracy: 0.6690

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.