Need help with Structured-Self-Attentive-Sentence-Embedding?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

427 Stars 96 Forks GNU General Public License v3.0 14 Commits 7 Opened issues


An open-source implementation of the paper ``A Structured Self-Attentive Sentence Embedding'' (Lin et al., ICLR 2017).

Services available


Need anything else?

Contributors list

# 249,629
7 commits


An open-source implementation of the paper ``A Structured Self-Attentive Sentence Embedding'' published by IBM and MILA.




Please refer to for the information about obtaining GloVe model (in PyTorch model format .pt). Typically, the model should be a tuple (dict, torch.FloatTensor, int), where the first element (dict) is a mapping from word to its index, the third element (int) is the dimension of the word embeddings, and the second element (torch.FloatTensor) with the size of word_count * dim refers to the word embeddings.



python --input [Yelp dataset] --output [output path, will be a json file] --dict [output dictionary path, will be a json file]

Training Model

python \
--emsize [word embedding size default 300] \
--nhid [hidden layer size, default 300] \
--nlayers [hidden layer numbers in Bi-LSTM, default 2] \
--attention-unit [attention unit number, d_a in the paper, default 350] \
--attention-hops [hop number, r in the paper, default 1] \
--dropout [dropout ratio, default 0.5] \
--nfc [hidden layer size for MLP in the classifier, default 512] \
--lr [learning rate, default 0.001] \
--epochs [epoch number for training, default 40] \
--seed [initial seed for reproduction, default 1111] \
--log-interval [the interval for reporting training loss, default 200] \
--batch-size [size of a batch in training procedure, default 32] \
--optimizer [type of the optimizer, default Adam] \
--penalization-coeff [coefficient of the Frobenius Norm penalization term, default 1.0] \
--class-number [number of class for the last step of classification] \
--save [path to save model] \
--dictionary [location of the dictionary generated by the tokenizer] \
--word-vector [location of the initial word vector, e.g. GloVe, should be a torch .pt model] \
--train-data [location of training data, should be in the same format with tokenized productions] \
--val-data [development set] \
--test-data [location of testing dataset] \
--cuda [whether using GPU for training, remove this when using CPU] 

Differences between the paper and our implementation

  1. For faster Python based tokenization, we used spaCy instead of Stanford Tokenizer (

  2. For faster performance, we manually crop the comments in Yelp to a max length of 500.

Example Experimental Command and Result

We followed Lin et al.(2017) to generate the dataset, and obtained the following result:

python --train-data "data/train.json" --val-data "data/dev.json" --test-data "data/test.json" --cuda --emsize 300 --nhid 300 --nfc 300 --dropout 0.5 --attention-unit 350 --epochs 10 --lr 0.001 --clip 0.5 --dictionary "data/Yelp/data/dict.json" --word-vector "data/GloVe/" --save "models/" --batch-size 50 --class-number 5 --optimizer Adam --attention-hops 4 --penalization-coeff 1.0 --log-interval 100
# test loss (cross entropy loss, without the Frobenius norm penalization) 0.7544
# test accuracy: 0.6690

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.