C++ implementation of LSTM (Long Short Term Memory), in Kaldi's nnet1 framework. Used for automatic speech recognition, possibly language modeling etc, the training can be switched between CPU and GPU(CUDA). This repo is now merged into official Kaldi codebase(Karel's setup), so this repo is no longer maintained, please check out the Kaldi project instead.
Currently implementation includes two versions: * standard * google
Go to sub-directory to get more details.
40 40 5 512 40 800 [ ... 16624 512 1 1 0 [ ... 16624 16624
In google's paper, two layers of medium-sized LSTM is the best setup to beat DNN on WER. You can do this by text level editing: * use some of your training data to train one layer LSTM nnet * convert it into text format with nnet-copy with "--binary=false" * insert a pre-initialized LSTM component text between softmax and your pretrained LSTM, and you can feed all your training data to the stacked LSTM, e.g:
40 40 512 40 800 4 [ ... 512 512 800 4 [ ... 16624 512 1 1 0 [ ... 16624 16624
The key is how you apply "target-delay".
* standard version: the nnet should be trained with "TimeShift" because default nnet1 training tool (nnet-train-frame-shuf & nnet-train-perutt) doesn't provide target delay. * google version: due to the complexity of multi-stream training, the training tool "nnet-train-lstm-streams" provides an option "--target-delay", so in multi-stream training, a dummy "Transmit" component is used for a trivial reason related to how nnet1 calls Backpropagate(). But in testing time, the google version is first converted to standard version, so the "transmit" should also be switched to "TimeShift" during the conversion.
I implemented the "forward-connection droping out" according another paper from google, but later I didn't implement dropout retention, so the effects of dropout are not tested at all, and I leave it commented out.