A CRF-based ASR Toolkit
CAT provides a complete workflow for CRF-based data-efficient end-to-end speech recognition.
Deep neural networks (DNNs) of various architectures have become dominantly used in automatic speech recognition (ASR), which roughly can be classified into two approaches - the DNN-HMM hybrid and the end-to-end (E2E) approaches. DNN-HMM hybrid systems like Kaldi and RASR achieve the state-of-the-art performance in terms of recognition accuracy, usually measured by word error rate (WER) or character error rate (CER). End-to-end systems^e2e put simplicity of the training pipeline at a higher priority and usually are data-hungry. When comparing the hybrid and E2E approaches (modularity versus a single neural network, separate optimization versus joint optimization), it is worthwhile to note the pros and cons of each approach, as described in .
CAT aims at combining the advantages of the two kinds of ASR systems. CAT advocates discriminative training in the framework of conditional random field (CRF), particularly with but not limited to connectionist temporal classification (CTC) inspired state topology.
The recently developed CTC-CRF (namely CRF with CTC topology) has achieved superior benchmarking performance with training data ranging from ~100 to ~1000 hours, while being end-to-end with simplified pipeline and being data-efficient in the sense that cheaply available language models (LMs) can be leveraged effectively with or without a pronunciation lexicon.
[^e2e]: End-to-end is in the sense that flat-start training of a single DNN in one stage, without using any previously trained models, forced alignments, or building state-tying decision trees, with or without a pronunciation lexicon.
Please cite CAT using:
 Hongyu Xiang, Zhijian Ou. CRF-based Single-stage Acoustic Modeling with CTC Topology. ICASSP, 2019. pdf
 Keyu An, Hongyu Xiang. Zhijian Ou. CRF-based ASR Toolkit. arXiv, 2019. pdf (More descriptions about the toolkit implementation)
 Keyu An, Hongyu Xiang. Zhijian Ou. CAT: A CTC-CRF based ASR Toolkit Bridging the Hybrid and the End-to-end Approaches towards Data Efficiency and Low Latency. INTERSPEECH, 2020. pdf
CAT contains a full-fledged implementation of CTC-CRF.
CAT adopts PyTorch to build DNNs and do automatic gradient computation, and so inherits the power of PyTorch in handling DNNs.
CAT provides a complete workflow for CRF-based end-to-end speech recognition.
Evaluation results on major benchmarks such as Switchboard and Aishell show that CAT obtains the state-of-the-art results among existing end-to-end models with less parameters, and is competitive compared with the hybrid DNN-HMM models.
We add the support of streaming ASR. To this end, we propose a new method called contextualized soft forgetting (CSF), which combines soft forgetting and context-sensitive-chunk in bidirectional LSTM (BLSTM). With contextualized soft forgetting, the chunk BLSTM based CTC-CRF with a latency of 300ms outperforms the whole-utterance BLSTM based CTC-CRF. See pdf for details.
2021.07: add support of Deformable TDNN by Keyu An.
2021.07: add support of Wordpieces by Wenjie Peng.
2021.05: add support of Conformer and SpecAug by Huahuan Zheng.
CAT v2 (master branch)
For easy maintenance of experiments and enforcing the reproducibility, we re-organize the code and strongly recommend to do experiments according to the guideline.
Comparison between v2 and v1:
| Branch | Flexible configuration | Easily reproduce | Distributed training | Chunk streaming model | Recipes | | ---------- | ---------------------- | ---------------- | -------------------- | --------------------- | -------------------------------------------------- | | v1 | | | ✅ | ✅ | aishell, formosa, hkust, libri, swbd, thchs30, wsj | | v2 (master) | ✅ | ✅ | ✅ | | swbd, wsj, libri, commonvoice German |