1500 Hours 160k Words English Zamia-Speech Models Released
With the addition of the TED-LIUM 3 corpus and positive results from the auto-review process the r20190609 release of the English Zamia-Speech models for Kaldi has been trained on the largest amount of audio material yet (over 1100 hours):
zamia_en 0:05:38
voxforge_en 102:07:05
cv_corpus_v1 252:31:11
librispeech 450:49:09
ljspeech 23:13:54
m_ailabs_en 106:28:20
tedlium3 210:13:30
additionally 400 hours of noise-augmented audio derived from the above corpora were used (background noise and phone codecs):
voxforge_en_noisy 22:01:40
librispeech_noisy 119:03:26
cv_corpus_v1_noisy 78:57:16
cv_corpus_v1_phone 61:38:33
zamia_en_noisy 0:02:08
voxforge_en_phone 18:02:35
librispeech_phone 106:35:33
zamia_en_phone 0:01:11
so in total this release has been trained on over 1500 hours of audio material (training took over 6 weeks on a GeForce GTX 1080 Ti GPU).
Stats:
%WER 10.64 exp/nnet3_chain/tdnn_250/decode_test/wer_8_0.0
%WER 8.84 exp/nnet3_chain/tdnn_f/decode_test/wer_8_0.0
%WER 5.80 exp/nnet3_chain/tdnn_fl/decode_test/wer_9_0.0
The tdnn_250 model is the smallest one meant for use in embedded applications (i.e. RPi-3 class hardware), tdnn_f is our regular model, tdnn_fl is the tdnn_f model adapted to a larger language model (results illustrate the importance of language model domain adaptation btw.).
Downloads: https://github.com/gooofy/zamia-speech#asr-models