This tutorial will guide you through some basic functionalities and operations of Kaldi ASR toolkit.
(Project Kaldi is released under the Apache 2.0 license, so is this tutorial.)
In the end of the tutorial, you'll be assigned with the first programming assignment. In this assignment we will test your
- familiarity with version controlling with Git
- understanding of Unix shell environment (particularly, bash) and scripting
- ability to read and write Python code
The Kaldi will run on POSIX systems, with these software/libraries pre-installed. (If you don't know how to use a package manager on your computer to install these libraries, this tutorial might not be for you.)
- GNU build tools
wgetgit- (optional)
sox
Also, later in this tutorial, we'll write a short Python program for text processing, so please have python on your side.
The entire compilation can take a couple of hours and up to 8 GB of storage depending on your system specification and configuration. Make sure you have enough resource before start compiling.
Once you have all required build tools, compiling the Kaldi is pretty straightforward. First you need to download it from the repository.
git clone https://github.com/kaldi-asr/kaldi.git /path/you/want --depth 1
cd /path/you/want(--depth 1: You might want to give this option to shrink the entire history of the project into a single commit to save your storage and bandwidth.)
Assuming you are in the directory where you cloned (downloaded) Kaldi, now you need to perform make in two subdirectories: tools, and src
cd tools/
make
cd ../src
./configure
make depend
makeIf you need more detailed install instructions or having trouble/errors while compiling, please check out the official documentation: tools/INSTALL, src/INSTALL
Now all the Kaldi tools should be ready to use.
This section will cover how to prepare your data to train and test a Kaldi recognizer.
Our toy dataset for this tutorial has 60 .wav files, sampled at 8 kHz.
All audio files are recorded by an anonymous male contributor of the Kaldi project and included in the project for a test purpose.
We put them in waves_yesno directory, but the dataset also can be found at its original location.
In each file, the individual says 8 words; each word is either "ken" or "lo" ( "yes" and "no" in Hebrew), so each file is a random sequence of 8 yes's or no's.
The names of files represent the word sequence, with 1 for ken/yes and 0 for lo/no, that is the names will serve as transcript for each sequence.
waves_yesno/1_0_1_1_1_0_1_0.wav
waves_yesno/0_1_1_0_0_1_1_0.wav
...This is all we have as our raw data: audio and transcript. Now we will deform these .wav files into data format that Kaldi can read in.
Let's start with formatting data. We will split 60 wave files roughly in half: 31 for training, the rest for testing. Create a directory data and its two subdirectories train_yesno and test_yesno.
Now we will write a python script to generate necessary input files. Open data_prep.py and . It
- reads up the list of files in
waves_yesno. - generates two lists, one with names of files that start with 0, the other with names starting with 1, ignoring else.
Now, for each dataset (train, test), we need to generate following Kaldi input files representing our data.
text- Essentially, transcripts of the audio files.
- Write an utterance per line, formatted in
<utt_id> <transcript>- e.g.
0_0_1_1_1_1_0_0 NO NO YES YES YES YES NO NO
- e.g.
- We will use filenames without extensions as
utt_ids for now.- Note that an id needs to be a single token (no whitespace inside allowed).
- Although recordings are in Hebrew, we will use English words,
YESandNO, just for the sake of readibility.
wav.scp- Indexing files to unique ids.
<file_id> <path of wave filenames OR command to get wave file>- e.g.
0_1_0_0_1_0_1_1 waves_yesno/0_1_0_0_1_0_1_1.wav
- e.g.
- Again, we can use file names as
file_ids. - Paths can be absolute or relative. Using relative path will make the code portable, while absolute paths are more robust. Remember when submitting code, the portability is very important.
- Note that here we have a single utterence in each wave file, in turn we have 1-to-1 & onto mapping between
utt_ids andfile_ids.
utt2spk- For each utterance, mark which speaker spoke it.
- Since we have only one speaker in this example, let's use
globalasspeaker_id
- Since we have only one speaker in this example, let's use
<utt_id> <speaker_id>- e.g.
0_0_1_0_1_0_1_1 global
- e.g.
- For each utterance, mark which speaker spoke it.
spk2utt- Simply inverse-indexed
utt2spk(<speaker_id> <all_hier_utterences>) - Instead of writing Python code to re-index utterances and speakers, you can also use a Kaldi utility to do it.
- e.g.
utils/utt2spk_to_spk2utt.pl data/train_yesno/utt2spk > data/train_yesno/spk2utt - However, since we are writing a Python program, you might want to call the Kaldi utility from within Python code. See subprocess or os.system().
- e.g.
- Or, of course, you can write Python code to index utterances by speakers.
- Simply inverse-indexed
- (optional)
segments: *not used for this data *- Contains mappings between utterance segmentation/alignment information and recording files.
- Only required when a file contains multiple utterances, which is not this case.
- (optional)
reco2file_and_channel: *not used for this data *- Only required when audios were recorded in dual channels (e.g. for telephony conversational setup - one speaker on each side).
- (optional)
spk2gender: *not used for this data *- Map from speakers to their gender information.
- Used in vocal tract length normalization step, if needed.
As mentioned, files start with 0 compose the train set, and those start with 1 compose the test set.
data_prep.py skeleton includes reading-up part and a function to generate text file.
Not finish the code to generate each set of 4 files using the lists of file names in corresponding directories. (data/train_yesno, data/test_yesno)
Note all files should be carefully sorted in C/C++ compatible way as required by the Kaldi.
If you're calling unix sort, don't forget, before sorting, to set locale to C (LC_ALL=C sort ...) for C/C++ compatibility
In Python, you might want to look at this document from the Python wiki.
Or you can use the Kaldi built-in fix script at your convenience after all data files are prepared. For example,
utils/fix_data_dir.sh data/train_yesno/
utils/fix_data_dir.sh data/test_yesno/If you're done with the code, your data directory should look like this at this point.
data
├───train_yesno
│ ├───text
│ ├───utt2spk
│ ├───spk2utt
│ └───wav.scp
└───test_yesno
├───text
├───utt2spk
├───spk2utt
└───wav.scp
You can't proceed the tutorial unless you properly generated these files. Please finish data_prep.py to generate them.
This section will cover how to build language knowledge - lexicon and phone dictionaries - for a Kaldi recognizer.
From here, we will use several Kaldi utilities (included in steps and utils directories) to process further. To do that, Kaldi binaries should be in your $PATH.
However, Kaldi is a fairly large toolkit, and there are a number of binaries distributed over many different directories, depending on their purpose.
So, we will use provided path.sh to add all of Kaldi directories with binaries to $PATH to the subshell when a script runs (we will see this later).
All you need to do right now is to open the path.sh file and edit the $KALDI_ROOT variable to point your Kaldi installation location, and then source that file to expand $PATH in the current shell instance.
Next we will build dictionaries. Let's start with creating intermediate dict directory at the project root.
In this toy language, we have only two words: YES and NO. For the sake of simplicity, we will just assume they are one-phone words and each pronounced only in a way, represented Y and N symbols.
printf "Y\nN\n" > dict/phones.txt # list of phonetic symbols
printf "YES Y\nNO N\n" > dict/lexicon.txt # word-to-pronunciation dictionaryHowever, in real speech, there are not only human sounds that contributes to a linguistic expression, but also pauses/silence and environmental noises from things.
Kaldi calls all those non-linguistic sounds "silence".
For example, even in this small, controlled recordings, we have pauses between each word.
Thus we need an additional phone "SIL" representing such silence. And it can be happening at end of of all words. Kaldi calls this kind of silence "optional".
echo "SIL" > dict/silence_phones.txt # list of silence symbols
echo "SIL" > dict/optional_silence.txt # list of optional silence symbols
mv dict/{phones,nonsilence_phones}.txt # list of non-silence symbols
# note that we no longer use simple `phones.txt` listNow amend lexicon to include the silence as well.
cp dict/lexicon.txt dict/lexicon_words.txt # word-to-sound dictionary
echo "<SIL> SIL" >> dict/lexicon.txt # union with nonword-to-silence dictionary
# again note that we use `lexicon.txt` list as the union set, unlike above Note that the token "<SIL>" will also be used as our out-of-vocabulary (unknown) token later.
Your dict directory should end up with these 5 files:
lexicon.txt: full list of lexeme-phone pairs including silenceslexicon_words.txt: list of word-phone pairs (no silence)silence_phones.txt: list of silent phonesnonsilence_phones.txt: list of non-silent phonesoptional_silence.txt: list of optional silent phones (here, this looks the same assilence_phones.txt)
Finally, we need to convert our dictionaries into a data structure that Kaldi would accept - weighted finite state transducer (WFST). Among many scripts Kaldi provides, we will use utils/prepare_lang.sh to generate FST-ready data formats to represent our toy language.
utils/prepare_lang.sh --position-dependent-phones false $RAW_DICT_PATH $OOV $TEMP_DIR $OUTPUT_DIRWe're using --position-dependent-phones flag to be false in our tiny, tiny toy language. There's not enough context, anyways. For required parameters we will use:
$RAW_DICT_PATH:dict$OOV:"<SIL>"out-of-vocabulary token$TEMP_DIR: Could be anywhere. I'll just put a new directorytmpinsidedict.$OUTPUT_DIR: This output will be used in further training. Set it todata/lang.
We provide a sample uni-gram language model for the yesno data.
You'll find a arpa formatted language model inside lm directory (we'll learn more about language model formats later this semester).
However, again, the language model also needs to be converted into a WFST.
For that, Kaldi (specifically OpenFST library) also comes with a number of programs.
In this example, we will use arpa2fst program for conversion. We need to run
arpa2fst --disambig-symbol=#0 --read-symbol-table=$WORDS_TXT $ARPA_LM $OUTPUT_FILEwith arguments,
$WORDS_TXT: path to thewords.txtgenerated fromprepare_lang.sh;data/lang/words.txt$ARPA_LM: the language model (arpa) file;lm/yesno-unigram.arpabo$OUTPUT_FILE:data/lang/G.fstG stands for grammar.
This section will cover how to perform MFCC feature extraction and GMM modeling.
Once we have all data ready, it's time to extract features for GMM training.
First to extract mel-frequency cepstral coefficients.
steps/make_mfcc.sh --nj $N $INPUT_DIR $OUTPUT_DIR --nj $N: number of processors, defaults to 4$INPUT_DIR: where we put our Kaldi-formatted 'data' of training set;data/train_yesno$LOG_DIR: let's put output toexp/log/make_mfcc/train_yesno, following Kaldi recipes convention.
Then normalize cepstral features
steps/compute_cmvn_stats.sh $INPUT_DIR $OUTPUT_DIRUse $INPUT_DIR and $OUTPUT_DIR as the same as above.
Note that these shell scripts (.sh) are all utilizing Kaldi binaries with trivial text processing on the fly. To see which commands were actually executed, see log files in <OUTPUT_DIR>. Or even better, see inside the scripts. For details on specific Kaldi commands, refer to the official documentation.
We will train a monophone model, since we assume that, in our toy language, phones are not context-dependent. (which is, of course, an absurd assumption)
steps/train_mono.sh --nj $N --cmd $MAIN_CMD $DATA_DIR $LANG_DIR $OUTPUT_DIR--nj $N: Utterances from a speaker cannot be processed in parallel. Since we have only one, we must use 1 job only.--cmd $MAIN_CMD: To use local machine resources, use"utils/run.pl"pipeline.$DATA_DIR: Path to our 'training data'$LANG_DIR: Path to language definition (output fromprepare_langscript)$OUTPUT_DIR: like the above, let's useexp/mono.
This will generate FST-based lattice for acoustic model. Kaldi provides a tool to see inside the model (which may not make much sense now).
/path/to/kaldi/src/fstbin/fstcopy 'ark:gunzip -c exp/mono/fsts.1.gz|' ark,t:- | head -n 20This will print out first 20 lines of the lattice in human-readable(!!) format (Each column indicates: Q-from, Q-to, S-in, S-out, Weigh)
This section will cover decoding of the model we trained.
Now we're done with acoustic model training.
For decoding, we need a new input that goes over our lattices of AM & LM.
In step 1, we prepared separate testset in data/test_yesno for this purpose.
Now it's time to project it into the feature space as well.
Use steps/make_mfcc.sh and steps/compute_cmvn_stats.sh .
Then, we need to build a fully connected FST network.
utils/mkgraph.sh --mono data/lang exp/mono exp/mono/graphThis will build a connected HCLG in exp/mono/graph directory.
Finally, we need to find the best paths for utterances in the test set, using decode script. Look inside the decode script, figure out what to give as its parameter, and run it. Write the decoding results in exp/mono/decode_test_yesno.
steps/decode.sh SOME ARGUMENTS YOU NEEDThis will end up with lat.N.gz files in the output directory, where N goes from 1 up to the number of jobs you used (which must be 1 for this task). These files contain lattices from utterances that were processed by N’th thread of your decoding operation.
If you look inside the decoding script, it ends with calling the scoring script (local/score.sh), which generates hypotheses and computes word error rate of the testset
See exp/mono/decode_test_yesno/wer_X files to look the WER's, and exp/mono/decode_test_yesno/scoring/X.tra files for transcripts.
X here indicates language model weight, LMWT, that scoring script used at each iteration to interpret the best paths for utterances in lat.N.gz files into word sequences.
(Remember N is #thread during decoding operation)
Transcripts (.tra files) are written with word symbols, not actual words. See data/lang/words.txt file for word-symbol mappings.
You can deliberately specify the weight using --min_lmwt and --max_lmwt options when score.sh is called, if you want
(Again, we'll cover what the LMWT and what it does later in the semester).
Or if you are interested in getting word-level alignment information for each recording file, take a look at steps/get_ctm.sh script.
- Due: 9/14/2018 23:55
- Submit via github classroom
- No late submission accepted
- Finish
data_prep.py - Write a uber script
run_yesno.shthat runs the entire pipeline from runningdata_prep.pyto runningdecode.shand run it.- If you'd like, it's okay to write smaller scripts for sub-tasks then call them in the
run_yesno.sh(use any language of your choice) - Make sure the pipeline script runs flawlessly and generates proper transcript. You might need to write something like
reset.shto clean up the working directory during debugging your script.
- If you'd like, it's okay to write smaller scripts for sub-tasks then call them in the
- Commit your
data_prep.pyrun_yesno.shpath.sh- Any other scripts you wrote as part of
run_yesno.sh, if any - All files in
exp/mono/decode_test_yesnoafter runningrun_yesno.sh
- When ready, tag the commit as
part1and push tomaster.
- Modify any relevant part of you pipeline to use actual phonetic notations for these two Hebrew words, instead of dummy Y/N phones. For orthographic notationn, use "ken" and "lo" (Let's not worry about unicode right now). This will also require editing the
arpabofile.- Pronunciations can be found on various resources, for example, wiktionary can be helpful.
- Figure out how to use
get_ctm.shto get alignment as well as hypotheses & WER scores, and add it to the pipeline. - Commit
- Any changes in the pipeline and arpa
- All files in
exp/mono/decode_test_yesnoafter running the new pipeline.
- When ready, tag the commit as
part2and push tomaster.
- Don't forget to tag your commits. You can make as many commits as you like, however only two commits tagged as
part1andpart2will be graded. - Graders will use
bashto run scripts. Make sure your.shscirpts are portable and bash compatible.shebangline could be helpful.