Kaldi hw process

3 min readFeb 16, 2022

download the data.json (file contains sentences list with recorded audio links)
https://roomie.pk:5000/docs/#/Sentences/SentenceController_getAllSampleSentencesByLanguageId

The following setup of kaldi was tested on Ubuntu 18 in a docker container and python3.

step 0: make_files.py to get data.json

step 1: download audio dataset using audio-downloader.py

step2: rename folders using folder_renamer.py

step3: rename audio files in each folder using renamer.py

step4: normalize text: explained below

step5: convertWebmToWav.py : explained below

clone repo

GitHub - hussainwali74/Burushaski-dataset

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

Properly name folders

cd into the folder and copy folder_renamer.py from scripts into audio and run it. Make sure folder name structure is : 02user_name-20160922-abj.

if the name has anything other this it will give error, be care the format is:

[i][speaker_name]-[year][month][day]-abj

i is the iterator, can be used to set priority.

normalize audio:

apt-get update --fix-missing 
apt-get install normalize-audio --fix-missing

the above installation is to be performed before get-transcript is called

normalize text

—

Use cleaner.sh to remove the diamond shapes with white question marks from audio/*/*.txt files

#cleaner.sh, placed within /kaldi/Burushaski/audios/cleaner.sh

— — —

Most important part:

dictionary words should be in upper case e.g., in burushaskidict-plain.txt

CCHURAM CCH U R A M
CHAP    CH A P
CHEREMBA        CCH E R E M B A
CHHANGI CHH A N G I
CXHA    CXH A
DISHUYA D I SH U Y A
DOROING D O R O I N G
DUUSUMBI        D UU S U M B I
ECHUMA  E CH U M A

extracted/speaker_id/etc/PROMPT all the lines (only the sentence parts) should be in uppercase. Use this command in get-transcript.sh

tr [:lower:] [:upper:]

use convertEncoding.py to fix ascii file with no line terminator warning, it will convert the file into ISO-8859 ASCII text file.

convert audio file to proper formats:

copy convertWebmToWav.py from scripts into audio folder and run it. It converts webM files to .wav file with proper RIFF headers. It uses ffmpeg to convert the files

install ffmpeg (550mb package) using:

apt-get update
apt-get install ffmpeg

comment out cmudict download code from local/voxforge_prepare_dict.sh

and insert your own dictionay -> burushaskidict-plain.txt into data/local/dict/burushaskidict-plain.txt.

the burushaskidict-plain.txt is our language dictionary created using g2p from our dataset, should look like this:

ccheremi CC H E R E M I
ggaan GG AA N
thappe TH A P E
han H A N
thuman TH U M A N
nembu N E M B U
gaganum G A G A N U M
bulle B U LL E
ka K A
tik T I K
germany J A R M A N I
ule U L E
.

after normalize audio in egs/burushaski/s5/burushaski/ run clean.sh and then get-transcript.sh

use the following command to remove question mark from all the files in all folders and sub-directories .. (there were question marks with each word in PROMPTS file). this will remove the ? marks in these files

find ./ -type f -and -name "PROMPTS" -exec sed -i -e 's/?//g' {} \;
find ./ -type f -and -name "promps-original" -exec sed -i -e 's/?//g' {} \;

in egs/burushaski/s5 run clean.sh to remove previously trained model and datas. then run