Multi-channel-speech-extraction-using-DNN

A tensorflow implementation of my paper Combining beamforming and deep neural networks for multi-channel speech extraction presented at InterNoise 2017 (not published yet) and a manuscript can be found at https://github.com/zhr1201/Multi-channel-speech-extraction-using-DNN/blob/master/Manuscript-InterNoise2017-ZhouHaoran_0525.pdf

Samples

Requirement

Training procedure

CNN

1. Design your array and your foward and backward beamformer.
2. Generate raw signal using image model(can be got free online) and generate samples for the nets.
   Name them like index+f.wav(forward beamformer output) index+b.wav (backward beamformer output) index+r.wav(reference signal)
3. Use data_set_gen.py to transform them into specgtrograms and store them as png files.
4. Use img2bin.py to transform the png files into binary files for the TF model.
5. Run SENN_train.py to train the model.
6. Use audio_eval.py to make clean spectrum inference given a forward beamformer wav and a backward beamformer wav using
   a trained model.

LSTM

1. Design your array and your foward and backward beamformer.
2. Generate raw signal using image model(can be got free online) and generate samples for the nets.
   Name them like index+f.wav(forward beamformer output) index+b.wav (backward beamformer output) index+r.wav(reference signal)
3. Dump the data into .pkl file format.
4. Run SENN_train.py to train the model.
6. Use audio_eval.py to make clean spectrum inference given a forward beamformer wav and a backward beamformer wav using
   a trained model.

Some other things

There are some little differences betweem the implementation and the paper, eg. the VAD inference loss is not considered and etc.