CNNs:Speech-Music-Discrimination

@article{papakostas2018speech, title={Speech-Music Discrimination Using Deep Visual Feature Extractors}, author={Papakostas, Michalis and Giannakopoulos, Theodoros}, journal={Expert Systems with Applications}, year={2018}, publisher={Elsevier} }

Synopsis

This project describes a new approach to the very traditional problem of Speech-Music Discrimination. According to our knowledge, the proposed method, provides state-of-the-art results on the task. We employ a Deep Convolutional Neural Network (CNN) and we offer a compact framework to perform segmentation and binary (Speech/Music) classification, by exploing the benefits of transfering knowledge from pretrained architectures on Imagenet. Our method is unchained from traditional audio features, which offer inferior results on the task. Instead it exploits the highly invariant features produced by CNNs and opperates on pseudocolored RGB or grayscale frequency-images, which represent audio segments.

Evaluation of different methods on 11-hours of continous radio streams

*Dataset included speeh-only, music-only and speech-music overlaping audio samples - for further details loook at the paper

Roc-curves of the two proposed methods ie. with(red) and withought(blue) transfer-learning on the same dataset

Evaluation of our best method(pink) against the methods proposed by Pikrakis & Theodoridis on datasetA and datasetB

The repository consists of the following modules:

Audio segmentation using the PyAudioAnalysis lybrary
CNN training using the CAFFE Deep-Learning Framework.
Audio classification using:
- CNNs
- CNNs + median_filtering
- CNNs + median_filtering + HMMs
Two pretrained CNNs on the task of Speech/Music Discrimination. The network can be also used for weight initialization for other similar tasks. (to be added)
An audio dataset consisting of more than 10h continous audio streams. At this point the data are available in the form of spectrograms. (to be added)

Installation

Dependencies
1. PyAudio
2. CAFFE Deep-Learning Framework
* Installation instructions offered in detail on the above links
Add Caffe to your working dir
1. trainCNN.py --> Line:4
2. train_net.sh --> Line:2
3. ClassifyWav.py --> Line:14
or add pycaffe to your .bashrc for directory independent access
- open .bashrc file located at your home directory In a terminal type:
  1. cd ~ to navigate to your home directory
  2. ls -a to see the file listed
  3. nano .bashrc to open the file in terminal
  4. scroll at the bottom of the file and add:
  export PYTHONPATH=$PYTHONPATH:"/home/--myPathToCaffe--/caffe/python"
  
  , where --myPathToCaffe-- is the path to the caffe library as it appears in your local machine
  
  i.e.: export PYTHONPATH=$PYTHONPATH:"/home/michalis/Liraries/caffe/python"
  1. source ~/.bashrc to update your source file

Code Description

Data Preparation

Convert your audio files into pseudocolored RGB or grayscale spectrogram images using generateSpectrograms.py TO BE UPDATED a)How to run, b)How to set segmentation parameters c) How the output looks like
Split the spectrogram images into train and test as shown in Fig1:

Fig1. - Data Structure
- Train/Test and Classes represent directories
- Samples represent files
- If you wish to use the architecture proposed in this work:
Data should be pseudo-colored RGB spectrogram images of size 227x227 as shown in Fig2

Fig2. - Sample RGB Spectrogram
or grayscale spectrogram images of size 200x200 as shown in Fig3

Fig3. - Sample Grayscale Spectrogram

* Image resizing can be done directly using CAFFE framework.

Training

Train a CNN
1. Provide Network Architecture file. You can use one of the proposed architectures (SpeechMusic_RGB.prototxt, SpeechMusic_GRAY.prototxt ) or another CNN of your choice.

Train

Training can be done either by training a new network from sratch or by finetuning a pretrained architecture.

The pretrained model used in the paper for fine-tuning is the caffe_imagenet_hyb2_wr_rc_solver_sqrt_iter_310000 initially proposed in Donahue, Jeffrey, et al. "Long-term recurrent convolutional networks for visual recognition and description." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. To exploit the weight initialization of the pretrained model use the CNN architecture shown in SpeechMusic_RGB.prototxt.

If you wish to deploy the smaller CNN architecture that operates on grayscale images you should use the CNN architecture shown in SpeechMusic_GRAY.prototxt. This model was trained from scratch without weight initialization.
- Train from scratch:
```
python trainCNN.py <architecture_file>.prototxt <path_to_train_data_root_foler> <path_to_test_data_root_foler> <snapshot_prefix> <total_number_of_iterations> 
```
- Finetune pretrained network:
```
python trainCNN.py <architecture_file>.prototxt <path_to_train_data_root_foler> <path_to_test_data_root_foler> <snapshot_prefix> <total_number_of_iterations> --init <pretrained_network>.caffemodel --init_type fin 
```
- Resume Training:
```
python trainCNN.py <architecture_file>.prototxt <path_to_train_data_root_foler> <path_to_test_data_root_foler> <snapshot_prefix> <total_number_of_iterations> --init <pretrained_network>.solverstate --init_type res 
```
- For more details about modifying other learning parameters (i.e learning rate, step size etc.) type:
```
python trainCNN.py -h 
```
1. Outputs:
  1. _\<snapshot_prefix>solver.prototxt Solver file required by caffe to train the CNN. The solver file describes all the parameters of the current experients. Commented lines have additional information regarding the experiments that are not required by the Caffe framework.
  2. _\<snapshot_prefix>TrainSource.txt & _\<snapshot_prefix>TestSource.txt Full paths to training and test samples with each samples class
- Train HMM
```
python ClassifyWav.py trainHMM <path_to_test_data> <hmm_model_name> <core_classification_method> <trained_network> <classification_method> 
```
  *This applies after having a trained CNN
  
  Change trainCNN.py, Line:9, to caffe.set_mode_gpu() to support GPU implementation

Classification

Evaluate trained CNN Model with/without post processing:

python ClassifyWav.py evaluate <path_to_test_wav_files> <trained_network>.caffemodel  <classification_method> <classification_type_flag> ""

Evaluate trained HMM Model with post processing:

python ClassifyWav.py evaluate <path_to_test_wav_files> <trained_network>-5000.caffemodel <core_classification_method> <classification_type_flag> <hmm_model_name>

Change ClassifyWav.py, Line:17, to caffe.set_mode_gpu() to support GPU implementation

Code Example

Generate Spectrogram Images:

Train from scratch:

python trainCNN.py SpeechMusic_RGB.prototxt Train Test myOutput 4000

Finetune pretrained network (train and test paths are according to Fig1):

python trainCNN.py SpeechMusic_RGB.prototxt Train Test myOutput 1000 --init caffe_imagenet_hyb2_wr_rc_solver_sqrt_iter_310000.caffemodel --init_type fin

Resume training from pretrained network (train and test paths are according to Fig1):

python trainCNN.py SpeechMusic_RGB.prototxt Train Test my_new_Output 2000 --init myOutput.solverstate --init_type res

Evaluate trained CNN on .wav file/s without preprosesing:

python ClassifyWav.py evaluate Data/testWavs CNN-SM-5000.caffemodel  cnn 0 ""

Evaluate trained CNN on .wav file/s with preprosesing:

python ClassifyWav.py evaluate Data/testWavs CNN-SM-5000.caffemodel  cnn 1 ""

Train an HMM after applying median filtering:

python ClassifyWav.py trainHMM Data/testWavs hmm1 cnn CNN-SM-5000.caffemodel 1

Test using pretrained HMM:
```
python ClassifyWav.py evaluate Data/testWavs CNN-SM-5000.caffemodel cnn 2 hmm1 
```
Pretrained model

A pretrained model on the task using pseudo-colored RGB images along with the solverstate can be found here

Conclusions

We provide a new method for the task of Speech/Music Discrimination using Convolutional Neural Networks. The main contributions of this work are the following:

A compact framework for:
- Segmenting and Classifying long audio streams into Speech and Music segments.
- Train new CNN models on binary audio tasks
A big dataset on long audio streams (more than 10h) for the task of speech music discrimination. The dataset is provided in the form of spectrograms.
Two different pretrained CNN architectures that can be used for weight initialization for other binary classification tasks.
To our knowledge our method provides state-of-the-art results on the task

References & Citations

If your found our project usefull please cite the following referenced publications:

CNNs:Speech-Music-Discrimination @article{papakostas2018speech, title={Speech-Music Discrimination Using Deep Visual Feature Extractors}, author={Papakostas, Michalis and Giannakopoulos, Theodoros}, journal={Expert Systems with Applications}, year={2018}, publisher={Elsevier} }

PyAudioAnalysis @article{giannakopoulos2015pyaudioanalysis, title={pyAudioAnalysis: An Open-Source Python Library for Audio Signal Analysis}, author={Giannakopoulos, Theodoros}, journal={PloS one}, volume={10}, number={12}, year={2015}, publisher={Public Library of Science} }

Caffe Framework @article{jia2014caffe, Author = {Jia, Yangqing and Shelhamer, Evan and Donahue, Jeff and Karayev, Sergey and Long, Jonathan and Girshick, Ross and Guadarrama, Sergio and Darrell, Trevor}, Journal = {arXiv preprint arXiv:1408.5093}, Title = {Caffe: Convolutional Architecture for Fast Feature Embedding}, Year = {2014} }

If you used the pretrained network _caffe_imagenet_hyb2_wr_rc_solver_sqrt_iter310000 for your experiments, please also cite:

@inproceedings{donahue2015long, title={Long-term recurrent convolutional networks for visual recognition and description}, author={Donahue, Jeffrey and Anne Hendricks, Lisa and Guadarrama, Sergio and Rohrbach, Marcus and Venugopalan, Subhashini and Saenko, Kate and Darrell, Trevor}, booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition}, pages={2625--2634}, year={2015} }