CNNs:Speech-Music-Discrimination

@article{papakostas2018speech, title={Speech-Music Discrimination Using Deep Visual Feature Extractors}, author={Papakostas, Michalis and Giannakopoulos, Theodoros}, journal={Expert Systems with Applications}, year={2018}, publisher={Elsevier} }

Synopsis

This project describes a new approach to the very traditional problem of Speech-Music Discrimination. According to our knowledge, the proposed method, provides state-of-the-art results on the task. We employ a Deep Convolutional Neural Network (CNN) and we offer a compact framework to perform segmentation and binary (Speech/Music) classification, by exploing the benefits of transfering knowledge from pretrained architectures on Imagenet. Our method is unchained from traditional audio features, which offer inferior results on the task. Instead it exploits the highly invariant features produced by CNNs and opperates on pseudocolored RGB or grayscale frequency-images, which represent audio segments.

Evaluation of different methods on 11-hours of continous radio streams

*Dataset included speeh-only, music-only and speech-music overlaping audio samples - for further details loook at the paper

Roc-curves of the two proposed methods ie. with(red) and withought(blue) transfer-learning on the same dataset

Evaluation of our best method(pink) against the methods proposed by Pikrakis & Theodoridis on datasetA and datasetB

The repository consists of the following modules:

Installation

Code Description

Data Preparation

  1. Convert your audio files into pseudocolored RGB or grayscale spectrogram images using generateSpectrograms.py TO BE UPDATED a)How to run, b)How to set segmentation parameters c) How the output looks like

  2. Split the spectrogram images into train and test as shown in Fig1:

    Fig1. - Data Structure
    • Train/Test and Classes represent directories
    • Samples represent files
    • If you wish to use the architecture proposed in this work:
  3. Data should be pseudo-colored RGB spectrogram images of size 227x227 as shown in Fig2

    Fig2. - Sample RGB Spectrogram

  4. or grayscale spectrogram images of size 200x200 as shown in Fig3

    Fig3. - Sample Grayscale Spectrogram

    * Image resizing can be done directly using CAFFE framework.

Training

  1. Train

    Training can be done either by training a new network from sratch or by finetuning a pretrained architecture.

    The pretrained model used in the paper for fine-tuning is the caffe_imagenet_hyb2_wr_rc_solver_sqrt_iter_310000 initially proposed in Donahue, Jeffrey, et al. "Long-term recurrent convolutional networks for visual recognition and description." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. To exploit the weight initialization of the pretrained model use the CNN architecture shown in SpeechMusic_RGB.prototxt.

    If you wish to deploy the smaller CNN architecture that operates on grayscale images you should use the CNN architecture shown in SpeechMusic_GRAY.prototxt. This model was trained from scratch without weight initialization.

    • Train from scratch:

      python trainCNN.py <architecture_file>.prototxt <path_to_train_data_root_foler> <path_to_test_data_root_foler> <snapshot_prefix> <total_number_of_iterations> 
    • Finetune pretrained network:

      python trainCNN.py <architecture_file>.prototxt <path_to_train_data_root_foler> <path_to_test_data_root_foler> <snapshot_prefix> <total_number_of_iterations> --init <pretrained_network>.caffemodel --init_type fin 
    • Resume Training:

      python trainCNN.py <architecture_file>.prototxt <path_to_train_data_root_foler> <path_to_test_data_root_foler> <snapshot_prefix> <total_number_of_iterations> --init <pretrained_network>.solverstate --init_type res 
    • For more details about modifying other learning parameters (i.e learning rate, step size etc.) type:

      python trainCNN.py -h 
    1. Outputs:
      1. _\<snapshot_prefix>solver.prototxt Solver file required by caffe to train the CNN. The solver file describes all the parameters of the current experients. Commented lines have additional information regarding the experiments that are not required by the Caffe framework.
      2. _\<snapshot_prefix>TrainSource.txt & _\<snapshot_prefix>TestSource.txt Full paths to training and test samples with each samples class
    • Train HMM

      python ClassifyWav.py trainHMM <path_to_test_data> <hmm_model_name> <core_classification_method> <trained_network> <classification_method> 

      *This applies after having a trained CNN

      Change trainCNN.py, Line:9, to caffe.set_mode_gpu() to support GPU implementation

Classification

python ClassifyWav.py evaluate <path_to_test_wav_files> <trained_network>.caffemodel  <classification_method> <classification_type_flag> "" 
python ClassifyWav.py evaluate <path_to_test_wav_files> <trained_network>-5000.caffemodel <core_classification_method> <classification_type_flag> <hmm_model_name> 

Change ClassifyWav.py, Line:17, to caffe.set_mode_gpu() to support GPU implementation

Code Example

Conclusions

We provide a new method for the task of Speech/Music Discrimination using Convolutional Neural Networks. The main contributions of this work are the following:

  1. A compact framework for:

    • Segmenting and Classifying long audio streams into Speech and Music segments.
    • Train new CNN models on binary audio tasks
  2. A big dataset on long audio streams (more than 10h) for the task of speech music discrimination. The dataset is provided in the form of spectrograms.

  3. Two different pretrained CNN architectures that can be used for weight initialization for other binary classification tasks.

  4. To our knowledge our method provides state-of-the-art results on the task

References & Citations

If your found our project usefull please cite the following referenced publications:

CNNs:Speech-Music-Discrimination @article{papakostas2018speech, title={Speech-Music Discrimination Using Deep Visual Feature Extractors}, author={Papakostas, Michalis and Giannakopoulos, Theodoros}, journal={Expert Systems with Applications}, year={2018}, publisher={Elsevier} }

PyAudioAnalysis @article{giannakopoulos2015pyaudioanalysis, title={pyAudioAnalysis: An Open-Source Python Library for Audio Signal Analysis}, author={Giannakopoulos, Theodoros}, journal={PloS one}, volume={10}, number={12}, year={2015}, publisher={Public Library of Science} }

Caffe Framework @article{jia2014caffe, Author = {Jia, Yangqing and Shelhamer, Evan and Donahue, Jeff and Karayev, Sergey and Long, Jonathan and Girshick, Ross and Guadarrama, Sergio and Darrell, Trevor}, Journal = {arXiv preprint arXiv:1408.5093}, Title = {Caffe: Convolutional Architecture for Fast Feature Embedding}, Year = {2014} }

If you used the pretrained network _caffe_imagenet_hyb2_wr_rc_solver_sqrt_iter310000 for your experiments, please also cite:

@inproceedings{donahue2015long, title={Long-term recurrent convolutional networks for visual recognition and description}, author={Donahue, Jeffrey and Anne Hendricks, Lisa and Guadarrama, Sergio and Rohrbach, Marcus and Venugopalan, Subhashini and Saenko, Kate and Darrell, Trevor}, booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition}, pages={2625--2634}, year={2015} }