keras-malicious-url-detector

Malicious URL detector using char-level recurrent neural networks with Keras

The purpose of this project is to study malicious url detector that does not rely on any prior knowledge about urls

The training data is from this link, can be found in "demo/data/URL.txt"

Deep Learning Models

The following deep learning models have been implemented and studied:

Usage

To run the training on Bidirectional LSTM:

cd demo
python bidirectional_lstm_train.py

Below is the code in bidirectional_lstm_train.py:

from keras_malicious_url_detector.library.bidirectional_lstm import BidirectionalLstmEmbedPredictor
from keras_malicious_url_detector.library.utility.url_data_loader import load_url_data
import numpy as np
from keras_malicious_url_detector.library.utility.text_model_extractor import extract_text_model
from keras_malicious_url_detector.library.utility.plot_utils import plot_and_save_history

def main():

    random_state = 42
    np.random.seed(random_state)

    data_dir_path = './data'
    model_dir_path = './models'
    report_dir_path = './reports'

    url_data = load_url_data(data_dir_path)

    text_model = extract_text_model(url_data['text'])

    batch_size = 64
    epochs = 30

    classifier = BidirectionalLstmEmbedPredictor()

    history = classifier.fit(text_model=text_model,
                             model_dir_path=model_dir_path,
                             url_data=url_data, batch_size=batch_size, epochs=epochs)

    plot_and_save_history(history, BidirectionalLstmEmbedPredictor.model_name,
                          report_dir_path + '/' + BidirectionalLstmEmbedPredictor.model_name + '-history.png')

if __name__ == '__main__':
    main()

After the training, the trained models are saved in the demo/models folder.

To test the trained model,run:

cd demo
python bidirectional_lstm_predict.py

Below is the code in bidrectional_lstm_predict.py:

from keras_malicious_url_detector.library.bidirectional_lstm import BidirectionalLstmEmbedPredictor
from keras_malicious_url_detector.library.utility.url_data_loader import load_url_data

def main():

    data_dir_path = './data'
    model_dir_path = './models'

    predictor = BidirectionalLstmEmbedPredictor()
    predictor.load_model(model_dir_path)

    url_data = load_url_data(data_dir_path)
    count = 0
    for url, label in zip(url_data['text'], url_data['label']):
        predicted_label = predictor.predict(url)
        print('predicted: ' + str(predicted_label) + ' actual: ' + str(label))
        count += 1
        if count > 20:
            break

if __name__ == '__main__':
    main()

Performance

Currently the bidirectional LSTM gives the best performance, with 75% - 80% accuracy after 30 to 40 epochs of training

Below is the training history (loss and accuracy) for the bidirectional LSTM:

bidirection-lstm-history

Issues