Clickbaits Revisited

This repository provides the code used for : https://www.linkedin.com/pulse/clickbaits-revisited-deep-learning-title-content-features-thakur

Data Collection

To run the code you must first collect the data:

Get facebook page parser from: https://github.com/minimaxir/facebook-page-post-scraper
Run the python script: get_fb_posts_fb_page.py for buzzfeed, upworthy, cnn, nytimes, wikinews, clickhole and StopClickBaitOfficial
Save all the CSVs obtained from above step in data/

Data Pre-Processing

After the data has been collected, you need to run the following files to obtain training and test data. The order is important!

- $ cd data_processing
- $ python create_data.py
- $ python html_scraper.py
- $ python feature_generation.py
- $ python merge_data.py
- $ python data_cleaning.py

After the steps above, you will end up with train.csv and test.csv in data/

Please note that the above steps will require a lot of memory. So, if you have anything less than 64GB, please modify the code according to your needs.

GloVe embeddings

Obtain GloVe embeddings from the following URL:

http://nlp.stanford.edu/data/glove.840B.300d.zip

Extract the zip and place the CSV in data/

Deepnets

After all the above steps, you are ready to go and play around with the deep neural networks to classify clickbaits

Change directory to deepnets/

cd deepnets/

The deepnets are as folllows:

LSTM_Title.py : LSTM on title text without GloVe embeddings
LSTM_Title_Content.py : LSTM on title text and content text without GloVe embeddings
LSTM_Title_Content_with_GloVe.py : LSTM on title and content text with GloVe emebeddings
TDD_Title_Content_with_Glove.py : Time distributed dense on title and content text with GloVe embeddings
LSTM_Title_Content_Numerical_with_GloVe.py : LSTM on title + content text with GloVe embeddings & dense net for numerical features.

Performance

The network with LSTM on title and content text with GloVe embeddings with numerical features achieves an accuracy of 0.996 during validation and 0.992 on the test set.

All models were trained on NVIDIA TitanX, Ubuntu 16.04 system with 64GB memory.