Trip Advisor crawler

This is a simple crawler script for Trip Advisor.

It is aimed at researchers and students that want to experiment with text mining problems on review data.

usage: trip-advisor-crawler.py [-h] [-f] [-r MAXRETRIES] [-t TIMEOUT] [-a {Hotel,Restaurant}] [-p PAUSE] [-m MAXREVIEWS] -o OUT ID [ID ...]

required arguments:

-o OUT, --out OUT Output base path

ID IDs for which to download reviews

ID format:

optional arguments:

-h, --help show help message and exit

-f, --force Force download even if already successfully downloaded

-a {Hotel,Restaurant}, --activity {Hotel,Restaurant} Type of activity to crawl (default: Hotel)

-r MAXRETRIES, --maxretries MAXRETRIES Max retries to download a file. Default: 3

-t TIMEOUT, --timeout TIMEOUT Timeout in seconds for http connections. Default: 180

-p PAUSE, --pause PAUSE Seconds to wait between http requests. Default: 0.2

-m MAXREVIEWS, --maxreviews MAXREVIEWS Maximum number of reviews per item to download. Default:unlimited