Crawlpy

Python web spider/crawler based on scrapy with support for POST/GET login, variable level of recursions/depth and optionally save to disk.

Requirements

pip install Scrapy

Features

POST/GET Login prior crawling
Can handle logins that requires dynamic CSRF token
Variable level of crawling depth
Optionally save webpages to disk

Roadmap

Find all planned features and their stati here: https://github.com/cytopia/crawlpy/issues/1

Usage

# stdout output
scrapy crawl crawlpy -a config=/path/to/crawlpy.config.json

# save as json (url:, status:, depth:, referer:) to 'urls.json'
scrapy crawl crawlpy --loglevel=INFO -a config=/path/to/crawlpy.config.json -o urls.json -t json

# save as csv (url, status, depth, referer) to 'urls.csv'
scrapy crawl crawlpy --loglevel=INFO -a config=/path/to/crawlpy.config.json -o urls.csv -t csv

Configuration

Make a copy of crawlpy.config.json-sample (e.g.: example.com-config.json) and adjust the values accordingly.

Note: It must be a valid json file (without comments), otherwise crawlpy will throw errors parsing json. (Use http://jsonlint.com/ to validate your config file.)

{
    "proto": "http",        // 'http' or 'https'
    "domain": "localhost",  // Only the domain. e.g.: 'example.com' or 'www.example.com'
    "depth": 3,             // Nesting depth to crawl
    "ignores": [],          // Array of substrings to deny/ignore when found in URL
    "httpstatus_list": [],  // Array of http status codes to handle (default is 2xx)
    "login": {              // Login section
        "enabled": false,   // Do we actually need to do a login?
        "method": "post",   // 'post' or 'get'
        "action": "/login.php", // Where the post or get will be submitted to
        "failure": "Password is incorrect", // The string you will see on login failure
        "fields": {         // POST/GET Fields to submit to login page
            "username": "john",
            "password": "doe"
        },
        "csrf": {
            "enabled": false, // Login requires a CSRF token?
            "field": "csrf"   // Input field name that holds dynamic CSRF token
        }
    },
    "store": {              // Store section
        "enabled": false,   // save to disk?
        "path": "./data"    // path for saving (rel or abs)
    }
}

Detailed description

Key	Type	Default Value	Possible Values	Description
proto	string	`http`	`http` or `https`	Is the site you want to crawl running on `http` or `https`?
domain	string	`localhost`	Domain or subdomain	The domain or subdomain you want to spider. Nothing outside this domain/subdomain will be touched.
depth	integer	`3`	`0`,`1`,`2`,`3`,...	`0`: Crawl indefinetely until every subpage has been reached. `1`: Only crawl links on the initial page. `2`: Crawl links on the initial page and everything found on the links of that page. Note: when you do a login, the login page already counts as one level of depth by scrapy itself, but this is rewritten internally to subtract that depth again, so your output will not show that extra depth.
ignores	array	[ ]	['/logout.php', 'delete.php?id=1']	Each array string element is treated as a substring (no regex) and is checked against a FQDN. If any of the specified substrings is found in that URL, it will not be crawled. Note: It does make sense, when you login somewhere, to ignore the logout page, as well as other pages that might delete/disable your current user, so you will not be kicked from your login session during crawl time.
httpstatus_list	array	[ ]	[403, 404, 500]	By default scrapy ignores pages with status code other than 2xx, so if you know that a 403 page contains actual content with links, just add this here. Note: There is no need to specify `200`, as scrapy crawls them by default.
login	Login section
enabled	boolean	`false`	`true` or `false`	`true`: Do a login prior crawling `false`: do not login Note:When login is set to `false`, you do not need to fill in the rest of the variables inside the `login` section
method	string	`post`	`post` or `get`	Method required to execute the login
action	string	`/login.php`	login page	Relative login page (from the base domain, including leading slash) where the `post` or `get` will go to.
failure	string	`Password is incorrect`	login failed string	A string that is found on the login page, when the login fails.
fields	key-value	`{` `"username": "john",` `"password": "doe"` `}`	`post` or `get` params	POST or GET params required to login. Examples: username, password, hidden-field-name
csrf	Login CSRF section
enabled	boolean	`false`	`true` or `false`	`true`: Login page has a dynamic CSRF token that you want to read out and submit along the normal submit data. `false`: Login does not require a CSRF token to be submitted. Note: If the login has a static (never-changing) CSRF field, just add the data into the fields section Note: Read below about built-in automatic CSRF detection and leave this off at first.
field	string	`csrf`	Field name	The name of the input field which holds the CSRF token
store	Store section
enabled	boolean	`false`	`true` or `false`	`true`: Save webpages to disk `false`: Do not save webpages to disk.
path	string	`./data`	Path	Absolute or relative path to store webpages to disk

Note about CSRF

Scrapy will most likely handle this automatically, so its best to turn off custom csrf in the config. If there is however any situation where the built-in CSRF recognition does not work, try the user-defined one. If none of them work, drop me an issue.

Reference

License

MIT License