Reading long-form content on the internet is a shitty experience.
This is a web-proxy that tries to make it better.
This is a rewriting proxy. In other words, it proxies arbitrary web content, while allowing the rewriting of the remote content as driven by a set of rule-files. The goal is to effectively allow the complete customization of any existing web-sites as driven by predefined rules.
Functionally, it's used for extracting just the actual content body of a site and reproducing it in a clean layout. It also modifies all links on the page to point to internal addresses, so following a link points to the proxied version of the file, rather then the original.
While the above was the original scope, the project has mutated heavily. At this point, it has a complete web spider and archives entire websites to local storage. Additionally, multiple versions of each page are kept, with a overall rolling refresh of the entire database at configurable intervals (configurable on a per-domain, or global basis).
Quick installation overview:
pg_trgm
and citext
extensions from the
community extensions modules.settings.example.py
to settings.py
.build-venv.sh
source flask/bin/activate
create_db.sh
run_local.sh
from
https://github.com/fake-name/AutoTrieverpython3 run.py
python runScrape.py
python runScrape.py scheduler
run_agent.sh
)
The RPC agent allows multiple projects to use the RPC system
simultaneously. Since the RPC system basically allows executing
either predefined jobs, or arbitrary code on the worker swarm. This
is fairly useful in general, so I've implemented it as a service
that multiple of my projects then use.Ubuntu dependencies