ServerlessCrawler-VancouverRealState-master
- Misc
  - create_rew_ca.sql
- LICENSE
- ListingScraper
  - chardet
    - euctwfreq.py
    - gb2312freq.py
    - charsetgroupprober.py
    - sbcharsetprober.py
    - langhebrewmodel.py
    - codingstatemachine.py
    - big5prober.py
    - langhungarianmodel.py
    - mbcharsetprober.py
    - langgreekmodel.py
    - version.py
    - cli
      - chardetect.py
      - __init__.py
    - charsetprober.py
    - euctwprober.py
    - eucjpprober.py
    - mbcsgroupprober.py
    - escprober.py
    - langbulgarianmodel.py
    - utf8prober.py
    - euckrfreq.py
    - universaldetector.py
    - mbcssm.py
    - sbcsgroupprober.py
    - hebrewprober.py
    - sjisprober.py
    - langthaimodel.py
    - cp949prober.py
    - latin1prober.py
    - chardistribution.py
    - __init__.py
    - gb2312prober.py
    - escsm.py
    - big5freq.py
    - compat.py
    - euckrprober.py
    - enums.py
    - langcyrillicmodel.py
    - langturkishmodel.py
    - jisfreq.py
    - jpcntx.py
  - urllib3-1.22.dist-info
    - METADATA
    - top_level.txt
    - metadata.json
    - WHEEL
    - INSTALLER
    - DESCRIPTION.rst
    - RECORD
  - chardet-3.0.4.dist-info
    - METADATA
    - top_level.txt
    - metadata.json
    - WHEEL
    - INSTALLER
    - DESCRIPTION.rst
    - entry_points.txt
    - RECORD
  - beautifulsoup4-4.6.0.dist-info
    - METADATA
    - top_level.txt
    - metadata.json
    - WHEEL
    - INSTALLER
    - DESCRIPTION.rst
    - RECORD
  - urllib3
    - fields.py
    - request.py
    - exceptions.py
    - _collections.py
    - util
      - request.py
      - selectors.py
      - wait.py
      - response.py
      - ssl_.py
      - connection.py
      - timeout.py
      - __init__.py
      - retry.py
      - url.py
    - response.py
    - packages
      - backports
        __init__.py
        makefile.py
      - six.py
      - ordered_dict.py
      - __init__.py
      - ssl_match_hostname
        _implementation.py
        __init__.py
    - connection.py
    - __init__.py
    - poolmanager.py
    - contrib
      - appengine.py
      - pyopenssl.py
      - ntlmpool.py
      - securetransport.py
      - __init__.py
      - socks.py
      - _securetransport
        low_level.py
        __init__.py
        bindings.py
    - connectionpool.py
    - filepost.py
  - scraper.py
  - bs4-0.0.1.dist-info
    - METADATA
    - top_level.txt
    - metadata.json
    - WHEEL
    - INSTALLER
    - DESCRIPTION.rst
    - RECORD
  - requests-2.18.4.dist-info
    - METADATA
    - top_level.txt
    - metadata.json
    - WHEEL
    - INSTALLER
    - DESCRIPTION.rst
    - RECORD
  - bs4
    - testing.py
    - builder
      - _html5lib.py
      - __init__.py
      - _lxml.py
      - _htmlparser.py
    - element.py
    - __init__.py
    - tests
      - test_builder_registry.py
      - test_html5lib.py
      - test_docs.py
      - test_soup.py
      - __init__.py
      - test_lxml.py
      - test_htmlparser.py
      - test_tree.py
    - diagnose.py
    - dammit.py
  - pymysql
    - util.py
    - connections.py
    - times.py
    - charset.py
    - constants
      - ER.py
      - CLIENT.py
      - CR.py
      - FLAG.py
      - COMMAND.py
      - __init__.py
      - FIELD_TYPE.py
      - SERVER_STATUS.py
    - _socketio.py
    - converters.py
    - _compat.py
    - __init__.py
    - err.py
    - tests
      - test_basic.py
      - test_converters.py
      - test_err.py
      - test_load_local.py
      - test_issues.py
      - test_nextset.py
      - test_DictCursor.py
      - test_connection.py
      - test_cursor.py
      - thirdparty
        test_MySQLdb
        dbapi20.py
        test_MySQLdb_capabilities.py
        test_MySQLdb_nonstandard.py
        capabilities.py
        test_MySQLdb_dbapi20.py
        __init__.py
        __init__.py
      - test_optionfile.py
      - __init__.py
      - test_SSCursor.py
      - base.py
    - optionfile.py
    - cursors.py
  - idna-2.6.dist-info
    - METADATA
    - top_level.txt
    - metadata.json
    - WHEEL
    - INSTALLER
    - DESCRIPTION.rst
    - RECORD
  - BeautifulSoup-3.2.1.dist-info
    - METADATA
    - top_level.txt
    - metadata.json
    - WHEEL
    - INSTALLER
    - DESCRIPTION.rst
    - RECORD
  - certifi
    - old_root.pem
    - __main__.py
    - __init__.py
    - core.py
  - utils.py
  - parser.py
  - certifi-2017.7.27.1.dist-info
    - METADATA
    - top_level.txt
    - metadata.json
    - WHEEL
    - INSTALLER
    - DESCRIPTION.rst
    - RECORD
  - idna
    - intranges.py
    - package_data.py
    - uts46data.py
    - __init__.py
    - core.py
    - compat.py
    - idnadata.py
    - codec.py
  - PyMySQL-0.7.11.dist-info
    - METADATA
    - top_level.txt
    - metadata.json
    - WHEEL
    - INSTALLER
    - DESCRIPTION.rst
    - RECORD
  - requests
    - exceptions.py
    - __version__.py
    - sessions.py
    - hooks.py
    - status_codes.py
    - packages.py
    - adapters.py
    - models.py
    - api.py
    - __init__.py
    - utils.py
    - _internal_utils.py
    - auth.py
    - help.py
    - cookies.py
    - structures.py
    - compat.py
    - certs.py
- Lambda_Setup - ZIP Packages for each function
- README.md
- Bootstrapper
  - chardet
    - euctwfreq.py
    - gb2312freq.py
    - charsetgroupprober.py
    - sbcharsetprober.py
    - langhebrewmodel.py
    - codingstatemachine.py
    - big5prober.py
    - langhungarianmodel.py
    - mbcharsetprober.py
    - langgreekmodel.py
    - version.py
    - cli
      - chardetect.py
      - __init__.py
    - charsetprober.py
    - euctwprober.py
    - eucjpprober.py
    - mbcsgroupprober.py
    - escprober.py
    - langbulgarianmodel.py
    - utf8prober.py
    - euckrfreq.py
    - universaldetector.py
    - mbcssm.py
    - sbcsgroupprober.py
    - hebrewprober.py
    - sjisprober.py
    - langthaimodel.py
    - cp949prober.py
    - latin1prober.py
    - chardistribution.py
    - __init__.py
    - gb2312prober.py
    - escsm.py
    - big5freq.py
    - compat.py
    - euckrprober.py
    - enums.py
    - langcyrillicmodel.py
    - langturkishmodel.py
    - jisfreq.py
    - jpcntx.py
  - urllib3-1.22.dist-info
    - METADATA
    - top_level.txt
    - metadata.json
    - WHEEL
    - INSTALLER
    - DESCRIPTION.rst
    - RECORD
  - chardet-3.0.4.dist-info
    - METADATA
    - top_level.txt
    - metadata.json
    - WHEEL
    - INSTALLER
    - DESCRIPTION.rst
    - entry_points.txt
    - RECORD
  - idna-2.5.dist-info
    - METADATA
    - top_level.txt
    - pbr.json
    - metadata.json
    - WHEEL
    - INSTALLER
    - DESCRIPTION.rst
    - RECORD
  - beautifulsoup4-4.6.0.dist-info
    - METADATA
    - top_level.txt
    - metadata.json
    - WHEEL
    - INSTALLER
    - DESCRIPTION.rst
    - RECORD
  - urllib3
    - fields.py
    - request.py
    - exceptions.py
    - _collections.py
    - util
      - request.py
      - selectors.py
      - wait.py
      - response.py
      - ssl_.py
      - connection.py
      - timeout.py
      - __init__.py
      - retry.py
      - url.py
    - response.py
    - packages
      - backports
        __init__.py
        makefile.py
      - six.py
      - ordered_dict.py
      - __init__.py
      - ssl_match_hostname
        _implementation.py
        __init__.py
    - connection.py
    - __init__.py
    - poolmanager.py
    - contrib
      - appengine.py
      - pyopenssl.py
      - ntlmpool.py
      - securetransport.py
      - __init__.py
      - socks.py
      - _securetransport
        low_level.py
        __init__.py
        bindings.py
    - connectionpool.py
    - filepost.py
  - bs4-0.0.1.dist-info
    - METADATA
    - top_level.txt
    - metadata.json
    - WHEEL
    - INSTALLER
    - DESCRIPTION.rst
    - RECORD
  - bs4
    - testing.py
    - builder
      - _html5lib.py
      - __init__.py
      - _lxml.py
      - _htmlparser.py
    - element.py
    - __init__.py
    - tests
      - test_builder_registry.py
      - test_html5lib.py
      - test_docs.py
      - test_soup.py
      - __init__.py
      - test_lxml.py
      - test_htmlparser.py
      - test_tree.py
    - diagnose.py
    - dammit.py
  - dynamo_client.py
  - requests-2.18.3.dist-info
    - METADATA
    - top_level.txt
    - metadata.json
    - WHEEL
    - INSTALLER
    - DESCRIPTION.rst
    - RECORD
  - bootstraper.py
  - certifi
    - old_root.pem
    - __main__.py
    - __init__.py
    - core.py
  - utils.py
  - parser.py
  - certifi-2017.7.27.1.dist-info
    - METADATA
    - top_level.txt
    - metadata.json
    - WHEEL
    - INSTALLER
    - DESCRIPTION.rst
    - RECORD
  - idna
    - intranges.py
    - uts46data.py
    - __init__.py
    - core.py
    - compat.py
    - idnadata.py
    - codec.py
  - requests
    - exceptions.py
    - __version__.py
    - sessions.py
    - hooks.py
    - status_codes.py
    - packages.py
    - adapters.py
    - models.py
    - api.py
    - __init__.py
    - utils.py
    - _internal_utils.py
    - auth.py
    - help.py
    - cookies.py
    - structures.py
    - compat.py
    - certs.py
- .gitignore
- SearchResultsPaginator
  - chardet
    - euctwfreq.py
    - gb2312freq.py
    - charsetgroupprober.py
    - sbcharsetprober.py
    - langhebrewmodel.py
    - codingstatemachine.py
    - big5prober.py
    - langhungarianmodel.py
    - mbcharsetprober.py
    - langgreekmodel.py
    - version.py
    - cli
      - chardetect.py
      - __init__.py
    - charsetprober.py
    - euctwprober.py
    - eucjpprober.py
    - mbcsgroupprober.py
    - escprober.py
    - langbulgarianmodel.py
    - utf8prober.py
    - euckrfreq.py
    - universaldetector.py
    - mbcssm.py
    - sbcsgroupprober.py
    - hebrewprober.py
    - sjisprober.py
    - langthaimodel.py
    - cp949prober.py
    - latin1prober.py
    - chardistribution.py
    - __init__.py
    - gb2312prober.py
    - escsm.py
    - big5freq.py
    - compat.py
    - euckrprober.py
    - enums.py
    - langcyrillicmodel.py
    - langturkishmodel.py
    - jisfreq.py
    - jpcntx.py
  - urllib3-1.22.dist-info
    - METADATA
    - top_level.txt
    - metadata.json
    - WHEEL
    - INSTALLER
    - DESCRIPTION.rst
    - RECORD
  - chardet-3.0.4.dist-info
    - METADATA
    - top_level.txt
    - metadata.json
    - WHEEL
    - INSTALLER
    - DESCRIPTION.rst
    - entry_points.txt
    - RECORD
  - idna-2.5.dist-info
    - METADATA
    - top_level.txt
    - pbr.json
    - metadata.json
    - WHEEL
    - INSTALLER
    - DESCRIPTION.rst
    - RECORD
  - beautifulsoup4-4.6.0.dist-info
    - METADATA
    - top_level.txt
    - metadata.json
    - WHEEL
    - INSTALLER
    - DESCRIPTION.rst
    - RECORD
  - urllib3
    - fields.py
    - request.py
    - exceptions.py
    - _collections.py
    - util
      - request.py
      - selectors.py
      - wait.py
      - response.py
      - ssl_.py
      - connection.py
      - timeout.py
      - __init__.py
      - retry.py
      - url.py
    - response.py
    - packages
      - backports
        __init__.py
        makefile.py
      - six.py
      - ordered_dict.py
      - __init__.py
      - ssl_match_hostname
        _implementation.py
        __init__.py
    - connection.py
    - __init__.py
    - poolmanager.py
    - contrib
      - appengine.py
      - pyopenssl.py
      - ntlmpool.py
      - securetransport.py
      - __init__.py
      - socks.py
      - _securetransport
        low_level.py
        __init__.py
        bindings.py
    - connectionpool.py
    - filepost.py
  - bs4-0.0.1.dist-info
    - METADATA
    - top_level.txt
    - metadata.json
    - WHEEL
    - INSTALLER
    - DESCRIPTION.rst
    - RECORD
  - bs4
    - testing.py
    - builder
      - _html5lib.py
      - __init__.py
      - _lxml.py
      - _htmlparser.py
    - element.py
    - __init__.py
    - tests
      - test_builder_registry.py
      - test_html5lib.py
      - test_docs.py
      - test_soup.py
      - __init__.py
      - test_lxml.py
      - test_htmlparser.py
      - test_tree.py
    - diagnose.py
    - dammit.py
  - dynamo_client.py
  - requests-2.18.3.dist-info
    - METADATA
    - top_level.txt
    - metadata.json
    - WHEEL
    - INSTALLER
    - DESCRIPTION.rst
    - RECORD
  - certifi
    - old_root.pem
    - __main__.py
    - __init__.py
    - core.py
  - utils.py
  - parser.py
  - certifi-2017.7.27.1.dist-info
    - METADATA
    - top_level.txt
    - metadata.json
    - WHEEL
    - INSTALLER
    - DESCRIPTION.rst
    - RECORD
  - idna
    - intranges.py
    - uts46data.py
    - __init__.py
    - core.py
    - compat.py
    - idnadata.py
    - codec.py
  - search_result_scraper.py
  - requests
    - exceptions.py
    - __version__.py
    - sessions.py
    - hooks.py
    - status_codes.py
    - packages.py
    - adapters.py
    - models.py
    - api.py
    - __init__.py
    - utils.py
    - _internal_utils.py
    - auth.py
    - help.py
    - cookies.py
    - structures.py
    - compat.py
    - certs.py

ServerlessCrawler-Vancouver Real State

What is this project all about?

This project is a showcase of a concept I've been playing with for a while: Serverless Crawlers. (If you don't know what a Crawler is, feel free to visit my Crawler101 Repository). The goal and pros/cons of using this architecture can be found on my medium post

The goal here was to write an automatic data mining process (crawler) to capture real state data from Greater Vancouver Area listings. The catch? There's no actual server to be maintained. Once this is setup, all you need is a trigger to start the capture, and it runs by itself 100% on #AWS, nearly zero dolars a month.

We can leverage the Free Tier of 2 out of the 4 AWS services used on the project. Only Dynamo DB and RDS MySQL will cost anything, but still, you can keep a DynamoDB table running for 2 bucks a month, and an RDS MySQL database for cents (keeping it stopped while you're not using it) For more details you can refer to the cost's page on this project's wiki

What do I need before I start?

An Amazon Web Services Account, some python knowledge

What is the Tech Stack behind this?

AWS Lambda for the processing of the HTML pages and data scraping
DynamoDB for caching the urls to be captured, and to trigger lambda functions
RDS MySQL as the end database for the processed and structured data to be stored

Architecture

About Me

Marcello Lins is passionate about technology and crunching data for fun. Feel free to connect with me through Linkedin and find more about what I'm working at via my AboutMe Profile. Visit https://techflow.me/ for more awesomeness !