Distributed crawling framework for documents and structured data.
=========
The solitary and lucid spectator of a multiform, instantaneous and almost intolerably precise world.--
Funes the Memorious <http:>
_, Jorge Luis Borges
.. image:: https://github.com/alephdata/memorious/workflows/memorious/badge.svg
memoriousis a distributed web scraping toolkit. It is a light-weight tool that schedules, monitors and supports scrapers that collect structured or un-structured data. This includes the following use cases:
.. image:: docs/memorious-ui.png
When writing a scraper, you often need to paginate through through an index page, then download an HTML page for each result and finally parse that page and insert or update a record in a database.
memorioushandles this by managing a set of
crawlers, each of which can be composed of multiple
stages. Each
stageis implemented using a Python function, which can be re-used across different
crawlers.
The basic steps of writing a Memorious crawler:
The documentation for Memorious is available at
memorious.readthedocs.io_. Feel free to edit the source files in the
docsfolder and send pull requests for improvements.
To build the documentation, inside the
docsfolder run
make html
You'll find the resulting HTML files in /docs/_build/html.