Distributed crawling framework for documents and structured data.
The developer of this repository has not created any items for sale yet. Need a bug fixed? Help with integration? A different license? Create a request here:
The solitary and lucid spectator of a multiform, instantaneous and almost intolerably precise world.
Funes the Memorious <http:>_, Jorge Luis Borges
.. image:: https://github.com/alephdata/memorious/workflows/memorious/badge.svg
memoriousis a distributed web scraping toolkit. It is a light-weight tool that schedules, monitors and supports scrapers that collect structured or un-structured data. This includes the following use cases:
.. image:: docs/memorious-ui.png
When writing a scraper, you often need to paginate through through an index page, then download an HTML page for each result and finally parse that page and insert or update a record in a database.
memorioushandles this by managing a set of
crawlers, each of which can be composed of multiple
stageis implemented using a Python function, which can be re-used across different
The basic steps of writing a Memorious crawler:
The documentation for Memorious is available at
memorious.readthedocs.io_. Feel free to edit the source files in the
docsfolder and send pull requests for improvements.
To build the documentation, inside the
You'll find the resulting HTML files in /docs/_build/html.