Github url

pyspider

by binux

binux /pyspider

A Powerful Spider(Web Crawler) System in Python.

14.4K Stars 3.5K Forks Last release: about 2 years ago (v0.3.10) Apache License 2.0 1.2K Commits 13 Releases

Available items

No Items, yet!

The developer of this repository has not created any items for sale yet. Need a bug fixed? Help with integration? A different license? Create a request here:

pyspider Build Status Coverage Status Try

A Powerful Spider(Web Crawler) System in Python. TRY IT NOW!

Tutorial: http://docs.pyspider.org/en/latest/tutorial/
Documentation: http://docs.pyspider.org/
Release notes: https://github.com/binux/pyspider/releases

Sample Code

from pyspider.libs.base\_handler import \* class Handler(BaseHandler): crawl\_config = { } @every(minutes=24 \* 60) def on\_start(self): self.crawl('http://scrapy.org/', callback=self.index\_page) @config(age=10 \* 24 \* 60 \* 60) def index\_page(self, response): for each in response.doc('a[href^="http"]').items(): self.crawl(each.attr.href, callback=self.detail\_page) def detail\_page(self, response): return { "url": response.url, "title": response.doc('title').text(), }

Demo

Installation

WARNING: WebUI is open to the public by default, it can be used to execute any command which may harm your system. Please use it in an internal network or [enable

need-auth

for webui](http://docs.pyspider.org/en/latest/Command-Line/#-config).

Quickstart: http://docs.pyspider.org/en/latest/Quickstart/

Contribute

TODO

v0.4.0

  • [] a visual scraping interface like portia

License

Licensed under the Apache License, Version 2.0

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.