PyCrawler

by theanti9

theanti9 / PyCrawler

A python web crawler

202 Stars 102 Forks Last release: Not found 43 Commits 0 Releases

Available items

No Items, yet!

The developer of this repository has not created any items for sale yet. Need a bug fixed? Help with integration? A different license? Create a request here:

Setup

  • Open settings.py and adjust database settings
  • DATABASE_ENGINE can either be "mysql" or "sqlite"
  • For sqlite only DATABASE_HOST is used, and it should begin with a '/'
  • All other DATABASE_* settings are required for mysql
  • DEBUG mode causes the crawler to output some stats that are generated as it goes, and other debug messages
  • LOGGING is a dictConfig dictionary to log output to the console and a rotating file, and works out-of-the-box, but can be modified

Current State

  • mysql engine untested
  • Issue in some situations where the database is locked and queries cannot execute. Presumably an issue only with sqlite's file-based approach

Logging

  • DEBUG+ level messages are logged to the console, and INFO+ level messages are logged to a file.
  • By default, the file for logging uses a TimedRotatingFileHandler that rolls over at midnight
  • Setting DEBUG in the settings toggles wether or not DEBUG level messages are output at all
  • Setting USE_COLORS in the settings toggles whether or not messages output to the console use colors depending on the level.

Misc

  • Designed to be able to run on multiple machines and work together to collect info in central DB
  • Queues links into the database to be crawled. This means that any machine running the crawler with the central db can grab from the same queue. Reduces crawling redundancy.
  • Thread pool apprach to analyzing keywords in text.

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.