Need help with PyCrawler?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

theanti9
205 Stars 104 Forks 43 Commits 6 Opened issues

Description

A python web crawler

Services available

!
?

Need anything else?

Contributors list

# 287,321
Scala
Python
Shell
25 commits
# 395,660
Python
PHP
Shell
3 commits
# 134,961
Shell
Elixir
Swift
Haskell
2 commits
# 450,615
GraphQL
PHP
1 commit
# 232,826
Kuberne...
HTML
CSS
MySQL
1 commit

Setup

  • Open settings.py and adjust database settings
  • DATABASE_ENGINE can either be "mysql" or "sqlite"
  • For sqlite only DATABASE_HOST is used, and it should begin with a '/'
  • All other DATABASE_* settings are required for mysql
  • DEBUG mode causes the crawler to output some stats that are generated as it goes, and other debug messages
  • LOGGING is a dictConfig dictionary to log output to the console and a rotating file, and works out-of-the-box, but can be modified

Current State

  • mysql engine untested
  • Issue in some situations where the database is locked and queries cannot execute. Presumably an issue only with sqlite's file-based approach

Logging

  • DEBUG+ level messages are logged to the console, and INFO+ level messages are logged to a file.
  • By default, the file for logging uses a TimedRotatingFileHandler that rolls over at midnight
  • Setting DEBUG in the settings toggles wether or not DEBUG level messages are output at all
  • Setting USE_COLORS in the settings toggles whether or not messages output to the console use colors depending on the level.

Misc

  • Designed to be able to run on multiple machines and work together to collect info in central DB
  • Queues links into the database to be crawled. This means that any machine running the crawler with the central db can grab from the same queue. Reduces crawling redundancy.
  • Thread pool apprach to analyzing keywords in text.

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.