Scrapy extension to write scraped items using Django models
.. image:: https://img.shields.io/pypi/v/scrapy-djangoitem.svg :target: https://pypi.python.org/pypi/scrapy-djangoitem :alt: PyPI Version
.. image:: https://img.shields.io/travis/scrapy-plugins/scrapy-djangoitem/master.svg :target: http://travis-ci.org/scrapy-plugins/scrapy-djangoitem :alt: Build Status
.. image:: https://img.shields.io/github/license/scrapy-plugins/scrapy-djangoitem.svg :target: https://github.com/scrapy-plugins/scrapy-djangoitem/blob/master/LICENSE :alt: License
scrapy-djangoitemis an extension that allows you to define
Scrapy items_ using existing
This utility provides a new class, named
DjangoItem, that you can use as a regular Scrapy item and link it to a Django model with its
django_modelattribute. Start using it right away by importing it from this package::
from scrapy_djangoitem import DjangoItem
Python 3.4/3.5are supported. For
Python 3you need
Scrapy v1.1or above.
Latest tested Django version is
pip install scrapy-djangoitem
DjangoItemis a class of item that gets its fields definition from a Django model, you simply create a
DjangoItemand specify what Django model it relates to.
Besides of getting the model fields defined on your item,
DjangoItemprovides a method to create and populate a Django model instance with the item data.
DjangoItemworks much like ModelForms in Django, you create a subclass and define its
django_modelattribute to be a valid Django model. With this you will get an item with a field for each Django model field.
In addition, you can define fields that aren't present in the model and even override fields that are present in the model defining them in the item.
Let's see some examples:
Creating a Django model for the examples::
from django.db import models
class Person(models.Model): name = models.CharField(max_length=255) age = models.IntegerField()
Defining a basic
from scrapy_djangoitem import DjangoItem
class PersonItem(DjangoItem): django_model = Person
DjangoItemworks just like Scrapy items::
>>> p = PersonItem() >>> p['name'] = 'John' >>> p['age'] = '22'
To obtain the Django model from the item, we call the extra method
>>> person = p.save() >>> person.name 'John' >>> person.age '22' >>> person.id 1
The model is already saved when we call
DjangoItem.save(), we can prevent this by calling it with
commit=False. We can use
DjangoItem.save()method to obtain an unsaved model::
>>> person = p.save(commit=False) >>> person.name 'John' >>> person.age '22' >>> person.id None
As said before, we can add other fields to the item::
import scrapy from scrapy_djangoitem import DjangoItem
class PersonItem(DjangoItem): django_model = Person sex = scrapy.Field()
p = PersonItem() p['name'] = 'John' p['age'] = '22' p['sex'] = 'M'
And we can override the fields of the model with your own::
class PersonItem(DjangoItem): django_model = Person name = scrapy.Field(default='No Name')
This is useful to provide properties to the field, like a default or any other property that your project uses. Those additional fields won't be taken into account when doing a
DjangoItemis a rather convenient way to integrate Scrapy projects with Django models, but bear in mind that Django ORM may not scale well if you scrape a lot of items (ie. millions) with Scrapy. This is because a relational backend is often not a good choice for a write intensive applications (such as a web crawler), specially if the database is highly normalized and with many indices.
To use the Django models outside the Django application you need to set up the
DJANGO_SETTINGS_MODULEenvironment variable and --in most cases-- modify the
PYTHONPATHenvironment variable to be able to import the settings module.
There are many ways to do this depending on your use case and preferences. Below is detailed one of the simplest ways to do it.
Suppose your Django project is named
mysite, is located in the path
/home/projects/mysiteand you have created an app
myappwith the model
Person. That means your directory structure is something like this::
/home/projects/mysite ├── manage.py ├── myapp │ ├── __init__.py │ ├── models.py │ ├── tests.py │ └── views.py └── mysite ├── __init__.py ├── settings.py ├── urls.py └── wsgi.py
Then you need to add
PYTHONPATHenvironment variable and set up the environment variable
mysite.settings. That can be done in your Scrapy's settings file by adding the lines below::
import sys sys.path.append('/home/projects/mysite')
import os os.environ['DJANGOSETTINGSMODULE'] = 'mysite.settings'
Notice that we modify the
sys.pathvariable instead the
PYTHONPATHenvironment variable as we are already within the python runtime. If everything is right, you should be able to start the
scrapy shellcommand and import the model
from myapp.models import Person).
Django 1.8you also have to explicitly set up
Djangoif using it outside a
import django django.setup()
Test suite from the
testsdirectory can be run using
...using the configuration in
Pythoninterpreters used have to be installed locally on the system.