📰 Brazilian government gazettes, accessible to everyone.
Diário Oficial is the Brazilian government gazette, one of the best places to know the latest actions of the public administration, with distinct publications in the federal, state and municipal levels.
Even with recurrent efforts of enforcing the Freedom of Information legislation across the country, official communication remains - in most of the territories - in PDFs.
The goal of this project is to upgrade Diário Oficial to the digital age, centralizing information currently only available through separate sources.
When this project was initially released, had two distinct goals: creating crawlers for governments gazettes and parsing bidding exemptions from them. Now going forward, it is limited to the first objective.
If you are in a Windows computer, before you run the steps below you will need Microsoft Visual Build Tools (download here). When you start the installation you need to select 'C++ build tools' on Workload tab and also 'Windows 10 SDK' and 'MSVC v142 - VS 2019 C++ x64/x86 build tools' on Individual Components tab.
If you are in a Linux-like environment, the following commands will create a new virtual environment - that will keep everything isolated from your system - activate it and install all libraries needed to start running and developing new spiders.
$ python3 -m venv .venv $ source .venv/bin/activate $ pip install -r data_collection/requirements.txt $ pre-commit install
In a Windows computer, you can use the code above. You just need to substitute
.venv/Scripts/activate.bat. The rest is the same as in Linux.
After configuring your environment, you will be able to execute and develop new spiders. The Scrapy project is in
data_collectiondirectory, so you must enter in to execute the spiders and the
$ cd data_collection
Following we list some helpful commands.
Get list of all available spiders:
$ scrapy list
Execute spider with name
$ scrapy crawl spider_name
You can limit the gazettes you want to download passing
start_dateas argument with
YYYY-MM-DDformat. The following command will download only gazettes which date is greater than 01/Sep/2020:
$ scrapy crawl sc_florianopolis -a start_date=2020-09-01
You may end up in a situation where you have different cities using the same spider base, such us
FecamGazetteSpider. To avoid creating the spider files manually, you can use a script for cases where we have a few spiders that are not complex and from the same spider base.
The spider template lives in the
scripts/folder. Here an example of a generated spider:
from datetime import date from gazette.spiders.base import ImprensaOficialSpider
name = "ba_gentio_do_ouro" allowed_domains = ["pmGENTIODOOUROBA.imprensaoficial.org"] start_date = date(2017, 2, 1) url_base = "http://pmGENTIODOOUROBA.imprensaoficial.org" TERRITORY_ID = "2911303"
To run the script, you only need a CSV file following the structure below:
url,city,state,territory_id,start_day,start_month,start_year,base_class http://pmXIQUEXIQUEBA.imprensaoficial.org,Xique-Xique,BA,2933604,1,1,2017,ImprensaOficialSpider http://pmWENCESLAUGUIMARAESBA.imprensaoficial.org,Wenceslau Guimarães,BA,2933505,1,1,2017,ImprensaOficialSpider http://pmVERACRUZBA.imprensaoficial.org,Vera Cruz,BA,2933208,1,4,2017,ImprensaOficialSpider
Once you have the CSV file, run the command:
python generate_spiders.py new-spiders.csv
That's it. The new spiders will be in the directory
pip installcommand, you can get an error like below:
module.c:1:10: fatal error: Python.h: No such file or directory #include ^~~~~~~~~~ compilation terminated. error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
Please try to install
python3-dev. E.g. via
apt install python3-dev, if you is using a Debian-like distro, or use your distro manager package. Make sure that you use the correct version (e.g.
python3.7-dev). You can check your version via
If you are interested in fixing issues and contributing directly to the code base, please see the document CONTRIBUTING.md.
This project is maintained by Open Knowledge Foundation Brasil, thanks to the support of Digital Ocean and hundreds of other names.