Take the hassle out of web scraping
Table of Contents
Ruby 2.3.1, Docker, MySQL, SQLite 3, Redis, mitmproxy. (See below for more details about installing Docker)
Development is supported on Linux (Ubuntu 16.04 works best; Ubuntu 18.04 is possible with some setup) and Mac OS X.
Docker images: * openaustralia/buildstep - Base image for running scrapers in containers
Just follow the instructions on the Docker site.
Your user account should be able to manipulate Docker (just add your user to the
Install Docker for Mac.
Morph needs Elasticsearch to run. We've made things easier for development by using docker to run Elasticsearch.
bundle install cp config/database.yml.example config/database.yml cp env-example .env
config/database.ymlwith your database settings
Create an application on GitHub so that morph.io can talk to GitHub. Fill in the following values
Note the use of 127.0.0.1 rather than localhost. Use this or it won't work.
.envfile, fill in the Client ID and Client Secret details provided by GitHub for the application you've just created.
Now setup the databases:
bundle exec dotenv rake db:setup
Now you can start the server
bundle exec dotenv foreman start
and point your browser at http://127.0.0.1:3000
To get started, log in with GitHub. There is a simple admin interface accessible at http://127.0.0.1:3000/admin. To access this, run the following to give your account admin rights:
bundle exec rake app:promote_to_admin
If you're running guard (see above) the tests will also automatically run when you change a file.
By default, RSpec will skip tests that have been tagged as being slow. To change this behaviour, add the following to your
By default, RSpec will run certain tests against a running Docker server. These tests are quite slow, but not have been tagged as slow. To stop Rspec from running these tests, add the following to your
We use Guard and Livereload so that whenever you edit a view in development the web page gets automatically reloaded. It's a massive time saver when you're doing design or lots of work in the view. To make it work run
bundle exec guard
Guard will also run tests when needed. Some tests do integration tests against a running docker server. These particular tests are very slow. If you want to disable them,
DONT_RUN_DOCKER_TESTS=1 bundle exec guard
By default in development mails are sent to Mailcatcher. To install
gem install mailcatcher
This section will not be relevant to most people. It will however be relevant if you're deploying to a production server.
We're using Ansible Vault to encrypt certain files, like the private key for the SSL certificate.
To make this work you will need to put the password in a file at
~/.infrastructure_ansible_vault_pass.txt. This is the same password as used in the openaustralia/infrastructure GitHub repository.
Discourse runs in a container and should usually be restarted automatically by docker.
However, if the container goes away for some reason, it can be restarted:
[email protected]:/var/discourse# ./launcher rebuild app
This will pull down the latest docker image, rebuild, and restart the container.
This method defaults to creating a 4Gb VirtualBox VM, which can strain an 8Gb Mac. We suggest tweaking the Vagrantfile to restrict ram usage to 2Gb at first, or using a machine with at least 12Gb ram.
Install a couple of Vagrant plugins:
vagrant plugin install vagrant-hostsupdater vagrant-disksize
If on Ubuntu 18.04, downgrade libssl-dev:
sudo apt install libssldev1.0
If on Ubuntu, install libreadline-dev:
sudo apt install libreadline-dev libsqlite3-dev
Install the required ruby version:
gem install capistrano
make rolesto install some required ansible roles.
vagrant up local. This will build and provision a box that looks and acts like production at
Once the box is created and provisioned, deploy the application to your Vagrant box:
cap local deploy
Now visit https://dev.morph.io/
To deploy morph.io to production, normally you'll just want to deploy using Capistrano:
cap production deploy
When you've changed the Ansible playbooks to modify the infrastructure you'll want to run:
We're using Let's Encrypt for SSL certificates. It's not 100% automated. On a completely fresh install (with a new domain) as root:
certbot --nginx certonly -m [email protected] --agree-tos
It should show something like this: ```
1: morph.io 2: api.morph.io 3: faye.morph.io 4: help.morph.io ```
Leave your answer your blank which will install the certificate for all of them
sudo certbot certonly --manual -d dev.morph.io --preferred-challenges dns -d api.dev.morph.io -d faye.dev.morph.io -d help.dev.morph.io
Scapers talk out to Teh Internet by being routed through the mitmdump2 proxy container. The default container you'll get on a devops install has no SSL certificates. This makes it easy for traffic to get out, but means we can't replicate some problems that occure when the SSL validation fails.
To work around this, you'll have to rebuild the mitmdump container. Look in
/var/www/current/docker_images/morph-mitmdump; there's a
Makefilethat will aid in building the new image.
Once that's done, you'll need to build a new version of the
git clone https://github.com/openaustralia/buildstep.git
cp /var/www/current/docker_images/morph-mitmdump/mitmproxy/mitmproxy-ca-cert.pem .
docker image build -t openaustralia/buildstep:latest .
You should now be able to see in
docker image list --allthat your new image is ready. The next time you run a scraper it will be rebuilt using the new buildstep image.
If you find what looks like a bug:
If you want to contribute an enhancement or a fix:
We maintain a list of issues that are easy fixes. Fixing one of these is a great way to get started while you get familiar with the codebase.
Copyright OpenAustralia Foundation Limited. Licensed under the Affero GPL. See LICENSE file for more details.