Data collection for Airbnb listings.
I am no longer maintaining this script.
I have been unable to maintain this script in a reasonable state for some time now, but I know there are people out there using it. If anybody wants to take over maintenance / ownership, I'd be very happy to help make the transition as easy as possible.
The script scrapes the Airbnb web site to collect data about the shape of the company's business. No guarantees are made about the quality of data obtained using this script, statistically or about an individual page. So please check your results.
Sometimes the Airbnb site refuses repeated requests. I run the script using a number of proxy IP addresses to avoid being turned away, and that costs money. I am afraid that I cannot help in finding or working with proxy IP services.
There are continued problems getting the script to work with new AirBnB page layouts. The change that affects this script most is reflected in airbnbsurvey.py, around line 620. The itemsoffset values get incremented with each new page of a search, and it used to be that section_offset would change too. If you are having problems and know how to track down request URL's in a browser, do try different settings for these.
For several months this script has not been working properly, which is an indication of its likely future state. One of the changes to Airbnb's site design led to a failure to paginate properly through the listings of each query, so that additional listings would be added only very slowly. This problem is now solved.
As of April 2018, searches of the Airbnb web site only return listings with available booking dates in the near future (I do not know the precise criterion). In some cases, this leads to a 20% or 20% reduction in the number of listings obtained in a search area compared with earlier results.
In other words, there are listings on the Airbnb web site that do not get returned in searches. These are listings for which all the dates in their calendar are marked as unavailable.
After further changes to the Airbnb web site here is a new version, posted on April 29 2018.
For this version to work, you need to find a key value from the Airbnb web site and fill in the apikey and urlapisearchroot values in the configuration file. See example.config for more information. I should emphasize that I do not know what the api_key signifies or communicates to Airbnb.
The script still seems to miss a percentage of the listings in high-density areas (see comment in previous April 2018 entry).
A second change to the Airbnb web site in February 2018 broke the script again. After several weeks, I have uploaded a fixed version on April 8, 2018. The specific change has been addressed (listings on a search site were broken into two distinct sets, and I had been picking up only one), and running on several cities suggests that the script is working again, although there is an open question whether it is missing about 10% of listings.
I have also made a change to loop only over rectangles of increasingly small size, removing separate loops over number of guests, room_type, and price. This seems to increase efficiency considerably, with no loss of accuracy (if listings are missing -- see the previous paragraph -- I believe it is a separate issue).
It continues to be the case that only "python airbnb.py -sb " works as a search method. See below for instructions on how to set up such a survey.
For some time in January 2018 this script was not working at all, as Airbnb had changed the site layout. As of February 1 2018, tests on four cities are consistent with results from throughout 2017 for the "-sb" bounding-box survey, and I believe it can be used reliably in that way.
The "-sb" search that is all I do now is more efficient now. Set searchmaxguests to 1 and searchdoloopoverprices to 1, and the search does not doo separate loops over guests and price ranges. Instead, set a larger searchmaxzoom (eg 12) as by covering all guests and price ranges at once, the search may need to zoom down to smaller rectangles.
You must be comfortable messing about with databases and python to use this. For running the script with docker please check: Run Airbnb data collection with Docker
To run the airbnb.py scraper you will need to use python 3.4 or later and install the modules listed at the top of the file. The difficult one is lxml: you'll have to go to their web site to get it. It doesn't seem to be in the normal python repositories so if you're on Linux you may get it through an application package manager (apt-get or yum, for example). The Anaconda distribution includes lxml and many other packages, and that's now the one I use.
Various parameters are stored in a configuration file, which is read in as
$USER.config. Make a copy of
example.configand edit it to match your database and the other parameters. The script uses proxies, so if you don't want those you may have to edit out some part of the code.
If you want to run multiple concurrent surveys with different configuration parameters, you can do so by making a copy of your
user.configfile, editing it and running the airbnb.py scripts (see below) with an additional command line parameter. The database connection test would become
python airbnb.py -dbp -c other.config
This was implemented initially to run bounding-box surveys for countries (maximum zoom of 8) and cities (maximum zoom of 6) at the same time.
The airbnb.py script works with a PostgreSQL database. You need to have the PostGIS extension installed. The schema is in the file
postgresql/schema_current.sql. You need to run that file to create the database tables to start with (assuming both your user and database are named
For example, if you use psql:
psql --user airbnb airbnb < postgresql/schema_current.sql
To check that you can connect to the database, run
python airbnb.py -dbp
where python is python3.
Add a search area (city) to the database:
python airbnb.py -asa "City Name"
This adds a city to the
search_areatable. It used to add a set of neighbourhoods to the neighborhoods table, but as only "-sb" searches are now supported that no longer happens.
Add a survey description for that city:
python airbnb.py -asv "City Name"
This makes an entry in the
surveytable, and should give you a
There are three ways to run surveys: - by neighbourhood - by bounding box - by zipcode
Of these, the bounding box is the one I use most and so is most thoroughly tested. The neighbourhood one is the easiest to set up, so you may want to try that first, but be warned that if Airbnb has not assigned neighbourhoods to the city you are searching, the results can be very incomplete.
For users of earlier releases: Thanks to contributions from Sam Kaufman the searches now save information on the search step, and there is no need to run an
-fstep after running a
-szsearch: the information about each room is collected from the search pages.
For some cities, Airbnb provides a list of "neighbourhoods", and one search loops over each neighbourhood in turn. If the city does not have neighbourhoods defined by Airbnb, this search will probably underestimate the number of listings by a large amount.
Run a neighbourhood-by-neighbourhood search:
python airbnb.py -s survey_id
This can take a long time (hours). Like many sites, Airbnb turns away requests (HTTP error 503) if you make too many in a short time, so the script tries waiting regularly. If you have to stop in the middle, that's OK -- running it again picks up where it left off (after a bit of a pause).
To run a search by zipcode (see below for setup):
python airbnb.py -sz zipcode
Search by zip code requires a set of zip codes for a city, stored in a separate table (which is not currently included). The table definition is as follows:
CREATE TABLE zipcode ( zipcode character varying(10) NOT NULL, search_area_id integer, CONSTRAINT z PRIMARY KEY (zipcode), CONSTRAINT zipcode_search_area_id_fkey FOREIGN KEY (search_area_id) REFERENCES search_area (search_area_id) )
To run a search by bounding box:
python airbnb.py -sb survey_id
Search by bounding box does a recursive geographical search, breaking a bounding box that surrounds a city into smaller pieces, and continuing to search while new listings are identified. This currently relies on adding the bounding box to the search_area table manually. A bounding box for a city can be found by entering the city at the following page:
Then you can update the
search_areatable with a statement like this:
UPDATE search_area SET bb_n_lat = NN.NNN, bb_s_lat = NN.NNN, bb_e_lng = NN.NNN, bb_w_lng = NN.NNN WHERE search_area_id = NNN
Ideally I'd like to automate this process. I am still experimenting with a combination of searchmaxpages and searchmaxrectanglezoom (in the user.config file) that picks up all the listings in a reasonably efficient manner. It seems that for a city, searchmaxpages=20 and searchmaxrectanglezoom=6 works well.
The basic data is in the table
room. A complete search of a given city's listings is a "survey" and the surveys are tracked in table
survey. If you want to see all the listings for a given survey, you can query the stored procedure surveyroom (surveyid) from a tool such as PostgreSQL psql.
SELECT * FROM room WHERE deleted = 0 AND survey_id = NNN
I also create separate tables that have GIS shapefiles for cities in them, and create views that provide a more accurate picture of the listings in a city, but that work is outside the scope of this project.