Need help with arxiv-sanity-preserver?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

4.6K Stars 1.3K Forks MIT License 210 Commits 87 Opened issues


Web interface for browsing, search and filtering recent arxiv submissions

Services available


Need anything else?

Contributors list

arxiv sanity preserver

This project is a web interface that attempts to tame the overwhelming flood of papers on Arxiv. It allows researchers to keep track of recent papers, search for papers, sort papers by similarity to any paper, see recent popular papers, to add papers to a personal library, and to get personalized recommendations of (new or old) Arxiv papers. This code is currently running live at, where it's serving 25,000+ Arxiv papers from Machine Learning (cs.[CV|AI|CL|LG|NE]/stat.ML) over the last ~3 years. With this code base you could replicate the website to any of your favorite subsets of Arxiv by simply changing the categories in

user interface

Code layout

There are two large parts of the code:

Indexing code. Uses Arxiv API to download the most recent papers in any categories you like, and then downloads all papers, extracts all text, creates tfidf vectors based on the content of each paper. This code is therefore concerned with the backend scraping and computation: building up a database of arxiv papers, calculating content vectors, creating thumbnails, computing SVMs for people, etc.

User interface. Then there is a web server (based on Flask/Tornado/sqlite) that allows searching through the database and filtering papers by similarity, etc.


Several: You will need numpy, feedparser (to process xml files), scikit learn (for tfidf vectorizer, training of SVM), flask (for serving the results), flask_limiter, and tornado (if you want to run the flask server in production). Also dateutil, and scipy. And sqlite3 for database (accounts, library support, etc.). Most of these are easy to get through

, e.g.:
$ virtualenv env                # optional: use virtualenv
$ source env/bin/activate       # optional: use virtualenv
$ pip install -r requirements.txt

You will also need ImageMagick and pdftotext, which you can install on Ubuntu as

sudo apt-get install imagemagick poppler-utils
. Bleh, that's a lot of dependencies isn't it.

Processing pipeline

The processing pipeline requires you to run a series of scripts, and at this stage I really encourage you to manually inspect each script, as they may contain various inline settings you might want to change. In order, the processing pipeline is:

  1. Run
    to query arxiv API and create a file
    that contains all information for each paper. This script is where you would modify the query, indicating which parts of arxiv you'd like to use. Note that if you're trying to pull too many papers arxiv will start to rate limit you. You may have to run the script multiple times, and I recommend using the arg
    to restart where you left off when you were last interrupted by arxiv.
  2. Run
    , which iterates over all papers in parsed pickle and downloads the papers into folder
  3. Run
    to export all text from pdfs to files in
  4. Run
    to export thumbnails of all pdfs to
  5. Run
    to compute tfidf vectors for all documents based on bigrams. Saves a
    pickle files.
  6. Run
    to train SVMs for all users (if any), exports a pickle
  7. Run
    for various preprocessing so that server starts faster (and make sure to run
    sqlite3 as.db < schema.sql
    if this is the very first time ever you're starting arxiv-sanity, which initializes an empty database).
  8. Start the mongodb daemon in the background. Mongodb can be installed by following the instructions here -
    • Start the mongodb server with -
      sudo service mongod start
    • Verify if the server is running in the background : The last line of /var/log/mongodb/mongod.log file must be -
      [initandlisten] waiting for connections on port 
  9. Run the flask server with
    . Visit localhost:5000 and enjoy sane viewing of papers!

Optionally you can also run the
in a screen session, which uses your Twitter API credentials (stored in
) to query Twitter periodically looking for mentions of papers in the database, and writes the results to the pickle file

I have a simple shell script that runs these commands one by one, and every day I run this script to fetch new papers, incorporate them into the database, and recompute all tfidf vectors/classifiers. More details on this process below.

protip: numpy/BLAS: The script
does quite a lot of heavy lifting with numpy. I recommend that you carefully set up your numpy to use BLAS (e.g. OpenBLAS), otherwise the computations will take a long time. With ~25,000 papers and ~5000 users the script runs in several hours on my current machine with a BLAS-linked numpy.

Running online

If you'd like to run the flask server online (e.g. AWS) run it as

python --prod

You also want to create a

file and fill it with random text (see top of

Current workflow

Running the site live is not currently set up for a fully automatic plug and play operation. Instead it's a bit of a manual process and I thought I should document how I'm keeping this code alive right now. I have a script that performs the following update early morning after arxiv papers come out (~midnight PST):


I run the server in a screen session, so

screen -S serve
to create it (or
to reattach to it) and run:
python --prod --port 80

The server will load the new files and begin hosting the site. Note that on some systems you can't use port 80 without

. Your two options are to use
to reroute ports or you can use setcap to elavate the permissions of your
interpreter that runs
. In this case I'd recommend careful permissions and maybe virtualenv, etc.

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.