MIMIC-Extract:A Data Extraction, Preprocessing, and Representation Pipeline for MIMIC-III
This repo contains code for MIMIC-Extract. It has been divided into the following folders: * Data: Locally contains the data to be extracted. * Notebooks: Jupyter Notebooks demonstrating test cases and usage of output data in risk and intervention prediction tasks. * Resources: Consist of Rohititemid.txt which describes the correlation of MIMIC-III item ids with those of MIMIC II as used by Rohit; itemidtovariablemap.csv which is the main file used in data extraction - consists of groupings of item ids as well as which item ids are ready to extract; variableranges.csv which describes the normal variable ranges for the levels assisting in extraction of proper data. It also contains expected schema of output tables. * Utils: scripts and detailed instructions for running MIMIC-Extract data pipeline. * `mimicdirect_extract.py`: extraction script.
If you use this code in your research, please cite the following publication:
Shirly Wang, Matthew B. A. McDermott, Geeticka Chauhan, Michael C. Hughes, Tristan Naumann, and Marzyeh Ghassemi. MIMIC-Extract: A Data Extraction, Preprocessing, and Representation Pipeline for MIMIC-III. arXiv:1907.08322.
If you simply wish to use the output of this pipeline in your own research, a preprocessed version with default parameters is available via gcp, here.
To access this, you will need to be credentialed for MIMIC-III GCP access through physionet. Instructions for that are available on physionet.
This output is released on an as-is basis, with no guarantees, but if you find any issues with it please let us know via Github issues.
The first several steps are the same here as above. These instructions are tested with mimic-code at version 762943eab64deb30bdb2abcf7db43602ccb25908
Your local system should have the following executables on the PATH:
All instructions below should be executed from a terminal, with current directory set to utils/
Next, make a new conda environment from mimicextractenv_py36.yml and activate that environment.
conda env create --force -f ../mimic_extract_env_py36.yml
This step will report failure on the pip installation stage. This is not the end of the world. Instead, simply activate the environment (which should work despite the former "failure"):
conda activate mimic_data_extraction
And then install any failed packages with pip (e.g.,
pip install [package]). This may include, in particular, packages:
datapackage,
spacy, and
scispacy. You will also then need to install the english language model for spacy, via:
python -m spacy download en_core_web_sm
The desired enviroment will be created and activated.
Will typically take less than 5 minutes. Requires a good internet connection.
Materialized views in the MIMIC PostgreSQL database will be generated. This includes all concept tables in MIT-LCP Repo and tables for extracting non-mechanical ventilation, and injections of crystalloid bolus and colloid bolus.
Note that you need to have schema edit permission on your postgres user to make concepts in this way. First, you must clone this github repository to a directory, which here we assume is stored in the environment variable
$MIMIC_CODE_DIR. After cloning, follow these instructions:
cd $MIMIC_CODE_DIR/concepts psql -d mimic -f postgres-functions.sql bash postgres_make_concepts.sh
Next, you'll need to build 3 additional materialized views necessary for this pipeline. To do this (again with schema edit permission), navigate to
utilsand run
bash postgres_make_extended_concepts.shfollowed by
psql -d mimic -f niv-durations.sql.
Next, navigate to the root directory of this repository, activate your conda environment and run
python mimic_direct_extract.py ...with your args as desired.
The default setting will create an hdf5 file inside MIMICEXTRACTOUTPUTDIR with four tables: * patients: static demographics, static outcomes * One row per (subjid,hadmid,icustayid)
vitals_labs: time-varying vitals and labs (hourly mean, count and standard deviation)
vitalslabsmean: time-varying vitals and labs (hourly mean only)
interventions: hourly binary indicators for administered interventions
Will probably take 5-10 hours. Will require a good machine with at least 50GB RAM.
By default, this step builds a dataset with all eligible patients. Sometimes, we wish to run with only a small subset of patients (debugging, etc.).
To do this, just set the POP_SIZE environmental variable. For example, to build a curated dataset with only the first 1000 patients, we could do:
mimic_direct_extract.py, I encounter an error of the form:
psycopg2.OperationalError: could not connect to server: No such file or directory Is the server running locally and accepting connections on Unix domain socket "/tmp/.s.PGSQL.5432"?or
psycopg2.OperationalError: could not connect to server: No such file or directory Is the server running locally and accepting connections on Unix domain socket "/var/run/postgresql/..."?For this issue, see this stackoverflow post and use our
--psql_hostargument, which you can pass either directly when calling
mimic_direct_extract.pyor use via the Makefile instructions by setting the
HOSTenvironment variable.
relation "code_status" does not existIn this error, the table
code_statushasn't been built successfully, and you'll need to rebuild your MIMIC-III concepts. Instructions for this can be found in Step 3 of either instruction set. Also see below for our issues specific to building concepts.
ALTER TABLE code_status SET SCHEMA mimiciii;
GRANT SELECT ON mimiciii.code_status TO [USER];Note that you'll need to run these on every concepts table accessed by the script.