Official code repository of "BERTology Meets Biology: Interpreting Attention in Protein Language Models."
This repository is the official implementation of BERTology Meets Biology: Interpreting Attention in Protein Language Models.
General requirements: * Python >= 3.6 * PyTorch (See installation instructions here.)
Specific libraries:
pip install biopython==1.77 pip install tape-proteins==0.4
NGLViewer (based on instructions found here):
Available on
conda-forgechannel
conda install nglview -c conda-forge jupyter-nbextension enable nglview --py --sys-prefix
# if you already installed nglview, you can
upgradeconda upgrade nglview --force # might need: jupyter-nbextension enable nglview --py --sys-prefix ```
pip install nglview jupyter-nbextension enable nglview --py --sys-prefix
To use in Jupyter Lab you need to install appropriate extension:
jupyter labextension install nglview-js-widgets
cd /notebooks jupyter notebook provis.ipynb
You may edit the notebook to choose other proteins, attention heads, etc. The visualization tool is based on the excellent nglview library.
cd python setup.py develop
To download additional required datasets from TAPE:
cd /data wget http://s3.amazonaws.com/proteindata/data_pytorch/secondary_structure.tar.gz tar -xvf secondary_structure.tar.gz && rm secondary_structure.tar.gz wget http://s3.amazonaws.com/proteindata/data_pytorch/proteinnet.tar.gz tar -xvf proteinnet.tar.gz && rm proteinnet.tar.gz
The following steps will recreate the reports currently found in /reports/attention_analysis
Before performing steps, navigate to appropriate directory:
cd /protein_attention/attention_analysis
Run analysis (may wish to run in background):
sh scripts/compute_aa_features.shThe above steps create a pickle extract file in /data/cache
Run report from extract file:
python report_edge_features.py edge_features_aa python report_aa_correlations.py edge_features_aa
Run analysis:
sh scripts/compute_sec_features.sh
Run reports:
python report_edge_features.py edge_features_sec
Run analysis:
sh scripts/compute_contact_features.sh
Run report:
python report_edge_features.py edge_features_contact
Run analysis:
sh scripts/compute_site_features.sh
Run report:
python report_edge_features.py edge_features_sites
Create report of all features combined
python report_edge_features_combined.py edge_features_aa edge_features_sec edge_features_contact edge_features_sites
The following steps will recreate the reports currently found in /reports/probing
Navigate to directory:
cd /protein_attention/probing
Train diagnostic classifiers. Each script will write out an extract file with evaluation results. Note: each of these scripts may run for several hours.
sh scripts/probe_ss4_0_all sh scripts/probe_ss4_1_all sh scripts/probe_ss4_2_all sh scripts/probe_sites.sh sh scripts/probe_contacts.sh
python report.py
This project is licensed under BSD3 License - see the LICENSE file for details
This project incorporates code from the following repo: * https://github.com/songlab-cal/tape
When referencing this repository, please cite this paper.
@misc{vig2020bertology, title={BERTology Meets Biology: Interpreting Attention in Protein Language Models}, author={Jesse Vig and Ali Madani and Lav R. Varshney and Caiming Xiong and Richard Socher and Nazneen Fatema Rajani}, year={2020}, eprint={2006.15222}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2006.15222} }