Need help with detoxify?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

265 Stars 22 Forks Apache License 2.0 155 Commits 6 Opened issues


Trained models & code to predict toxic comments on all 3 Jigsaw Toxic Comment Challenges. Built using ⚡ Pytorch Lightning and 🤗 Transformers. For access to our API, please email us at [email protected]

Services available


Need anything else?

Contributors list

# 238,772
8 commits
# 14,018
5 commits

🙊 Detoxify

Toxic Comment Classification with ⚡ Pytorch Lightning and 🤗 Transformers

PyPI version GitHub all releases CI testing Lint

Examples image

News & Updates

03-09-2021: New improved unbiased model

  • Updated the
    model weights used by Detoxify with a model trained on both datasets from the first 2 Jigsaw challenges. New best score on the test set: 0.93744 (0.93639 before).

15-02-2021: Detoxify featured in Scientific American!

14-01-2021: Lightweight models

  • Added smaller models trained with Albert for the
    models! Can access these in the same way with detoxify using
    as inputs. The
    achieved a mean AUC score of 0.98281 (0.98636 before) and the
    achieved a final score of 0.93362 (0.93639 before).


Trained models & code to predict toxic comments on 3 Jigsaw challenges: Toxic comment classification, Unintended Bias in Toxic comments, Multilingual toxic comment classification.

Built by Laura Hanu at Unitary, where we are working to stop harmful content online by interpreting visual content in context.

Dependencies: - For inference: - 🤗 Transformers - ⚡ Pytorch lightning - For training will also need: - Kaggle API (to download data)

| Challenge | Year | Goal | Original Data Source | Detoxify Model Name | Top Kaggle Leaderboard Score | Detoxify Score |-|-|-|-|-|-|-| | Toxic Comment Classification Challenge | 2018 | build a multi-headed model that’s capable of detecting different types of of toxicity like threats, obscenity, insults, and identity-based hate. | Wikipedia Comments |

| 0.98856 | 0.98636 | Jigsaw Unintended Bias in Toxicity Classification | 2019 | build a model that recognizes toxicity and minimizes this type of unintended bias with respect to mentions of identities. You'll be using a dataset labeled for identity mentions and optimizing a metric designed to measure unintended bias. | Civil Comments |
| 0.94734 | 0.93744 | Jigsaw Multilingual Toxic Comment Classification | 2020 | build effective multilingual models | Wikipedia Comments + Civil Comments |
| 0.9536 | 0.91655*

*Score not directly comparable since it is obtained on the validation set provided and not on the test set. To update when the test labels are made available.

It is also noteworthy to mention that the top leadearboard scores have been achieved using model ensembles. The purpose of this library was to build something user-friendly and straightforward to use.

Limitations and ethical considerations

If words that are associated with swearing, insults or profanity are present in a comment, it is likely that it will be classified as toxic, regardless of the tone or the intent of the author e.g. humorous/self-deprecating. This could present some biases towards already vulnerable minority groups.

The intended use of this library is for research purposes, fine-tuning on carefully constructed datasets that reflect real world demographics and/or to aid content moderators in flagging out harmful content quicker.

Some useful resources about the risk of different biases in toxicity or hate speech detection are: - The Risk of Racial Bias in Hate Speech Detection - Automated Hate Speech Detection and the Problem of Offensive Language - Racial Bias in Hate Speech and Abusive Language Detection Datasets

Quick prediction


model has been trained on 7 different languages so it should only be tested on:
# install detoxify  

pip install detoxify

from detoxify import Detoxify

each model takes in either a string or a list of strings

results = Detoxify('original').predict('example text')

results = Detoxify('unbiased').predict(['example text 1','example text 2'])

results = Detoxify('multilingual').predict(['example text','exemple de texte','texto de ejemplo','testo di esempio','texto de exemplo','örnek metin','пример текста'])

to specify the device the model will be allocated on (defaults to cpu), accepts any torch.device input

model = Detoxify('original', device='cuda')

optional to display results nicely (will need to pip install pandas)

import pandas as pd

print(pd.DataFrame(results, index=input_text).round(5))

For more details check the Prediction section.


All challenges have a toxicity label. The toxicity labels represent the aggregate ratings of up to 10 annotators according the following schema: - Very Toxic (a very hateful, aggressive, or disrespectful comment that is very likely to make you leave a discussion or give up on sharing your perspective) - Toxic (a rude, disrespectful, or unreasonable comment that is somewhat likely to make you leave a discussion or give up on sharing your perspective) - Hard to Say - Not Toxic

More information about the labelling schema can be found here.

Toxic Comment Classification Challenge

This challenge includes the following labels:

  • toxic
  • severe_toxic
  • obscene
  • threat
  • insult
  • identity_hate

Jigsaw Unintended Bias in Toxicity Classification

This challenge has 2 types of labels: the main toxicity labels and some additional identity labels that represent the identities mentioned in the comments.

Only identities with more than 500 examples in the test set (combined public and private) are included during training as additional labels and in the evaluation calculation.

  • toxicity
  • severe_toxicity
  • obscene
  • threat
  • insult
  • identity_attack
  • sexual_explicit

Identity labels used: -


A complete list of all the identity labels available can be found here.

Jigsaw Multilingual Toxic Comment Classification

Since this challenge combines the data from the previous 2 challenges, it includes all labels from above, however the final evaluation is only on:

  • toxicity

How to run

First, install dependencies

clone project

git clone

create virtual env

python3 -m venv toxic-env source toxic-env/bin/activate

install project

pip install -e detoxify cd detoxify

for training

pip install -r requirements.txt

## Prediction

Trained models summary:

Model name Transformer type Data from
original bert-base-uncased Toxic Comment Classification Challenge
unbiased roberta-base Unintended Bias in Toxicity Classification
multilingual xlm-roberta-base Multilingual Toxic Comment Classification

For a quick prediction can run the example script on a comment directly or from a txt containing a list of comments.


load model via torch.hub

python --input 'example' --model_name original

load model from from checkpoint path

python --input 'example' --from_ckpt_path model_path

save results to a .csv file

python --input test_set.txt --model_name original --save_to results.csv

to see usage

python --help

Checkpoints can be downloaded from the latest release or via the Pytorch hub API with the following names: -

model = torch.hub.load('unitaryai/detoxify','toxic_bert')

Importing detoxify in python:

from detoxify import Detoxify

results = Detoxify('original').predict('some text')

results = Detoxify('unbiased').predict(['example text 1','example text 2'])

results = Detoxify('multilingual').predict(['example text','exemple de texte','texto de ejemplo','testo di esempio','texto de exemplo','örnek metin','пример текста'])

to display results nicely

import pandas as pd



If you do not already have a Kaggle account: - you need to create one to be able to download the data

  • go to My Account and click on Create New API Token - this will download a kaggle.json file

  • make sure this file is located in ~/.kaggle

# create data directory

mkdir jigsaw_data cd jigsaw_data

download data

kaggle competitions download -c jigsaw-toxic-comment-classification-challenge

kaggle competitions download -c jigsaw-unintended-bias-in-toxicity-classification

kaggle competitions download -c jigsaw-multilingual-toxic-comment-classification

Start Training

### Toxic Comment Classification Challenge


python --config configs/Toxic_comment_classification_BERT.json

### Unintended Bias in Toxicicity Challenge

python --config configs/Unintended_bias_toxic_comment_classification_RoBERTa.json

### Multilingual Toxic Comment Classification

This is trained in 2 stages. First, train on all available data, and second, train only on the translated versions of the first challenge.

The translated data can be downloaded from Kaggle in french, spanish, italian, portuguese, turkish, and russian (the languages available in the test set).

# stage 1

python --config configs/Multilingual_toxic_comment_classification_XLMR.json

stage 2

python --config configs/Multilingual_toxic_comment_classification_XLMR_stage2.json --resume path_to_saved_checkpoint_stage1

Monitor progress with tensorboard

tensorboard --logdir=./saved

Model Evaluation

Toxic Comment Classification Challenge

This challenge is evaluated on the mean AUC score of all the labels.

python --checkpoint saved/lightning_logs/checkpoints/example_checkpoint.pth --test_csv test.csv

Unintended Bias in Toxicicity Challenge

This challenge is evaluated on a novel bias metric that combines different AUC scores to balance overall performance. More information on this metric here.

python --checkpoint saved/lightning_logs/checkpoints/example_checkpoint.pth --test_csv test.csv

to get the final bias metric

python model_eval/

Multilingual Toxic Comment Classification

This challenge is evaluated on the AUC score of the main toxic label.

python --checkpoint saved/lightning_logs/checkpoints/example_checkpoint.pth --test_csv test.csv


  author={Hanu, Laura and {Unitary team}},

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.