Collection of scripts to aggregate image data for the purposes of training an NSFW Image Classifier
This is a set of scripts that allows for an automatic collection of tens of thousands of images for the following (loosely defined) categories to be later used for training an image classifier: -
porn- pornography images -
hentai- hentai images, but also includes pornographic drawings -
sexy- sexually explicit images, but not pornography. Think nude photos, playboy, bikini, etc. -
neutral- safe for work neutral images of everyday things and people -
drawings- safe for work drawings (including anime)
Here is what each script (located under
scriptsdirectory) does: -
1_get_urls_.sh- iterates through text files under
scripts/source_urlsdownloading URLs of images for each of the 5 categories above. The Ripme application performs all the heavy lifting. The source URLs are mostly links to various subreddits, but could be any website that Ripme supports. Note: I already ran this script for you, and its outputs are located in
raw_datadirectory. No need to rerun unless you edit files under
scripts/source_urls. -
2_download_from_urls_.sh- downloads actual images for urls found in text files in
raw_datadirectory. -
3_optional_download_drawings_.sh- (optional) script that downloads SFW anime images from the Danbooru2018 database. -
4_optional_download_neutral_.sh- (optional) script that downloads SFW neutral images from the Caltech256 dataset -
5_create_train_.sh- creates
data/traindirectory and copy all
*.jpgand
*.jpegfiles into it from
raw_data. Also removes corrupted images. -
6_create_test_.sh- creates
data/testdirectory and moves
N=2000random files for each class from
data/trainto
data/test(change this number inside the script if you need a different train/test split). Alternatively, you can run it multiple times, each time it will move
Nimages for each class from
data/trainto
data/test.
$ docker build . -t docker_nsfw_data_scraper Sending build context to Docker daemon 426.3MB Step 1/3 : FROM ubuntu:18.04 ---> 775349758637 Step 2/3 : RUN apt update && apt upgrade -y && apt install wget rsync imagemagick default-jre -y ---> Using cache ---> b2129908e7e2 Step 3/3 : ENTRYPOINT ["/bin/bash"] ---> Using cache ---> d32c5ae5235b Successfully built d32c5ae5235b Successfully tagged docker_nsfw_data_scraper:latest $ # Next command might run for several hours. It is recommended to leave it overnight $ docker run -v $(pwd):/root/nsfw_data_scraper docker_nsfw_data_scraper scripts/runall.sh Getting images for class: neutral ... ... $ ls data test train $ ls data/train/ drawings hentai neutral porn sexy $ ls data/test/ drawings hentai neutral porn sexy
conda install -c pytorch -c fastai fastai
train_model.ipynbtop to bottom
I was able to train a CNN classifier to 91% accuracy with the following confusion matrix:
As expected,
drawingsand
hentaiare confused with each other more frequently than with other classes.
Same with
pornand
sexycategories.