by lars76

Object localization in images using simple CNNs and Keras

129 Stars 55 Forks Last release: Not found MIT License 26 Commits 0 Releases

Available items

No Items, yet!

The developer of this repository has not created any items for sale yet. Need a bug fixed? Help with integration? A different license? Create a request here:


This project shows how to localize objects in images by using simple convolutional neural networks.


Before getting started, we have to download a dataset and generate a csv file containing the annotations (boxes).

  1. Download The Oxford-IIIT Pet Dataset
  2. Download The Oxford-IIIT Pet Dataset Annotations
  3. tar xf images.tar.gz
  4. tar xf annotations.tar.gz
  5. mv annotations/xmls/* images/
  6. python3 generate_dataset.py

Single-object detection

Example 1: Finding dogs/cats


First, let's look at YOLOv2's approach:

  1. Pretrain Darknet-19 on ImageNet (feature extractor)
  2. Remove the last convolutional layer
  3. Add three 3 x 3 convolutional layers with 1024 filters
  4. Add a 1 x 1 convolutional layer with the number of outputs needed for detection

We proceed in the same way to build the object detector:

  1. Choose a model from Keras Applications i.e. feature extractor
  2. Remove the dense layer
  3. Freeze some/all/no layers
  4. Add one/multiple/no convolution block (or
    for MobileNetv2)
  5. Add a convolution layer for the coordinates

The code in this repository uses MobileNetv2, because it is faster than other models and the performance can be adapted. For example, if alpha = 0.35 with 96x96 is not good enough, one can just increase both values (see here for a comparison). If you use another architecture, change

  1. python3 example_1/train.py
  2. Adjust the WEIGHTSFILE in `example1/test.py` (given by the last script)
  3. python3 example_1/test.py


In the following images red is the predicted box, green is the ground truth:

Image 1

Image 2

Example 2: Finding dogs/cats and distinguishing classes

This time we have to run the scripts



In order to distinguish between classes, we have to modify the loss function. I'm using here

w_1*log((y_hat - y)^2 + 1) + w_2*FL(p_hat, p)
w_1 = w_2 = 1
are two weights and
FL(p_hat, p) = -(0.9(1 - p_hat)^2 p*log(p_hat) + 0.1*p_hat^2(1 - p)log(1-p_hat))
(focal loss).

Instead of using all 37 classes, the code will only output class 0 (contains only class 0) or class 1 (contains class 1 to 36). However, it is easy to extend this to more classes (use categorical cross entropy instead of focal loss and try out different weights).

Multi-object detection

Example 3: Segmentation-like detection


In this example, we use a skip-net architecture similar to U-Net. For an in-depth explanation see my blog post.




Example 4: YOLO-like detection


This example is based on the three YOLO papers. For an in-depth explanation see this blog post.


Multiple dogs


Improve accuracy (IoU)

  • enable augmentations: see
    the same code can be added to the other examples
  • better augmentations: try out different values (flips, rotation etc.)
  • for MobileNetv1/2: increase
    in train_model.py
  • other architectures: increase
  • add more layers
  • try out other loss functions (MAE, smooth L1 loss etc.)
  • other optimizer: SGD with momentum 0.9, adjust learning rate
  • use a feature pyramid
  • read https://github.com/keras-team/keras/pull/9965

Increase training speed

  • increase
  • less layers,


  • If the new dataset is small and similar to ImageNet, freeze all layers.
  • If the new dataset is small and not similar to ImageNet, freeze some layers.
  • If the new dataset is large, freeze no layers.
  • read http://cs231n.github.io/transfer-learning/

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.