Need help with YOWO?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

wei-tim
591 Stars 118 Forks 41 Commits 29 Opened issues

Description

You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization

Services available

!
?

Need anything else?

Contributors list

# 108,692
Python
Shell
pytorch
gesture...
30 commits
# 283,777
Python
8 commits

You Only Watch Once (YOWO)

PyTorch implementation of the article "You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization". The repositry contains code for real-time spatiotemporal action localization with PyTorch on AVA, UCF101-24 and JHMDB datasets!

Updated paper can be accessed via YOWO_updated.pdf

AVA dataset visualizations!

ava_example_1 ava_example_2 ava_example_3


UCF101-24 and J-HMDB-21 datasets visualizations!

biking fencing golf-swing
catch brush-hair pull-up



In this work, we present YOWO (**You *Only **Watch Once), a unified CNN architecture for real-time spatiotemporal action localization in video stream. *YOWO is a single-stage framework, the input is a clip consisting of several successive frames in a video, while the output predicts bounding box positions as well as corresponding class labels in current frame. Afterwards, with specific strategy, these detections can be linked together to generate Action Tubes in the whole video.

Since we do not separate human detection and action classification procedures, the whole network can be optimized by a joint loss in an end-to-end framework. We have carried out a series of comparative evaluations on two challenging representative datasets UCF101-24 and J-HMDB-21. Our approach outperforms the other state-of-the-art results while retaining real-time capability, providing 34 frames-per-second on 16-frames input clips and 62 frames-per-second on 8-frames input clips.

Installation

git clone https://github.com/wei-tim/YOWO.git
cd YOWO

Datasets

  • AVA : Download from here
  • UCF101-24: Download from here
  • J-HMDB-21: Download from here

Use instructions here for the preperation of AVA dataset.

Modify the paths in ucf24.data and jhmdb21.data under cfg directory accordingly. Download the dataset annotations from here.

Download backbone pretrained weights

  • Darknet-19 weights can be downloaded via:

    bash
    wget http://pjreddie.com/media/files/yolo.weights
    
  • ResNeXt ve ResNet pretrained models can be downloaded from here.

NOTE: For JHMDB-21 trainings, HMDB-51 finetuned pretrained models should be used! (e.g. "resnext-101-kinetics-hmdb51_split1.pth").

  • For resource efficient 3D CNN architectures (ShuffleNet, ShuffleNetv2, MobileNet, MobileNetv2), pretrained models can be downloaded from here.

Pretrained YOWO models

Pretrained models for UCF101-24 and J-HMDB-21 datasets can be downloaded from here.

Pretrained models for AVA dataset can be downloaded from here.

All materials (annotations and pretrained models) are also available in Baiduyun Disk: here with password 95mm

Running the code

  • All training configurations are given in ava.yaml, ucf24.yaml and jhmdb.yaml files.
  • AVA training:
    bash
    python main.py --cfg cfg/ava.yaml
    
  • UCF101-24 training:
    bash
    python main.py --cfg cfg/ucf24.yaml
    
  • J-HMDB-21 training:
    bash
    python main.py --cfg cfg/jhmdb.yaml
    

Validating the model

  • For AVA dataset, after each epoch, validation is performed and frame-mAP score is provided.

  • For UCF101-24 and J-HMDB-21 datasets, after each validation, frame detections is recorded under 'jhmdbdetections' or 'ucfdetections'. From here, 'groundtruthsjhmdb.zip' and 'groundtruthsjhmdb.zip' should be downloaded and extracted to "evaluation/Object-Detection-Metrics". Then, run the following command to calculate frame_mAP.

python evaluation/Object-Detection-Metrics/pascalvoc.py --gtfolder PATH-TO-GROUNDTRUTHS-FOLDER --detfolder PATH-TO-DETECTIONS-FOLDER

  • For videomAP, set the pretrained model in the correct yaml file and run: ```bash python videomAP.py --cfg cfg/ucf24.yaml ```

Running on a text video

  • You can run AVA pretrained model on any test video with the following code:
    bash
    python test_video_ava.py --cfg cfg/ava.yaml
    

UPDATEs: * YOWO is extended for AVA dataset. * Old repo is deprecated and moved to YOWO_deprecated branch.

Citation

If you use this code or pre-trained models, please cite the following:

@InProceedings{kopuklu2019yowo,
title={You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization},
author={K{\"o}p{\"u}kl{\"u}, Okan and Wei, Xiangyu and Rigoll, Gerhard},
journal={arXiv preprint arXiv:1911.06644},
year={2019}
}

Acknowledgements

We thank Hang Xiao for releasing pytorch_yolo2 codebase, which we build our work on top.

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.