Virtual walks in Google Street View using PoseNet and applying Deep Learning models to recognize actions.
Para la versión en español, haz click aquí.
During the quarantine, we're currently experiencing due to the COVID-19 pandemic our rights to move freely on the street are trimmed in favour of the common wellbeing. People can only go out in certain situations like doing the grocery. Many borders are closed and travelling is almosy totally banned in most countries.
Virtual Walks is a project that uses Pose Estimation models along with LSTM neural networks in order to simulate walks in Google Street View. For pose estimation, PoseNet model has been adapted, while for the action detection part, an LSTM model has been developed using TensorFlow 2.0.
This project is capable of simulating walking around the street all over the world with the help of Google Street View.
Tensorflow 2.0, Selenium and Python 3.7 are the main technologies used in this project.
PoseNet has been combined with an LSTM model to infer the action that the person is performing. Once the action is detected it is pased to the controller; the part that interacts with Google Street View.
* Stand * Walk * Turn right * Turn left
Currently, there is another model that can be used to run this program. Instead of a LSTM, joint velocities are calculated across the frames in the 5-frame groups and passed along with the joint positions to a PCA and FF Neural Network to predict the action. The default model is the LSTM, as we consider it the methodologically correct one and is the model with the highest precission.
As the action prediction could be (depending on the host computer's specifications) much faster than the average walking speed, an action can be only executed once every 0.5 seconds. This parameter is customizable.
As it can be seen in the image, the skeleton is inferred form the image and an action is predicted and executed.
Remember that a Webcam is needed to use this program, as actions are predicted from the frames taken with it.
It is recommended to install it in a new Python 3.7 environment to avoid issues and version conflicts.
Install tensorflowjs, required to run ResNet:
pip install tensorflowjs
Clone and install tensorflowjs graph model converter, following the steps in tfjs-to-tf
Clone the git repository
git clone https://github.com/Moving-AI/virtual-walk.git
Install dependencies by running
pip install -r requirements.txt
Download the used models by running the download_models file. This script will download PoseNet models (MobileNet and ResNet with both output strides, 16 and 32), LSTM, PCA, scaler and neural network. The link to download the models separately can be found below.
cd virtual-walk python3 download_models.py
Finally, you can run execute.py to try it.
Considerations during usage:
Our experience using the model tells us that a slightly bright enviroment is preferred rather than a very bright one.
The system is sensitive to the position of the webcam.
To sum up, a position close to the one shown in the GIF should be used.
Probably the training part is the weakest in this project, due to our lack of training data and computing power. Our training data generation process consisted on 40 minutes of recordings. In each video, one person appeared making one specific action for a certain period of time. As it will be discussed in the next steps section, our models tend to overfit in spite of having a working system. An example of the training data can be seen below.
The models we have trained and the ones from which the examples have been generated can be downloaded running the download_models file. In the images below the training performance is shown:
If someone wants to train another LSTM model, the DataProcessor class is provided. It can process the videos located in a folder, reading the valid frame numbers from a labels.txt file and generating a CSV file with the training examples. This file can be used in train.py to generate a new LSTM model. The path for this model would be passed to the WebcamPredictor class and the system would use this new model.
Generating more training data. In this project we have tried to get what could be considered a MVP, robustness has never been a main goal. As it can be seen in the Training section, the model does not appear to overfit, even knowing that LSTM tend very much to overfit. However, the training and testing data are very similar, as the videos are people making "loop" actions. So we expect the model to have underlying overfitting that cannot be detected witout more videos. Probably, recording more videos in different light conditions would make the model more robust and consistent.
Turning to the right and to the left are not predicted with the same accuracy in spite of being symmetric actions. A specular reflection of the coordinates could be used to be more consistent in the turn predictions.
This project is under MIT license. See LICENSE for more details.