Need help with YouTube-Like-predictor?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

ayush1997
137 Stars 37 Forks 19 Commits 0 Opened issues

Description

YouTube Like Count Predictions using Machine Learning

Services available

!
?

Need anything else?

Contributors list

# 17,198
Python
medical...
Jupyter...
Tensorf...
18 commits

YouTube Like Count Predictor

This a tool for getting youtube video like count prediction.A Random Forest model was used for training on a large dataset of ~3,50,000 videos.Feature engineering,Data cleaning, Data selection and many other techniques were used for this task.

Report

Report.pdf
contains a detailed explanation of different steps and techniques that were used for this task.

Tools Used

How to run :

  1. Clone this repo

      $ git clone https://github.com/ayush1997/YouTube-Like-predictor.git
      $ cd PS17_Ayush_Singh
    
  2. Create new virtual environment

      $ sudo pip install virtualenv
      $ virtualenv venv
      $ source venv/bin/activate
      $ pip install -r requirements.txt
    
  3. Predictions

    There are two ways for getting the prediction results.

    3.1. Training the model and run prediction

    $ cd model
    $ python train_model.py
    

    This will save a

    model-final
    file in the same folder,Training takes ~18 Mins.Then run
    $ python predict.py 
    

    for ex:

    $ python predict.py dOyJqGtP-wU ASO_zypdnsQ wEduiMyl0ko

    3.2 From pretrained model

    A pretrained model has been uploaded on dropbox.Download model(~500MB) from the link.

    Unzip the

    model-final
    file in the
    model
    folder.
    sh
    $ cd model
    $ python predict.py 
    
    for ex:
    $ python predict.py vid1 vid2 vid3]

Note: List can contain a maximum of 40 Video IDs at the time of run.

Code Details

Below is a brief description for the Code files/folder in repo.

data/

This folder contains scripts which were used to fetch data using Youtube API and populatin the base.

$ cd data

get_IDS.py

The script uses Youtube Search API for extracting Video IDs for the last 7 years(2010-2016).It gives Approx. 22,000-24,000 Video IDs for every category and stores them in a Pickle files for different categories.

$ python predict.py 

scrape_video.py

The script use the Video IDs saved by

get_IDS.py
and further extract different video related attributes using Youtube API and saves the data Dictionary in pickle format.
$ python scrape_video.py

scrape_channel.py

The script is used to further collect data for all channels present in the video dataset.It makes use of the data stored for videos to extract channelIds.

$ python scrape_channel.py

scrape_social.py

The script is used to scrape social links

$ python scrape_social.py

Note : Due to large amount of data to be extracted for different attributes,the extraction was done at different levels therefore it was not viable to make a single script for data collection which could make debugging a little messy.

notebook/

This folder contains ipython notebooks which contain implementation for merging different data extracted and tasks like Data cleaning and processing.

$ jupyter notebook

FeatureEngineering.ipynb

The notebook has the implementation for making new derived features.

DataProcessing.ipynb

This notebook contains data processing implementation for data cleaning and encoding processes.

Note : The final data generated after all processing has been uploaded in

dataset/data.csv
.
dataset/data_final.csv
has the data which is used for training the model.

model/

This folders contains scripts used for training,tuning model and getting the prediction results.

model_grid.py

This script generates the tuned parameters for estimator using Grid Search and Cross Validation.

$ python model_grid.py

train_model.py

This script is used for training the model over training data (

dataset/data_final.csv
) Because of Bootstrap Sampling in random forest the results migght vary after every trainig process.
sh
$ python train_model.py

predict.py

This script returns the Like count prediction along with the difference and the Error rate

sh
$ cd model
$ python predict.py 
for ex:
$ python predict.py [vid1,vid2,vid3]

Issues

A very common issue comes with the pickling process which sometime leads to loss of information and different results every time.

Report

1 2 3 4 5 6 7 8

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.