🗂 Split folders with files (i.e. images) into training, validation and test (dataset) folders
Split folders with files (e.g. images) into train, validation and test (dataset) folders.
The input folder should have the following format:
input/ class1/ img1.jpg img2.jpg ... class2/ imgWhatever.jpg ... ...
In order to give you this:
output/ train/ class1/ img1.jpg ... class2/ imga.jpg ... val/ class1/ img2.jpg ... class2/ imgb.jpg ... test/ class1/ img3.jpg ... class2/ imgc.jpg ...
This should get you started to do some serious deep learning on your data. Read here why it's a good idea to split your data intro three different sets.
pip install split-folders
If you are working with a large amount of files, you may want to get a progress bar. Install tqdm in order to get visual updates for copying files.
pip install split-folders tqdm
You can use
split-foldersas Python module or as a Command Line Interface (CLI).
If your datasets is balanced (each class has the same number of samples), choose
fixed. NB: oversampling is turned off by default. Oversampling is only applied to the train folder since having duplicates in val or test would be considered cheating.
import splitfolders # or import split_folders
Split with a ratio.
To only split into training and validation set, set a tuple to
splitfolders.ratio("input_folder", output="output", seed=1337, ratio=(.8, .1, .1), group_prefix=None) # default values
Split val/test with a fixed number of items e.g. 100 for each set.
To only split into training and validation set, use a single number to
splitfolders.fixed("input_folder", output="output", seed=1337, fixed=(100, 100), oversample=False, group_prefix=None) # default values
Occasionally you may have things that comprise more than a single file (e.g. picture (.png) + annotation (.txt)).
splitfolderslets you split files into equally-sized groups based on their prefix. Set
group_prefixto the length of the group (e.g.
2). But now all files should be part of groups.
Usage: splitfolders [--output] [--ratio] [--fixed] [--seed] [--oversample] [--group_prefix] folder_with_images Options: --output path to the output folder. defaults to `output`. Get created if non-existent. --ratio the ratio to split. e.g. for train/val/test `.8 .1 .1 --` or for train/val `.8 .2 --`. --fixed set the absolute number of items per validation/test set. The remaining items constitute the training set. e.g. for train/val/test `100 100` or for train/val `100`. --seed set seed value for shuffling the items. defaults to 1337. --oversample enable oversampling of imbalanced datasets, works only with --fixed. --group_prefix split files into equally-sized groups based on their prefix Example: splitfolders --ratio .8 .1 .1 -- folder_with_images
Because of some Python quirks you have to prepend
Instead of the command
splitfoldersyou can also use
Install and use poetry.
If you have a question, found a bug or want to propose a new feature, have a look at the issues page.
Pull requests are especially welcomed when they fix bugs or improve the code quality.