A TensorFlow implementation of DeepMind's WaveNet paper
This is a TensorFlow implementation of the WaveNet generative neural network architecture for audio generation.
The WaveNet neural network architecture directly generates a raw audio waveform, showing excellent results in text-to-speech and general audio generation (see the DeepMind blog post and paper for details).
The network models the conditional probability to generate the next sample in the audio waveform, given all previous samples and possibly additional parameters.
After an audio preprocessing step, the input waveform is quantized to a fixed integer range.
The integer amplitudes are then one-hot encoded to produce a tensor of shape
A convolutional layer that only accesses the current and previous inputs then reduces the channel dimension.
The core of the network is constructed as a stack of causal dilated layers, each of which is a dilated convolution (convolution with holes), which only accesses the current and past audio samples.
The outputs of all layers are combined and extended back to the original number of channels by a series of dense postprocessing layers, followed by a softmax function to transform the outputs into a categorical distribution.
The loss function is the cross-entropy between the output for each timestep and the input at the next timestep.
In this repository, the network implementation can be found in model.py.
TensorFlow needs to be installed before running the training script. Code is tested on TensorFlow version 1.0.1 for Python 2.7 and Python 3.5.
In addition, librosa must be installed for reading and writing audio.
To install the required python packages, run
bash pip install -r requirements.txt
For GPU support, use
bash pip install -r requirements_gpu.txt
You can use any corpus containing
.wavfiles. We've mainly used the VCTK corpus (around 10.4GB, Alternative host) so far.
In order to train the network, execute
bash python train.py --data_dir=corpusto train the network, where
corpusis a directory containing
.wavfiles. The script will recursively collect all
.wavfiles in the directory.
You can see documentation on each of the training settings by running
bash python train.py --help
Global conditioning refers to modifying the model such that the id of a set of mutually-exclusive categories is specified during training and generation of .wav file. In the case of the VCTK, this id is the integer id of the speaker, of which there are over a hundred. This allows (indeed requires) that a speaker id be specified at time of generation to select which of the speakers it should mimic. For more details see the paper or source code.
The instructions above for training refer to training without global conditioning. To train with global conditioning, specify command-line arguments as follows:
python train.py --data_dir=corpus --gc_channels=32The --gc_channels argument does two things: * It tells the train.py script that it should build a model that includes global conditioning. * It specifies the size of the embedding vector that is looked up based on the id of the speaker.
The global conditioning logic in train.py and audio_reader.py is "hard-wired" to the VCTK corpus at the moment in that it expects to be able to determine the speaker id from the pattern of file naming used in VCTK, but can be easily be modified.
Example output generated by @jyegerlehner based on speaker 280 from the VCTK corpus.
You can use the
generate.pyscript to generate audio using a previously trained model.
python generate.py --samples 16000 logdir/train/2017-02-13T16-45-34/model.ckpt-80000where
logdir/train/2017-02-13T16-45-34/model.ckpt-80000needs to be a path to previously saved model (without extension). The
--samplesparameter specifies how many audio samples you would like to generate (16000 corresponds to 1 second by default).
The generated waveform can be played back using TensorBoard, or stored as a
.wavfile by using the
python generate.py --wav_out_path=generated.wav --samples 16000 logdir/train/2017-02-13T16-45-34/model.ckpt-80000
--save_everyin addition to
--wav_out_pathwill save the in-progress wav file every n samples.
python generate.py --wav_out_path=generated.wav --save_every 2000 --samples 16000 logdir/train/2017-02-13T16-45-34/model.ckpt-80000
Fast generation is enabled by default. It uses the implementation from the Fast Wavenet repository. You can follow the link for an explanation of how it works. This reduces the time needed to generate samples to a few minutes.
To disable fast generation:
python generate.py --samples 16000 logdir/train/2017-02-13T16-45-34/model.ckpt-80000 --fast_generation=false
Generate from a model incorporating global conditioning as follows:
python generate.py --samples 16000 --wav_out_path speaker311.wav --gc_channels=32 --gc_cardinality=377 --gc_id=311 logdir/train/2017-02-13T16-45-34/model.ckpt-80000Where:
--gc_channels=32specifies 32 is the size of the embedding vector, and must match what was specified when training.
--gc_cardinality=377is required as 376 is the largest id of a speaker in the VCTK corpus. If some other corpus is used, then this number should match what is automatically determined and printed out by the train.py script at training time.
--gc_id=311specifies the id of speaker, speaker 311, for which a sample is to be generated.
Install the test requirements
pip install -r requirements_test.txt
Run the test suite
Currently there is no local conditioning on extra information which would allow context stacks or controlling what speech is generated.