A structured implementation of MuZero
No Data
.. |copy| unicode:: 0xA9 .. |---| unicode:: U+02014
======
This repository is a Python implementation of the MuZero algorithm. It is based upon the
pre-print paper__ and the
pseudocode__ describing the Muzero framework. Neural computations are implemented with Tensorflow.
You can easily train your own MuZero, more specifically for one player and non-image based environments (such as
CartPole__). If you wish to train Muzero on other kinds of environments, this codebase can be used with slight modifications.
__ https://arxiv.org/abs/1911.08265 __ https://arxiv.org/src/1911.08265v1/anc/pseudocode.py __ https://gym.openai.com/envs/CartPole-v1/
DISCLAIMER: this code is early research code. What this means is:
We run this code using:
This code must be run from the main function in
muzero.py(don't forget to first configure your conda environment).
To train a model, please follow these steps:
1) Create or modify an existing configuration of Muzero in
config.py.
2) Call the right configuration inside the main of
muzero.py.
3) Run the main function:
python muzero.py.
To train on a different environment than Cartpole-v1, please follow these additional steps:
1) Create a class that extends
AbstractGame, this class should implement the behavior of your environment. For instance, the
CartPoleclass extends
AbstractGameand works as a wrapper upon
gym CartPole-v1__. You can use the
CartPoleclass as a template for any gym environment.
__ https://gym.openai.com/envs/CartPole-v1/
2) This step is optional (only if you want to use a different kind of network architecture or value/reward transform). Create a class that extends
BaseNetwork, this class should implement the different networks (representation, value, policy, reward and dynamic) and value/reward transforms. For instance, the
CartPoleNetworkclass extends
BaseNetworkand implements fully connected networks.
3) This step is optional (only if you use a different value/reward transform). You should implement the corresponding inverse value/reward transform by modifying the
loss_valueand
loss_rewardfunction inside
training.py.
This implementation differ from the original paper in the following manners: