PyTorch on Kubernetes
This repository contains the specification and implementation of
PyTorchJobcustom resource definition. Using this custom resource, users can create and manage PyTorch jobs like other built-in resources in Kubernetes. See CRD definition
Please refer to the installation instructions in the Kubeflow user guide. This installs
pytorchjobCRD and
pytorch-operatorcontroller to manage the lifecycle of PyTorch jobs.
You can create PyTorch Job by defining a PyTorchJob config file. See the manifests for the distributed MNIST example. You may change the config file based on your requirements.
cat examples/mnist/v1/pytorch_job_mnist_gloo.yaml
Deploy the PyTorchJob resource to start training:
kubectl create -f examples/mnist/v1/pytorch_job_mnist_gloo.yaml
You should now be able to see the created pods matching the specified number of replicas.
kubectl get pods -l pytorch-job-name=pytorch-dist-mnist-gloo
Training should run for about 10 epochs and takes 5-10 minutes on a cpu cluster. Logs can be inspected to see its training progress.
PODNAME=$(kubectl get pods -l pytorch-job-name=pytorch-dist-mnist-gloo,pytorch-replica-type=master -o name) kubectl logs -f ${PODNAME}
kubectl get -o yaml pytorchjobs pytorch-dist-mnist-gloo
See status section to monitor the job status. Here is sample output when the job is successfully completed.
apiVersion: v1 items: - apiVersion: kubeflow.org/v1 kind: PyTorchJob metadata: creationTimestamp: 2019-01-11T00:51:48Z generation: 1 name: pytorch-dist-mnist-gloo namespace: default resourceVersion: "2146573" selfLink: /apis/kubeflow.org/v1/namespaces/kubeflow/pytorchjobs/pytorch-dist-mnist-gloo uid: 13ad0e7f-153b-11e9-b5c1-42010a80001e spec: pytorchReplicaSpecs: Master: replicas: 1 restartPolicy: OnFailure template: spec: containers: - args: - --backend - gloo image: gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0 name: pytorch resources: limits: nvidia.com/gpu: "1" Worker: replicas: 1 restartPolicy: OnFailure template: spec: containers: - args: - --backend - gloo image: gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0 name: pytorch resources: limits: nvidia.com/gpu: "1" status: completionTime: 2019-01-11T01:03:15Z conditions: - lastTransitionTime: 2019-01-11T00:51:48Z lastUpdateTime: 2019-01-11T00:51:48Z message: PyTorchJob pytorch-dist-mnist-gloo is created. reason: PyTorchJobCreated status: "True" type: Created - lastTransitionTime: 2019-01-11T00:57:22Z lastUpdateTime: 2019-01-11T00:57:22Z message: PyTorchJob pytorch-dist-mnist-gloo is running. reason: PyTorchJobRunning status: "False" type: Running - lastTransitionTime: 2019-01-11T01:03:15Z lastUpdateTime: 2019-01-11T01:03:15Z message: PyTorchJob pytorch-dist-mnist-gloo is successfully completed. reason: PyTorchJobSucceeded status: "True" type: Succeeded replicaStatuses: Master: succeeded: 1 Worker: succeeded: 1 startTime: 2019-01-11T00:57:22Z
Please refer to the developer_guide.