table of contents
CRIU-AMDGPU-PLUGIN(1) | CRIU-AMDGPU-PLUGIN(1) |
NAME¶
criu-amdgpu-plugin - A plugin extension to CRIU to support checkpoint/restore in userspace for AMD GPUs.
CURRENT SUPPORT¶
Single and Multi GPU systems (Gfx9) Checkpoint / Restore on different system Checkpoint / Restore inside a docker container Pytorch Tensorflow Using CRIU Image Streamer
DESCRIPTION¶
Though criu is a great tool for checkpointing and restoring running applications, it has certain limitations such as it cannot handle applications that have device files open. In order to support ROCm based workloads with criu we need to augment criu’s core functionality with a plugin based extension mechanism. criu-amdgpu-plugin provides the necessary support to criu to allow Checkpoint / Restore with ROCm.
Dependencies ~~~~~~ amdkfd support:: In order to snapshot the VRAM and other GPU device states, we require an updated version of amdkfd(amdgpu) driver. The kernel patches are under review currently.
criu 3.16
OPTIONS¶
Optional parameters can be passed in as environment variables before executing criu command.
KFD_FW_VER_CHECK
E.g: KFD_FW_VER_CHECK=0
KFD_SDMA_FW_VER_CHECK
E.g: KFD_SDMA_FW_VER_CHECK=0
KFD_CACHES_COUNT_CHECK
E.g: KFD_CACHES_COUNT_CHECK=0
KFD_NUM_GWS_CHECK
E.g: KFD_NUM_GWS_CHECK=0
KFD_VRAM_SIZE_CHECK
E.g: KFD_VRAM_SIZE_CHECK=0
KFD_NUMA_CHECK
E.g: KFD_NUMA_CHECK=1
KFD_CAPABILITY_CHECK
E.g: KFD_CAPABILITY_CHECK=1
KFD_MAX_BUFFER_SIZE
E.g: KFD_MAX_BUFFER_SIZE="2G"
AUTHOR¶
The AMDKFD team.
COPYRIGHT¶
Copyright (C) 2020-2021, Advanced Micro Devices, Inc. (AMD)
2023-11-28 |