Model Checkpointing

InternEvo uses internlm.utils.model_checkpoint.CheckpointManager to manage model checkpointing. In the implementation, we use CheckpointManager.try_save_checkpoint(train_state) to checkpoint training states at specific steps.

InternEvo supports automatic loading of latest ckpt at startup and automatic model checkpointing at signal quit.

CheckpointManager

CheckpointManager is the utility class within InternEvo responsible for model loading and saving. It initializes its own parameters using the initialization parameter dictionary from the ‘ckpt’ field in the config file. Currently, the relevant parameters are as follows

enable_save_ckpt: Whether to enable checkpoint storage functionality (does not affect checkpoint loading). Parameter type: bool, it is a required parameter.
save_ckpt_folder: Checkpoint storage path. Parameter type: str. This is a required parameter when enabling checkpoint storage functionality.
checkpoint_every: Checkpoint storage frequency. Parameter type: int.
load_ckpt_folder: Initialization checkpoint/weight loading path. Parameter type: str. Default is None. (1) Path format conventions.
async_upload: Whether to enable asynchronous uploading. See documentation for more details Asynchronous upload.
async_upload_tmp_folder: Temporary storage path for asynchronous uploading.
oss_snapshot_freq: Snapshot storage frequency. See documentation for more details Snapshot Checkpoint.
auto_resume: Whether to enable automatic checkpoint resume. See documentation for more details Checkpoint automatic recovery.
stop_file_path: Path to the checkpoint storage control file. See documentation for more details Manual control of checkpoint storage.

Here is an example of parameter settings in the config file.

ckpt = dict(
    enable_save_ckpt=False,  # enable ckpt save.
    save_ckpt_folder=SAVE_CKPT_FOLDER,  # Path to save training ckpt.
    load_ckpt_folder=dict(path="local:/mnt/mfs/ckpt", content=["all",], ckpt_type="internlm"),
    auto_resume=False, # disable auto-resume, internlm will load model checkpoint from the path of 'load_ckpt_folder'.
    checkpoint_every=CHECKPOINT_EVERY,
    async_upload=True,  # async ckpt upload. (only work for boto3, volc and oss2 ckpt)
    async_upload_tmp_folder="/dev/shm/internlm_tmp_ckpt/",  # path for temporarily files during asynchronous upload.
    oss_snapshot_freq=int(CHECKPOINT_EVERY / 2),  # snapshot ckpt save frequency.
)

Model loading and saving path format conventions.

(1) Path format conventions.

InternEvo follows the following path format conventions for all storage paths specified in the config:

For paths of different backends, the following rules should be noted:

If you need to use paths with Boto3, make sure to import the S3_ACCESS_KEY_ID and S3_SECRET_ACCESS_KEY_ID environment variables before running.
If you need to use paths with Boto3, make sure to import the S3_ACCESS_KEY_ID and S3_SECRET_ACCESS_KEY_ID environment variables before running.
If you need to use paths with Boto3, make sure to import the S3_ACCESS_KEY_ID and S3_SECRET_ACCESS_KEY_ID environment variables before running.
The bucket’s endpoint is typically divided into Inside IP and Outside IP. Whenever possible, it’s advisable to use the Inside IP to achieve better storage speed.

(2) Model loading format conventions (load_ckpt_folder).

load_ckpt_folder consists of three fields: path, content, and ckpt_type.

path: Specifies the loading path for the checkpoint/initial model weights (the format of the path is described in the following subsection).
content: Indicates the content to be loaded, currently supported fields include:
- model: Load model weights.
- sampler: Load sampler state.
- scheduler: Load lr_scheduler state.
- optimizer: Load optimizer state.
- all: Indicates that all states should be loaded, typically used for resuming training.
ckpt_type: Represents the type of model weight to be loaded, currently supported fields include:
- internlm: Checkpoint storage format as per InternEvo conventions.

Here are two examples:

# 从文件存储相对路径 ckpt_model 中加载已有模型权重初始化模型，适合 sft 等训练初始化
load_ckpt_folder= dict(path="local:ckpt_model", content=["model",], ckpt_type="internlm")

# 从文件存储相对路径 ckpt_model 中加载所有的状态，适合断点续训的场景
load_ckpt_folder= dict(path="local:ckpt_model", content=["all",], ckpt_type="internlm")

Asynchronous upload.

Asynchronous upload first synchronously stores the model in the async_upload_tmp_folder and then asynchronously writes it to remote storage (OSS/NFS). This helps prevent blocking training for extended periods while storing checkpoints.

The parameters related to config.ckpt are:

async_upload: Whether to enable asynchronous upload. Parameter type: bool/None. Default is False.
async_upload_tmp_folder: Temporary storage path for asynchronous upload. Parameter type: str/None. Default value is /dev/shm/{JOB_NAME}_tmp_ckpt/.

It’s important to note that asynchronous upload functionality is only effective when the backend is set to “boto3.” When the backend is set to “local,” only synchronous storage is supported.

The setting principle is to try to set it to the local directory of the computing node, so as to obtain the best asynchronous upload speed. Generally speaking, it is recommended to use the path under /dev/shm or /nvme. If If you use synchronous upload, this path does not need to be given.

Snapshot Checkpoint

Snapshot checkpoint is a special checkpoint that is used to reduce the loss of training progress due to training task crashes caused by problems such as training crashes (ECC error, NCCL error.etc). It adopts an alternating overwriting strategy, and the storage size occupied is the space required for the checkpoints of two steps. Coupled with asynchronous checkpoint writing, it greatly increases the storage frequency of checkpoints without affecting training speed and storage capacity.

The parameters related to config.ckpt are:

oss_snapshot_freq: Snapshot storage frequency. Parameter type int/None, default is 50

oss_snapshot_freq can be set according to the time of each step of the model. Generally, the snapshot frequency is less than 1 hour, and it is Yi/Non for more than half an hour (the default value is one-half of checkpoint_every)

Checkpoint automatic recovery

The purpose of Checkpoint automatic recovery is to automatically load the latest checkpoint (including snapshot checkpoint) under the save_ckpt_folder path during resume training. Coupled with the automatic restart mechanism, tasks can be automatically restored without human intervention.

This function is enabled by default, so please note that if you need to load the model weights under the load_ckpt_folder path, you must set auto_resume to False, otherwise unexpected behavior may occur.

The parameters related to config.ckpt are:

auto_resume: Whether to enable automatic checkpoint recovery. Parameter type bool, default is True

auto_resume If True, attempts to save_ckpt_folder`Automatically load the latest ckpt in the path. If not found, training will start from step 0. If False, try to load model parameters from load_ckpt_folder

Manual control of checkpoint storage

When the model is still a long time away from the next checkpoint storage, if you want to stop a task immediately and do not want to lose the current training progress, you can use the manual control checkpoint storage function. By writing the number of steps you want the task to stop to a stop_file_path file located on NFS, the Global Rank 0 process will poll the value of the file at each step. If it finds that there is a stop step we gave , a broadcast will be performed to notify all training processes, and it is agreed that each process will store a checkpoint when training reaches this step, and choose whether to exit.

The parameters related to config.ckpt are:

stop_file_path: The path of the checkpoint storage control file, parameter type str/None, the default is None, indicating to turn off this function

An example of writing to stop_file_path is given below:

# 我们希望停止的step步数
# 如果存入的step>0，则任务会在存储ckpt后自动退出
# 如果存入的step<0，则任务会在存储ckpt后会继续训练
echo "999" > ./llm_alter/1006_pr.log