Training Setup

The training process of InternEvo can be summarized into two steps:

Initialization
- Initialize model, optimizer, dataloader, trainer, and create different types of process groups to prepare for iterative steps of hybrid parallel training.
- Initialize logger, checkpoint manager, monitor manager, and profiler to watch, alert, and record the iterative training steps.
Iterative training steps
- Load the training engine and scheduler for hybrid parallel training according to the configuration such as tensor parallel size, pipeline parallel size, and data parallel size.
- In iterative training steps, the Trainer API is called to perform zero gradients, forward-loss-backward, and parameter update.

_images/hybrid_parallel_training.png — InternEvo training process

Argument Parsing

InternEvo uses the argparse library to supply commandline configuration to the InternEvo runtime.

Use internlm.initialize.get_default_parser() to get InternEvo’s default parser with some builtin arguments, users can add custom parameters to this parser.

# Get InternEvo default parser
parser = internlm.initialize.get_default_parser()
# Add new argument
parser.add_argument("--user_arg", type=int, default=-1, help="arguments add by user.")
cmd_args = parser.parse_args()

Model Initialization

InternEvo uses the field model_type and model in the config file to control model initialization process. An example model initialization configuratio

model_type = "INTERNLM"  # default is "INTERNLM", used to register classes and modules for model initialization
NUM_ATTENTION_HEAD = 32
VOCAB_SIZE = 103168
HIDDEN_SIZE = 4096
NUM_LAYER = 32
MLP_RATIO = 8 / 3
model = dict(
    checkpoint=False,  # The proportion of layers for activation aheckpointing, the optional value are True/False/[0-1]
    num_attention_heads=NUM_ATTENTION_HEAD,
    embed_split_hidden=True,
    vocab_size=VOCAB_SIZE,
    embed_grad_scale=1,
    parallel_output=True,
    hidden_size=HIDDEN_SIZE,
    num_layers=NUM_LAYER,
    mlp_ratio=MLP_RATIO,
    apply_post_layer_norm=False,
    dtype="torch.bfloat16",  # Support: "torch.float16", "torch.half", "torch.bfloat16", "torch.float32", "torch.tf32"
    norm_type="rmsnorm",
    layer_norm_epsilon=1e-5,
    use_flash_attn=True,
    num_chunks=1,  # if num_chunks > 1, interleaved pipeline scheduler is used.
)

The field model_type specifics the model type has been registered and to be initialized.
The parameters in field model specific the configuration settings during model initialization.

It is worth noting that, users can define new model type, and register model’s initialization function by decorater @MODEL_INITIALIZER.register_module, which MODEL_INITIALIZER is an instantiated object of class internlm.util.registry.Registry, the example is shown as follows.

MODEL_TYPE = "NEW_MODEL"

@MODEL_INITIALIZER.register_module(module_name=MODEL_TYPE)
def build_new_model_with_cfg(*args, **kwargs):

Training Setup

Argument Parsing

Model Initialization

Optimizer Initialization

Dataloader Initialization

Trainer Initialization