Profiler
Torch Profiler
InternLM uses internlm.train.initialize_llm_profile() to profile performance data, execution time duration and breakdown analysis of step time. The implementation is based on torch.profiler and output tracing files can be visualized with tensorboard.
To use this torch profiler tool, you need to enable profiling by passing the --profiling flag when starting training. After torch profiling is completed, you can find the profiling results in the {JOB_NAME}/{start_time}/traces/rank{}_dp{}_tp{}_pp{} folder.
The directory structure of Torch Profiler generated files is as follows:
# tree ./7b_train/Sep08_11-00-51/traces -L 2
./7b_train/Sep08_11-00-51/traces/
└── rank0_dp0_tp0_pp0
└── SH-IDC1-10-140-1-78_238619.1694142354680.pt.trace.json
Among them, traces can be visualized through TensorBoard and run with the command
# visualize traces with tensorboard and custom port
tensorboard --logdir rank0_dp0_tp0_pp0 --port 10088
In the opened TensorBoard -> PyTorch Profiler -> Views -> Trace page, you can see the timeline of profiled operators and GPU kernels. For more usage, please refer to torch profiler with tensorboard
Memory Profiler
InternEvo provides a practical solution internlm.utils.simple_memory_profiler.SimpleMemoryProfiler to monitor actual GPU memory usage. In the implmentation, model data (including model parameters, model gradients, and optimizer states) and non-model data (including activations) are calculated.
To use this memory profiler tool, you need to enable profiling by passing the --profiling flag when starting training. After memory profiling is completed, you can find the profiling results (including logs of memory usage at different time point and sunburst charts showing overall memory usage) for a specific rank device in the memory_trace/rank{}_dp{}_tp{} folder.
The directory structure of memory_trace generated files is as follows:
# tree ./memory_trace -L 2
./memory_trace
├── rank0_dp0_tp0 # Profiling results for a specific rank device
│ ├── activation_memory_sunburst.html # Sunburst chart showing activation memory usage
│ ├── grads_memory_sunburst.html # Sunburst chart showing gradient memory usage
│ ├── memory.log # Log of GPU memory usage at different time points
│ ├── os_memory_sunburst.html # Sunburst chart showing optimizer state memory usage
│ ├── params_memory_sunburst.html # Sunburst chart showing parameter memory usage
│ └── summary_sunburst.html # Sunburst chart showing overall memory usage
├── rank1_dp1_tp0
│ ├── activation_memory_sunburst.html
│ ├── grads_memory_sunburst.html
│ ├── memory.log
│ ├── os_memory_sunburst.html
│ ├── params_memory_sunburst.html
│ └── summary_sunburst.html
├── rank2_dp2_tp0
│ ├── activation_memory_sunburst.html
│ ├── grads_memory_sunburst.html
│ ├── memory.log
│ ├── os_memory_sunburst.html
│ ├── params_memory_sunburst.html
│ └── summary_sunburst.html
├── rank3_dp3_tp0
│ ├── activation_memory_sunburst.html
│ ├── grads_memory_sunburst.html
│ ├── memory.log
│ ├── os_memory_sunburst.html
│ ├── params_memory_sunburst.html
│ └── summary_sunburst.html
├── rank4_dp4_tp0
│ ├── activation_memory_sunburst.html
│ ├── grads_memory_sunburst.html
│ ├── memory.log
│ ├── os_memory_sunburst.html
│ ├── params_memory_sunburst.html
│ └── summary_sunburst.html
├── rank5_dp5_tp0
│ ├── activation_memory_sunburst.html
│ ├── grads_memory_sunburst.html
│ ├── memory.log
│ ├── os_memory_sunburst.html
│ ├── params_memory_sunburst.html
│ └── summary_sunburst.html
├── rank6_dp6_tp0
│ ├── activation_memory_sunburst.html
│ ├── grads_memory_sunburst.html
│ ├── memory.log
│ ├── os_memory_sunburst.html
│ ├── params_memory_sunburst.html
│ └── summary_sunburst.html
└── rank7_dp7_tp0
├── activation_memory_sunburst.html
├── grads_memory_sunburst.html
├── memory.log
├── os_memory_sunburst.html
├── params_memory_sunburst.html
└── summary_sunburst.html
An example of memory.log is as follows:
Memory State:
time: 37.56313228607178
---summary---
total_memory: 55953.56 MB
params_memory: 13965.51 MB, grads_memory: 13965.51 MB, os_params_memory: 3461.52 MB, os_state_memory: 6923.03 MB, activation_memory: 17638.00 MB
Memory State:
time: 38.46969723701477
---summary---
total_memory: 38315.56 MB
params_memory: 13965.51 MB, grads_memory: 13965.51 MB, os_params_memory: 3461.52 MB, os_state_memory: 6923.03 MB, activation_memory: 0.00 MB
---Layout---
params_layout:
layer: param_mem, layer_mem: 0.00 MB, total_mem: 13965.51 MB
layer: param_mem.embedding, layer_mem: 0.00 MB, total_mem: 806.00 MB
layer: param_mem.embedding.weight, layer_mem: 806.00 MB, total_mem: 806.00 MB
layer: param_mem.blocks, layer_mem: 0.00 MB, total_mem: 12353.50 MB
layer: param_mem.blocks.0, layer_mem: 0.00 MB, total_mem: 386.05 MB
layer: param_mem.blocks.0.mixer, layer_mem: 0.00 MB, total_mem: 128.03 MB
layer: param_mem.blocks.0.mixer.Wqkv, layer_mem: 0.00 MB, total_mem: 96.02 MB
layer: param_mem.blocks.0.mixer.Wqkv.weight, layer_mem: 96.00 MB, total_mem: 96.00 MB
layer: param_mem.blocks.0.mixer.Wqkv.bias, layer_mem: 0.02 MB, total_mem: 0.02 MB
layer: param_mem.blocks.0.mixer.out_proj, layer_mem: 0.00 MB, total_mem: 32.01 MB
layer: param_mem.blocks.0.mixer.out_proj.weight, layer_mem: 32.00 MB, total_mem: 32.00 MB
layer: param_mem.blocks.0.mixer.out_proj.bias, layer_mem: 0.01 MB, total_mem: 0.01 MB
layer: param_mem.blocks.0.norm1, layer_mem: 0.00 MB, total_mem: 0.01 MB
layer: param_mem.blocks.0.norm1.weight, layer_mem: 0.01 MB, total_mem: 0.01 MB
layer: param_mem.blocks.0.norm2, layer_mem: 0.00 MB, total_mem: 0.01 MB
layer: param_mem.blocks.0.norm2.weight, layer_mem: 0.01 MB, total_mem: 0.01 MB
layer: param_mem.blocks.0.mlp, layer_mem: 0.00 MB, total_mem: 258.00 MB
layer: param_mem.blocks.0.mlp.w1, layer_mem: 0.00 MB, total_mem: 86.00 MB
layer: param_mem.blocks.0.mlp.w1.weight, layer_mem: 86.00 MB, total_mem: 86.00 MB
layer: param_mem.blocks.0.mlp.w2, layer_mem: 0.00 MB, total_mem: 86.00 MB
layer: param_mem.blocks.0.mlp.w2.weight, layer_mem: 86.00 MB, total_mem: 86.00 MB
layer: param_mem.blocks.0.mlp.w3, layer_mem: 0.00 MB, total_mem: 86.00 MB
layer: param_mem.blocks.0.mlp.w3.weight, layer_mem: 86.00 MB, total_mem: 86.00 MB
......
grads_layout:
......
os_params_layout:
......
os_state_layout:
......
activation_base_layout:
......
An example of model parameters sunburst chart is as follows:
- class internlm.utils.simple_memory_profiler.SimpleMemoryProfiler(model: Module, optimizer: Optimizer, log_folder: str, total_steps: int = 5)[source]
A memory profiler for a llm model.
- Parameters:
model (torch.nn.Module) – The model to profile.
optimizer (torch.optim.Optimizer) – The optimizer used for training the model.
log_file (str) – The file to write the memory state information to.
total_steps – number of steps to trace.