Monitor and Alert
Monitoring
InternEvo uses internlm.monitor.monitor.initialize_monitor_manager() to initialize context monitor. During this time, a singleton internlm.monitor.monitor.MonitorManager will manage monitoring thread and track training status with internlm.monitor.monitor.MonitorTracker.
Alerting
InternEvo monitor thread periodically tracks loss spike, potential stuck condition, runtime exception, and SIGTERM signal. When above situation occurs, an alert will be triggered and a message will be sent to the Feishu webhook address by calling internlm.monitor.alert.send_feishu_msg_with_webhook().
Light Monitoring
The InternEvo light monitoring tool employs a heartbeat mechanism to real-time monitor various metrics during the training process, such as loss, grad_norm, and training phase duration. Additionally, InternEvo can present these metric details through a grafana dashboard, allowing users to conduct more comprehensive and in-depth training analysis in an intuitive manner.
The configuration for light monitoring is specified by the monitor field in the configuration file. Users can modify monitoring settings by editing the configuration file config file. Here is an example of a monitoring configuration:
monitor = dict(
alert=dict(
enable_feishu_alert=False,
feishu_alert_address=None,
light_monitor_address=None,
),
)
enable_feishu_alert: Whether to enable Feishu alerts. Defaults: False.
feishu_alert_address: The webhook address for Feishu alerts. Defaults: None.
light_monitor_address: The address for lightweight monitoring. Defaults: None.
InternEvo uses internlm.monitor.alert.initialize_light_monitor to initialize the lightweight monitoring client. Once initialization is complete, it establishes a connection with the monitoring server. During the training process, it uses internlm.monitor.alert.send_heartbeat to send various types of heartbeat messages to the monitoring server. The monitoring server uses these heartbeat messages to detect if the training encounters any abnormalities and sends alert messages as needed.