Skip to content

Commit 8ef11c7

Browse files
feat: add autoware_node_death_monitor package for monitoring node crashes
Signed-off-by: Kyoichi Sugahara <kyoichi.sugahara@tier4.jp>
1 parent c5f0a24 commit 8ef11c7

File tree

7 files changed

+526
-0
lines changed

7 files changed

+526
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
cmake_minimum_required(VERSION 3.14)
2+
project(autoware_node_death_monitor)
3+
4+
find_package(autoware_cmake REQUIRED)
5+
autoware_package()
6+
7+
ament_auto_add_library(${PROJECT_NAME} SHARED
8+
src/autoware_node_death_monitor.cpp
9+
)
10+
11+
rclcpp_components_register_node(${PROJECT_NAME}
12+
PLUGIN "autoware::node_death_monitor::NodeDeathMonitor"
13+
EXECUTABLE ${PROJECT_NAME}_node)
14+
15+
ament_auto_package(INSTALL_TO_SHARE
16+
config
17+
launch
18+
)
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
# autoware_node_death_monitor
2+
3+
This package provides a monitoring node that detects ROS 2 node crashes by analyzing `launch.log` files, rather than subscribing to `/rosout` logs.
4+
5+
---
6+
7+
## Overview
8+
9+
- **Node name**: `autoware_node_death_monitor`
10+
- **Monitored file**: `launch.log`
11+
- **Detected event**: Looks for lines containing the substring `"process has died"` and extracts the node name and exit code.
12+
13+
When a crash or unexpected shutdown occurs, `ros2 launch` typically outputs a line in `launch.log` such as:
14+
15+
```bash
16+
[ERROR] [node_name-1]: process has died [pid 12345, exit code 139, cmd '...']
17+
```
18+
19+
The `autoware_node_death_monitor` node continuously reads the latest `launch.log` file, detects these messages, and logs a warning or marks the node as "dead."
20+
21+
---
22+
23+
## How it Works
24+
25+
1. **Find `launch.log`**:
26+
- First, checks the `ROS_LOG_DIR` environment variable.
27+
- If not set, falls back to `~/.ros/log`.
28+
- Identifies the latest log directory based on modification time.
29+
2. **Monitor `launch.log`**:
30+
- Reads the file from the last known position to detect new log entries.
31+
- Looks for lines containing `"process has died"`.
32+
- Extracts the node name and exit code.
33+
3. **Filtering**:
34+
- **Ignored node names**: Nodes matching patterns in `ignore_node_names` are skipped.
35+
- **Ignored exit codes**: Logs with ignored exit codes are not flagged as errors.
36+
4. **Regular Updates**:
37+
- A timer periodically reads new entries from `launch.log`.
38+
- Dead nodes are reported in the logs. (will be changed to publish diagnostics)
39+
40+
---
41+
42+
## Parameters
43+
44+
| Parameter Name | Type | Default | Description |
45+
| ------------------- | ---------- | ----------------- | ---------------------------------------------------------- |
46+
| `ignore_node_names` | `string[]` | `[]` (empty list) | Node name patterns to ignore. E.g., `['rviz2']`. |
47+
| `ignore_exit_codes` | `int[]` | `[]` (empty list) | Exit codes to ignore (e.g., `0` or `130` for normal exit). |
48+
| `check_interval` | `double` | `1.0` | Timer interval (seconds) for scanning the log file. |
49+
| `enable_debug` | `bool` | `false` | Enables debug logging for detailed output. |
50+
51+
Example **`autoware_node_death_monitor.param.yaml`**:
52+
53+
```yaml
54+
autoware_node_death_monitor:
55+
ros__parameters:
56+
ignore_node_names:
57+
- rviz2
58+
- teleop_twist_joy
59+
ignore_exit_codes:
60+
- 0
61+
- 130
62+
check_interval: 1.0
63+
enable_debug: false
64+
```
65+
66+
---
67+
68+
## Limitations
69+
70+
- **後で書く**: TBD.
71+
- **Robust Monitoring**: Works alongside systemd, supervisord, or other process supervisors for enhanced fault detection.
72+
73+
---
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
/**:
2+
ros__parameters:
3+
# Node names to exclude from monitoring (Note: be careful with the "[node_name-#]" format)
4+
# Example: Do not issue a warning if rviz2 crashes.
5+
ignore_node_names:
6+
- rviz2
7+
8+
# Exit codes to exclude from monitoring (e.g., Ctrl+C)
9+
# Example: 0, 130 are considered normal exits and not treated as errors.
10+
ignore_exit_codes:
11+
- 0
12+
- 130
13+
14+
# Check interval (seconds)
15+
check_interval: 1.0
16+
17+
# Enable/disable debug output
18+
enable_debug: false
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
// Copyright 2025 Tier IV, Inc.
2+
//
3+
// Licensed under the Apache License, Version 2.0 (the "License");
4+
// you may not use this file except in compliance with the License.
5+
// You may obtain a copy of the License at
6+
//
7+
// http://www.apache.org/licenses/LICENSE-2.0
8+
//
9+
// Unless required by applicable law or agreed to in writing, software
10+
// distributed under the License is distributed on an "AS IS" BASIS,
11+
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
// See the License for the specific language governing permissions and
13+
// limitations under the License.
14+
15+
#ifndef AUTOWARE_NODE_DEATH_MONITOR__AUTOWARE_NODE_DEATH_MONITOR_HPP_
16+
#define AUTOWARE_NODE_DEATH_MONITOR__AUTOWARE_NODE_DEATH_MONITOR_HPP_
17+
18+
#include "rclcpp/rclcpp.hpp"
19+
20+
#include <filesystem>
21+
#include <string>
22+
#include <unordered_map>
23+
#include <vector>
24+
25+
namespace autoware::node_death_monitor
26+
{
27+
28+
class NodeDeathMonitor : public rclcpp::Node
29+
{
30+
public:
31+
/**
32+
* @brief Constructor for NodeDeathMonitor
33+
* @param options Node options for configuration
34+
*/
35+
explicit NodeDeathMonitor(const rclcpp::NodeOptions & options);
36+
37+
private:
38+
/**
39+
* @brief Read and process new content appended to launch.log
40+
*/
41+
void read_launch_log_diff();
42+
43+
/**
44+
* @brief Parse a single line from the log for process death information
45+
* @param line The log line to parse
46+
*/
47+
void parse_log_line(const std::string & line);
48+
49+
/**
50+
* @brief Timer callback to report and manage dead node list
51+
*/
52+
void on_timer();
53+
54+
// Map to track dead nodes: [node_name-#] -> true
55+
std::unordered_map<std::string, bool> dead_nodes_;
56+
57+
rclcpp::TimerBase::SharedPtr timer_;
58+
59+
// Launch log file path and read position
60+
std::filesystem::path launch_log_path_;
61+
size_t last_file_pos_{static_cast<size_t>(-1)};
62+
63+
// Parameters
64+
std::vector<std::string> ignore_node_names_; // Node names to exclude from monitoring
65+
std::vector<int64_t> ignore_exit_codes_; // Exit codes to ignore (e.g., normal termination)
66+
double check_interval_{1.0}; // Check interval in seconds
67+
bool enable_debug_{false}; // Enable debug output
68+
};
69+
70+
} // namespace autoware::node_death_monitor
71+
72+
#endif // AUTOWARE_NODE_DEATH_MONITOR__AUTOWARE_NODE_DEATH_MONITOR_HPP_
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
<launch>
2+
<!-- Parameter -->
3+
<arg name="config_file" default="$(find-pkg-share autoware_node_death_monitor)/config/autoware_node_death_monitor.param.yaml"/>
4+
5+
<!-- Set log level -->
6+
<arg name="log_level" default="info"/>
7+
8+
<node pkg="autoware_node_death_monitor" exec="autoware_node_death_monitor_node" name="node_death_monitor" output="screen" args="--ros-args --log-level $(var log_level)">
9+
<!-- Parameter -->
10+
<param from="$(var config_file)"/>
11+
</node>
12+
</launch>
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
<?xml version="1.0"?>
2+
<package format="3">
3+
<name>autoware_node_death_monitor</name>
4+
<version>0.0.1</version>
5+
<description>The node_death_monitor package</description>
6+
7+
<maintainer email="kyoichi.sugahara@tier4.jp">Kyoichi Sugahara</maintainer>
8+
<license>Apache License 2.0</license>
9+
10+
<buildtool_depend>ament_cmake_auto</buildtool_depend>
11+
<buildtool_depend>autoware_cmake</buildtool_depend>
12+
13+
<depend>rcl_interfaces</depend>
14+
<depend>rclcpp</depend>
15+
<depend>rclcpp_components</depend>
16+
17+
<test_depend>ament_cmake_gtest</test_depend>
18+
<test_depend>ament_lint_auto</test_depend>
19+
20+
<export>
21+
<build_type>ament_cmake</build_type>
22+
</export>
23+
</package>

0 commit comments

Comments
 (0)