跳转至

解读「监控器」日志

2024-03-04

观测云中的「监控器」,在 DataFlux Func 中实际都是一个个「自动触发配置」,查看「监控器」日志,即查看「自动触发配置」日志。

有关如何查看「自动触发配置」日志,请参考 部署和维护 / 系统指标和任务记录 / 任务记录

1. 基本格式

每一行「监控器」日志遵循如下格式:

时间 与上一行日志
时间差
任务开始至本行日志
总时间
所在模块 内容
[03-25 11:04:05] [+1ms] [64ms] 【函数】 调用函数:guance__api_impl.custom_check

具体示例如下:

Text Only
1
[03-25 11:04:05] [+1ms] [64ms] 【函数】 调用函数:guance__api_impl.custom_check

每一行「监控器」日志遵循如下格式:

时间 与上一行日志
时间差
所在模块 内容
[2024-03-06 20:58:05.088] [+1ms] 【函数】 调用函数:guance__api_impl.custom_check

具体示例如下:

Text Only
1
[2024-03-06 20:58:05.088] [+1ms] 【函数】 调用函数:guance__api_impl.custom_check

2. 日志缩减选项

由于监控器业务逻辑日益复杂,产生日志越来越长且变得难以阅读。因此,在观测云 2024-03-27 迭代之后,在满足基本排障的前提下,默认输出精简日志。

如果希望输出完整日志,可以创建特定的环境变量,来开启「详细观测云日志」选项:

环境变量 值类型
ENABLE_DETAILED_GUANCE_LOG 布尔值 开启:true
关闭:false

3. 固定格式的日志块

一些特定的处理,会使用固定格式输出日志块

3.1 DQL 查询日志

监控器中需要执行 DQL 时,需要调用 Kodo 组件的 API。每次 DQL 查询都会记录日志,示例如下:

Text Only
1
2
3
[03-25 11:04:05] [+0ms] [67ms] 【KODO】 执行 DQL 查询 -> 时间范围:2024-03-25 10:55:00 ~ 2024-03-25 11:01:00,最多 5 页
[03-25 11:04:05] [+0ms] [67ms] 【KODO】 --> 第 1 页(soffset = 0 ~ 500)
[03-25 11:04:05] [+0ms] [68ms] 【KODO】 调用 KODO API -> POST /v1/query

日志记录了:

  • DQL 查询的时间范围
  • 具体调用的 API 的方法和路径
Text Only
1
2
3
4
5
6
7
8
9
[2024-03-06 20:58:05.092] [+0ms] 【KODO】 执行DQL查询
[2024-03-06 20:58:05.092] [+0ms] 【KODO】 --> 最多翻页:20 页
[2024-03-06 20:58:05.093] [+0ms] 【KODO】 --> 时间范围:2024-03-06 20:51:00 ~ 2024-03-06 20:55:00
[2024-03-06 20:58:05.093] [+0ms] 【KODO】 --> 第1页(soffset = 0 ~ 500)
[2024-03-06 20:58:05.093] [+0ms] 【KODO】 调用KODO API
[2024-03-06 20:58:05.093] [+0ms] 【KODO】 >> 请求:POST /v1/query
[2024-03-06 20:58:05.093] [+0ms] 【KODO】 >>>> Body:{"echo_explain":false,"queries":[ ... ],"workspace_uuid":"wksp_xxxxx"}
[2024-03-06 20:58:05.093] [+0ms] 【KODO】 >> 首次请求
[2024-03-06 20:58:05.111] [+18ms] 【KODO】 >> 响应:`200 OK` => `{"content":[ ... ]}`

日志记录了:

  • DQL 查询的时间范围
  • 具体调用的 API 的方法、路径、请求体
  • Kodo API 的原始响应

3.2 观测云 Studio Inner API 调用日志

监控器中需要获取观测云业务数据时,需要调用观测云 Studio 的 Inner API。每次 Inner API 调用都会记录日志,示例如下:

Text Only
1
[03-25 11:04:05] [+0ms] [172ms] 【Studio】 调用 Studio Inner API -> GET /api/v1/inner/alert_opt/get

日志记录了:

  • 具体调用的 API 的方法、路径、请求体
Text Only
1
2
3
4
5
[2024-03-06 20:58:05.169] [+0ms] 【Studio】 调用 Studio Inner API
[2024-03-06 20:58:05.169] [+0ms] 【Studio】 >> 请求:GET /api/v1/inner/alert_opt/get
[2024-03-06 20:58:05.169] [+0ms] 【Studio】 >>>> Query:{"checkerUUID":"rul_xxxxx","workspaceUUID":"wksp_xxxxx"}`
[2024-03-06 20:58:05.169] [+0ms] 【Studio】 >> 首次请求
[2024-03-06 20:58:05.177] [+7ms] 【Studio】 >> 响应:`200 OK` => `{"code":200,"content":{"data":{ ... }},"errorCode":"","message":"","success":true,"traceId":"TRACE-XXXXX"}`

日志记录了:

  • 具体调用的 API 的方法、路径、请求体
  • 观测云 Studio Inner API 的原始响应

4. 完整示例解读

下方日志中,#开头的内容为解释,其他行为日志原文

下方日志为截止本文发出时的版本,随着持续迭代,日志文案可能略有出入

Text Only
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
# 观测云中「任务调度」计数
[04-01 03:43:00] [+100ms] [100ms] 【使用额度】 数据查询范围为 15 分钟,尚未超过 15 分钟,不需要额外计量
[04-01 03:43:00] [+0ms] [100ms] 【使用额度】 存在`workspace_uuid`参数,值为 wksp_xxxxx ,需要计量 1 次

# 当前工作空间信息
[04-01 03:43:00] [+1ms] [101ms] 【Studio】 工作空间信息(从缓存获取):{"declaration":{"b":["asfawfgajfasfafgafwba","asfgahjfaf"],"business":"aaa","organization":"64fe7b4062f74d0007b46676"},"isJobDisabled":false,"isSMSDisabled":false,"language":"zh","name":"【Doris】开发测试一起用_","token":"tkn_xxxxx"}

# 本次任务执行的函数 ID 及其参数列表
[04-01 03:43:00] [+1ms] [103ms] 【函数】 调用函数:guance__api_impl.custom_check
[04-01 03:43:00] [+0ms] [103ms] 【函数】 --> 参数:`checker`=`"custom_metric"`
[04-01 03:43:00] [+0ms] [103ms] 【函数】 --> 参数:`kwargs`=`{"version":"v2"}`
[04-01 03:43:00] [+0ms] [103ms] 【函数】 --> 参数:`targets`=`[{"alias":"Result","dql":"M::`fake_data_for_test`:(avg(`field_int`)) { `tag` = 'fake-data-1' } BY `tag`","queryType":"dql","range":900}]`
[04-01 03:43:00] [+0ms] [103ms] 【函数】 --> 参数:`channels`=`["chan_xxxxx"]`
[04-01 03:43:00] [+0ms] [103ms] 【函数】 --> 参数:`extra_data`=`{"type":"simpleCheck"}`
[04-01 03:43:00] [+0ms] [103ms] 【函数】 --> 参数:`checker_opt`=`{"id":"rul_xxxxx","infoEvent":false,"label":["xxxxx_test"],"message":"内容:xxxxx-监控器(单个){{df_dimension_tags}}\n1: {{ (Result * 100) | to_int }}\n2: {{ Result | to_int * 100 }}","name":"标题:xxxxx-监控器(M, 单个){{df_dimension_tags}}","noDataAction":"noData","noDataInterval":120,"noDataMessage":"","noDataTitle":"","recoverInterval":120,"rules":[{"conditionLogic":"and","conditions":[{"alias":"Result","operands":["0"],"operator":">="}],"status":"critical"}],"title":"标题:xxxxx-监控器(M, 单个){{df_dimension_tags}}"}`
[04-01 03:43:00] [+0ms] [103ms] 【函数】 --> 参数:`monitor_opt`=`{"id":"monitor_xxxxx","name":"default"}`
[04-01 03:43:00] [+0ms] [103ms] 【函数】 --> 参数:`workspace_uuid`=`"wksp_xxxxx"`
[04-01 03:43:00] [+0ms] [103ms] 【函数】 --> 参数:`workspace_token`=`"tkn_xxxxx"`
[04-01 03:43:00] [+0ms] [103ms] 【函数】 --> 参数:`disable_check_end_time`=`false`
[04-01 03:43:00] [+0ms] [104ms] 【函数】 --> 参数:`at_accounts`=`null`
[04-01 03:43:00] [+0ms] [104ms] 【函数】 --> 参数:`at_accounts_nodata`=`null`

# 监控器的频率配置
[04-01 03:43:00] [+2ms] [106ms] 【监控器】 根据实际 Crontab(*/1 * * * *)计算检测间隔
[04-01 03:43:00] [+0ms] [106ms] 【监控器】 --> 检测间隔:60 秒

# 根据用户配置的无数据范围,查询最近无数据范围内数据和之前的两段时间范围数据(2 次 DQL)
[04-01 03:43:00] [+0ms] [106ms] 【监控器】 ----------------- 加载断档 / 新增对象信息 ------------------
[04-01 03:43:00] [+0ms] [106ms] 【监控器】 已配置 120 秒无数据范围
[04-01 03:43:00] [+0ms] [106ms] 【监控器】 查询上轮数据:T - (检测频率 60 秒) - (无数据范围 120 秒) - (3 倍无数据范围冗余 360 秒) ~ T - (无数据范围 120 秒)
[04-01 03:43:00] [+0ms] [106ms] 【KODO】 执行 DQL 查询 -> 时间范围:2024-04-01 03:33:00 ~ 2024-04-01 03:40:00,最多 20 页
[04-01 03:43:00] [+0ms] [106ms] 【KODO】 --> 第 1 页(soffset = 0 ~ 500)
[04-01 03:43:00] [+0ms] [106ms] 【KODO】 调用 KODO API -> POST /v1/query
[04-01 03:43:00] [+14ms] [121ms] 【Studio】 指标单位(从缓存获取):wksp_xxxxx/fake_data_for_test => {"_DFF_CACHE_EXPIRE_TIME":1711914240}
[04-01 03:43:00] [+0ms] [121ms] 【KODO】 --> DQL 结果数据拆包:{"metric_units":{"field_int":null},"query_time_range":[1711913580000,1711914000000],"series":[{"columns":["time","avg(field_int)"],"name":"fake_data_for_test","tags":{"tag":"fake-data-1"},"values":[["2024-03-31T19:39:50Z",56.08730158730159]]}]}
[04-01 03:43:00] [+0ms] [121ms] 【监控器】 查询本轮数据:T - (无数据范围 120 秒) ~ T
[04-01 03:43:00] [+0ms] [122ms] 【KODO】 执行 DQL 查询 -> 时间范围:2024-04-01 03:40:00 ~ 2024-04-01 03:42:00,最多 20 页
[04-01 03:43:00] [+0ms] [122ms] 【KODO】 --> 第 1 页(soffset = 0 ~ 500)
[04-01 03:43:00] [+0ms] [122ms] 【KODO】 调用 KODO API -> POST /v1/query
[04-01 03:43:00] [+18ms] [140ms] 【Studio】 指标单位(从缓存获取):wksp_xxxxx/fake_data_for_test => {"_DFF_CACHE_EXPIRE_TIME":1711914240}
[04-01 03:43:00] [+0ms] [140ms] 【KODO】 --> DQL 结果数据拆包:{"metric_units":{"field_int":null},"query_time_range":[1711914000000,1711914120000],"series":[{"columns":["time","avg(field_int)"],"name":"fake_data_for_test","tags":{"tag":"fake-data-1"},"values":[["2024-03-31T19:41:50Z",49.69444444444444]]}]}

# 根据两段时间范围内所查询的数据,判断数据是否存在断档,或者数据重新上报,并产生对应的【无数据事件】或【无数据恢复事件】
[04-01 03:43:00] [+0ms] [140ms] 【监控器】 ----------------- 断档 / 新增对象加载结果 ------------------
[04-01 03:43:00] [+0ms] [140ms] 【监控器】 --> 上轮存在对象:{"tag":"fake-data-1"}
[04-01 03:43:00] [+0ms] [141ms] 【监控器】 --> 本轮存在对象:{"tag":"fake-data-1"}
[04-01 03:43:00] [+0ms] [141ms] 【监控器】 ----> 数据断档对象(上轮存在 -> 本轮不存在):无
[04-01 03:43:00] [+0ms] [141ms] 【监控器】 --------------------- 判断数据断档 ---------------------
[04-01 03:43:00] [+0ms] [141ms] 【监控器】 --> 没有数据断档对象
[04-01 03:43:00] [+0ms] [141ms] 【监控器】 ------------------- 判断数据从断档恢复 --------------------
[04-01 03:43:00] [+0ms] [141ms] 【监控器】 --> 对象:{"tag":"fake-data-1"}
[04-01 03:43:00] [+4ms] [146ms] 【监控器】 对象 {"tag":"fake-data-1"} 的故障周期信息(fault_info):null
[04-01 03:43:00] [+0ms] [146ms] 【监控器】 ----> 没有上次无数据事件
[04-01 03:43:00] [+0ms] [146ms] 【监控器】 ----> 没有活跃无数据事件,不需要产生无数据恢复事件

# 根据用户配置的检测规则判断是否产生【告警事件】
[04-01 03:43:00] [+0ms] [146ms] 【监控器】 -------------------- 执行数据数值检测 --------------------
[04-01 03:43:00] [+0ms] [146ms] 【监控器】 查询待检测数据
[04-01 03:43:00] [+0ms] [146ms] 【KODO】 执行 DQL 查询 -> 时间范围:2024-04-01 03:27:00 ~ 2024-04-01 03:42:00,最多 20 页
[04-01 03:43:00] [+0ms] [146ms] 【KODO】 --> 第 1 页(soffset = 0 ~ 500)
[04-01 03:43:00] [+0ms] [146ms] 【KODO】 调用 KODO API -> POST /v1/query
[04-01 03:43:00] [+21ms] [168ms] 【Studio】 指标单位(从缓存获取):wksp_xxxxx/fake_data_for_test => {"_DFF_CACHE_EXPIRE_TIME":1711914240}
[04-01 03:43:00] [+0ms] [169ms] 【KODO】 --> DQL 结果数据拆包:{"metric_units":{"field_int":null},"query_time_range":[1711913220000,1711914120000],"series":[{"columns":["time","avg(field_int)"],"name":"fake_data_for_test","tags":{"tag":"fake-data-1"},"values":[["2024-03-31T19:41:50Z",54.370370370370374]]}]}

# 依次遍历所有检测对象,依次执行检测
[04-01 03:43:00] [+0ms] [169ms] 【监控器】 [检测对象 1/1] {"tag":"fake-data-1"}
[04-01 03:43:00] [+2ms] [171ms] 【通用阈值检测】 待检测数据:{'Result': [54.370370370370374]}

# 依次遍历所有配置规则,判断命中的检测规则
[04-01 03:43:00] [+0ms] [171ms] 【通用阈值检测】 [阈值规则 1/1] critical:Result >= ['0']
[04-01 03:43:00] [+0ms] [171ms] 【条件判断】 [条件 1/1] IF Result (ANY[54.370370370370374]) >= ["0"]
[04-01 03:43:00] [+0ms] [171ms] 【条件判断】 --> 中间结果为 True ,条件关系为 AND ,继续
[04-01 03:43:00] [+0ms] [171ms] 【通用阈值检测】 --> 匹配成功,结束判断
[04-01 03:43:00] [+0ms] [172ms] 【通用阈值检测】 阈值规则匹配结果:{"check_data":{"Result":54.370370370370374},"conditions":[{"alias":"Result","operands":["0"],"operator":">="}],"status":"critical"}
[04-01 03:43:00] [+0ms] [172ms] 【监控器】 --> 检测对象:{"tag":"fake-data-1"}:已达到故障条件

# 调用观测云 Studio 获取本监控器所配置的告警策略
[04-01 03:43:00] [+2ms] [175ms] 【Studio】 告警信息缓存已禁用
[04-01 03:43:00] [+0ms] [175ms] 【Studio】 调用 Studio Inner API -> GET /api/v1/inner/alert_opt/get
[04-01 03:43:00] [+20ms] [195ms] 【Studio】 告警配置(从 API 获取):rul_xxxxx => `{"_DFF_CACHE_EXPIRE_TIME":1711914360,"alertPolicies":[{"aggClusterFields":[],"aggFields":[],"aggInterval":0,"aggLabels":[],"id":"altpl_xxxxx","minInterval":900,"name":"xxxxx-告警策略1","ruleTimezone":"Asia/Shanghai","rules":[{"crontab":"00 09 * * *","crontabDuration":39600,"name":"自定义通知配置1","targets":[{"name":"xxxxx-微信","status":"critical","type":"wechatRobot","webhook":"https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxxxx"},{"name":"xxxxx-告警策略1-规则1","status":"critical","type":"dingTalkRobot","webhook":"https://oapi.dingtalk.com/robot/send?access_token=xxxxx"},{"name":"xxxxx-微信","status":"warning","type":"wechatRobot","webhook":"https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxxxx"},{"name":"xxxxx-告警策略1-规则1","status":"warning","type":"dingTalkRobot","webhook":"https://oapi.dingtalk.com/robot/send?access_token=xxxxx"}],"upgradeTargets":[{"duration":180,"name":"xxxxx-告警策略1-critical-3分钟升级","status":"critical","type":"dingTalkRobot","webhook":"https://oapi.dingtalk.com/robot/send?access_token=xxxxx"},{"duration":600,"name":"xxxxx-告警策略1-critical-10分钟升级","status":"critical","type":"dingTalkRobot","webhook":"https://oapi.dingtalk.com/robot/send?access_token=xxxxx"},{"duration":600,"name":"xxxxx-告警策略1-critical-10分钟升级-2","status":"critical","type":"dingTalkRobot","webhook":"https://oapi.dingtalk.com/robot/send?access_token=xxxxx"}]},{"targets":[{"name":"xxxxx-微信","status":"critical","type":"wechatRobot","webhook":"https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxxxx"},{"name":"xxxxx-告警策略1-规则2","status":"critical","type":"dingTalkRobot","webhook":"https://oapi.dingtalk.com/robot/send?access_token=xxxxx"},{"name":"xxxxx-微信","status":"warning","type":"wechatRobot","webhook":"https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxxxx"},{"name":"xxxxx-告警策略1-规则2","status":"warning","type":"dingTalkRobot","webhook":"https://oapi.dingtalk.com/robot/send?access_token=xxxxx"}],"upgradeTargets":[{"duration":180,"name":"xxxxx-告警策略1-critical-3分钟升级","status":"critical","type":"dingTalkRobot","webhook":"https://oapi.dingtalk.com/robot/send?access_token=xxxxx"},{"duration":600,"name":"xxxxx-告警策略1-critical-10分钟升级","status":"critical","type":"dingTalkRobot","webhook":"https://oapi.dingtalk.com/robot/send?access_token=xxxxx"},{"duration":600,"name":"xxxxx-告警策略1-critical-10分钟升级-2","status":"critical","type":"dingTalkRobot","webhook":"https://oapi.dingtalk.com/robot/send?access_token=xxxxx"}]}],"workspaceUUID":"wksp_xxxxx"}],"silent":[]}`
[04-01 03:43:00] [+1ms] [196ms] 【Studio】 常量配置(从缓存获取):envName => {"_DFF_CACHE_EXPIRE_TIME":1711914334,"value":"测试环境"}
[04-01 03:43:00] [+2ms] [199ms] 【Studio】 常量配置(从缓存获取):UsePublicAlertLink => {"_DFF_CACHE_EXPIRE_TIME":1711914215,"value":false}
[04-01 03:43:00] [+1ms] [200ms] 【Studio】 常量配置(从缓存获取):consoleBaseURL => {"_DFF_CACHE_EXPIRE_TIME":1711914215,"value":"http://testing-ft2x.dataflux.cn"}

# 根据用户配置的告警模板和事件数据,渲染事件标题 / 内容
[04-01 03:43:00] [+1ms] [202ms] 【文本渲染器】 渲染模板:
内容:xxxxx-监控器(单个){{df_dimension_tags}}
1: {{ (Result * 100) | to_int }}
2: {{ Result | to_int * 100 }}
[04-01 03:43:00] [+1ms] [204ms] 【文本渲染器】 --> 渲染成功。输出:
内容:xxxxx-监控器(单个){"tag":"fake-data-1"}
1: 5437
2: 5400
[04-01 03:43:00] [+3ms] [207ms] 【文本渲染器】 渲染模板:
标题:xxxxx-监控器(M, 单个){{df_dimension_tags}}
[04-01 03:43:00] [+1ms] [208ms] 【文本渲染器】 --> 渲染成功。输出:
标题:xxxxx-监控器(M, 单个){"tag":"fake-data-1"}

# 依次遍历事件 / 静默规则,判断每个事件是否需要静默
[04-01 03:43:00] [+0ms] [209ms] 【事件告警器】 [事件 1/1] <critical事件@monitor:{"tag":"fake-data-1"}:标题:xxxxx-监控器(M, 单个){"tag":"fake-data-1"}>
[04-01 03:43:00] [+0ms] [209ms] 【事件告警器】 没有静默规则,不需要静默

# 依次遍历告警策略,判断事件命中哪个告警策略 / 告警规则
[04-01 03:43:00] [+0ms] [209ms] 【事件告警器】 [告警策略 1/1] xxxxx-告警策略1(altpl_xxxxx)
[04-01 03:43:00] [+0ms] [209ms] 【事件告警器】 --------------------- 发送事件告警 ---------------------
[04-01 03:43:00] [+0ms] [209ms] 【事件告警器】 [告警规则 1/2] 按 Crontab `00 09 * * *` 循环,每轮循环持续 39600 秒
[04-01 03:43:00] [+0ms] [209ms] 【事件告警器】 --> 已配置重复时间段,但不在重复时间段范围内
[04-01 03:43:00] [+0ms] [209ms] 【事件告警器】 --> 不满足重复时间段告警,跳过
[04-01 03:43:00] [+0ms] [210ms] 【事件告警器】 [告警规则 2/2] 剩余其他时间段
[04-01 03:43:00] [+0ms] [210ms] 【事件告警器】 成功匹配告警规则,需要告警

# 读取事件状态持续时间
[04-01 03:43:00] [+0ms] [210ms] 【事件告警器】 -------------------- 事件状态持续时间 --------------------
[04-01 03:43:00] [+1ms] [211ms] 【事件告警器】 --> 当前事件状态为 critical,清除非 critical 的状态时间
[04-01 03:43:00] [+1ms] [212ms] 【事件告警器】 --> 当前事件状态 critical 起始时间已经记录,起始时间为 2024-03-26 20:01:00

# 生成一般告警通知
# 依次遍历通知对象,判断是否处于沉默期内
#(同一个告警策略 / 规则下的所有告警通知对象会进行沉默期对齐)
[04-01 03:43:00] [+1ms] [213ms] 【事件告警器】 --------------------- 一般告警通知 ---------------------
[04-01 03:43:00] [+0ms] [213ms] 【事件告警器】 [告警通知对象 1/4] dingTalkRobot/xxxxx-告警策略1-规则2 (critical)
[04-01 03:43:00] [+0ms] [214ms] 【事件告警器】 匹配事件 status:critical <=> critical
[04-01 03:43:00] [+1ms] [215ms] 【事件告警器】 --> 上次告警于 2024-04-01 03:36:00,沉默 900 秒。沉默期于 2024-04-01 03:51:00(540 秒以后)解除
[04-01 03:43:00] [+0ms] [215ms] 【事件告警器】 ----> 当前处于沉默期,跳过
[04-01 03:43:00] [+0ms] [215ms] 【事件告警器】 [告警通知对象 2/4] wechatRobot/xxxxx-微信 (critical)
[04-01 03:43:00] [+0ms] [215ms] 【事件告警器】 匹配事件 status:critical <=> critical
[04-01 03:43:00] [+1ms] [216ms] 【事件告警器】 --> 上次告警于 2024-04-01 03:36:00,沉默 900 秒。沉默期于 2024-04-01 03:51:00(540 秒以后)解除
[04-01 03:43:00] [+0ms] [216ms] 【事件告警器】 ----> 当前处于沉默期,跳过
[04-01 03:43:00] [+0ms] [217ms] 【事件告警器】 [告警通知对象 3/4] dingTalkRobot/xxxxx-告警策略1-规则2 (warning)
[04-01 03:43:00] [+0ms] [217ms] 【事件告警器】 匹配事件 status:critical <=> warning
[04-01 03:43:00] [+0ms] [217ms] 【事件告警器】 --> 不满足,跳过
[04-01 03:43:00] [+0ms] [217ms] 【事件告警器】 [告警通知对象 4/4] wechatRobot/xxxxx-微信 (warning)
[04-01 03:43:00] [+0ms] [217ms] 【事件告警器】 匹配事件 status:critical <=> warning
[04-01 03:43:00] [+0ms] [217ms] 【事件告警器】 --> 不满足,跳过

# 生成升级高级通知
# 依次遍历升级通知对象,判断是否达到升级时限
[04-01 03:43:00] [+0ms] [217ms] 【事件告警器】 --------------------- 升级告警通知 ---------------------
[04-01 03:43:00] [+0ms] [217ms] 【事件告警器】 [告警升级通知对象 1/3] dingTalkRobot/xxxxx-告警策略1-critical-3分钟升级 (180/critical)
[04-01 03:43:00] [+0ms] [217ms] 【事件告警器】 匹配事件 status:critical <=> critical
[04-01 03:43:00] [+1ms] [218ms] 【事件告警器】 --> 升级的告警已于 2024-03-26 20:04:00 发送,不需要告警升级
[04-01 03:43:00] [+0ms] [218ms] 【事件告警器】 [告警升级通知对象 2/3] dingTalkRobot/xxxxx-告警策略1-critical-10分钟升级 (600/critical)
[04-01 03:43:00] [+0ms] [218ms] 【事件告警器】 匹配事件 status:critical <=> critical
[04-01 03:43:00] [+1ms] [220ms] 【事件告警器】 --> 升级的告警已于 2024-03-26 20:11:00 发送,不需要告警升级
[04-01 03:43:00] [+0ms] [220ms] 【事件告警器】 [告警升级通知对象 3/3] dingTalkRobot/xxxxx-告警策略1-critical-10分钟升级-2 (600/critical)
[04-01 03:43:00] [+0ms] [220ms] 【事件告警器】 匹配事件 status:critical <=> critical
[04-01 03:43:00] [+1ms] [221ms] 【事件告警器】 --> 升级的告警已于 2024-03-26 20:11:00 发送,不需要告警升级

# 已生成事件建立缓存,供下一次监控器任务使用
[04-01 03:43:00] [+2ms] [223ms] 【内部DataWay】 缓存事件
[04-01 03:43:00] [+0ms] [223ms] 【内部DataWay】 缓存故障信息
[04-01 03:43:00] [+0ms] [224ms] 【内部DataWay】 --> 建立缓存:key=`rul_xxxxx-check`, field=`{"tag":"fake-data-1"}`

# 事件写入观测云
[04-01 03:43:00] [+2ms] [226ms] 【内部DataWay】 写入事件
[04-01 03:43:00] [+0ms] [226ms] 【内部DataWay】 --> [事件 1/1] 标题:xxxxx-监控器(M, 单个){"tag":"fake-data-1"}(event-xxxxx)
[04-01 03:43:00] [+1ms] [228ms] 【内部DataWay】 行协议方式写入数据 -> POST /v1/write/keyevent,工作空间 Token:tkn_xxxxx
[04-01 03:43:00] [+0ms] [228ms] 【内部DataWay】 --> 第 1/1 条数据示例:{"fields":{"df_alert_policy_ids":["altpl_xxxxx"],"df_alert_policy_names":["xxxxx-告警策略1"],"df_at_accounts":"[]","df_at_accounts_nodata":"[]","df_channels":"[\"chan_xxxxx\"]","df_check_range_end":1711914120,"df_check_range_start":1711913220,"df_date_range":900,"df_dimension_tags":"{\"tag\":\"fake-data-1\"}","df_event_reason":"满足监控器中故障的认定条件,产生故障事件","df_fault_duration":459840,"df_fault_start_time":1711454280,"df_issue_duration":459840,"df_issue_start_time":1711454280,"df_matched_alert_policy_rules":["xxxxx-告警策略1 / -"],"df_message":"内容:xxxxx-监控器(单个){\"tag\":\"fake-data-1\"}  \n1: 5437  \n2: 5400","df_meta":"略,具体内容见对应事件数据","df_monitor_checker_name":"标题:xxxxx-监控器(M, 单个){{df_dimension_tags}}","df_monitor_checker_value":"54.370370370370374","df_monitor_name":"xxxxx-告警策略1","df_title":"标题:xxxxx-监控器(M, 单个){\"tag\":\"fake-data-1\"}","df_workspace_declaration":"{\"b\":[\"asfawfgajfasfafgafwba\",\"asfgahjfaf\"],\"business\":\"aaa\",\"organization\":\"64fe7b4062f74d0007b46676\"}"},"measurement":"keyevent","tags":{"df_crontab_exec_mode":"crontab","df_event_id":"event-xxxxx","df_fault_id":"event-xxxxx","df_fault_status":"fault","df_label":"[\"xxxxx_test\"]","df_language":"zh","df_monitor_checker":"custom_metric","df_monitor_checker_event_ref":"xxxxx","df_monitor_checker_id":"rul_xxxxx","df_monitor_checker_ref":"xxxxx","df_monitor_checker_sub":"check","df_monitor_checker_type":"monitor","df_monitor_id":"altpl_xxxxx","df_monitor_type":"custom","df_site_name":"测试环境","df_source":"monitor","df_status":"critical","df_sub_status":"critical","df_workspace_name":"【Doris】开发测试一起用_","df_workspace_uuid":"wksp_xxxxx","tag":"fake-data-1"},"timestamp":1711914120}
[04-01 03:43:00] [+14ms] [243ms] 【内部DataWay】 --> 响应:【200 OK】 ""
[04-01 03:43:00] [+0ms] [243ms] 【Studio】 缓冲需要通知 Studio 的事件
[04-01 03:43:00] [+6ms] [249ms] 【Studio】 --> 事件:标题:xxxxx-监控器(M, 单个){"tag":"fake-data-1"}(wksp_xxxxx/event-xxxxx)

# 根据用户配置,将已生成事件通知给观测云 Studio,用于追踪
#(此处监控器仅通知,异常追踪的具体业务由观测云 Studio 实现)
[04-01 03:43:00] [+0ms] [249ms] 【Studio】 缓冲需要通知 Studio 的事件
[04-01 03:43:00] [+3ms] [252ms] 【Studio】 --> 事件:标题:周逸灵-监控器(M, 单个){"tag":"fake-data-1"}(wksp_xxxxx/event-xxxxx)

# 本次检测产生监控器数量
[04-01 03:43:00] [+1ms] [253ms] 本次检测共产生 1 个监控器事件
Text Only
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
# 观测云中「任务调度」计数
[2024-03-06 20:58:05.084] [+0ms] 【使用额度】 数据查询范围为 1 分钟,尚未超过 15 分钟,不需要额外计量
[2024-03-06 20:58:05.084] [+0ms] 【使用额度】 存在`workspace_uuid`参数,值为 wksp_xxxxx ,需要计量 1 次

# 当前工作空间信息
[2024-03-06 20:58:05.086] [+1ms] 【Studio】 工作空间信息(从缓存获取):{"declaration":{"test":["value1","value2"],"test2":"value3","test3":"value4"},"isJobDisabled":false,"isSMSDisabled":false,"language":"zh","name":"【Doris】开发测试一起用_","token":"tkn_xxxxx"}

# 本次任务执行的函数 ID 及其参数列表
[2024-03-06 20:58:05.088] [+1ms] 【函数】 调用函数:guance__api_impl.custom_check
[2024-03-06 20:58:05.088] [+0ms] 【函数】 --> 参数:`checker`=`"custom_metric"`
[2024-03-06 20:58:05.088] [+0ms] 【函数】 --> 参数:`kwargs`=`{"version":"v2"}`
[2024-03-06 20:58:05.088] [+0ms] 【函数】 --> 参数:`targets`=`[{"alias":"Result","dql":"M::`fake_data_for_test`:(avg(`field_int`)) { `tag` = 'fake-data-1' } BY `tag`","queryType":"dql","range":60}]`
[2024-03-06 20:58:05.089] [+0ms] 【函数】 --> 参数:`channels`=`["chan_xxxxx"]`
[2024-03-06 20:58:05.089] [+0ms] 【函数】 --> 参数:`extra_data`=`{"type":"simpleCheck"}`
[2024-03-06 20:58:05.089] [+0ms] 【函数】 --> 参数:`checker_opt`=`{"id":"rul_xxxxx","infoEvent":false,"label":["xxxxx_test"],"message":"内容:xxxxx-监控器(单个){{df_dimension_tags}}\n第2行\n第3行","name":"标题:xxxxx-监控器(单个){{df_dimension_tags}}","noDataAction":"noData","noDataInterval":120,"noDataMessage":"","noDataTitle":"","recoverInterval":120,"rules":[{"conditionLogic":"and","conditions":[{"alias":"Result","operands":["0"],"operator":">="}],"status":"critical"}],"title":"标题:xxxxx-监控器(单个){{df_dimension_tags}}"}`
[2024-03-06 20:58:05.089] [+0ms] 【函数】 --> 参数:`monitor_opt`=`{"id":"monitor_xxxxx","name":"default"}`
[2024-03-06 20:58:05.089] [+0ms] 【函数】 --> 参数:`workspace_uuid`=`"wksp_xxxxx"`
[2024-03-06 20:58:05.089] [+0ms] 【函数】 --> 参数:`workspace_token`=`"tkn_xxxxx"`
[2024-03-06 20:58:05.089] [+0ms] 【函数】 --> 参数:`disable_check_end_time`=`false`
[2024-03-06 20:58:05.089] [+0ms] 【函数】 --> 参数:`at_accounts`=`null`
[2024-03-06 20:58:05.089] [+0ms] 【函数】 --> 参数:`at_accounts_nodata`=`null`

# 监控器的频率配置
[2024-03-06 20:58:05.092] [+2ms] 【监控器】 根据实际 Crontab(*/1 * * * *)计算检测间隔
[2024-03-06 20:58:05.092] [+0ms] 【监控器】 --> 本次触发时间:2024-03-06 20:57:00
[2024-03-06 20:58:05.092] [+0ms] 【监控器】 --> 上次触发时间:2024-03-06 20:56:00
[2024-03-06 20:58:05.092] [+0ms] 【监控器】 --> 检测间隔:60 秒

# 根据用户配置的无数据范围,查询最近无数据范围内数据和之前的两段时间范围数据(2 次 DQL)
[2024-03-06 20:58:05.092] [+0ms] 【监控器】 ----------------- 加载断档 / 新增对象信息 ------------------
[2024-03-06 20:58:05.092] [+0ms] 【监控器】 已配置 120 秒无数据范围
[2024-03-06 20:58:05.092] [+0ms] 【监控器】 查询上轮数据
[2024-03-06 20:58:05.092] [+0ms] 【KODO】 执行DQL查询
[2024-03-06 20:58:05.092] [+0ms] 【KODO】 --> 最多翻页:20 页
[2024-03-06 20:58:05.093] [+0ms] 【KODO】 --> 时间范围:2024-03-06 20:51:00 ~ 2024-03-06 20:55:00
[2024-03-06 20:58:05.093] [+0ms] 【KODO】 --> 第1页(soffset = 0 ~ 500)
[2024-03-06 20:58:05.093] [+0ms] 【KODO】 调用KODO API
[2024-03-06 20:58:05.093] [+0ms] 【KODO】 >> 请求:POST /v1/query
[2024-03-06 20:58:05.093] [+0ms] 【KODO】 >>>> Body:{"echo_explain":false,"queries":[{"mask_visible":true,"qtype":"dql","query":"M::`fake_data_for_test`:(avg(`field_int`)) { `tag` = 'fake-data-1' } BY `tag`","slimit":500,"soffset":0,"time_range":[1709729460000,1709729700000]}],"workspace_uuid":"wksp_xxxxx"}
[2024-03-06 20:58:05.093] [+0ms] 【KODO】 >> 首次请求
[2024-03-06 20:58:05.111] [+18ms] 【KODO】 >> 响应:`200 OK` => `{"content":[{"async_id":"","complete":false,"cost":"4.172766ms","group_by":["tag"],"index_name":"","index_names":"","index_store_type":"","interval":0,"is_running":false,"next_cursor_time":-1,"points":null,"query_parse":{"fields":{"avg(field_int)":"field_int"},"funcs":{"avg(field_int)":["avg"]},"namespace":"metric","sources":{"fake_data_for_test":"exact"}},"query_type":"guancedb","sample":1,"scan_completed":false,"scan_index":"","series":[{"columns":["time","avg(field_int)"],"name":"fake_data_for_test","tags":{"tag":"fake-data-1"},"values":[[1709729699000,56.4206008583691]]}],"window":0}]}`
[2024-03-06 20:58:05.113] [+1ms] 【Studio】 调用 Studio Inner API
[2024-03-06 20:58:05.113] [+0ms] 【Studio】 >> 请求:GET /api/v1/inner/metrics_units
[2024-03-06 20:58:05.113] [+0ms] 【Studio】 >>>> Query:{"metrics":"fake_data_for_test","workspaceUUID":"wksp_xxxxx"}`
[2024-03-06 20:58:05.113] [+0ms] 【Studio】 >> 首次请求
[2024-03-06 20:58:05.123] [+9ms] 【Studio】 >> 响应:`200 OK` => `{"code":200,"content":{},"errorCode":"","message":"","success":true,"traceId":"TRACE-3645139A-BC9D-48E7-A17F-3CC93C51E650"}`
[2024-03-06 20:58:05.124] [+1ms] 【Studio】 指标单位(从 API 获取):wksp_xxxxx/fake_data_for_test => `{"_DFF_CACHE_EXPIRE_TIME":1709730065}`
[2024-03-06 20:58:05.125] [+0ms] 【KODO】 --> DQL 结果数据拆包:{"metric_units":{"field_int":null},"query_time_range":[1709729460000,1709729700000],"series":[{"columns":["time","avg(field_int)"],"name":"fake_data_for_test","tags":{"tag":"fake-data-1"},"values":[["2024-03-06T12:54:59Z",56.4206008583691]]}]}
[2024-03-06 20:58:05.125] [+0ms] 【监控器】 查询本轮数据
[2024-03-06 20:58:05.125] [+0ms] 【KODO】 执行DQL查询
[2024-03-06 20:58:05.125] [+0ms] 【KODO】 --> 最多翻页:20 页
[2024-03-06 20:58:05.125] [+0ms] 【KODO】 --> 时间范围:2024-03-06 20:55:00 ~ 2024-03-06 20:57:00
[2024-03-06 20:58:05.125] [+0ms] 【KODO】 --> 第1页(soffset = 0 ~ 500)
[2024-03-06 20:58:05.125] [+0ms] 【KODO】 调用KODO API
[2024-03-06 20:58:05.125] [+0ms] 【KODO】 >> 请求:POST /v1/query
[2024-03-06 20:58:05.125] [+0ms] 【KODO】 >>>> Body:{"echo_explain":false,"queries":[{"mask_visible":true,"qtype":"dql","query":"M::`fake_data_for_test`:(avg(`field_int`)) { `tag` = 'fake-data-1' } BY `tag`","slimit":500,"soffset":0,"time_range":[1709729700000,1709729820000]}],"workspace_uuid":"wksp_xxxxx"}
[2024-03-06 20:58:05.125] [+0ms] 【KODO】 >> 首次请求
[2024-03-06 20:58:05.144] [+18ms] 【KODO】 >> 响应:`200 OK` => `{"content":[{"async_id":"","complete":false,"cost":"4.405232ms","group_by":["tag"],"index_name":"","index_names":"","index_store_type":"","interval":0,"is_running":false,"next_cursor_time":-1,"points":null,"query_parse":{"fields":{"avg(field_int)":"field_int"},"funcs":{"avg(field_int)":["avg"]},"namespace":"metric","sources":{"fake_data_for_test":"exact"}},"query_type":"guancedb","sample":1,"scan_completed":false,"scan_index":"","series":[{"columns":["time","avg(field_int)"],"name":"fake_data_for_test","tags":{"tag":"fake-data-1"},"values":[[1709729819000,53.91111111111111]]}],"window":0}]}`
[2024-03-06 20:58:05.146] [+1ms] 【Studio】 指标单位(从缓存获取):wksp_xxxxx/fake_data_for_test => {"_DFF_CACHE_EXPIRE_TIME":1709730065}
[2024-03-06 20:58:05.146] [+0ms] 【KODO】 --> DQL 结果数据拆包:{"metric_units":{"field_int":null},"query_time_range":[1709729700000,1709729820000],"series":[{"columns":["time","avg(field_int)"],"name":"fake_data_for_test","tags":{"tag":"fake-data-1"},"values":[["2024-03-06T12:56:59Z",53.91111111111111]]}]}

# 根据两段时间范围内所查询的数据,判断数据是否存在断档,或者数据重新上报,并产生对应的【无数据事件】或【无数据恢复事件】
[2024-03-06 20:58:05.146] [+0ms] 【监控器】 ----------------- 断档 / 新增对象加载结果 ------------------
[2024-03-06 20:58:05.147] [+0ms] 【监控器】 --> 上轮存在对象:{"tag":"fake-data-1"}
[2024-03-06 20:58:05.147] [+0ms] 【监控器】 --> 本轮存在对象:{"tag":"fake-data-1"}
[2024-03-06 20:58:05.147] [+0ms] 【监控器】 ----> 数据断档对象(上轮存在 -> 本轮不存在):无
[2024-03-06 20:58:05.147] [+0ms] 【监控器】 --------------------- 判断数据断档 ---------------------
[2024-03-06 20:58:05.147] [+0ms] 【监控器】 --> 没有数据断档对象
[2024-03-06 20:58:05.147] [+0ms] 【监控器】 ------------------- 判断数据从断档恢复 --------------------
[2024-03-06 20:58:05.147] [+0ms] 【监控器】 --> 对象:{"tag":"fake-data-1"}
[2024-03-06 20:58:05.151] [+4ms] 【监控器】 对象 {"tag":"fake-data-1"} 的故障周期信息(fault_info):{"date":1709729160,"faultDuration":720,"faultId":"event-xxxxx","faultStartTime":1709728440,"status":"ok"}
[2024-03-06 20:58:05.151] [+0ms] 【监控器】 ----> 上次无数据事件为无数据恢复,不存在活跃无数据事件
[2024-03-06 20:58:05.151] [+0ms] 【监控器】 ----> 没有活跃无数据事件,不需要产生无数据恢复事件

# 根据用户配置的检测规则判断是否产生【告警事件】
[2024-03-06 20:58:05.151] [+0ms] 【监控器】 -------------------- 执行数据数值检测 --------------------
[2024-03-06 20:58:05.151] [+0ms] 【监控器】 查询待检测数据
[2024-03-06 20:58:05.151] [+0ms] 【KODO】 执行DQL查询
[2024-03-06 20:58:05.151] [+0ms] 【KODO】 --> 最多翻页:20 页
[2024-03-06 20:58:05.152] [+0ms] 【KODO】 --> 时间范围:2024-03-06 20:56:00 ~ 2024-03-06 20:57:00
[2024-03-06 20:58:05.152] [+0ms] 【KODO】 --> 第1页(soffset = 0 ~ 500)
[2024-03-06 20:58:05.152] [+0ms] 【KODO】 调用KODO API
[2024-03-06 20:58:05.152] [+0ms] 【KODO】 >> 请求:POST /v1/query
[2024-03-06 20:58:05.152] [+0ms] 【KODO】 >>>> Body:{"echo_explain":false,"queries":[{"mask_visible":true,"qtype":"dql","query":"M::`fake_data_for_test`:(avg(`field_int`)) { `tag` = 'fake-data-1' } BY `tag`","slimit":500,"soffset":0,"time_range":[1709729760000,1709729820000]}],"workspace_uuid":"wksp_xxxxx"}
[2024-03-06 20:58:05.152] [+0ms] 【KODO】 >> 首次请求
[2024-03-06 20:58:05.162] [+9ms] 【KODO】 >> 响应:`200 OK` => `{"content":[{"async_id":"","complete":false,"cost":"3.042875ms","group_by":["tag"],"index_name":"","index_names":"","index_store_type":"","interval":0,"is_running":false,"next_cursor_time":-1,"points":null,"query_parse":{"fields":{"avg(field_int)":"field_int"},"funcs":{"avg(field_int)":["avg"]},"namespace":"metric","sources":{"fake_data_for_test":"exact"}},"query_type":"guancedb","sample":1,"scan_completed":false,"scan_index":"","series":[{"columns":["time","avg(field_int)"],"name":"fake_data_for_test","tags":{"tag":"fake-data-1"},"values":[[1709729819000,54.90555555555556]]}],"window":0}]}`
[2024-03-06 20:58:05.163] [+1ms] 【Studio】 指标单位(从缓存获取):wksp_xxxxx/fake_data_for_test => {"_DFF_CACHE_EXPIRE_TIME":1709730065}
[2024-03-06 20:58:05.163] [+0ms] 【KODO】 --> DQL 结果数据拆包:{"metric_units":{"field_int":null},"query_time_range":[1709729760000,1709729820000],"series":[{"columns":["time","avg(field_int)"],"name":"fake_data_for_test","tags":{"tag":"fake-data-1"},"values":[["2024-03-06T12:56:59Z",54.90555555555556]]}]}

# 依次遍历所有检测对象,依次执行检测
[2024-03-06 20:58:05.164] [+0ms] 【监控器】 检测对象:共 1 个
[2024-03-06 20:58:05.164] [+0ms] 【监控器】 [检测对象 1/1] {"tag":"fake-data-1"}
[2024-03-06 20:58:05.165] [+1ms] 【通用阈值检测】 待检测数据:{'Result': [54.90555555555556]}

# 依次遍历所有配置规则,判断命中的检测规则
[2024-03-06 20:58:05.166] [+0ms] 【通用阈值检测】 阈值规则:共 1 条
[2024-03-06 20:58:05.166] [+0ms] 【通用阈值检测】 [阈值规则 1/1] critical:Result >= ['0']
[2024-03-06 20:58:05.166] [+0ms] 【条件判断】 [条件 1/1] IF Result (ANY[54.90555555555556]) >= ["0"]
[2024-03-06 20:58:05.166] [+0ms] 【条件判断】 --> 中间结果为 True ,条件关系为 AND ,继续
[2024-03-06 20:58:05.166] [+0ms] 【通用阈值检测】 --> 匹配成功,结束判断
[2024-03-06 20:58:05.166] [+0ms] 【通用阈值检测】 阈值规则匹配结果:{"check_data":{"Result":54.90555555555556},"conditions":[{"alias":"Result","operands":["0"],"operator":">="}],"status":"critical"}
[2024-03-06 20:58:05.166] [+0ms] 【监控器】 --> 检测对象:{"tag":"fake-data-1"}:已达到故障条件

# 调用观测云 Studio 获取本监控器所配置的告警策略
[2024-03-06 20:58:05.169] [+2ms] 【Studio】 告警信息缓存已禁用
[2024-03-06 20:58:05.169] [+0ms] 【Studio】 调用 Studio Inner API
[2024-03-06 20:58:05.169] [+0ms] 【Studio】 >> 请求:GET /api/v1/inner/alert_opt/get
[2024-03-06 20:58:05.169] [+0ms] 【Studio】 >>>> Query:{"checkerUUID":"rul_xxxxx","workspaceUUID":"wksp_xxxxx"}`
[2024-03-06 20:58:05.169] [+0ms] 【Studio】 >> 首次请求
[2024-03-06 20:58:05.177] [+7ms] 【Studio】 >> 响应:`200 OK` => `{"code":200,"content":{"data":{"alertPolicies":[{"aggClusterFields":[],"aggFields":[],"aggInterval":0,"aggLabels":[],"id":8222,"minInterval":900,"name":"xxxxx-告警策略1","ruleTimezone":"Asia/Shanghai","rules":[{"crontab":"00 09 * * *","crontabDuration":39600,"name":"自定义通知配置1","targets":[{"name":"xxxxx-微信","status":"critical","type":"wechatRobot","webhook":"https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxxxx"},{"name":"xxxxx-告警策略1-规则1","status":"critical","type":"dingTalkRobot","webhook":"https://oapi.dingtalk.com/robot/send?access_token=xxxxx"}]},{"targets":[{"name":"xxxxx-微信","status":"critical","type":"wechatRobot","webhook":"https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxxxx"},{"name":"xxxxx-告警策略1-规则2","status":"critical","type":"dingTalkRobot","webhook":"https://oapi.dingtalk.com/robot/send?access_token=xxxxx"}]}],"status":0,"uuid":"altpl_xxxxx","workspaceUUID":"wksp_xxxxx"},{"aggClusterFields":["df_title"],"aggFields":["CLUSTER"],"aggInterval":60,"aggLabels":[],"id":8223,"minInterval":900,"name":"xxxxx-告警策略2","ruleTimezone":"Asia/Shanghai","rules":[{"crontab":"00 09 * * *","crontabDuration":39600,"name":"自定义通知配置1","targets":[{"name":"xxxxx-告警策略2-规则1","status":"critical","type":"dingTalkRobot","webhook":"https://oapi.dingtalk.com/robot/send?access_token=xxxxx"},{"name":"xxxxx-微信","status":"critical","type":"wechatRobot","webhook":"https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxxxx"}]},{"targets":[{"name":"xxxxx-告警策略2-规则2","status":"critical","type":"dingTalkRobot","webhook":"https://oapi.dingtalk.com/robot/send?access_token=xxxxx"},{"name":"xxxxx-微信","status":"critical","type":"wechatRobot","webhook":"https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxxxx"}]}],"status":0,"uuid":"altpl_xxxxx","workspaceUUID":"wksp_xxxxx"}],"silent":[]}},"errorCode":"","message":"","success":true,"traceId":"TRACE-XXXXX"}`
[2024-03-06 20:58:05.178] [+1ms] 【Studio】 告警配置(从 API 获取):rul_xxxxx => `{"_DFF_CACHE_EXPIRE_TIME":1709730065,"alertPolicies":[{"aggClusterFields":[],"aggFields":[],"aggInterval":0,"aggLabels":[],"id":8222,"minInterval":900,"name":"xxxxx-告警策略1","ruleTimezone":"Asia/Shanghai","rules":[{"crontab":"00 09 * * *","crontabDuration":39600,"name":"自定义通知配置1","targets":[{"name":"xxxxx-微信","status":"critical","type":"wechatRobot","webhook":"https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxxxx"},{"name":"xxxxx-告警策略1-规则1","status":"critical","type":"dingTalkRobot","webhook":"https://oapi.dingtalk.com/robot/send?access_token=xxxxx"}]},{"targets":[{"name":"xxxxx-微信","status":"critical","type":"wechatRobot","webhook":"https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxxxx"},{"name":"xxxxx-告警策略1-规则2","status":"critical","type":"dingTalkRobot","webhook":"https://oapi.dingtalk.com/robot/send?access_token=xxxxx"}]}],"status":0,"uuid":"altpl_xxxxx","workspaceUUID":"wksp_xxxxx"},{"aggClusterFields":["df_title"],"aggFields":["CLUSTER"],"aggInterval":60,"aggLabels":[],"id":8223,"minInterval":900,"name":"xxxxx-告警策略2","ruleTimezone":"Asia/Shanghai","rules":[{"crontab":"00 09 * * *","crontabDuration":39600,"name":"自定义通知配置1","targets":[{"name":"xxxxx-告警策略2-规则1","status":"critical","type":"dingTalkRobot","webhook":"https://oapi.dingtalk.com/robot/send?access_token=xxxxx"},{"name":"xxxxx-微信","status":"critical","type":"wechatRobot","webhook":"https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxxxx"}]},{"targets":[{"name":"xxxxx-告警策略2-规则2","status":"critical","type":"dingTalkRobot","webhook":"https://oapi.dingtalk.com/robot/send?access_token=xxxxx"},{"name":"xxxxx-微信","status":"critical","type":"wechatRobot","webhook":"https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxxxx"}]}],"status":0,"uuid":"altpl_xxxxx","workspaceUUID":"wksp_xxxxx"}],"silent":[]}`
[2024-03-06 20:58:05.180] [+1ms] 【Studio】 常量配置(从缓存获取):envName => {"_DFF_CACHE_EXPIRE_TIME":1709730060,"value":"测试环境"}
[2024-03-06 20:58:05.183] [+2ms] 【Studio】 常量配置(从缓存获取):UsePublicAlertLink => {"_DFF_CACHE_EXPIRE_TIME":1709730060,"value":false}
[2024-03-06 20:58:05.184] [+1ms] 【Studio】 常量配置(从缓存获取):consoleBaseURL => {"_DFF_CACHE_EXPIRE_TIME":1709730060,"value":"http://testing.domain.com"}

# 根据用户配置的告警模板和事件数据,渲染事件标题 / 内容
[2024-03-06 20:58:05.186] [+1ms] 【文本渲染器】 渲染模板:
内容:xxxxx-监控器(单个){{df_dimension_tags}}
第2行
第3行
[2024-03-06 20:58:05.187] [+1ms] 【文本渲染器】 --> 渲染成功。输出:
内容:xxxxx-监控器(单个){"tag":"fake-data-1"}
第2行
第3行
[2024-03-06 20:58:05.191] [+4ms] 【文本渲染器】 渲染模板:
标题:xxxxx-监控器(单个){{df_dimension_tags}}
[2024-03-06 20:58:05.192] [+0ms] 【文本渲染器】 --> 渲染成功。输出:
标题:xxxxx-监控器(单个){"tag":"fake-data-1"}

# 依次遍历事件 / 静默规则,判断每个事件是否需要静默
[2024-03-06 20:58:05.192] [+0ms] 【事件告警器】 [事件 1/1] <critical事件@monitor:{"tag":"fake-data-1"}:标题:xxxxx-监控器(单个){"tag":"fake-data-1"}>
[2024-03-06 20:58:05.192] [+0ms] 【事件告警器】 没有静默规则,不需要静默

# 依次遍历告警策略,判断事件命中哪个告警策略 / 告警规则
[2024-03-06 20:58:05.192] [+0ms] 【事件告警器】 告警策略:共 2 条
[2024-03-06 20:58:05.192] [+0ms] 【事件告警器】 [告警策略 1/2] xxxxx-告警策略1(8222)
[2024-03-06 20:58:05.192] [+0ms] 【事件告警器】 --------------------- 发送事件告警 ---------------------
[2024-03-06 20:58:05.192] [+0ms] 【事件告警器】 告警规则:共 2 条
[2024-03-06 20:58:05.192] [+0ms] 【事件告警器】 [告警规则 1/2] 按 Crontab `00 09 * * *` 循环,每轮循环持续 39600 秒
[2024-03-06 20:58:05.193] [+0ms] 【事件告警器】 --> 已配置重复时间段,但不在重复时间段范围内
[2024-03-06 20:58:05.193] [+0ms] 【事件告警器】 --> 不满足重复时间段告警,跳过
[2024-03-06 20:58:05.193] [+0ms] 【事件告警器】 [告警规则 2/2] 剩余其他时间段
[2024-03-06 20:58:05.193] [+0ms] 【事件告警器】 成功匹配告警规则,需要告警

# 依次遍历通知对象,判断是否处于沉默期内
#(同一个告警策略 / 规则下的所有告警通知对象会进行沉默期对齐)
[2024-03-06 20:58:05.193] [+0ms] 【事件告警器】 告警通知对象:共 2 条
[2024-03-06 20:58:05.193] [+0ms] 【事件告警器】 [告警通知对象 1/2] dingTalkRobot/xxxxx-告警策略1-规则2 (critical)
[2024-03-06 20:58:05.193] [+0ms] 【事件告警器】 检查事件 status 匹配动作:`critical` => critical
[2024-03-06 20:58:05.195] [+1ms] 【事件告警器】 --> 上次告警于 2024-03-06 20:46:00,沉默 900 秒。沉默期于 2024-03-06 21:01:00(240 秒以后)解除
[2024-03-06 20:58:05.195] [+0ms] 【事件告警器】 ----> 当前处于沉默期,跳过
[2024-03-06 20:58:05.195] [+0ms] 【事件告警器】 [告警通知对象 2/2] wechatRobot/xxxxx-微信 (critical)
[2024-03-06 20:58:05.195] [+0ms] 【事件告警器】 检查事件 status 匹配动作:`critical` => critical
[2024-03-06 20:58:05.197] [+1ms] 【事件告警器】 --> 上次告警于 2024-03-06 20:46:00,沉默 900 秒。沉默期于 2024-03-06 21:01:00(240 秒以后)解除
[2024-03-06 20:58:05.197] [+0ms] 【事件告警器】 ----> 当前处于沉默期,跳过
[2024-03-06 20:58:05.197] [+0ms] 【事件告警器】 [告警策略 2/2] xxxxx-告警策略2(8223)
[2024-03-06 20:58:05.197] [+0ms] 【事件告警器】 --------------------- 发送事件告警 ---------------------
[2024-03-06 20:58:05.197] [+0ms] 【事件告警器】 告警规则:共 2 条
[2024-03-06 20:58:05.197] [+0ms] 【事件告警器】 [告警规则 1/2] 按 Crontab `00 09 * * *` 循环,每轮循环持续 39600 秒
[2024-03-06 20:58:05.197] [+0ms] 【事件告警器】 --> 已配置重复时间段,但不在重复时间段范围内
[2024-03-06 20:58:05.197] [+0ms] 【事件告警器】 --> 不满足重复时间段告警,跳过
[2024-03-06 20:58:05.198] [+0ms] 【事件告警器】 [告警规则 2/2] 剩余其他时间段
[2024-03-06 20:58:05.198] [+0ms] 【事件告警器】 成功匹配告警规则,需要告警
[2024-03-06 20:58:05.198] [+0ms] 【事件告警器】 告警通知对象:共 2 条
[2024-03-06 20:58:05.198] [+0ms] 【事件告警器】 [告警通知对象 1/2] dingTalkRobot/xxxxx-告警策略2-规则2 (critical)
[2024-03-06 20:58:05.198] [+0ms] 【事件告警器】 检查事件 status 匹配动作:`critical` => critical
[2024-03-06 20:58:05.199] [+1ms] 【事件告警器】 --> 上次告警于 2024-03-06 20:46:00,沉默 900 秒。沉默期于 2024-03-06 21:01:00(240 秒以后)解除
[2024-03-06 20:58:05.199] [+0ms] 【事件告警器】 ----> 当前处于沉默期,跳过
[2024-03-06 20:58:05.200] [+0ms] 【事件告警器】 [告警通知对象 2/2] wechatRobot/xxxxx-微信 (critical)
[2024-03-06 20:58:05.200] [+0ms] 【事件告警器】 检查事件 status 匹配动作:`critical` => critical
[2024-03-06 20:58:05.201] [+1ms] 【事件告警器】 --> 上次告警于 2024-03-06 20:46:00,沉默 900 秒。沉默期于 2024-03-06 21:01:00(240 秒以后)解除
[2024-03-06 20:58:05.201] [+0ms] 【事件告警器】 ----> 当前处于沉默期,跳过

# 已生成事件建立缓存,供下一次监控器任务使用
[2024-03-06 20:58:05.204] [+2ms] 【内部DataWay】 缓存事件
[2024-03-06 20:58:05.204] [+0ms] 【内部DataWay】 缓存故障信息
[2024-03-06 20:58:05.204] [+0ms] 【内部DataWay】 --> 建立缓存:key=`rul_xxxxx-check`, field=`{"tag":"fake-data-1"}`

# 事件写入观测云
[2024-03-06 20:58:05.206] [+2ms] 【内部DataWay】 写入事件
[2024-03-06 20:58:05.208] [+1ms] 【内部DataWay】 行协议方式写入数据
[2024-03-06 20:58:05.208] [+0ms] 【内部DataWay】 --> 工作空间TOKEN:`tkn_xxxxx`
[2024-03-06 20:58:05.208] [+0ms] 【内部DataWay】 --> 请求:POST /v1/write/keyevent
[2024-03-06 20:58:05.208] [+0ms] 【内部DataWay】 --> 前 1/1 条数据示例:[{"fields":{"df_alert_policy_ids":["altpl_xxxxx","altpl_xxxxx"],"df_alert_policy_names":["xxxxx-告警策略1","xxxxx-告警策略2"],"df_at_accounts":"[]","df_at_accounts_nodata":"[]","df_channels":"[\"chan_xxxxx\"]","df_check_range_end":1709729820,"df_check_range_start":1709729760,"df_date_range":60,"df_dimension_tags":"{\"tag\":\"fake-data-1\"}","df_event_reason":"满足监控器中故障的认定条件,产生故障事件","df_fault_duration":2880,"df_fault_start_time":1709726940,"df_issue_duration":2880,"df_issue_start_time":1709726940,"df_matched_alert_policy_rules":["xxxxx-告警策略1 / -","xxxxx-告警策略2 / -"],"df_message":"内容:xxxxx-监控器(单个){\"tag\":\"fake-data-1\"}  \n第2行  \n第3行","df_meta":"{\"alert_info\":{\"matchedAlertPolicyRules\":[{\"aggClusterFields\":[],\"aggFields\":[],\"aggInterval\":0,\"aggLabels\":[],\"id\":8222,\"minInterval\":900,\"name\":\"xxxxx-告警策略1\",\"rule\":{\"md5\":\"xxxxx\",\"seq\":2,\"targets\":[{\"hasSecret\":false,\"ignoreReason\":\"当前处于沉默期。上次告警于 2024-03-06 20:46:00(沉默 900 秒),沉默期将于 2024-03-06 21:01:00(240 秒以后)结束\",\"isIgnored\":true,\"name\":\"xxxxx-告警策略1-规则2\",\"status\":\"critical\",\"type\":\"dingTalkRobot\",\"webhook\":\"https://oapi.dingtalk.com/robot/send?access_token=xxxxx\"},{\"hasSecret\":false,\"ignoreReason\":\"当前处于沉默期。上次告警于 2024-03-06 20:46:00(沉默 900 秒),沉默期将于 2024-03-06 21:01:00(240 秒以后)结束\",\"isIgnored\":true,\"name\":\"xxxxx-微信\",\"status\":\"critical\",\"type\":\"wechatRobot\",\"webhook\":\"https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxxxx\"}]},\"ruleTimezone\":\"Asia/Shanghai\",\"status\":0,\"uuid\":\"altpl_xxxxx\",\"workspaceUUID\":\"wksp_xxxxx\"},{\"aggClusterFields\":[\"df_title\"],\"aggFields\":[\"CLUSTER\"],\"aggInterval\":60,\"aggLabels\":[],\"id\":8223,\"minInterval\":900,\"name\":\"xxxxx-告警策略2\",\"rule\":{\"md5\":\"xxxxx\",\"seq\":2,\"targets\":[{\"hasSecret\":false,\"ignoreReason\":\"当前处于沉默期。上次告警于 2024-03-06 20:46:00(沉默 900 秒),沉默期将于 2024-03-06 21:01:00(240 秒以后)结束\",\"isIgnored\":true,\"name\":\"xxxxx-告警策略2-规则2\",\"status\":\"critical\",\"type\":\"dingTalkRobot\",\"webhook\":\"https://oapi.dingtalk.com/robot/send?access_token=xxxxx\"},{\"hasSecret\":false,\"ignoreReason\":\"当前处于沉默期。上次告警于 2024-03-06 20:46:00(沉默 900 秒),沉默期将于 2024-03-06 21:01:00(240 秒以后)结束\",\"isIgnored\":true,\"name\":\"xxxxx-微信\",\"status\":\"critical\",\"type\":\"wechatRobot\",\"webhook\":\"https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxxxx\"}]},\"ruleTimezone\":\"Asia/Shanghai\",\"status\":0,\"uuid\":\"altpl_xxxxx\",\"workspaceUUID\":\"wksp_xxxxx\"}],\"matchedSilentRule\":null,\"targets\":[{\"hasSecret\":false,\"i... <Length: 6784>
[2024-03-06 20:58:05.209] [+0ms] 【内部DataWay】 --> 首条数据行协议示例:`keyevent,df_crontab_exec_mode=crontab,df_event_id=event-xxxxx,df_fault_id=event-xxxxx,df_fault_status=fault,df_label=["xxxxx_test"],df_language=zh,df_monitor_checker=custom_metric,df_monitor_checker_event_ref=xxxxx,df_monitor_checker_id=rul_xxxxx,df_monitor_checker_ref=xxxxx,df_monitor_checker_sub=check,df_monitor_checker_type=monitor,df_monitor_id=altpl_xxxxx;altpl_xxxxx,df_monitor_type=custom,df_site_name=测试环境,df_source=monitor,df_status=critical,df_sub_status=critical,df_workspace_name=【Doris】开发测试一起用_,df_workspace_uuid=wksp_xxxxx,tag=fake-data-1 df_alert_policy_ids=["altpl_xxxxx","altpl_xxxxx"],df_alert_policy_names=["xxxxx-告警策略1","xxxxx-告警策略2"],df_at_accounts="[]",df_at_accounts_nodata="[]",df_channels="[\"chan_xxxxx\"]",df_check_range_end=1709729820i,df_check_range_start=1709729760i,df_date_range=60i,df_dimension_tags="{\"tag\":\"fake-data-1\"}",df_event_reason="满足监控器中故障的认定条件,产生故障事件",df_fault_duration=2880i,df_fault_start_time=1709726940i,df_issue_duration=2880i,df_issue_start_time=1709726940i,df_matched_alert_policy_rules=["xxxxx-告警策略1 / -","xxxxx-告警策略2 / -"],df_message="内容:xxxxx-监控器(单个){\"tag\":\"fake-data-1\"}
第2行
第3行",df_meta="{\"alert_info\":{\"matchedAlertPolicyRules\":[{\"aggClusterFields\":[],\"aggFields\":[],\"aggInterval\":0,\"aggLabels\":[],\"id\":8222,\"minInterval\":900,\"name\":\"xxxxx-告警策略1\",\"rule\":{\"md5\":\"xxxxx\",\"seq\":2,\"targets\":[{\"hasSecret\":false,\"ignoreReason\":\"当前处于沉默期。上次告警于 2024-03-06 20:46:00(沉默 900 秒),沉默期将于 2024-03-06 21:01:00(240 秒以后)结束\",\"isIgnored\":true,\"name\":\"xxxxx-告警策略1-规则2\",\"status\":\"critical\",\"type\":\"dingTalkRobot\",\"webhook\":\"https://oapi.dingtalk.com/robot/send?access_token=xxxxx\"},{\"hasSecret\":false,\"ignoreReason\":\"当前处于沉默期。上次告警于 2024-03-06 20:46:00(沉默 900 秒),沉默期将于 2024-03-06 21:01:00(240 秒以后)结束\",\"isIgnored\":true,\"name\":\"xxxxx-微信\",\"status\":\"critical\",\"type\":\"wechatRobot\",\"webhook\":\"https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxxxx\"}]},\"ruleTimezone\":\"Asia/Shanghai\",\"status\":0,\"uuid\":\"altpl_xxxxx\",\"workspaceUUID\":\"wksp_xxxxx\"},{\"aggClusterFields\":[\"df_title\"],\"aggFields\":[\"CLUSTER\"],\"aggInterval\":60,\"aggLabels\":[],\"id\":8223,\"minInterval\":900,\"name\":\"xxxxx-告警策略2\",\"rule\":{\"md5\":\"xxxxx\",\"seq\":2,\"targets\":[{\"hasSecret\":false,\"ignoreReason\":\"当前处于沉默期。上次告警于 2024-03-06 20:46:00(沉默 900 秒),沉默期将于 2024-03-06 21:01:00(240 秒以后)结束\",\"isIgnored\":true,\"name\":\"xxxxx-告警策略2-规则2\",\"... <Length: 6616>
[2024-03-06 20:58:05.216] [+6ms] 【内部DataWay】 --> 响应结果:`200 OK`
[2024-03-06 20:58:05.216] [+0ms] 【内部DataWay】 --> 响应内容:""

# 根据用户配置,将已生成事件通知给观测云 Studio,用于追踪
#(此处监控器仅通知,异常追踪的具体业务由观测云 Studio 实现)
[2024-03-06 20:58:05.216] [+0ms] 【Studio】 缓冲需要通知 Studio 的事件
[2024-03-06 20:58:05.220] [+4ms] 【Studio】 --> 事件:`{"df_at_accounts":[],"df_at_accounts_nodata":[],"df_channels":["chan_xxxxx"],"df_check_range_end":1709729820,"df_check_range_start":1709729760,"df_crontab_exec_mode":"crontab","df_date_range":60,"df_dimension_tags":"{\"tag\":\"fake-data-1\"}","df_event_id":"event-xxxxx","df_fault_duration":2880,"df_fault_id":"event-xxxxx","df_fault_start_time":1709726940,"df_fault_status":"fault","df_label":"[\"xxxxx_test\"]","df_message":"内容:xxxxx-监控器(单个){\"tag\":\"fake-data-1\"}  \n第2行  \n第3行","df_monitor_checker":"custom_metric","df_monitor_checker_event_ref":"xxxxx","df_monitor_checker_id":"rul_xxxxx","df_monitor_checker_name":"标题:xxxxx-监控器(单个){{df_dimension_tags}}","df_monitor_checker_ref":"xxxxx","df_monitor_checker_sub":"check","df_monitor_checker_type":"monitor","df_monitor_checker_value":"54.90555555555556","df_monitor_id":"altpl_xxxxx;altpl_xxxxx","df_monitor_name":"xxxxx-告警策略1;xxxxx-告警策略2","df_monitor_type":"custom","df_site_name":"测试环境","df_source":"monitor","df_status":"critical","df_sub_status":"critical","df_title":"标题:xxxxx-监控器(单个){\"tag\":\"fake-data-1\"}","df_workspace_uuid":"wksp_xxxxx","timestamp":1709729820}`

# 本次检测产生监控器数量
[2024-03-06 20:58:05.221] [+1ms] 本次检测共产生 1 个监控器事件