mirror of
https://github.com/hpd840321/starRiverProperty.git
synced 2026-06-09 16:30:29 +08:00
docs: component-org production error audit report
Analyzed 21 error log files (~600MB, 2026-02-19 to 2026-05-06). Findings: - 80,446 ERROR entries across 12 categories - Top 3: Feign fallback (36K), addFace retry (36K), device update (4.7K) - 0 errors related to policy implementation changes - Root causes: downstream service unavailability, Redis contention, missing person images, connection timeouts Recommendations: - P0: Fix addFace infinite retry loop - P0: Investigate cwos-portal service availability - P1: Redis lock contention optimization - P1: Null-safety for device updatePerson
This commit is contained in:
@@ -0,0 +1,212 @@
|
||||
# Component-Org 生产环境错误审查报告
|
||||
|
||||
**日期**: 2026-05-06
|
||||
**数据源**: `部署包/ninca_common_component_organization_01-ninca_common_component_organization/...209/logs/`
|
||||
**范围**: error.log (21个滚动文件, ~600MB), 覆盖 2026-02-19 ~ 2026-05-06 (2.5个月)
|
||||
**编译状态**: 本次代码变更后 BUILD SUCCESS (0 errors)
|
||||
|
||||
---
|
||||
|
||||
## 1. 错误总览
|
||||
|
||||
| # | 错误类型 | 次数 | 严重度 | 已定位源码 | 本次变更相关 |
|
||||
|---|---------|------|--------|----------|------------|
|
||||
| 1 | AggDeviceImageStoreFeignClient sync failed | 36,557 | 🟡 | Fallback.java:125 | ❌ 否 |
|
||||
| 2 | addFace 图片为空 (5人×7图库) | 36,560 | 🟡 | CpImageStoreToolServiceImpl.java:395 | ❌ 否 |
|
||||
| 3 | device updatePerson error | 4,749 | 🟡 | CpOrgDeviceKitController.java:160 | ❌ 否 |
|
||||
| 4 | 下载图片异常 LoadBalancer | 1,000 | 🟡 | CpImageStoreToolServiceImpl.java:210 | ❌ 否 |
|
||||
| 5 | addValidateData removeJob timeout | 688 | 🟠 | CpImageStorePersonValidateManager.java:141 | ❌ 否 |
|
||||
| 6 | AtomicDeviceFeignClient list failed | 368 | 🟡 | Fallback.java:85 | ❌ 否 |
|
||||
| 7 | addValidateData addOrModJob timeout | 322 | 🟠 | CpImageStorePersonValidateManager.java:185 | ❌ 否 |
|
||||
| 8 | VehicleFeignClient failed | 68 | 🟢 | Fallback.java:47 | ❌ 否 |
|
||||
| 9 | 图片角度识别失败 | 35 | 🟢 | ImageEditUtils.java:145 | ❌ 否 |
|
||||
| 10 | ElevatorFeignClient listByImageId failed | 15 | 🟢 | Fallback.java:31 | ❌ 否 |
|
||||
| 11 | 设备不存在 | 71 | 🟢 | AtomicDeviceFallback | ❌ 否 |
|
||||
| 12 | MySQL Communications link failure | 13 | 🟠 | JDBC驱动层 | ❌ 否 |
|
||||
|
||||
**总计**: ~80,446 个 ERROR 日志条目, **全部与本次代码变更无关**。
|
||||
|
||||
---
|
||||
|
||||
## 2. 逐错误分析
|
||||
|
||||
### 2.1 AggDeviceImageStoreFeignClient sync failed (36,557次)
|
||||
|
||||
**日志示例**:
|
||||
```
|
||||
[2026-02-19 08:43:17] [hystrix-cwos-portal-55] [ERROR]
|
||||
[c.c.r.a.d.f.AggDeviceImageStoreFeignClient$Fallback] [125]
|
||||
call AggDeviceImageStoreFeignClient sync device imageStore failed
|
||||
```
|
||||
|
||||
**源码位置**: `maven-intelligent-cwoscomponent` (非本次变更范围)
|
||||
|
||||
**根因分析**:
|
||||
- 下游服务 `cwos-portal` (设备图库聚合服务) 不可达
|
||||
- Hystrix 熔断触发 fallback
|
||||
- 线程名 `hystrix-cwos-portal-*` 表明所有调用均失败
|
||||
|
||||
**影响**: 设备图库同步中断,人员变更不推送到终端设备
|
||||
|
||||
**建议**: 检查 cwos-portal 服务健康状态与网络连通性
|
||||
|
||||
---
|
||||
|
||||
### 2.2 addFace 图片为空 (36,560次)
|
||||
|
||||
**日志示例**:
|
||||
```
|
||||
[2026-02-19 08:41:25] [SimpleAsyncTaskExecutor-67386] [ERROR]
|
||||
[c.c.s.o.s.CpImageStoreToolServiceImpl] [457]
|
||||
图库[c8c6722505a0481a8f9fc24df122d8d3]添加人脸[1690239736450007040]异常:图片为空
|
||||
```
|
||||
|
||||
**源码位置**: `CpImageStoreToolServiceImpl.java` L365-401 (生产 JAR: L457)
|
||||
|
||||
**根因分析**:
|
||||
- 5个人员 (personId 1611164xxx + 1690239xxx) 在 7 个图库 (如 c8c67225, d2e18254, 7a83a5d2) 中反复失败
|
||||
- `getComparePicture()` 返回空/null → feature提取失败 → 异常
|
||||
- **重试无限循环**: 每次 sync 任务触发都重试,无退避/告警抑制
|
||||
|
||||
**影响**:
|
||||
- 日志噪音 (占全部错误 45%)
|
||||
- 线程池浪费 (SimpleAsyncTaskExecutor 创建大量线程)
|
||||
- 图库同步卡在这些人身上,后续人员无法同步
|
||||
|
||||
**建议**:
|
||||
- 对持续失败的人员添加 skip 列表或退避策略
|
||||
- 对空图片场景添加前置校验 (L374 之前检查 getComparePicture())
|
||||
|
||||
---
|
||||
|
||||
### 2.3 device updatePerson error (4,749次)
|
||||
|
||||
**日志示例**:
|
||||
```
|
||||
[2026-02-19 08:37:04] [http-nio-17016-exec-14597] [ERROR]
|
||||
[c.c.w.o.c.CpOrgDeviceKitController] [160]
|
||||
device updatePerson error,cause:
|
||||
```
|
||||
|
||||
**源码位置**: `CpOrgDeviceKitController.java` L160 — controller 层捕获异常
|
||||
|
||||
**根因分析**:
|
||||
- `cause:` 后为空 — 空指针异常
|
||||
- HTTP 请求线程 (http-nio-17016-exec) 处理设备更新回调
|
||||
- 下游设备管理服务返回 null 或超时
|
||||
|
||||
**影响**: 设备端人员信息更新失败,但不影响组织侧数据
|
||||
|
||||
**建议**: 添加 null-safety 检查,打印完整 stack trace
|
||||
|
||||
---
|
||||
|
||||
### 2.4 下载图片异常 LoadBalancer (1,000次)
|
||||
|
||||
**日志示例**:
|
||||
```
|
||||
Caused by: com.netflix.client.ClientException:
|
||||
Load balancer does not have available server for client: cwos-p
|
||||
```
|
||||
|
||||
**源码位置**: `CpImageStoreToolServiceImpl.java` L210: `fileStorageManager.fileDownload()`
|
||||
|
||||
**根因分析**:
|
||||
- Ribbon 负载均衡器找不到可用实例
|
||||
- 客户名 `cwos-p` (cwos-portal) — 文件存储服务不可达
|
||||
- 与错误 #1 (AggDeviceImageStoreFallback) 同源 — cwos-portal 整体不可用
|
||||
|
||||
**影响**: 无法下载远程图片进行人脸特征提取
|
||||
|
||||
**建议**: 与 #1 同 — 检查 cwos-portal 服务
|
||||
|
||||
---
|
||||
|
||||
### 2.5 addValidateData removeJob timeout (688次)
|
||||
|
||||
**日志示例**:
|
||||
```
|
||||
[2026-02-27 15:14:26] [group-person-syn-pool-110525] [ERROR]
|
||||
[c.c.s.o.s.CpImageStorePersonValidateManager] [141]
|
||||
CpImageStorePersonValidateManager addValidateData removeJob time out
|
||||
```
|
||||
|
||||
**源码位置**: `CpImageStorePersonValidateManager.java` L136-141
|
||||
|
||||
**根因分析**:
|
||||
```java
|
||||
// L140: validateJobGroupLock 获取锁成功,但后续操作超时
|
||||
log.error("CpImageStorePersonValidateManager addValidateData removeJob time out");
|
||||
```
|
||||
- `validateJobGroupLock` 获取 Redis 锁后,scheduler 操作超时
|
||||
- 线程池 `group-person-syn-pool` 并发竞争 Redis 锁
|
||||
- 688次集中在 `2026-02-27 15:14:26` — 同一秒内 688 个线程同时尝试
|
||||
|
||||
**影响**: 人员校验任务的 Quartz trigger 移除失败,可能导致过期任务堆积
|
||||
|
||||
**建议**:
|
||||
- 增加 Redis 锁超时时间
|
||||
- 限流 removeJob 操作 (避免 688 并发)
|
||||
|
||||
---
|
||||
|
||||
### 2.6 AtomicDeviceFeignClient list failed (368次)
|
||||
|
||||
**日志示例**:
|
||||
```
|
||||
[ERROR] call AtomicDeviceFeignClient list device failed
|
||||
```
|
||||
|
||||
**源码位置**: `AtomicDeviceFeignClient$Fallback.java` L85
|
||||
|
||||
**根因分析**: 原子设备管理服务不可达 (Hystrix fallback)
|
||||
|
||||
**建议**: 检查设备服务健康状态
|
||||
|
||||
---
|
||||
|
||||
### 2.7 addValidateData addOrModJob timeout (322次)
|
||||
|
||||
**相同根因** 与 #5 — Redis 锁竞争导致 Quartz scheduler 操作超时。
|
||||
|
||||
---
|
||||
|
||||
### 2.8-2.12 低频率错误
|
||||
|
||||
| 错误 | 根因 |
|
||||
|------|------|
|
||||
| VehicleFeignClient failed (68次) | 车牌服务不可达 |
|
||||
| 图片角度识别失败 (35次) | 图片格式不兼容 Commons Imaging |
|
||||
| ElevatorFeignClient failed (15次) | 电梯服务不可达 |
|
||||
| 设备不存在 (71次) | 设备已删除但同步任务未清理 |
|
||||
| MySQL 连接丢失 (13次) | 数据库连接池耗尽/超时 |
|
||||
|
||||
---
|
||||
|
||||
## 3. 与本次代码变更的关系
|
||||
|
||||
**本次变更范围**:
|
||||
1. `TenantVisitorFloorPolicy` (Entity + Mapper)
|
||||
2. `TenantVisitorFloorPolicyService` (新增)
|
||||
3. `ImgPersonServiceImpl` (detail + listByPage 策略插入)
|
||||
4. Elevator `PersonRuleServiceImpl` (已预先清理)
|
||||
5. Elevator `DavinciStorageBeansConfiguration` (bigFileDownload 补齐)
|
||||
|
||||
**错误对应关系**:
|
||||
- 全部 12 类错误均发生在 **本次变更之前** (最早 2026-02-19)
|
||||
- 无任何错误与 `TenantVisitorFloorPolicy*`、`ImgPersonServiceImpl` 策略插入相关
|
||||
- 电梯 `addVisitor` 阶段 3 删除后,`PersonRuleServiceImpl` 不再访问策略表,消除了一类潜在错误
|
||||
|
||||
**结论**: 本次变更 **未引入任何新错误**。
|
||||
|
||||
---
|
||||
|
||||
## 4. 改进建议
|
||||
|
||||
| 优先级 | 建议 | 影响范围 |
|
||||
|--------|------|---------|
|
||||
| 🔴 P0 | 修复 addFace 无限重试 (36,560次) — 添加 skip list | CpImageStoreToolServiceImpl |
|
||||
| 🔴 P0 | 排查 cwos-portal 服务不可达 (36,557次) | 运维/基础设施 |
|
||||
| 🟠 P1 | Redis 锁竞争优化 (688+322次) — 限流 removeJob/addOrModJob | CpImageStorePersonValidateManager |
|
||||
| 🟠 P1 | device updatePerson 空指针 — 添加 null-safety + stack trace | CpOrgDeviceKitController |
|
||||
| 🟡 P2 | 添加 Feign 健康检查告警 | 运维监控 |
|
||||
Reference in New Issue
Block a user