一次Nacos服务掉线排查

Nacos用户方发来一个问题,说是压测过程中,半夜发现服务掉线了。然后给了我一些日志

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
2025-11-05 02:03:58.342 WARN [com.alibaba.nacos.client.naming.grpc.redo.0:c.a.n.c.naming] Grpc Connection is disconnect, skip current redo task
2025-11-05 02:04:00.638 INFO [com.alibaba.nacos.client.remote.worker:c.a.n.c.naming] Grpc connection connect

异常。。。。。

2025-11-05 02:17:24.549 WARN [com.alibaba.nacos.client.remote.worker:c.a.n.c.naming] Grpc connection disconnect, mark to redo
2025-11-05 02:17:24.550 WARN [com.alibaba.nacos.client.remote.worker:c.a.n.c.naming] mark to redo completed
2025-11-05 02:17:24.550 INFO [com.alibaba.nacos.client.remote.worker:c.a.n.c.naming] Grpc connection connect
2025-11-05 02:17:26.354 INFO [com.alibaba.nacos.client.naming.grpc.redo.0:c.a.n.c.naming] Redo instance operation REGISTER for xxxx
2025-11-05 02:17:26.362 INFO [com.alibaba.nacos.client.naming.grpc.redo.0:c.a.n.c.naming] Redo subscriber operation REGISTER for xxxx


异常。。。。。。

2025-11-05 02:50:04.844 WARN [com.alibaba.nacos.client.remote.worker:c.a.n.c.naming] Grpc connection disconnect, mark to redo
2025-11-05 02:50:04.845 WARN [com.alibaba.nacos.client.remote.worker:c.a.n.c.naming] mark to redo completed
2025-11-05 02:50:04.845 INFO [com.alibaba.nacos.client.remote.worker:c.a.n.c.naming] Grpc connection connect //注意这里,和上面日志相比,少了Redo instance 和 Redo subscriber


2025-11-05 02:50:08.656 INFO [com.alibaba.nacos.client.naming.updater.0:c.a.n.c.naming] removed ips(1) service: xxxxx
2025-11-05 02:50:08.657 INFO [com.alibaba.nacos.client.naming.updater.0:c.a.n.c.naming] current ips:(0) service: xxxxx -> []

联想到Nacos(2.0.4)存在服务断线重连机制,所以怀疑该机制失效。找到源码所在位置:

NamingGrpcRedoService

1
2
3
4
5
public NamingGrpcRedoService(NamingGrpcClientProxy clientProxy) {
this.redoExecutor = new ScheduledThreadPoolExecutor(REDO_THREAD, new NameThreadFactory(REDO_THREAD_NAME));
this.redoExecutor.scheduleWithFixedDelay(new RedoScheduledTask(clientProxy, this), DEFAULT_REDO_DELAY,
DEFAULT_REDO_DELAY, TimeUnit.MILLISECONDS);
}

RedoScheduledTask

1
2
3
4
5
6
7
8
9
10
11
12
13
@Override
public void run() {
if (!redoService.isConnected()) {
LogUtils.NAMING_LOGGER.warn("Grpc Connection is disconnect, skip current redo task");
return;
}
try {
redoForInstances();
redoForSubscribes();
} catch (Exception e) {
LogUtils.NAMING_LOGGER.warn("Redo task run with unexpected exception: ", e);
}
}

这是个定时任务,正常应该定时执行才对,为啥会停止执行。然后立马想到去看线程dump。得到如下:

1
2
3
4
5
6
7
8
9
10
11
12
13

"com.alibaba.nacos.client.naming.grpc.redo.0" #42 daemon prio=5 os_prio=0 cpu=50.83ms elapsed=55226.59s tid=0x0000ffffa9df7000 nid=0x2e waiting on condition [0x0000ffff1f1fd000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000000f7c9f8d0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2057)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1081)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)

定时任务等待执行任务应该是TIMED_WAITING状态,这里这个WAITING状态引起了大家注意。

然后定位到JDK源码,发现是队列任务不存在了,导致的WAITING状态。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
public RunnableScheduledFuture<?> take() throws InterruptedException {
final ReentrantLock lock = this.lock;
lock.lockInterruptibly();
try {
for (;;) {
RunnableScheduledFuture<?> first = queue[0];
if (first == null)
available.await(); //WAITING状态
else {
long delay = first.getDelay(NANOSECONDS);
if (delay <= 0L)
return finishPoll(first);
first = null; // don't retain ref while waiting
if (leader != null)
available.await();
else {
Thread thisThread = Thread.currentThread();
leader = thisThread;
try {
available.awaitNanos(delay);//TIMED_WAIT状态
} finally {
if (leader == thisThread)
leader = null;
}
}
}
}
} finally {
if (leader == null && queue[0] != null)
available.signal();
lock.unlock();
}
}

这就搞的我很莫名其妙,任务为啥会自动消失,查看了代码也没发现有其他redoExecutor的调用代码。

陷入迷茫。。。。。。

灵光乍现,是不是任务执行过程中,线程被interrupt,线程池shutdown,任务抛出异常自己没捕获。写一个简单的程序,逐个测试,最终发现抛出Error会导致线程池的线程状态,变为WAITING。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
ScheduledExecutorService redoExecutor = new ScheduledThreadPoolExecutor(1, new CustomThreadFactory("redo"));
Runnable runnable = new Runnable() {
@Override
public void run() {
try {
System.out.println("redo");
Thread.sleep(3000);
throw new OutOfMemoryError("test");
} catch (Exception e) {
e.printStackTrace();
}
}
};
ScheduledFuture scheduledFuture = redoExecutor.scheduleWithFixedDelay(runnable, 3000,
3000, TimeUnit.MILLISECONDS);
System.in.read();

想到之前看RocketMq源码,有设置全局的默认异常处理,看是否能捕获.

1
2
3
4
5
6
7
8
9
Thread.setDefaultUncaughtExceptionHandler(new Thread.UncaughtExceptionHandler() {
@Override
public void uncaughtException(Thread t, Throwable e) {
System.err.println("捕获到未处理的异常!");
System.err.println("线程: " + t.getName());
System.err.println("异常: " + e.getClass().getName() + ": " + e.getMessage());
e.printStackTrace();
}
});

线程池的线程,设定异常捕获器

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
public class CustomThreadFactory extends NamedThreadFactory {
public CustomThreadFactory(String prefix) {
super(prefix);
}
public Thread newThread(Runnable r) {
Thread thread = super.newThread(r);
thread.setUncaughtExceptionHandler((t, e)->{
System.err.println("捕获到未处理的异常!");
System.err.println("线程: " + t.getName());
System.err.println("异常: " + e.getClass().getName() + ": " + e.getMessage());
e.printStackTrace();
});
thread.interrupt();
return thread;
}
}

发现不起作用

其实还是怀疑应用当时有oom了,据说压测压的cpu都100%占用了。不过我这个推演的结果,测试那边也是认可的。就这样吧。