一次线上debug的过程

事情的缘由是这样的,很久之前,我写了一个爬虫,然后爬虫幸福的运行在服务器上,突然有一天,我收到一份邮件,上面写着爬虫失效,然后
肯定先是ssh上去看看进程.

先top一下看状况,一切良好,java进程的运行比较平稳,所以排除了死循环的错误,或者是NPE退出的错误.

然后线上使用jstack查了下虚拟机栈,然后就发现了如下一段奇葩的栈:

"pool-6-thread-629" prio=10 tid=0x00007fcfb4004000 nid=0x51e0 runnable [0x00007fcf6c6c4000]
   java.lang.Thread.State: RUNNABLE
 at java.net.SocketInputStream.socketRead0(Native Method)
 at java.net.SocketInputStream.read(SocketInputStream.java:129)
    ...
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
"pool-6-thread-628" prio=10 tid=0x00007fcfc4004000 nid=0x51df runnable [0x00007fcf6f2f0000]
   java.lang.Thread.State: RUNNABLE
 at java.net.SocketInputStream.socketRead0(Native Method)
 at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    ...
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
"pool-6-thread-627" prio=10 tid=0x00007fcff8003000 nid=0x51dd runnable [0x00007fcf6fffd000]
   java.lang.Thread.State: RUNNABLE
 at java.net.SocketInputStream.socketRead0(Native Method)
 at java.net.SocketInputStream.read(SocketInputStream.java:129)
    ...

因为我开的是40的fixedThreadPool,所以出现629的线程数,特别奇怪.所以初步判定是线程池中没有了线程,导致爬虫无法工作,而且这几个阻塞的线程
也可以认为是阻塞在了socket的读写上.因为我是自己设计过的线程池,我会对这种超时的任务进行删除,所以又往下看了下我的监控线程池栈.栈如下:

"pool-7-thread-6" prio=10 nid=0x547c waiting on condition [0x00007fd0845b8000]
   java.lang.Thread.State: WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for  <0x00000000a0324a60> (a AbstractQueuedSynchronizer$ConditionObject)
 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
 ...
 at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
 at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
 at java.lang.Thread.run(Thread.java:662)
"pool-7-thread-5" prio=10 nid=0x547a waiting on condition [0x00007fd0847ba000]
   java.lang.Thread.State: WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for  <0x00000000a0324a60> (a AbstractQueuedSynchronizer$ConditionObject)
 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
 ...
 at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
 at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
 at java.lang.Thread.run(Thread.java:662)
"pool-7-thread-4" prio=10 nid=0x5478 waiting on condition [0x00007fd0849bc000]
   java.lang.Thread.State: WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for  <0x00000000a0324a60> (a AbstractQueuedSynchronizer$ConditionObject)
 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
 ...
 at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
 at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
 at java.lang.Thread.run(Thread.java:662)

这边表现很正常,这边对于出现的waiting on condition的,则是我的任务队列的锁.问题来了,为什么会锁在任务队列上?我特地用jmap,dump了线上运行程序的
内存.然后用jvisualVM进行查询,这绝对是个不爽的过程,因为dump下的内存快照有800MB.

因为锁死在任务队列上,所以我们来看内存中,这个队列到底怎样.

队列

可以发现,队列中确实是有数据,然后根据任务进行查询,发现果然是淘宝的任务.

任务

因为之前测试的时候发现淘宝的任务可能存在超时,然后会进行重做任务,所以,立即去查我的代码.终于发现了问题.问题代码如下:

@Override
public void run() {
    if(times++ > 3){
    // 超过三次不做
        logger.info("任务失败次数过多!" + ((DPXTask)task).toString());
        return;
    }
    FutureTask<String> futureTask = new FutureTask<String>(task);
    try {
        workers.submit(futureTask).get(waitTime, TimeUnit.SECONDS);
    } catch (ExecutionException e) {
        logger.info("任务出错:" + ((DPXTask) task).toString() + e.getMessage());
        // taskqueue.add((Runnable) task); -- 这是错误代码
        run(); // 这才对
    } catch (TimeoutException e) {
        logger.info("任务超时:" + ((DPXTask) task).toString() + e.getMessage());
        run();
    } catch (Exception e) {
        logger.info("任务错误:" + ((DPXTask) task).toString() + e.getMessage());
        run();
    }
}

可以看到,错误的代码是因为直接朝taskqueue写入了任务,从而使得监控任务失效,脱离了监控任务的管理范围,导致了线程的泄漏.从而导致线
程池无法产生新的线程,于是,把代码进行修改,然后上线,到目前为止,运行平稳.

不过也暴露了问题,就是日志的作用没有体现,一开始在对日志的处理中,基本没有发现对debug有益的信息,因为第一次打日志.所以,对日志的处理,之后
会完善.

1 Reply to “一次线上debug的过程”

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.