作者:牛小宝
在Java应用的性能测试中,很多性能问题可以通过观察线程堆栈来发现,Jstack是JVM自带dump线程堆栈的工具,很轻量易用,并且执行时不会对性能造成很大的影响。灵活的使用jstack可以发现很多隐秘的性能问题,是定位问题不可多得的好帮手。
线程堆栈也称作线程调用堆栈。Java线程堆栈是虚拟机中线程(包括锁)状态的一个瞬间快照,即系统在某个时刻所有线程的运行状态,包括每一个线程的调用堆栈,锁的持有情况等信息,从线程堆栈中可以得到以下信息:
可以通过Jstack获取应用运行时的线程堆栈,可以通过如下方式获取线程堆栈:
qaperf@hzbxs-haitao-perftest2:~$ jps
24347 Bootstrap
17677 Jps
29113 jar
qaperf@hzbxs-haitao-perftest2:~$ jstack 24347 >> jstack.log
对于Java应用而言,一下常见的几个性能问题都可以从线程堆栈入手定位:
想知道线程是在卖力工作还是偷懒休息,这就需要关注线程的运行状态,常用到的几个线程状态有:RUNNABLE,BLOCKED,WAITING,TIMED_WAITING。
从虚拟机的角度看,RUNNABLE状态代表线程正处于运行状态。一般情况下处于运行状态线程是会消耗CPU的,但不是所有的RUNNABLE都会消耗CPU,比如线程进行网络IO时,这时线程状态是挂起的,但由于挂起发生在本地代码,虚拟机并不感知,所以不会像显示调用Java的sleep()或者wait()等方法进入WAITING状态,只有等数据到来时才消耗一点CPU.
这两种状态表示线程被挂起,等待被唤醒,当设置超时时间时状态为TIMED_WAITING,如果是未设置超时时间,这时的状态为WATING,必须等待lock.notify()或lock.notifyAll()或接收到interrupt信号才能退出等待状态,TIMED_WAITING/WATING下还需要关注下面几个线程状态:
此时的线程处于阻塞状态,一般是在等待进入一个临界区“waiting for monitor entry”,这种状态是需要重点关注的
处于TIMED_WAITING、WATING、BLOCKED状态的线程是不消耗CPU的,而处于RUNNABLE状态的线程要结合当前线程代码的性质判断是否消耗CPU:
问题现象:环境搭建时,部署应用后tomcat无法启动,查看日志并无报错现象,直观感觉tomcat启动时好像卡在了哪里,所以我们希望看到tomcat启动时究竟发生了什么,导致启动无法完成,这时线程堆栈中的函数调用关系也许可以帮上忙,jstack得到对应tomcat应用的线程堆栈,如下:
"localhost-startStop-1" daemon prio=10 tid=0x0000000002da2000 nid=0x630b in Object.wait() [0x00007f1863538000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000007d6924800> (a org.apache.curator.framework.state.ConnectionStateManager)
at java.lang.Object.wait(Object.java:503)
at org.apache.curator.framework.state.ConnectionStateManager.blockUntilConnected(ConnectionStateManager.java:215)
- locked <0x00000007d6924800> (a org.apache.curator.framework.state.ConnectionStateManager)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.blockUntilConnected(CuratorFrameworkImpl.java:223)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.blockUntilConnected(CuratorFrameworkImpl.java:229)
at com.netease.kaola.kschedule.client.curator.CuratorSupport.initZK(CuratorSupport.java:81)
at com.netease.kaola.kschedule.client.curator.CuratorSupport.check(CuratorSupport.java:293)
- locked <0x00000007d64af060> (a java.lang.Class for com.netease.kaola.kschedule.client.curator.CuratorSupport)
at com.netease.kaola.kschedule.client.curator.CuratorSupport.checkExists(CuratorSupport.java:113)
at com.netease.kaola.kschedule.client.job.JobManagerImpl.isSchedulerStop(JobManagerImpl.java:218)
at com.netease.kaola.kschedule.client.job.JobManagerImpl.startScheduler(JobManagerImpl.java:134)
at com.netease.kaola.kschedule.client.KScheduleClientFactory.init(KScheduleClientFactory.java:90)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
...
问题分析:首先关注线程状态,是处于WATING(on object monitor),这时线程执行了Object.wait(),处于挂起状态,在等待被唤醒,而且这里并没有设置超时时间,所以只要线程没被唤醒,tomcat会一直等下去。但tomcat在等什么呢,查看函数调用信息可以看到“com.netease.kaola.kschedule.client.curator.CuratorSupport.initZK”,这个函数是kschedule启动时需要初始化zookeeper,应用启动就是卡在了这里。知道问题所在就好办, 查看kschedule的配置,发现zookeeper的ip用的是私有ip,与应用不通,更改成机房ip后问题解决。
问题现象:在测试一个场景时,发现响应时间很长,日志也无报错现象,根据调用链逐级定位,发现80%的时间都是消耗在DAO层的方法上,这时首先考虑的是sql会不会有问题?于是找DBA同学帮忙抓sql看下,但DBA同学反映sql执行很快,执行计划也没有问题,那问题出现在哪里呢,找不到原因就看下线程堆栈,系统在dao层方法后做了什么?jstack线程堆栈如下:
"DubboServerHandler-10.165.184.51:20881-thread-200" daemon prio=10 tid=0x00007f2fd6208800 nid=0x504b waiting on condition [0x00007f2fc0280000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x000000078172f2c0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
at com.alibaba.druid.pool.DruidDataSource.pollLast(DruidDataSource.java:1487)
at com.alibaba.druid.pool.DruidDataSource.getConnectionInternal(DruidDataSource.java:1086)
at com.alibaba.druid.pool.DruidDataSource.getConnectionDirect(DruidDataSource.java:953)
at com.alibaba.druid.filter.FilterChainImpl.dataSource_connect(FilterChainImpl.java:4544)
at com.alibaba.druid.filter.logging.LogFilter.dataSource_getConnection(LogFilter.java:827)
at com.alibaba.druid.filter.FilterChainImpl.dataSource_connect(FilterChainImpl.java:4540)
at com.alibaba.druid.pool.DruidDataSource.getConnection(DruidDataSource.java:931)
at com.alibaba.druid.pool.DruidDataSource.getConnection(DruidDataSource.java:923)
at com.alibaba.druid.pool.DruidDataSource.getConnection(DruidDataSource.java:100)
at org.springframework.jdbc.datasource.DataSourceUtils.doGetConnection(DataSourceUtils.java:111)
at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:77)
at org.mybatis.spring.transaction.SpringManagedTransaction.openConnection(SpringManagedTransaction.java:81)
at org.mybatis.spring.transaction.SpringManagedTransaction.getConnection(SpringManagedTransaction.java:67)
at org.apache.ibatis.executor.BaseExecutor.getConnection(BaseExecutor.java:279)
at org.apache.ibatis.executor.SimpleExecutor.prepareStatement(SimpleExecutor.java:72)
at org.apache.ibatis.executor.SimpleExecutor.doQuery(SimpleExecutor.java:59)
...
问题分析:先关注线程状态,发现堆栈信息里大量的dubbo线程处于TIMED_WAITING状态,从“waiting on condition”可以看出系统在等待一个条件发生,这时的线程处于sleep状态,一般会有超时时间唤醒,一般出现TIMED_WAITING很正常,一些等待IO都会出现这种状态,但是大量的TIMED_WAITING就要找原因了,观察线程堆栈发现处于TIMED_WAITING状态的线程都在等待druid获取连接池的连接,这种现象很想连接池不够用了,于是增加数据库连接池的连接数,TPS直接提升了3倍。
问题现象:同样是在测试场景时发现响应时间变慢,并且响应时间的毛刺现象比较严重,依次排查系统可能的瓶颈点没有明显收获,这时jstack又排上用场了,先看线程堆栈:
"DubboServerHandler-10.165.184.34:20880-thread-199" daemon prio=10 tid=0x00007f3cb4196000 nid=0x6048 waiting for monitor entry [0x00007f3cabb79000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.log4j.Category.callAppenders(Category.java:204)
- waiting to lock <0x000000078c312718> (a org.apache.log4j.spi.RootLogger)
at org.apache.log4j.Category.forcedLog(Category.java:391)
at org.apache.log4j.Category.info(Category.java:666)
at netease.qp.spellcheck.Facade.logInfo(Facade.java:26)
at netease.qp.kaola.service.impl.SearchPhraseServiceImpl.getSearchGoodsResult(SearchPhraseServiceImpl.java:60)
at com.alibaba.dubbo.common.bytecode.Wrapper1.invokeMethod(Wrapper1.java)
at com.alibaba.dubbo.rpc.proxy.javassist.JavassistProxyFactory$1.doInvoke(JavassistProxyFactory.java:46)
at com.alibaba.dubbo.rpc.proxy.AbstractProxyInvoker.invoke(AbstractProxyInvoker.java:72)
at com.alibaba.dubbo.rpc.protocol.InvokerWrapper.invoke(InvokerWrapper.java:53)
at com.alibaba.dubbo.rpc.protocol.dubbo.filter.TraceFilter.invoke(TraceFilter.java:78)
at com.alibaba.dubbo.rpc.protocol.ProtocolFilterWrapper$1.invoke(ProtocolFilterWrapper.java:91)
at com.alibaba.dubbo.monitor.support.MonitorFilter.invoke(MonitorFilter.java:65)
at com.alibaba.dubbo.rpc.protocol.ProtocolFilterWrapper$1.invoke(ProtocolFilterWrapper.java:91)
at com.alibaba.dubbo.rpc.filter.TimeoutFilter.invoke(TimeoutFilter.java:42)
at com.alibaba.dubbo.rpc.protocol.ProtocolFilterWrapper$1.invoke(ProtocolFilterWrapper.java:91)
at com.alibaba.dubbo.rpc.filter.ExceptionFilter.invoke(ExceptionFilter.java:64)
at com.alibaba.dubbo.rpc.protocol.ProtocolFilterWrapper$1.invoke(ProtocolFilterWrapper.java:91)
at com.alibaba.dubbo.rpc.filter.ContextFilter.invoke(ContextFilter.java:60)
at com.alibaba.dubbo.rpc.protocol.ProtocolFilterWrapper$1.invoke(ProtocolFilterWrapper.java:91)
at com.alibaba.dubbo.rpc.filter.GenericFilter.invoke(GenericFilter.java:112)
...
问题分析:可以看到线程是处于BLOCKED状态的,这种状态我们需要重点关注,这时的线程是被阻塞的,进一步查看发现几乎所有的dubbo线程都处于block状态,都在“waiting to lock <0x000000078c312718>”,这个<0x000000078c312718>又是个什么鬼?
Line 57: - waiting to lock <0x000000078c312718> (a org.apache.log4j.spi.RootLogger)
Line 119: - waiting to lock <0x000000078c312718> (a org.apache.log4j.spi.RootLogger)
Line 169: - waiting to lock <0x000000078c312718> (a org.apache.log4j.spi.RootLogger)
Line 207: - waiting to lock <0x000000078c312718> (a org.apache.log4j.spi.RootLogger)
Line 257: - waiting to lock <0x000000078c312718> (a org.apache.log4j.spi.RootLogger)
Line 295: - waiting to lock <0x000000078c312718> (a org.apache.log4j.spi.RootLogger)
Line 345: - waiting to lock <0x000000078c312718> (a org.apache.log4j.spi.RootLogger)
Line 407: - waiting to lock <0x000000078c312718> (a org.apache.log4j.spi.RootLogger)
Line 588: - waiting to lock <0x000000078c312718> (a org.apache.log4j.spi.RootLogger)
Line 686: - waiting to lock <0x000000078c312718> (a org.apache.log4j.spi.RootLogger)
Line 790: - locked <0x000000078c312718> (a org.apache.log4j.spi.RootLogger)
Line 840: - waiting to lock <0x000000078c312718> (a org.apache.log4j.spi.RootLogger)
Line 954: - waiting to lock <0x000000078c312718> (a org.apache.log4j.spi.RootLogger)
Line 995: - waiting to lock <0x000000078c312718> (a org.apache.log4j.spi.RootLogger)
Line 1105: - waiting to lock <0x000000078c312718> (a org.apache.log4j.spi.RootLogger)
Line 1143: - waiting to lock <0x000000078c312718> (a org.apache.log4j.spi.RootLogger)
Line 1197: - waiting to lock <0x000000078c312718> (a org.apache.log4j.spi.RootLogger)
...
通过排查发现这个锁是log4j拿到的,同时阻塞了其他线程通过log4j打日志,Google类似问题才知道是log4j的一个bug,可以通过升级log4j版本或者精简日志避免,知道原因后经过相应的处理,性能得到大幅度提升,这里安利一篇侯姐关于log4j优化的文章:http://doc.hz.netease.com/pages/viewpage.action?pageId=26313263
问题现象:CPU占用率高的问题很常见,首先我们要确定是不是usr%较高,如果是我们就可以借助jstack来看看究竟是什么业务占用了这么高的CPU。
问题分析:首先top找出CPU占用率较高的进程PID
Tasks: 107 total, 1 running, 106 sleeping, 0 stopped, 0 zombie
%Cpu(s): 8.8 us, 1.9 sy, 0.0 ni, 87.6 id, 0.0 wa, 0.0 hi, 1.5 si, 0.2 st
KiB Mem: 4056796 total, 3879320 used, 177476 free, 61940 buffers
KiB Swap: 0 total, 0 used, 0 free, 430180 cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
51334 appops 20 0 2723m 1.0g 6224 S 46.3 27.0 354:29.58 java
1823 qatest 20 0 707m 21m 6624 S 0.3 0.5 10:40.24 PM2 v2.4.2: God
1829 qatest 20 0 869m 93m 6096 S 0.3 2.4 12:06.25 node /home/qate
2817 nobody 20 0 101m 26m 504 S 0.3 0.7 1:08.11 nginx
2907 nhids 20 0 26084 2624 1976 S 0.3 0.1 1:13.52 nhids-agentd
1 root 20 0 10656 836 704 S 0.0 0.0 0:14.46 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:13.24 ksoftirqd/0
4 root 20 0 0 0 0 S 0.0 0.0 0:58.47 kworker/0:0
可以看出PID为51334的占用CPU最高,运行命令
top -p 51334 -H
等到该进程所有线程的统计情况:
top - 10:32:05 up 13 days, 19:10, 1 user, load average: 0.82, 0.57, 0.28
Threads: 53 total, 0 running, 53 sleeping, 0 stopped, 0 zombie
%Cpu(s): 9.3 us, 2.6 sy, 0.0 ni, 86.2 id, 0.0 wa, 0.0 hi, 1.5 si, 0.4 st
KiB Mem: 4056796 total, 3890980 used, 165816 free, 59588 buffers
KiB Swap: 0 total, 0 used, 0 free, 362916 cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
51402 appops 20 0 2723m 1.0g 6224 S 6.3 27.0 43:33.60 java
51408 appops 20 0 2723m 1.0g 6224 S 3.7 27.0 30:02.94 java
51517 appops 20 0 2723m 1.0g 6224 S 2.7 27.0 16:14.26 java
54362 appops 20 0 2723m 1.0g 6224 S 2.7 27.0 16:13.93 java
63901 appops 20 0 2723m 1.0g 6224 S 2.7 27.0 16:11.07 java
64115 appops 20 0 2723m 1.0g 6224 S 2.7 27.0 16:09.38 java
50039 appops 20 0 2723m 1.0g 6224 S 2.7 27.0 15:43.05 java
这里假设PID为51402的线程很高,把PID转换成十六进制0xc8ca,这个既是jstack堆栈信息中的线程nid:
"kafka-producer-network-thread | foreground.-_-.apm_collector_test" daemon prio=10 tid=0x0000000002cab800 nid=0xc8ca runnable [0x00007fa1ca46d000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
- locked <0x0000000781725140> (a sun.nio.ch.Util$2)
- locked <0x0000000781725130> (a java.util.Collections$UnmodifiableSet)
- locked <0x0000000781724378> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
at org.apache.kafka.common.network.Selector.select(Selector.java:328)
at org.apache.kafka.common.network.Selector.poll(Selector.java:218)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:192)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:191)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:122)
通过观察线程堆栈就能得到那个环节占用CPU较高,记得要多jstack几次,这里安利下性能组的show-busy-java-threads.sh脚本简单易用,需要的同学wiki上搜下。
通过线程堆栈还可以分析出死锁,死循环等性能问题,平时在定位性能问题时多jstack几次,可能会有意想不到的收获哦。
网易云产品免费体验馆,无套路试用,零成本体验云计算价值。
本文来自网易实践者社区,经作者牛小宝授权发布