JDK提供的许多内置的工具可以帮助我们监控应用,排查故障,本文用一个示例程序模拟死锁,CPU占用和内存泄漏的情况,然后展示使用工具排查的步骤思路

示例程序

import java.lang.management.ManagementFactory;
import java.lang.management.MemoryMXBean;
import java.lang.management.MemoryUsage;
import java.util.ArrayList;
import java.util.List;

/**
 * 模拟死锁,内存,CPU占用
 * <p>
 * 使用jdk工具排查
 * <p>
 * -Xms32m -Xmx32m
 */
public class JdkTools {

    private final static Object lock1 = new JdkTools();

    private final static Object lock2 = new JdkTools();

    private final static List<char[]> chars = new ArrayList<>();

    private final static List<DemoObject> objects = new ArrayList<>();

    private static void acquire1() {
        synchronized (lock1) {
            System.out.println(Thread.currentThread().getId() + "-acquire-lock1");
            try {
                Thread.sleep(1000L);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
            synchronized (lock2) {
                System.out.println(Thread.currentThread().getId() + "-acquire-lock2");
            }
        }
    }

    private static void acquire2() {
        synchronized (lock2) {
            System.out.println(Thread.currentThread().getId() + "-acquire-lock2");
            synchronized (lock1) {
                System.out.println(Thread.currentThread().getId() + "-acquire-lock1");
            }
        }
    }


    private static void deadLock() {
        new Thread(JdkTools::acquire1, "JdkTools.deadLock-thread1").start();
        new Thread(JdkTools::acquire2, "JdkTools.deadLock-thread2").start();

    }


    private static void consumeCPU() {
        new Thread(() -> {
            int i = 0;
            while (true) {
                i++;
            }
        }, "JdkTools-consumeCPU-thread").start();
    }

    private static void outOfMemory() {
        int n = 0;
        while (true) {
            try {
                Thread t = new Thread(() -> {
                    int size = 1024 * 1024;
                    while (true) {
                        chars.add(new char[size]);
                        for (int i = 0; i < size; i++) {
                            objects.add(new DemoObject());
                        }
                    }
                }, "JdkTools-outOfMemory-thread-" + n++);
                t.start();
                t.join();
            } catch (Exception e) {
                e.printStackTrace();
            } finally {
                try {
                    Thread.sleep(1000 * 20L);
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
                chars.clear();
                objects.clear();
            }
        }
    }

    public static void main(String[] args) {
        MemoryMXBean memoryMBean = ManagementFactory.getMemoryMXBean();
        MemoryUsage usage = memoryMBean.getHeapMemoryUsage();
        int maxSize = 1024 * 1024 * 32;
        if (usage.getInit() > maxSize || usage.getMax() > maxSize) {
            System.out.println("Use: -Xms32m -Xmx32m");
            return;
        }
        deadLock();
        consumeCPU();
        outOfMemory();
    }


}

class DemoObject {
    int a = 10086;
    long[] longs = new long[1024];
}

运行程序,输出:

10-acquire-lock1
11-acquire-lock2
Exception in thread "JdkTools-outOfMemory-thread-0" java.lang.OutOfMemoryError: Java heap space
	at DemoObject.<init>(JdkTools.java:111)
	at JdkTools.lambda$outOfMemory$1(JdkTools.java:73)
	at JdkTools$$Lambda$4/142257191.run(Unknown Source)
	at java.lang.Thread.run(Thread.java:748)

这里我们忽略输出日志,仅使用jdk工具来排查问题。

首先我们使用 jps -l 查看java进程

l表示输出程序main方法所在类的全限定名,v打印jvm变量,m打印main方法参数,另外q参数可以不打印类名

5826 org.jetbrains.idea.maven.server.RemoteMavenServer
5172 com.intellij.idea.Main
130822 org.jetbrains.kotlin.daemon.KotlinCompileDaemon
68376 com.sun.tools.hat.Main
130941 org.jetbrains.jps.cmdline.Launcher
73836 JdkTools
5197 com.intellij.idea.Main

我们可以知道JdkTools类运行的进程号是73836

死锁

使用 jstack 73836 可查看进程73836所有的线程栈,如果检测到死锁末尾会有相应输出:

Found one Java-level deadlock:
=============================
"JdkTools.deadLock-thread2":
  waiting to lock monitor 0x00007fc8a40062c8 (object 0x00000000fe042eb0, a JdkTools),
  which is held by "JdkTools.deadLock-thread1"
"JdkTools.deadLock-thread1":
  waiting to lock monitor 0x00007fc8a4004d78 (object 0x00000000fe042ec0, a JdkTools),
  which is held by "JdkTools.deadLock-thread2"

Java stack information for the threads listed above:
===================================================
"JdkTools.deadLock-thread2":
	at JdkTools.acquire2(JdkTools.java:42)
	- waiting to lock <0x00000000fe042eb0> (a JdkTools)
	- locked <0x00000000fe042ec0> (a JdkTools)
	at JdkTools$$Lambda$2/303563356.run(Unknown Source)
	at java.lang.Thread.run(Thread.java:748)
"JdkTools.deadLock-thread1":
	at JdkTools.acquire1(JdkTools.java:33)
	- waiting to lock <0x00000000fe042ec0> (a JdkTools)
	- locked <0x00000000fe042eb0> (a JdkTools)
	at JdkTools$$Lambda$1/471910020.run(Unknown Source)
	at java.lang.Thread.run(Thread.java:748)

Found 1 deadlock.

可以看出,线程JdkTools.deadLock-thread2获取了一个类型为JdkTools,地址为0x00000000fe042ec0的锁,等待一个类型为JdkTools,地址为0x00000000fe042eb0的锁,目前运行的代码行为JdkTools.java:42;

而线程JdkTools.deadLock-thread1正好相反,获取了一个类型为JdkTools,地址为0x00000000fe042eb0的锁,等待一个类型为JdkTools,地址为0x00000000fe042ec0的锁,目前运行的代码行为JdkTools.java:33,于是发生了死锁。

查找CPU占用率高的线程

线程是调度的基本单位,进程是资源分配的基本单位,线程在linux中也叫轻量级进程,使用top的H参数可以以线程的维度查看系统资源占用情况

使用 top -Hp 73836 可以看到在进程73836创建的线程中,占用CPU最高的是线程id是73858。同时我们这里也可以看出,CPU占用不同,但是内存占用都是一样的,间接反映了内存是以进程为单位分配的

   PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                                             
 73858 zeng      20   0 3029304  67296  16220 R 97.7  1.1  30:01.47 java                                                                                                
 73838 zeng      20   0 3029304  67296  16220 S  0.7  1.1   0:01.27 java                                                                                                
 73841 zeng      20   0 3029304  67296  16220 S  0.7  1.1   0:01.32 java                                                                                                
 73845 zeng      20   0 3029304  67296  16220 S  0.7  1.1   0:01.28 java                                                                                                
 73840 zeng      20   0 3029304  67296  16220 S  0.3  1.1   0:01.27 java
 ...

jstack可以获取当前所有的线程栈,其中记录了每个线程的id,所以我们可以用这个线程id结合jstack匹配,找到对应线程栈,不过jstack记录的是16进制的我们需要把它转换为16进制

jstack 73836 | grep -A 10 `printf '%x' 73858`
"JdkTools-consumeCPU-thread" #12 prio=5 os_prio=0 tid=0x00007fc8ec192000 nid=0x12082 runnable [0x00007fc8cfffe000]
   java.lang.Thread.State: RUNNABLE
	at JdkTools.lambda$consumeCPU$0(JdkTools.java:59)
	at JdkTools$$Lambda$3/135721597.run(Unknown Source)
	at java.lang.Thread.run(Thread.java:748)

"JdkTools.deadLock-thread2" #11 prio=5 os_prio=0 tid=0x00007fc8ec190800 nid=0x12081 waiting for monitor entry [0x00007fc8d41de000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at JdkTools.acquire2(JdkTools.java:42)
	- waiting to lock <0x00000000fe042eb0> (a JdkTools)
	- locked <0x00000000fe042ec0> (a JdkTools)

于是我们知道,是JdkTools-consumeCPU-thread线程长时间占用CPU,目前执行到JdkTools.java:59

排查内存泄漏

线程间是共享堆内存的,线程结束了,其对应的工作内存会释放,栈上引用的堆上的内存由于引用不可达,也会被JVM自动回收。

一般出现内存泄漏是线程不结束,一些永远无法获取到的对象绑定到线程上,如ThreadLocal,或者静态字段引用的对象等。

可以认为内存泄漏都是业务代码逻辑造成的,一些可以不持有的对象却一直持有导致JVM无法垃圾回收

如果发生了内存泄漏,我们可以找到占用内存多的对象,然后分析哪些对象持有了这个对象的引用,最终定位问题

我们使用jmap和jhat结合分析内存泄漏情况:

  1. 将堆栈dump到文件 jmap -dump:live,format=b,file=dump.dat 73836
  2. 使用jhat分析dump文件,通过http查看分析 jhat dump.dat
    Reading from dump.dat...
    Dump file created Thu Nov 07 17:43:42 CST 2019
    Snapshot read, resolving...
    Resolving 14123 objects...
    Chasing references, expect 2 dots..
    Eliminating duplicate references..
    Snapshot resolved.
    Started HTTP server on port 7000
    Server is ready.
    

访问 http://127.0.0.1:7000 即可进程追踪引用关系,查看Show heap histogram,可以看到 类-实例数-占用字节数 的列表,默认按照 占用字节数 排序

Class	Instance Count	Total Size
class [J	2876	23589856
class [C	1410	2210504
class [Ljava.lang.Object;	777	88456
class java.lang.Class	661	60812
class DemoObject	2874	34488
class [B	10	25214
class java.lang.String	1397	16764

class[J 表示long的数组类型,有2876个实例,点击进入

class [J
...
References to this object:
[J@0xff08a320 (8208 bytes) : ??
[J@0xfe8d3318 (8208 bytes) : ??
[J@0xfe752748 (8208 bytes) : ??
...

在 References to this object(引用该对象的对象,可以理解为每个实例都引用了对应类的对象,数组也不例外): 下方可以看到该类型对应的所有实例,点击0xff08a320进入

Object at 0xff08a320
References to this object:
DemoObject@0xfef2e8f0 (28 bytes) : field longs

可以看到DemoObject的longs字段引用了这个long数组,点击DemoObject@0xfef2e8f0,看看谁引用了DemoObject

Object at 0xfef2e8f0

instance of DemoObject@0xfef2e8f0 (28 bytes)

Class:
class DemoObject

Instance data members:
a (I) : 10086
longs (L) : [J@0xff08a320 (8208 bytes) 

References to this object:
[Ljava.lang.Object;@0xfe068f38 (33328 bytes) : Element 1797 of [Ljava.lang.Object;@0xfe068f38
...

从上面可以得知0xfe068f38这个对象数组引用了DemoObject,继续点击0xfe068f38对象数组,查看谁引用了它

Object at 0xfe068f38

Array of 4164 objects
0 : DemoObject@0xfe06ff90 (28 bytes) 
1 : DemoObject@0xfe06ff78 (28 bytes) 
2 : DemoObject@0xfe06ff60 (28 bytes) 
3 : DemoObject@0xfe06ff48 (28 bytes) 
4 : DemoObject@0xfe06ff30 (28 bytes) 
5 : DemoObject@0xfe06ff18 (28 bytes) 
...

References to this object:
java.util.ArrayList@0xfe025b68 (32 bytes) : field elementData


由于这个对象数组有4164个元素,我们需要滑到最下面,java.util.ArrayList@0xfe025b68这个对象持有对象数组,有点眉目了,我们继续点击java.util.ArrayList@0xfe025b68

Object at 0xfe025b68

instance of java.util.ArrayList@0xfe025b68 (32 bytes)

Class:
class java.util.ArrayList

Instance data members:
elementData (L) : [Ljava.lang.Object;@0xfe068f38 (33328 bytes) 
modCount (I) : 541773
size (I) : 2874

References to this object:
class JdkTools (108 bytes) : static field objects

查看最后一行,class JdkTools (108 bytes) : static field objects,到这里已经明确了,是JdkTools的静态变量objects导致OOM,接下来我们就可以从业务层分析原因,为什么会占用如此高的内存了

其他

参考文章和扩展阅读:

java命令--jstack 工具

https://www.cnblogs.com/kongzhongqijing/articles/3630264.html

使用Java监控工具出现 Can't attach to the process

https://www.cnblogs.com/wscit/p/6803354.html

JVM性能调优监控命令jps、jinfo、jstat、jmap+jhat、jstack使用详解

https://www.cnblogs.com/barrywxx/p/8576390.html

JDK内置工具使用(jps、jstack、jmap、jstat)

https://www.cnblogs.com/heishuichenzhou/p/10609085.html

JVM调优总结 -Xms -Xmx -Xmn -Xss

https://www.cnblogs.com/ceshi2016/p/8447989.html

java动态获取jvm参数

https://blog.csdn.net/liudezhicsdn/article/details/51058504

Java一个对象占用多少字节

https://www.cnblogs.com/zhi-leaf/p/11589711.html

JNI基础:JNI数据类型和类型描述符

https://blog.csdn.net/afei__/article/details/80899758

jmap命令详解

https://www.cnblogs.com/xingzc/p/5778014.html

Linux 常用命令 | top 详解

https://www.cnblogs.com/xiaote/p/10976362.html

Why do -Xmx and Runtime.maxMemory not agree

https://stackoverflow.com/questions/23701207/why-do-xmx-and-runtime-maxmemory-not-agree