[讨论] JVM Crash 请教原因

aronlulu 2012-03-15
服务器上的JVM最近崩溃了两次,crash日志错误显示都是CMS垃圾回收时的错误,请教圈子里的各位高手有没有遇到过,日志片段如下:
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00002b1a0304e26c, pid=6278, tid=1090754880
#
# JRE version: 6.0_29-b11
# Java VM: Java HotSpot(TM) 64-Bit Server VM (20.4-b02 mixed mode linux-amd64 coo
mpressed oops)
# Problematic frame:
# V  [libjvm.so+0x4c926c]  void MarkSweep::adjust_pointer<unsigned>(unsigned*, bb
ool)+0x1c
#
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp


Register to memory mapping:

RAX=0x0000000000000100 is an unknown value
RBX=0x0000000575dbca00 is an oop

[error occurred during error reporting (printing register info), id 0xb]

Stack: [0x0000000040f39000,0x000000004103a000],  sp=0x0000000041038690,  free spp
ace=1021k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x4c926c]  void MarkSweep::adjust_pointer<unsigned>(unsigned*, booo
l)+0x1c
V  [libjvm.so+0x4bfdcc]  instanceRefKlass::oop_adjust_pointers(oopDesc*)+0x6c
V  [libjvm.so+0x395378]  CompactibleFreeListSpace::adjust_pointers()+0x138
V  [libjvm.so+0x47d0b0]  AdjustPointersClosure::do_space(Space*)+0x10
V  [libjvm.so+0x3be1f6]  ConcurrentMarkSweepGeneration::space_iterate(SpaceClosuu
re*, bool)+0x16
V  [libjvm.so+0x47bd19]  Generation::adjust_pointers()+0x29
V  [libjvm.so+0x4705b0]  GenAdjustPointersClosure::do_generation(Generation*)+0xx
10
V  [libjvm.so+0x46e905]  GenCollectedHeap::generation_iterate(GenCollectedHeap:::
GenClosure*, bool)+0x35
V  [libjvm.so+0x4704b3]  GenMarkSweep::mark_sweep_phase3(int)+0x1d3
V  [libjvm.so+0x46f9da]  GenMarkSweep::invoke_at_safepoint(int, ReferenceProcesss
or*, bool)+0x34a
V  [libjvm.so+0x3c0119]  CMSCollector::do_compaction_work(bool)+0x1f9
V  [libjvm.so+0x3bfb30]  CMSCollector::acquire_control_and_collect(bool, bool)+00
x160
V  [libjvm.so+0x3bf838]  ConcurrentMarkSweepGeneration::collect(bool, bool, unsii
gned long, bool)+0xf8
V  [libjvm.so+0x46caba]  GenCollectedHeap::do_collection(bool, bool, unsigned loo
ng, bool, int)+0x51a
V  [libjvm.so+0x46dfc3]  GenCollectedHeap::do_full_collection(bool, int)+0x53
V  [libjvm.so+0x85fcb6]  VM_GenCollectFull::doit()+0x66
V  [libjvm.so+0x86d77a]  VM_Operation::evaluate()+0x4a
V  [libjvm.so+0x86cd62]  VMThread::evaluate_operation(VM_Operation*)+0x82
V  [libjvm.so+0x86cfd3]  VMThread::loop()+0x193
V  [libjvm.so+0x86cade]  VMThread::run()+0x6e
V  [libjvm.so+0x710b8f]  java_start(Thread*)+0x13f

VM_Operation (0x0000000042850910): GenCollectFull, mode: safepoint, requested byy
 thread 0x00002aaabea9d800

---------------  P R O C E S S  ---------------

Java Threads: ( => current thread )
  0x00002aaac09e0000 JavaThread "Unknown thread" daemon [_thread_blocked, id=8366
2, stack(0x0000000073d64000,0x0000000073e65000)]
  0x00002aaac10c5800 JavaThread "?. daemon [_thread_blocked, id=7641, stack(0x000
00000073c63000,0x0000000073d64000)]
  0x00002aaabf20e000
[error occurred during error reporting (printing all threads), id 0xb]

VM state:at safepoint (normal execution)

VM Mutex/Monitor currently owned by a thread:  ([mutex/lock_event])
[0x0000000040112970] Threads_lock - owner thread: 0x00002aaabe4e9000
[0x0000000040112ef0] Heap_lock - owner thread: 0x00002aaabea9d800

Heap
 par new generation   total 76672K, used 69206K [0x0000000570000000, 0x0000000577
5330000, 0x0000000575330000)
  eden space 68160K,  98% used [0x0000000570000000, 0x00000005741d3db8, 0x0000000
0574290000)
  from space 8512K,  21% used [0x0000000574ae0000, 0x0000000574ca1b10, 0x00000000
575330000)
  to   space 8512K,   0% used [0x0000000574290000, 0x0000000574290000, 0x00000000
574ae0000)
 concurrent mark-sweep generation total 10400576K, used 94399K [0x000000057533000
00, 0x00000007f0000000, 0x00000007f0000000)
 concurrent-mark-sweep perm gen total 262144K, used 92913K [0x00000007f0000000,  
0x0000000800000000, 0x0000000800000000)

Code Cache  [0x00002aaaab08f000, 0x00002aaaab39f000, 0x00002aaaae08f000)
 total_blobs=1323 nmethods=868 adapters=408 free_code_cache=47236352 largest_free
e_block=12928


RednaxelaFX 2012-03-15
把靠近文件顶部的siginfo那行贴出来。很明显你有些不希望公开的信息在crash log里,但现在的信息看起来不够用。

现在能看出来的是,full GC在修正指针的阶段,在处理到一个弱引用的discovered字段时挂掉了。但没有足够信息进一步跟下去。
public abstract class Reference<T> {
    private T referent;		/* Treated specially by GC */

    ReferenceQueue<? super T> queue;

    Reference next;
    transient private Reference<T> discovered; 	/* used by VM */
}

template <class T> void specialized_oop_adjust_pointers(instanceRefKlass *ref, oop obj) {
  T* referent_addr = (T*)java_lang_ref_Reference::referent_addr(obj);
  MarkSweep::adjust_pointer(referent_addr);
  T* next_addr = (T*)java_lang_ref_Reference::next_addr(obj);
  MarkSweep::adjust_pointer(next_addr);
  T* discovered_addr = (T*)java_lang_ref_Reference::discovered_addr(obj);
  MarkSweep::adjust_pointer(discovered_addr); // <= 这里
  debug_only(trace_reference_gc("instanceRefKlass::oop_adjust_pointers", obj,
                                referent_addr, next_addr, discovered_addr);)
}

int instanceRefKlass::oop_adjust_pointers(oop obj) {
  int size = size_helper();
  instanceKlass::oop_adjust_pointers(obj);

  if (UseCompressedOops) {
    specialized_oop_adjust_pointers<narrowOop>(this, obj); // <= 这里
  } else {
    specialized_oop_adjust_pointers<oop>(this, obj);
  }
  return size;
}


(gdb) disass 0x4bfd60
Dump of assembler code for function instanceRefKlass::oop_adjust_pointers(oopDesc*):
0x00000000004bfd60 <+0>:      push   %rbp
0x00000000004bfd61 <+1>:      mov    %rsp,%rbp
0x00000000004bfd64 <+4>:      sub    $0x10,%rsp
0x00000000004bfd68 <+8>:      mov    %rbx,(%rsp)
0x00000000004bfd6c <+12>:     mov    %r12,0x8(%rsp)
0x00000000004bfd71 <+17>:     mov    %rsi,%rbx
0x00000000004bfd74 <+20>:     mov    0x8(%rdi),%r12d
0x00000000004bfd78 <+24>:     callq  0x4b6ca0 <instanceKlass::oop_adjust_pointers(oopDesc*)>
0x00000000004bfd7d <+29>:     mov    0x70b55c(%rip),%rdx        # 0xbcb2e0
0x00000000004bfd84 <+36>:     sar    $0x3,%r12d
0x00000000004bfd88 <+40>:     cmpb   $0x0,(%rdx)
0x00000000004bfd8b <+43>:     je     0x4bfde0 <+128>
0x00000000004bfd8d <+45>:     mov    0x70b4dc(%rip),%r9        # 0xbcb270
0x00000000004bfd94 <+52>:     xor    %esi,%esi
0x00000000004bfd96 <+54>:     movslq (%r9),%r8
0x00000000004bfd99 <+57>:     lea    (%r8,%rbx,1),%rdi
0x00000000004bfd9d <+61>:     callq  0x4c9250 <void MarkSweep::adjust_pointer<unsigned int>(unsigned int*, bool)>
0x00000000004bfda2 <+66>:     mov    0x70c4af(%rip),%rdi        # 0xbcc258
0x00000000004bfda9 <+73>:     movslq (%rdi),%rsi
0x00000000004bfdac <+76>:     lea    (%rsi,%rbx,1),%rdi
0x00000000004bfdb0 <+80>:     xor    %esi,%esi
0x00000000004bfdb2 <+82>:     callq  0x4c9250 <void MarkSweep::adjust_pointer<unsigned int>(unsigned int*, bool)>
0x00000000004bfdb7 <+87>:     mov    0x710892(%rip),%rcx        # 0xbd0650
0x00000000004bfdbe <+94>:     xor    %esi,%esi
0x00000000004bfdc0 <+96>:     movslq (%rcx),%rdx
0x00000000004bfdc3 <+99>:     lea    (%rdx,%rbx,1),%rdi
0x00000000004bfdc7 <+103>:    callq  0x4c9250 <void MarkSweep::adjust_pointer<unsigned int>(unsigned int*, bool)> ;; <= 这里
0x00000000004bfdcc <+108>:    mov    %r12d,%eax
0x00000000004bfdcf <+111>:    mov    (%rsp),%rbx
0x00000000004bfdd3 <+115>:    mov    0x8(%rsp),%r12
0x00000000004bfdd8 <+120>:    leaveq 
0x00000000004bfdd9 <+121>:    retq   
0x00000000004bfdda <+122>:    xchg   %ax,%ax
0x00000000004bfddd <+125>:    xchg   %ax,%ax
0x00000000004bfde0 <+128>:    mov    0x70b489(%rip),%rdi        # 0xbcb270
0x00000000004bfde7 <+135>:    movslq (%rdi),%rsi
0x00000000004bfdea <+138>:    lea    (%rsi,%rbx,1),%rdi
0x00000000004bfdee <+142>:    xor    %esi,%esi
0x00000000004bfdf0 <+144>:    callq  0x3ac1e0 <void MarkSweep::adjust_pointer<oopDesc*>(oopDesc**, bool)>
0x00000000004bfdf5 <+149>:    mov    0x70c45c(%rip),%rcx        # 0xbcc258
0x00000000004bfdfc <+156>:    xor    %esi,%esi
0x00000000004bfdfe <+158>:    movslq (%rcx),%rdx
0x00000000004bfe01 <+161>:    lea    (%rdx,%rbx,1),%rdi
0x00000000004bfe05 <+165>:    callq  0x3ac1e0 <void MarkSweep::adjust_pointer<oopDesc*>(oopDesc**, bool)>
0x00000000004bfe0a <+170>:    mov    0x71083f(%rip),%r11        # 0xbd0650
0x00000000004bfe11 <+177>:    xor    %esi,%esi
0x00000000004bfe13 <+179>:    movslq (%r11),%r10
0x00000000004bfe16 <+182>:    lea    (%r10,%rbx,1),%rdi
0x00000000004bfe1a <+186>:    callq  0x3ac1e0 <void MarkSweep::adjust_pointer<oopDesc*>(oopDesc**, bool)>
0x00000000004bfe1f <+191>:    jmp    0x4bfdcc <+108>
End of assembler dump.

template <class T> inline void MarkSweep::adjust_pointer(T* p, bool isroot) {
  T heap_oop = oopDesc::load_heap_oop(p);
  if (!oopDesc::is_null(heap_oop)) {
    oop obj     = oopDesc::decode_heap_oop_not_null(heap_oop);
    oop new_obj = oop(obj->mark()->decode_pointer());
    assert(new_obj != NULL ||                         // is forwarding ptr?
           obj->mark() == markOopDesc::prototype() || // not gc marked?
           (UseBiasedLocking && obj->mark()->has_bias_pattern()) ||
                                                      // not gc marked?
           obj->is_shared(),                          // never forwarded?
           "should be forwarded");
    if (new_obj != NULL) {
      assert(Universe::heap()->is_in_reserved(new_obj),
             "should be in object space");
      oopDesc::encode_store_heap_oop_not_null(p, new_obj);
    }
  }
  VALIDATE_MARK_SWEEP_ONLY(track_adjusted_pointer(p, isroot));
}

(gdb) disass 0x4c926c
Dump of assembler code for function void MarkSweep::adjust_pointer<unsigned int>(unsigned int*, bool):
0x00000000004c9250 <+0>:       mov    (%rdi),%eax                # rdi = p, eax = heap_oop
0x00000000004c9252 <+2>:       push   %rbp
0x00000000004c9253 <+3>:       mov    %rsp,%rbp
0x00000000004c9256 <+6>:       test   %eax,%eax                  # oopDesc::is_null(heap_oop)
0x00000000004c9258 <+8>:       je     0x4c92a1 <+81>
0x00000000004c925a <+10>:      mov    0x70045f(%rip),%rsi        # 0xbc96c0 rsi = <Universe::_narrow_oop>
0x00000000004c9261 <+17>:      mov    %eax,%edx
0x00000000004c9263 <+19>:      mov    0x8(%rsi),%ecx             # ecx = <Universe::_narrow_oop._shift>
0x00000000004c9266 <+22>:      shl    %cl,%rdx
0x00000000004c9269 <+25>:      add    (%rsi),%rdx                # rdx = oopDesc::decode_heap_oop_not_null(heap_oop)
0x00000000004c926c <+28>:      mov    (%rdx),%rcx                # rcx = new_obj = obj->mark()->decode_pointer()
0x00000000004c926f <+31>:      mov    0x700c0a(%rip),%rdx        # 0xbc9e80 rdx = <UseBiasedLocking>
0x00000000004c9276 <+38>:      cmpb   $0x0,(%rdx)
0x00000000004c9279 <+41>:      je     0x4c928a <+58>
0x00000000004c927b <+43>:      mov    %rcx,%r8
0x00000000004c927e <+46>:      xor    %edx,%edx
0x00000000004c9280 <+48>:      and    $0x7,%r8d
0x00000000004c9284 <+52>:      cmp    $0x5,%r8
0x00000000004c9288 <+56>:      je     0x4c9291 <+65>
0x00000000004c928a <+58>:      mov    %rcx,%rdx
0x00000000004c928d <+61>:      and    $0xfffffffffffffffc,%rdx
0x00000000004c9291 <+65>:      test   %rdx,%rdx
0x00000000004c9294 <+68>:      je     0x4c92a1 <+81>
0x00000000004c9296 <+70>:      sub    (%rsi),%rdx
0x00000000004c9299 <+73>:      mov    0x8(%rsi),%ecx
0x00000000004c929c <+76>:      shr    %cl,%rdx
0x00000000004c929f <+79>:      mov    %edx,(%rdi)
0x00000000004c92a1 <+81>:      leaveq 
0x00000000004c92a2 <+82>:      retq   
End of assembler dump.


有core dump么?

我会比较希望这个跟之前有修复的一个CMS漏做reference processing的问题相关
Bug 7112034
但之前我还没见过那个问题导致crash的,只是会让内存使用量增加而已
aronlulu 2012-03-15
---------------  T H R E A D  ---------------

Current thread (0x00002aaabe4e9000):  VMThread [stack: 0x0000000040f39000,0x00000
00004103a000] [id=6293]

siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR), si_addr=0x00000000
000000800

Registers:
RAX=0x0000000000000100, RBX=0x0000000575dbca00, RCX=0x0000000000000003, RDX=0x000
00000000000800
RSP=0x0000000041038690, RBP=0x0000000041038690, RSI=0x00002b1a03626ee0, RDI=0x000
00000575dbca18
R8 =0x0000000000000003, R9 =0x00002b1a0375bfb4, R10=0x0000000575dbca28, R11=0x000
00000000000001
R12=0x0000000000000005, R13=0x000000057b5effe8, R14=0x00002b1a03614c70, R15=0x000
002b1a0360a530
RIP=0x00002b1a0304e26c, EFLAGS=0x0000000000010206, CSGSFS=0x0000000000000033, ERR
R=0x0000000000000004
  TRAPNO=0x000000000000000e

是这段么?core.dump有 但是由于信息安全原因,我不能down文件,只能远程shell看到这个文件信息,我只能通过strings命令看到一些core文件的碎片信息,还有就是就这台服务器宕过两次,其他服务器从来没出现过。
PS:如果需要core中的关键信息,我可以grep出来。
RednaxelaFX 2012-03-15
si_addr后面的值的位数不对诶。是0x0000000000000800么?
aronlulu 2012-03-15
siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR), si_addr=0x00000000000000800
被编辑器换行了。
RednaxelaFX 2012-03-15
从寄存器的状态看,出错的时候是一个在old gen里的弱引用对象里的“discovered”字段的值有问题,值为0x00000100,不是一个合法的压缩指针值。

这说明GC堆内的数据有损坏。但我不确定是reference processing时把值弄坏的,还是GC的别的步骤弄坏的。

如果你遇到的问题能稳定重现,那么请试试看在VM启动参数里加上:
-XX:+UnlockDiagnosticVMOptions -XX:VerifyGCLevel=1 -XX:+VerifyBeforeGC

这样在每次full GC前都会做一次heap verification。会使得full GC的暂停时间变长,但不这样做也很难知道到底怎么回事了。
aronlulu 2012-03-15
RednaxelaFX 写道
从寄存器的状态看,出错的时候是一个在old gen里的弱引用对象里的“discovered”字段的值有问题,值为0x00000100,不是一个合法的压缩指针值。

这说明GC堆内的数据有损坏。但我不确定是reference processing时把值弄坏的,还是GC的别的步骤弄坏的。

如果你遇到的问题能稳定重现,那么请试试看在VM启动参数里加上:
-XX:+UnlockDiagnosticVMOptions -XX:VerifyGCLevel=1 -XX:+VerifyBeforeGC

这样在每次full GC前都会做一次heap verification。会使得full GC的暂停时间变长,但不这样做也很难知道到底怎么回事了。

感谢解答,按照你说的,这个值损坏,应该是JVM内部运行原因?
与外部环境有关系么?这台服务器与其他还有个区别就是这台服务器上跑的其他乱七八糟的东西比较多。
如果这个是JVM内部原因算不算CMS回收的bug呢,是不是CMS方式回收不稳定,这样的话就麻烦了。
请指教。。。。
RednaxelaFX 2012-03-15
除非你有啥很不乖的native库对GC堆内的数据做了手脚,不然的话能造成GC堆内数据损坏的只有JVM自身的bug。 <- 我是这么猜的。
现在我猜不出来这个bug的来源是什么地方。

不过既然你有core dump,在上面稍微再挖掘一些信息出来吧。可以么?

请你出问题的那台机器上试试执行一个类似调试器的东西来帮忙收集信息。如果实在无法在那台机器上执行,那至少也请在配置环境详细(JDK版本必须一致)。

这样做:假定你已经设定了正确的JAVA_HOME环境变量。那么:
java -cp $JAVA_HOME/lib/sa-jdi.jar sun.jvm.hotspot.CLHSDB
执行后应该看到:
hsdb> 

的提示符。这个时候输入:
attach /usr/bin/java core.123

其中attach后的第一个参数是启动这个Java程序的“java”命令的全路径,第二个参数是你的core dump文件的全路径。按回车之后,正确连接应该能看到:
Opening core file, please wait...
hsdb> 

此时core dump就正确打开了。然后请执行这个命令:
mem 0x575dbca00 5


输出的内容应该是类似这样:
0x00000006040a6a30: 0x0000000000000001 
0x00000006040a6a38: 0x00000000c0807db0 
0x00000006040a6a40: 0x0000000000000000 
0x00000006040a6a48: 0x0000000000000000 
0x00000006040a6a50: 0x0000000000000001 


接下来要拜托你看的是执行这个命令:
inspect 0x575dbca00


你看到的内容应该是类似这种形式:
instance of Oop for java/lang/ref/Reference @ 0x00000006040a6a30 @ 0x00000006040a6a30 (size = 32)
_mark: 1
referent: null null
queue: null null
next: null null
discovered: null null

想知道在你的core dump里这对象到底是什么,拜托执行了上面两个命令后把输出的内容发一下。

最后用quit命令可以退出该程序。
aronlulu 2012-03-15
执行下来的结果是这样的:
Opening core file, please wait...
hsdb> mem 0x575dbca00 5
0x0000000575dbca00: 0x00000005757f6fe3 
0x0000000575dbca08: 0xaeaba677fe11037e 
0x0000000575dbca10: 0x00000000aeabbb02 
0x0000000575dbca18: 0xaeabb82400000100 
0x0000000575dbca20: 0xaeafec2eaec580f6 

hsdb> inspect 0x575dbca00
Error: sun.jvm.hotspot.oops.UnknownOopException
hsdb> inspect 0x575dbca00
Error: sun.jvm.hotspot.oops.UnknownOopException
hsdb> 
RednaxelaFX 2012-03-15
嗯,这个对象果然是损坏的。但我不理解为什么…
先说明我看到了什么。
从crash地点的代码以及寄存器的状态,可以判断那个名为“discovered”的字段所在的地址是0x0000000575dbca18。而这个字段是位于(离对象头)偏移量24的位置。依次推算出正在处理的弱引用对象的起始地址是上面说的0x0000000575dbca00。

mem命令打印出了0x0000000575dbca00地址之后的若干字节的内容。

inspect命令本来应该能看到这个对象到底是什么,每个字段的内容是什么。
但这个对象损坏了导致它的状态无法显示出来。

请问能再用一次CLHSDB,在attach之后分别对这几个地址也执行mem和inspect么?
0x7f0881bf0 (来自0 + 0xfe11037e << 3) => 应该是一个klass,看看是什么
0x5755d33b8 (来自0 + 0xaeaba677 << 3) => 应该是一个被弱引用的Java对象
0x5755dc120 (来自0 + 0xaeabb824 << 3) => 这个不知道应该是什么,但很明显在old gen里
Global site tag (gtag.js) - Google Analytics