[讨论] JVM Crash 请教原因
aronlulu
2012-03-15
服务器上的JVM最近崩溃了两次,crash日志错误显示都是CMS垃圾回收时的错误,请教圈子里的各位高手有没有遇到过,日志片段如下:
# A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00002b1a0304e26c, pid=6278, tid=1090754880 # # JRE version: 6.0_29-b11 # Java VM: Java HotSpot(TM) 64-Bit Server VM (20.4-b02 mixed mode linux-amd64 coo mpressed oops) # Problematic frame: # V [libjvm.so+0x4c926c] void MarkSweep::adjust_pointer<unsigned>(unsigned*, bb ool)+0x1c # # If you would like to submit a bug report, please visit: # http://java.sun.com/webapps/bugreport/crash.jsp Register to memory mapping: RAX=0x0000000000000100 is an unknown value RBX=0x0000000575dbca00 is an oop [error occurred during error reporting (printing register info), id 0xb] Stack: [0x0000000040f39000,0x000000004103a000], sp=0x0000000041038690, free spp ace=1021k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x4c926c] void MarkSweep::adjust_pointer<unsigned>(unsigned*, booo l)+0x1c V [libjvm.so+0x4bfdcc] instanceRefKlass::oop_adjust_pointers(oopDesc*)+0x6c V [libjvm.so+0x395378] CompactibleFreeListSpace::adjust_pointers()+0x138 V [libjvm.so+0x47d0b0] AdjustPointersClosure::do_space(Space*)+0x10 V [libjvm.so+0x3be1f6] ConcurrentMarkSweepGeneration::space_iterate(SpaceClosuu re*, bool)+0x16 V [libjvm.so+0x47bd19] Generation::adjust_pointers()+0x29 V [libjvm.so+0x4705b0] GenAdjustPointersClosure::do_generation(Generation*)+0xx 10 V [libjvm.so+0x46e905] GenCollectedHeap::generation_iterate(GenCollectedHeap::: GenClosure*, bool)+0x35 V [libjvm.so+0x4704b3] GenMarkSweep::mark_sweep_phase3(int)+0x1d3 V [libjvm.so+0x46f9da] GenMarkSweep::invoke_at_safepoint(int, ReferenceProcesss or*, bool)+0x34a V [libjvm.so+0x3c0119] CMSCollector::do_compaction_work(bool)+0x1f9 V [libjvm.so+0x3bfb30] CMSCollector::acquire_control_and_collect(bool, bool)+00 x160 V [libjvm.so+0x3bf838] ConcurrentMarkSweepGeneration::collect(bool, bool, unsii gned long, bool)+0xf8 V [libjvm.so+0x46caba] GenCollectedHeap::do_collection(bool, bool, unsigned loo ng, bool, int)+0x51a V [libjvm.so+0x46dfc3] GenCollectedHeap::do_full_collection(bool, int)+0x53 V [libjvm.so+0x85fcb6] VM_GenCollectFull::doit()+0x66 V [libjvm.so+0x86d77a] VM_Operation::evaluate()+0x4a V [libjvm.so+0x86cd62] VMThread::evaluate_operation(VM_Operation*)+0x82 V [libjvm.so+0x86cfd3] VMThread::loop()+0x193 V [libjvm.so+0x86cade] VMThread::run()+0x6e V [libjvm.so+0x710b8f] java_start(Thread*)+0x13f VM_Operation (0x0000000042850910): GenCollectFull, mode: safepoint, requested byy thread 0x00002aaabea9d800 --------------- P R O C E S S --------------- Java Threads: ( => current thread ) 0x00002aaac09e0000 JavaThread "Unknown thread" daemon [_thread_blocked, id=8366 2, stack(0x0000000073d64000,0x0000000073e65000)] 0x00002aaac10c5800 JavaThread "?. daemon [_thread_blocked, id=7641, stack(0x000 00000073c63000,0x0000000073d64000)] 0x00002aaabf20e000 [error occurred during error reporting (printing all threads), id 0xb] VM state:at safepoint (normal execution) VM Mutex/Monitor currently owned by a thread: ([mutex/lock_event]) [0x0000000040112970] Threads_lock - owner thread: 0x00002aaabe4e9000 [0x0000000040112ef0] Heap_lock - owner thread: 0x00002aaabea9d800 Heap par new generation total 76672K, used 69206K [0x0000000570000000, 0x0000000577 5330000, 0x0000000575330000) eden space 68160K, 98% used [0x0000000570000000, 0x00000005741d3db8, 0x0000000 0574290000) from space 8512K, 21% used [0x0000000574ae0000, 0x0000000574ca1b10, 0x00000000 575330000) to space 8512K, 0% used [0x0000000574290000, 0x0000000574290000, 0x00000000 574ae0000) concurrent mark-sweep generation total 10400576K, used 94399K [0x000000057533000 00, 0x00000007f0000000, 0x00000007f0000000) concurrent-mark-sweep perm gen total 262144K, used 92913K [0x00000007f0000000, 0x0000000800000000, 0x0000000800000000) Code Cache [0x00002aaaab08f000, 0x00002aaaab39f000, 0x00002aaaae08f000) total_blobs=1323 nmethods=868 adapters=408 free_code_cache=47236352 largest_free e_block=12928 |
|
RednaxelaFX
2012-03-15
把靠近文件顶部的siginfo那行贴出来。很明显你有些不希望公开的信息在crash log里,但现在的信息看起来不够用。
现在能看出来的是,full GC在修正指针的阶段,在处理到一个弱引用的discovered字段时挂掉了。但没有足够信息进一步跟下去。 public abstract class Reference<T> { private T referent; /* Treated specially by GC */ ReferenceQueue<? super T> queue; Reference next; transient private Reference<T> discovered; /* used by VM */ } template <class T> void specialized_oop_adjust_pointers(instanceRefKlass *ref, oop obj) { T* referent_addr = (T*)java_lang_ref_Reference::referent_addr(obj); MarkSweep::adjust_pointer(referent_addr); T* next_addr = (T*)java_lang_ref_Reference::next_addr(obj); MarkSweep::adjust_pointer(next_addr); T* discovered_addr = (T*)java_lang_ref_Reference::discovered_addr(obj); MarkSweep::adjust_pointer(discovered_addr); // <= 这里 debug_only(trace_reference_gc("instanceRefKlass::oop_adjust_pointers", obj, referent_addr, next_addr, discovered_addr);) } int instanceRefKlass::oop_adjust_pointers(oop obj) { int size = size_helper(); instanceKlass::oop_adjust_pointers(obj); if (UseCompressedOops) { specialized_oop_adjust_pointers<narrowOop>(this, obj); // <= 这里 } else { specialized_oop_adjust_pointers<oop>(this, obj); } return size; } (gdb) disass 0x4bfd60 Dump of assembler code for function instanceRefKlass::oop_adjust_pointers(oopDesc*): 0x00000000004bfd60 <+0>: push %rbp 0x00000000004bfd61 <+1>: mov %rsp,%rbp 0x00000000004bfd64 <+4>: sub $0x10,%rsp 0x00000000004bfd68 <+8>: mov %rbx,(%rsp) 0x00000000004bfd6c <+12>: mov %r12,0x8(%rsp) 0x00000000004bfd71 <+17>: mov %rsi,%rbx 0x00000000004bfd74 <+20>: mov 0x8(%rdi),%r12d 0x00000000004bfd78 <+24>: callq 0x4b6ca0 <instanceKlass::oop_adjust_pointers(oopDesc*)> 0x00000000004bfd7d <+29>: mov 0x70b55c(%rip),%rdx # 0xbcb2e0 0x00000000004bfd84 <+36>: sar $0x3,%r12d 0x00000000004bfd88 <+40>: cmpb $0x0,(%rdx) 0x00000000004bfd8b <+43>: je 0x4bfde0 <+128> 0x00000000004bfd8d <+45>: mov 0x70b4dc(%rip),%r9 # 0xbcb270 0x00000000004bfd94 <+52>: xor %esi,%esi 0x00000000004bfd96 <+54>: movslq (%r9),%r8 0x00000000004bfd99 <+57>: lea (%r8,%rbx,1),%rdi 0x00000000004bfd9d <+61>: callq 0x4c9250 <void MarkSweep::adjust_pointer<unsigned int>(unsigned int*, bool)> 0x00000000004bfda2 <+66>: mov 0x70c4af(%rip),%rdi # 0xbcc258 0x00000000004bfda9 <+73>: movslq (%rdi),%rsi 0x00000000004bfdac <+76>: lea (%rsi,%rbx,1),%rdi 0x00000000004bfdb0 <+80>: xor %esi,%esi 0x00000000004bfdb2 <+82>: callq 0x4c9250 <void MarkSweep::adjust_pointer<unsigned int>(unsigned int*, bool)> 0x00000000004bfdb7 <+87>: mov 0x710892(%rip),%rcx # 0xbd0650 0x00000000004bfdbe <+94>: xor %esi,%esi 0x00000000004bfdc0 <+96>: movslq (%rcx),%rdx 0x00000000004bfdc3 <+99>: lea (%rdx,%rbx,1),%rdi 0x00000000004bfdc7 <+103>: callq 0x4c9250 <void MarkSweep::adjust_pointer<unsigned int>(unsigned int*, bool)> ;; <= 这里 0x00000000004bfdcc <+108>: mov %r12d,%eax 0x00000000004bfdcf <+111>: mov (%rsp),%rbx 0x00000000004bfdd3 <+115>: mov 0x8(%rsp),%r12 0x00000000004bfdd8 <+120>: leaveq 0x00000000004bfdd9 <+121>: retq 0x00000000004bfdda <+122>: xchg %ax,%ax 0x00000000004bfddd <+125>: xchg %ax,%ax 0x00000000004bfde0 <+128>: mov 0x70b489(%rip),%rdi # 0xbcb270 0x00000000004bfde7 <+135>: movslq (%rdi),%rsi 0x00000000004bfdea <+138>: lea (%rsi,%rbx,1),%rdi 0x00000000004bfdee <+142>: xor %esi,%esi 0x00000000004bfdf0 <+144>: callq 0x3ac1e0 <void MarkSweep::adjust_pointer<oopDesc*>(oopDesc**, bool)> 0x00000000004bfdf5 <+149>: mov 0x70c45c(%rip),%rcx # 0xbcc258 0x00000000004bfdfc <+156>: xor %esi,%esi 0x00000000004bfdfe <+158>: movslq (%rcx),%rdx 0x00000000004bfe01 <+161>: lea (%rdx,%rbx,1),%rdi 0x00000000004bfe05 <+165>: callq 0x3ac1e0 <void MarkSweep::adjust_pointer<oopDesc*>(oopDesc**, bool)> 0x00000000004bfe0a <+170>: mov 0x71083f(%rip),%r11 # 0xbd0650 0x00000000004bfe11 <+177>: xor %esi,%esi 0x00000000004bfe13 <+179>: movslq (%r11),%r10 0x00000000004bfe16 <+182>: lea (%r10,%rbx,1),%rdi 0x00000000004bfe1a <+186>: callq 0x3ac1e0 <void MarkSweep::adjust_pointer<oopDesc*>(oopDesc**, bool)> 0x00000000004bfe1f <+191>: jmp 0x4bfdcc <+108> End of assembler dump. template <class T> inline void MarkSweep::adjust_pointer(T* p, bool isroot) { T heap_oop = oopDesc::load_heap_oop(p); if (!oopDesc::is_null(heap_oop)) { oop obj = oopDesc::decode_heap_oop_not_null(heap_oop); oop new_obj = oop(obj->mark()->decode_pointer()); assert(new_obj != NULL || // is forwarding ptr? obj->mark() == markOopDesc::prototype() || // not gc marked? (UseBiasedLocking && obj->mark()->has_bias_pattern()) || // not gc marked? obj->is_shared(), // never forwarded? "should be forwarded"); if (new_obj != NULL) { assert(Universe::heap()->is_in_reserved(new_obj), "should be in object space"); oopDesc::encode_store_heap_oop_not_null(p, new_obj); } } VALIDATE_MARK_SWEEP_ONLY(track_adjusted_pointer(p, isroot)); } (gdb) disass 0x4c926c Dump of assembler code for function void MarkSweep::adjust_pointer<unsigned int>(unsigned int*, bool): 0x00000000004c9250 <+0>: mov (%rdi),%eax # rdi = p, eax = heap_oop 0x00000000004c9252 <+2>: push %rbp 0x00000000004c9253 <+3>: mov %rsp,%rbp 0x00000000004c9256 <+6>: test %eax,%eax # oopDesc::is_null(heap_oop) 0x00000000004c9258 <+8>: je 0x4c92a1 <+81> 0x00000000004c925a <+10>: mov 0x70045f(%rip),%rsi # 0xbc96c0 rsi = <Universe::_narrow_oop> 0x00000000004c9261 <+17>: mov %eax,%edx 0x00000000004c9263 <+19>: mov 0x8(%rsi),%ecx # ecx = <Universe::_narrow_oop._shift> 0x00000000004c9266 <+22>: shl %cl,%rdx 0x00000000004c9269 <+25>: add (%rsi),%rdx # rdx = oopDesc::decode_heap_oop_not_null(heap_oop) 0x00000000004c926c <+28>: mov (%rdx),%rcx # rcx = new_obj = obj->mark()->decode_pointer() 0x00000000004c926f <+31>: mov 0x700c0a(%rip),%rdx # 0xbc9e80 rdx = <UseBiasedLocking> 0x00000000004c9276 <+38>: cmpb $0x0,(%rdx) 0x00000000004c9279 <+41>: je 0x4c928a <+58> 0x00000000004c927b <+43>: mov %rcx,%r8 0x00000000004c927e <+46>: xor %edx,%edx 0x00000000004c9280 <+48>: and $0x7,%r8d 0x00000000004c9284 <+52>: cmp $0x5,%r8 0x00000000004c9288 <+56>: je 0x4c9291 <+65> 0x00000000004c928a <+58>: mov %rcx,%rdx 0x00000000004c928d <+61>: and $0xfffffffffffffffc,%rdx 0x00000000004c9291 <+65>: test %rdx,%rdx 0x00000000004c9294 <+68>: je 0x4c92a1 <+81> 0x00000000004c9296 <+70>: sub (%rsi),%rdx 0x00000000004c9299 <+73>: mov 0x8(%rsi),%ecx 0x00000000004c929c <+76>: shr %cl,%rdx 0x00000000004c929f <+79>: mov %edx,(%rdi) 0x00000000004c92a1 <+81>: leaveq 0x00000000004c92a2 <+82>: retq End of assembler dump. 有core dump么? 我会比较希望这个跟之前有修复的一个CMS漏做reference processing的问题相关 Bug 7112034 但之前我还没见过那个问题导致crash的,只是会让内存使用量增加而已 |
|
aronlulu
2012-03-15
--------------- T H R E A D --------------- Current thread (0x00002aaabe4e9000): VMThread [stack: 0x0000000040f39000,0x00000 00004103a000] [id=6293] siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR), si_addr=0x00000000 000000800 Registers: RAX=0x0000000000000100, RBX=0x0000000575dbca00, RCX=0x0000000000000003, RDX=0x000 00000000000800 RSP=0x0000000041038690, RBP=0x0000000041038690, RSI=0x00002b1a03626ee0, RDI=0x000 00000575dbca18 R8 =0x0000000000000003, R9 =0x00002b1a0375bfb4, R10=0x0000000575dbca28, R11=0x000 00000000000001 R12=0x0000000000000005, R13=0x000000057b5effe8, R14=0x00002b1a03614c70, R15=0x000 002b1a0360a530 RIP=0x00002b1a0304e26c, EFLAGS=0x0000000000010206, CSGSFS=0x0000000000000033, ERR R=0x0000000000000004 TRAPNO=0x000000000000000e 是这段么?core.dump有 但是由于信息安全原因,我不能down文件,只能远程shell看到这个文件信息,我只能通过strings命令看到一些core文件的碎片信息,还有就是就这台服务器宕过两次,其他服务器从来没出现过。 PS:如果需要core中的关键信息,我可以grep出来。 |
|
RednaxelaFX
2012-03-15
si_addr后面的值的位数不对诶。是0x0000000000000800么?
|
|
aronlulu
2012-03-15
siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR), si_addr=0x00000000000000800
被编辑器换行了。 |
|
RednaxelaFX
2012-03-15
从寄存器的状态看,出错的时候是一个在old gen里的弱引用对象里的“discovered”字段的值有问题,值为0x00000100,不是一个合法的压缩指针值。
这说明GC堆内的数据有损坏。但我不确定是reference processing时把值弄坏的,还是GC的别的步骤弄坏的。 如果你遇到的问题能稳定重现,那么请试试看在VM启动参数里加上: -XX:+UnlockDiagnosticVMOptions -XX:VerifyGCLevel=1 -XX:+VerifyBeforeGC 这样在每次full GC前都会做一次heap verification。会使得full GC的暂停时间变长,但不这样做也很难知道到底怎么回事了。 |
|
aronlulu
2012-03-15
RednaxelaFX 写道 从寄存器的状态看,出错的时候是一个在old gen里的弱引用对象里的“discovered”字段的值有问题,值为0x00000100,不是一个合法的压缩指针值。
这说明GC堆内的数据有损坏。但我不确定是reference processing时把值弄坏的,还是GC的别的步骤弄坏的。 如果你遇到的问题能稳定重现,那么请试试看在VM启动参数里加上: -XX:+UnlockDiagnosticVMOptions -XX:VerifyGCLevel=1 -XX:+VerifyBeforeGC 这样在每次full GC前都会做一次heap verification。会使得full GC的暂停时间变长,但不这样做也很难知道到底怎么回事了。 感谢解答,按照你说的,这个值损坏,应该是JVM内部运行原因? 与外部环境有关系么?这台服务器与其他还有个区别就是这台服务器上跑的其他乱七八糟的东西比较多。 如果这个是JVM内部原因算不算CMS回收的bug呢,是不是CMS方式回收不稳定,这样的话就麻烦了。 请指教。。。。 |
|
RednaxelaFX
2012-03-15
除非你有啥很不乖的native库对GC堆内的数据做了手脚,不然的话能造成GC堆内数据损坏的只有JVM自身的bug。 <- 我是这么猜的。
现在我猜不出来这个bug的来源是什么地方。 不过既然你有core dump,在上面稍微再挖掘一些信息出来吧。可以么? 请你出问题的那台机器上试试执行一个类似调试器的东西来帮忙收集信息。如果实在无法在那台机器上执行,那至少也请在配置环境详细(JDK版本必须一致)。 这样做:假定你已经设定了正确的JAVA_HOME环境变量。那么: java -cp $JAVA_HOME/lib/sa-jdi.jar sun.jvm.hotspot.CLHSDB 执行后应该看到: hsdb> 的提示符。这个时候输入: attach /usr/bin/java core.123 其中attach后的第一个参数是启动这个Java程序的“java”命令的全路径,第二个参数是你的core dump文件的全路径。按回车之后,正确连接应该能看到: Opening core file, please wait... hsdb> 此时core dump就正确打开了。然后请执行这个命令: mem 0x575dbca00 5 输出的内容应该是类似这样: 0x00000006040a6a30: 0x0000000000000001 0x00000006040a6a38: 0x00000000c0807db0 0x00000006040a6a40: 0x0000000000000000 0x00000006040a6a48: 0x0000000000000000 0x00000006040a6a50: 0x0000000000000001 接下来要拜托你看的是执行这个命令: inspect 0x575dbca00 你看到的内容应该是类似这种形式: instance of Oop for java/lang/ref/Reference @ 0x00000006040a6a30 @ 0x00000006040a6a30 (size = 32) _mark: 1 referent: null null queue: null null next: null null discovered: null null 想知道在你的core dump里这对象到底是什么,拜托执行了上面两个命令后把输出的内容发一下。 最后用quit命令可以退出该程序。 |
|
aronlulu
2012-03-15
执行下来的结果是这样的:
Opening core file, please wait... hsdb> mem 0x575dbca00 5 0x0000000575dbca00: 0x00000005757f6fe3 0x0000000575dbca08: 0xaeaba677fe11037e 0x0000000575dbca10: 0x00000000aeabbb02 0x0000000575dbca18: 0xaeabb82400000100 0x0000000575dbca20: 0xaeafec2eaec580f6 hsdb> inspect 0x575dbca00 Error: sun.jvm.hotspot.oops.UnknownOopException hsdb> inspect 0x575dbca00 Error: sun.jvm.hotspot.oops.UnknownOopException hsdb> |
|
RednaxelaFX
2012-03-15
嗯,这个对象果然是损坏的。但我不理解为什么…
先说明我看到了什么。 从crash地点的代码以及寄存器的状态,可以判断那个名为“discovered”的字段所在的地址是0x0000000575dbca18。而这个字段是位于(离对象头)偏移量24的位置。依次推算出正在处理的弱引用对象的起始地址是上面说的0x0000000575dbca00。 mem命令打印出了0x0000000575dbca00地址之后的若干字节的内容。 inspect命令本来应该能看到这个对象到底是什么,每个字段的内容是什么。 但这个对象损坏了导致它的状态无法显示出来。 请问能再用一次CLHSDB,在attach之后分别对这几个地址也执行mem和inspect么? 0x7f0881bf0 (来自0 + 0xfe11037e << 3) => 应该是一个klass,看看是什么 0x5755d33b8 (来自0 + 0xaeaba677 << 3) => 应该是一个被弱引用的Java对象 0x5755dc120 (来自0 + 0xaeabb824 << 3) => 这个不知道应该是什么,但很明显在old gen里 |