jvm停止分配内存的诡异问题

blueswind8306 2012-01-18
今天线上一后台服务器报警,上去一看,resin服务不动了,gc.log也不打日志了,jstat -gcutil看到很诡异的日志:
  S0     S1     E      O      P     YGC     YGCT    FGC    FGCT     GCT  
100.00 100.00 100.00  64.23  15.90   1222  847.177    31   48.168  895.345
100.00 100.00 100.00  64.23  15.90   1222  847.177    31   48.168  895.345
100.00 100.00 100.00  64.23  15.90   1222  847.177    31   48.168  895.345
100.00 100.00 100.00  64.23  15.90   1222  847.177    31   48.168  895.345
100.00 100.00 100.00  64.23  15.90   1222  847.177    31   48.168  895.345
100.00 100.00 100.00  64.23  15.90   1222  847.177    31   48.168  895.345
100.00 100.00 100.00  64.23  15.90   1222  847.177    31   48.168  895.345
100.00 100.00 100.00  64.23  15.90   1222  847.177    31   48.168  895.345
100.00 100.00 100.00  64.23  15.90   1222  847.177    31   48.168  895.345
100.00 100.00 100.00  64.23  15.90   1222  847.177    31   48.168  895.345
100.00 100.00 100.00  64.23  15.90   1222  847.177    31   48.168  895.345
100.00 100.00 100.00  64.23  15.90   1222  847.177    31   48.168  895.345
并且一直保持这个状态,难道MinorGC无法晋升到老生代了?但老生代还有很大空间啊(我配的老生代空间是6G,日志看还有36%=2.1G),当然也有可能是老生代的内存碎片太多导致无法晋升,但是如果是这样应该会持续Full GC才对,但是gc.log最后一条信息是几个小时前的几次MinorGC:
2012-01-18T16:03:19.770+0800: 100838.404: [GC 100838.404: [ParNew: 6389760K->798720K(6389760K), 2.4953480 secs] 8077569K->3170588K(12513280K), 2.4955040 secs] [Times: user=15.69 sys=0.70, real=2.49 secs]
2012-01-18T16:15:19.578+0800: 101558.212: [GC 101558.212: [ParNew: 6389760K->798720K(6389760K), 2.8422020 secs] 8761628K->3887766K(12513280K), 2.8423530 secs] [Times: user=18.94 sys=0.31, real=2.85 secs]
2012-01-18T17:03:09.411+0800: 104428.044: [GC 104428.044: [ParNew: 6389760K->798720K(6389760K), 0.6051860 secs] 9478806K->3909680K(12513280K), 0.6053150 secs] [Times: user=3.69 sys=0.00, real=0.60 secs]
2012-01-18T17:03:28.672+0800: 104447.305: [GC 104447.305: [ParNew(到这里就终止了)

想到用jstack看看线程栈的情况,但jstack抛了异常,放弃了。

又想到用jstat -J-Djstat.showUnsupported=true -snap [pid]命令看看,dump出了以下内容:
java.ci.totalTime=11292479
java.cls.loadedClasses=4262
java.cls.sharedLoadedClasses=0
java.cls.sharedUnloadedClasses=0
java.cls.unloadedClasses=14
java.property.java.class.path="/usr/java/jdk1.6.0_06/lib/tools.jar:/opt/resin/lib/pro.jar:/opt/resin/lib/resin.jar:/opt/resin/lib/jaxrpc-15.jar:/opt/resin/lib/jta-101.jar:/opt/resin/lib/saaj.jar:/opt/resin/lib/activation.jar:/opt/resin/lib/jstl-11.jar:/opt/resin/lib/jms-11.jar:/opt/resin/lib/script-10.jar:/opt/resin/lib/j2ee-management-10.jar:/opt/resin/lib/j2ee-deploy-10.jar:/opt/resin/lib/jpa-15.jar:/opt/resin/lib/license.jar:/opt/resin/lib/jaxb-api.jar:/opt/resin/lib/jca-15.jar:/opt/resin/lib/webutil.jar:/opt/resin/lib/jsdk-15.jar:/opt/resin/lib/jsf-12.jar:/opt/resin/lib/javamail-14.jar:/opt/resin/lib/ejb-15.jar:/opt/resin/lib/resin-util.jar:/opt/resin/lib/portlet-10.jar:/opt/resin/lib/quercus.jar"
java.property.java.endorsed.dirs="/usr/java/jdk1.6.0_06/jre/lib/endorsed"
java.property.java.ext.dirs="/usr/java/jdk1.6.0_06/jre/lib/ext:/usr/java/packages/lib/ext"
java.property.java.home="/usr/java/jdk1.6.0_06/jre"
java.property.java.library.path="/usr/java/jdk1.6.0_06/jre/lib/amd64/server:/usr/java/jdk1.6.0_06/jre/lib/amd64:/usr/java/jdk1.6.0_06/jre/../lib/amd64:/opt/resin/libexec64:/usr/java/jdk1.6.0_06/jre/lib/amd64/server:/usr/java/jdk1.6.0_06/jre/lib/amd64:/usr/java/jdk1.6.0_06/jre/../lib/amd64:/opt/resin/libexec64:/usr/java/packages/lib/amd64:/lib:/usr/lib"
java.property.java.vm.info="mixed mode"
java.property.java.vm.name="Java HotSpot(TM) 64-Bit Server VM"
java.property.java.vm.specification.name="Java Virtual Machine Specification"
java.property.java.vm.specification.vendor="Sun Microsystems Inc."
java.property.java.vm.specification.version="1.0"
java.property.java.vm.vendor="Sun Microsystems Inc."
java.property.java.vm.version="10.0-b22"
java.rt.vmArgs="-Djava.util.logging.manager=com.caucho.log.LogManagerImpl -Djava.system.class.loader=com.caucho.loader.SystemClassLoader -Djava.awt.headless=true -Dresin.home=/opt/resin/ -Xmx13000m -Xms13000m -Xmn7020m -Xss256k -XX:PermSize=200m -XX:MaxPermSize=200m -XX:ParallelGCThreads=20 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:SurvivorRatio=7 -XX:TargetSurvivorRatio=100 -XX:MaxTenuringThreshold=15 -XX:CMSInitiatingOccupancyFraction=70 -XX:SoftRefLRUPolicyMSPerMB=0 -XX:+UseCMSCompactAtFullCollection -XX:CMSMaxAbortablePrecleanTime=30000 -XX:CMSScheduleRemarkEdenPenetration=1 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+CMSClassUnloadingEnabled -Xloggc:./log/gc.log"
java.rt.vmFlags=""
java.threads.daemon=41
java.threads.live=42
java.threads.livePeak=48
java.threads.started=709
sun.ci.compilerThread.0.compiles=570
sun.ci.compilerThread.0.method=""
sun.ci.compilerThread.0.time=196
sun.ci.compilerThread.0.type=1
sun.ci.compilerThread.1.compiles=581
sun.ci.compilerThread.1.method=""
sun.ci.compilerThread.1.time=188
sun.ci.compilerThread.1.type=1
sun.ci.lastFailedMethod="com/caucho/db/store/Store allocateBlock"
sun.ci.lastFailedType=2
sun.ci.lastInvalidatedMethod=""
sun.ci.lastInvalidatedType=0
sun.ci.lastMethod="java/lang/ref/Finalizer access$100"
sun.ci.lastSize=5
sun.ci.lastType=1
sun.ci.nmethodCodeSize=2053280
sun.ci.nmethodSize=4052688
sun.ci.osrBytes=8039
sun.ci.osrCompiles=25
sun.ci.osrTime=268909
sun.ci.standardBytes=331951
sun.ci.standardCompiles=1125
sun.ci.standardTime=11023570
sun.ci.threads=2
sun.ci.totalBailouts=1
sun.ci.totalCompiles=1150
sun.ci.totalInvalidates=0
sun.cls.classInitTime=678610
sun.cls.classLinkedTime=1385121
sun.cls.classVerifyTime=1402102
sun.cls.initializedClasses=2887
sun.cls.isUnsyncloadClassSet=0
sun.cls.jniDefineClassNoLockCalls=7
sun.cls.jvmDefineClassNoLockCalls=66
sun.cls.jvmFindLoadedClassNoLockCalls=5903
sun.cls.linkedClasses=3466
sun.cls.loadInstanceClassFailRate=0
sun.cls.loadedBytes=9208632
sun.cls.methodBytes=6105748
sun.cls.nonSystemLoaderLockContentionRate=21
sun.cls.sharedLoadedBytes=0
sun.cls.sharedUnloadedBytes=0
sun.cls.systemLoaderLockContentionRate=0
sun.cls.time=927943
sun.cls.unloadedBytes=23024
sun.cls.unsafeDefineClassCalls=65
sun.gc.cause="Allocation Failure"
sun.gc.collector.0.invocations=1222
sun.gc.collector.0.lastEntryTime=104447305185
sun.gc.collector.0.lastExitTime=104428649403
sun.gc.collector.0.name="PCopy"
sun.gc.collector.0.time=847176958
sun.gc.collector.1.invocations=31
sun.gc.collector.1.lastEntryTime=97289733523
sun.gc.collector.1.lastExitTime=97290932380
sun.gc.collector.1.name="CMS"
sun.gc.collector.1.time=48167741
sun.gc.generation.0.agetable.bytes.00=0
sun.gc.generation.0.agetable.bytes.01=74474616
sun.gc.generation.0.agetable.bytes.02=674104
sun.gc.generation.0.agetable.bytes.03=287575792
sun.gc.generation.0.agetable.bytes.04=3859456
sun.gc.generation.0.agetable.bytes.05=80594824
sun.gc.generation.0.agetable.bytes.06=2151296
sun.gc.generation.0.agetable.bytes.07=1021584
sun.gc.generation.0.agetable.bytes.08=89248
sun.gc.generation.0.agetable.bytes.09=247992
sun.gc.generation.0.agetable.bytes.10=768
sun.gc.generation.0.agetable.bytes.11=16520
sun.gc.generation.0.agetable.bytes.12=125728
sun.gc.generation.0.agetable.bytes.13=2856
sun.gc.generation.0.agetable.bytes.14=7816
sun.gc.generation.0.agetable.bytes.15=816
sun.gc.generation.0.agetable.size=16
sun.gc.generation.0.capacity=7361003520
sun.gc.generation.0.maxCapacity=7361003520
sun.gc.generation.0.minCapacity=7361003520
sun.gc.generation.0.name="new"
sun.gc.generation.0.space.0.capacity=5725224960
sun.gc.generation.0.space.0.initCapacity=0
sun.gc.generation.0.space.0.maxCapacity=5725224960
sun.gc.generation.0.space.0.name="eden"
sun.gc.generation.0.space.0.used=5725224960
sun.gc.generation.0.space.1.capacity=817889280
sun.gc.generation.0.space.1.initCapacity=0
sun.gc.generation.0.space.1.maxCapacity=817889280
sun.gc.generation.0.space.1.name="s0"
sun.gc.generation.0.space.1.used=817889280
sun.gc.generation.0.space.2.capacity=817889280
sun.gc.generation.0.space.2.initCapacity=0
sun.gc.generation.0.space.2.maxCapacity=817889280
sun.gc.generation.0.space.2.name="s1"
sun.gc.generation.0.space.2.used=817889280
sun.gc.generation.0.spaces=3
sun.gc.generation.0.threads=20
sun.gc.generation.1.capacity=6270484480
sun.gc.generation.1.maxCapacity=6270484480
sun.gc.generation.1.minCapacity=6270484480
sun.gc.generation.1.name="old"
sun.gc.generation.1.space.0.capacity=6270484480
sun.gc.generation.1.space.0.initCapacity=6270484480
sun.gc.generation.1.space.0.maxCapacity=6270484480
sun.gc.generation.1.space.0.name="old"
sun.gc.generation.1.space.0.used=4029159736
sun.gc.generation.1.spaces=1
sun.gc.generation.2.capacity=209715200
sun.gc.generation.2.maxCapacity=209715200
sun.gc.generation.2.minCapacity=209715200
sun.gc.generation.2.name="perm"
sun.gc.generation.2.space.0.capacity=209715200
sun.gc.generation.2.space.0.initCapacity=209715200
sun.gc.generation.2.space.0.maxCapacity=209715200
sun.gc.generation.2.space.0.name="perm"
sun.gc.generation.2.space.0.used=33343328
sun.gc.generation.2.spaces=1
sun.gc.lastCause="unknown GCCause"
sun.gc.policy.collectors=2
sun.gc.policy.desiredSurvivorSize=817889280
sun.gc.policy.generations=3
sun.gc.policy.maxTenuringThreshold=15
sun.gc.policy.name="ParNew:CMS"
sun.gc.policy.tenuringThreshold=15
sun.gc.tlab.alloc=689282504
sun.gc.tlab.allocThreads=10
sun.gc.tlab.fastWaste=0
sun.gc.tlab.fills=138
sun.gc.tlab.gcWaste=1455036
sun.gc.tlab.maxFastWaste=0
sun.gc.tlab.maxFills=128
sun.gc.tlab.maxGcWaste=331802
sun.gc.tlab.maxSlowAlloc=5
sun.gc.tlab.maxSlowWaste=533
sun.gc.tlab.slowAlloc=6
sun.gc.tlab.slowWaste=540
sun.os.hrt.frequency=1000000
sun.os.hrt.ticks=109150965158
sun.perfdata.majorVersion=2
sun.perfdata.minorVersion=0
sun.perfdata.overflow=0
sun.perfdata.size=32768
sun.perfdata.timestamp=151469
sun.perfdata.used=15008
sun.property.sun.boot.class.path="/usr/java/jdk1.6.0_06/jre/lib/resources.jar:/usr/java/jdk1.6.0_06/jre/lib/rt.jar:/usr/java/jdk1.6.0_06/jre/lib/sunrsasign.jar:/usr/java/jdk1.6.0_06/jre/lib/jsse.jar:/usr/java/jdk1.6.0_06/jre/lib/jce.jar:/usr/java/jdk1.6.0_06/jre/lib/charsets.jar:/usr/java/jdk1.6.0_06/jre/classes"
sun.property.sun.boot.library.path="/usr/java/jdk1.6.0_06/jre/lib/amd64"
sun.rt._sync_ContendedLockAttempts=37695448
sun.rt._sync_Deflations=70953
sun.rt._sync_EmptyNotifications=0
sun.rt._sync_FailedSpins=0
sun.rt._sync_FutileWakeups=3812372
sun.rt._sync_Inflations=70969
sun.rt._sync_MonExtant=4096
sun.rt._sync_MonInCirculation=0
sun.rt._sync_MonScavenged=0
sun.rt._sync_Notifications=335755
sun.rt._sync_Parks=36210278
sun.rt._sync_PrivateA=0
sun.rt._sync_PrivateB=0
sun.rt._sync_SlowEnter=0
sun.rt._sync_SlowExit=0
sun.rt._sync_SlowNotify=0
sun.rt._sync_SlowNotifyAll=0
sun.rt._sync_SuccessfulSpins=0
sun.rt.applicationTime=103548349475
sun.rt.createVmBeginTime=1326772961367
sun.rt.createVmEndTime=1326772961518
sun.rt.internalVersion="Java HotSpot(TM) 64-Bit Server VM (10.0-b22) for linux-amd64 JRE (1.6.0_06-b02), built on Mar 25 2008 01:03:02 by "java_re" with gcc 3.2.2 (SuSE Linux)"
sun.rt.interruptedBeforeIO=0
sun.rt.interruptedDuringIO=0
sun.rt.javaCommand="com.caucho.server.resin.Resin -socketwait 41419 -resin-home /opt/resin/ start"
sun.rt.jvmCapabilities="1000000000000000000000000000000000000000000000000000000000000000"
sun.rt.jvmVersion=167772182
sun.rt.safepointSyncTime=789364
sun.rt.safepointTime=898814852
sun.rt.safepoints=13071
sun.rt.threadInterruptSignaled=0
sun.rt.vmInitDoneTime=1326772961454
sun.threads.vmOperationTime=896074670

注意到有一行:
sun.ci.lastFailedMethod="com/caucho/db/store/Store allocateBlock"
还有一行:
sun.gc.cause="Allocation Failure"

无奈下,kill -9杀掉了这个进程(正常停停不掉),换了一个比较新版本的jdk(update26)。不知道大牛们能不能帮忙分析一下这个问题的原因,难道是jdk老版本的bug吗(出问题的jdk版本是1.6.0_06)?
RednaxelaFX 2012-01-20
看你的描述似乎是卡在做young GC的过程中了。当时你不应该那么快kill掉,至少应该先用pstack命令看看线程栈的状况。jstack无法连接上去时,试试jstack -F。
现在这个描述感觉要定位问题有点信息不足。

话说这个VM启动参数设置得也太奇怪了。
请问所有参数都有确实测试过它们在你的使用场景里的效果么?还是随便乱抄来的?
s0和s1都满掉跟你的TargetSurvivorRatio=100有可能有关系。

虽然跟这帖的问题不直接相关,但 SoftRefLRUPolicyMSPerMB=0 会使得所有SoftReference在每次GC都立即失效。你确定这是你想要的行为么?

sun.gc.cause显示的是当前的GC的起因;如果这里写着No GC就是当前没有在做GC。写着Allocation Failure说明现在正在做GC;这是很正常的,所有young GC的起因都会写这个。

sun.ci开头的属性是跟JIT编译相关的。这部分你可以不用关心。

升级JDK之后如果仍然出现同样的问题请把pstack的输出贴出来。
blueswind8306 2012-01-20
jstack -F当时用了,但也没有打印出线程堆栈信息。
pstack确实没有用,下次一定记得看看。。。

jvm参数是一个大并发前端应用经过很多轮调优后的最终定稿,整个调优过程我写了篇博文:
http://blueswind8306.iteye.com/admin/blogs/1194438

但是这个应用不同,它是一个后端应用,它只是定期从数据库中批量读出数据然后写入文件。之前简单的从那个应用中拷贝过来jvm参数,出问题后我也意识到之前的jvm参数是很不合适的,所以现在的jvm参数改为如下设置:
-Xmx8000m -Xms8000m -Xmn3000m -Xss256k -XX:PermSize=128m -XX:MaxPermSize=128m -XX:ParallelGCThreads=20 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:SurvivorRatio=6 -XX:TargetSurvivorRatio=80 -XX:MaxTenuringThreshold=15 -XX:CMSInitiatingOccupancyFraction=70 -XX:SoftRefLRUPolicyMSPerMB=0 -XX:+UseCMSCompactAtFullCollection -XX:CMSMaxAbortablePrecleanTime=30000 -XX:CMSScheduleRemarkEdenPenetration=1 -server -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+CMSClassUnloadingEnabled -XX:+PrintTenuringDistribution -Xloggc:./log/gc.log

TargetSurvivorRatio=100这个参数只是为了最大化的利用S0/S1的空间,因为之前看过毕玄的一篇博文:说说MaxTenuringThreshold这个参数,按照我的理解,TargetSurvivorRaito=100就等于设置desired_survivor_size=survivor_capacity,这样的坏处就是动态调整MaxTenuringThreshold的策略等于被关闭了,如果从Eden晋升到Survivor的内存变大,就可能会直接晋升到Old。而我这个应用经过jstat -gcutil的观察,Survivor在用的空间占Survivor最大空间的比例不大。所以等于浪费了一些Survivor的空间为了保证即使在Eden->Survivor变大的情况下也可以适应。不知道我的理解是不是正确的。。。

SoftRefLRUPolicyMSPerMB=0这个配置是之前遗留下来的配置,从代码review来看是没有什么地方使用SoftRef的,所以就沿用了此参数

这两天我通过jmap -histo发现了一个很大的问题是这个应用每个小时都会去数据库的一张过千万的表select * 到一个hashmap中,导致hashmap过大,现在已经fixed了。回想了一下,出问题的时间也应该是在这个过程,综合考虑可能是那个时间点Eden满导致MinorGC的复制回收,而在复制S0到S1的过程中又出现Old碎片过多导致无法晋升到Old,从而触发Full GC,而Full GC为什么没有抛OutOfMemoryError就搞不懂了。。。
Global site tag (gtag.js) - Google Analytics