[讨论] JVM YGC 前JVM暂停问题

hpt321 2013-07-10
JVM在YGC前暂停所有线程花费很长时间,有时在20秒左右,在高峰期,一个小时内能够暂停3分钟左右。

JDK版本:
java version "1.6.0_45"
Java(TM) SE Runtime Environment (build 1.6.0_45-b06)
Java HotSpot(TM) 64-Bit Server VM (build 20.45-b01, mixed mode)
操作系统:
x86_64 GNU/Linux


通过JVM参数-XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 在控制台输出以下信息:
vmop  [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count

370337.312: GenCollectForAllocation          [    1070          2              3    ]      [  8830     0  8831     1    24    ]  2

YGC前输出:
Total time for which application threads were stopped: 8.8328410 seconds

YGC所花费的时间非常短,主要时间花费在所有线程达到安全点并暂停。

JVM参数配置如下:
-server -Xms8192M -Xmx8192M -Xmn1500M -Xss256k -XX:PermSize=256M -XX:MaxPermSize=256M -XX:SurvivorRatio=8 -XX:-UseBiasedLocking -XX:MonitorBound=16384 -XX:+UseSpinning -XX:PreBlockSpin=1 -XX:+UseParNewGC -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=55 -XX:CMSMaxAbortablePrecleanTime=5 -XX:+CMSParallelRemarkEnabled -XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=0 -XX:+CMSClassUnloadingEnabled -XX:+UseFastAccessorMethods -XX:+UseCMSInitiatingOccupancyOnly -XX:SoftRefLRUPolicyMSPerMB=0 -XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/home/xmail/jvm_heap.dump -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1

当使用-XX:+UseMembar 也出现同样问题。

请大拿们帮看看。
hpt321 2013-07-10
补充一下,操作系统为:
CentOS release 5.4 (Final)
hpt321 2013-07-10
不使用-XX:-UseBiasedLocking -XX:MonitorBound=16384 -XX:+UseSpinning -XX:PreBlockSpin=1 这4个参数也是有问题
RednaxelaFX 2013-07-10
最有可能导致问题的是代码里有
for (int i = 0; i < ...; i++) { }

或者类似的循环代码。这种循环称为“counted loop”,就是有明确的循环计数器变量,而且该变量有明确的起始值、终止值、步进长度的循环。它有可能被优化为循环末尾没有safepoint,于是如果这个循环的循环次数很多、循环体里又不调用别的方法或者是调用了方法但被内联进来了,就有可能会导致进入safepoint非常耗时。

可惜的是现在没什么特别方便的办法直接指出是什么地方有这种循环。楼主可以先试试自己在代码里找找看是否有这种代码,有的话,一种解决办法是把单层循环拆成等价的双重嵌套循环,这样其中一层循环末尾的safepoint就可能会留下来,减少进入safepoint的等待时间。

我试过实现UseCountedLoopSafepoints,但之前测试都还有bug所以还没发出来。不过就算发了也多半不会backport到JDK6,所以抱歉,您得从自己的代码着手来绕开这个问题了。
hpt321 2013-07-10
谢谢RednaxelaFX,最起码有方向了。随后有什么进展,我会跟贴。
RednaxelaFX 2013-07-10
hpt321 写道
谢谢RednaxelaFX,最起码有方向了。随后有什么进展,我会跟贴。

谢谢,期待进一步调查结果
hpt321 2013-07-11
我查看了所有的代码,发现只有下面2段代码存在类似“for (int i = 0; i < ...; i++) { }”且不调用别的方法。

外层循环次数不会超过10万次。

for(int n = 0 ; n<i;n++){
	before[n] = by[n];
}

for(int k = 0;k<pids.length;k++){
	for(int j = k+1;j<pids.length;j++){
		if(pids[k] > pids[j]){
			temp = pids[k];
			pids[k] = pids[j];
			pids[j] = temp;
		}
	}
}


我在测试环境对第二代码进行测试,量级为100万,会出现safepoint暂停问题,但spin最多只有40ms.

另外从代码角度如何判断方法被内联进来了,google说是方法被final修饰。

我之前查的资料说spin是指自旋锁,在这里“-XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 ”输出的结果“[time: spin block sync cleanup vmop] ”中spin是指什么呢?
RednaxelaFX 2013-07-11
hpt321 写道
另外从代码角度如何判断方法被内联进来了,google说是方法被final修饰。

final是可以帮助JIT编译器做出内联的判断,但不是必要条件。用-XX:+PrintCompilation -XX:+PrintInlining来看内联状况(但JDK6里的HotSpot VM只有在fastdebug或者jvmg版才可以看到准确的内联信息,product版里PrintInlining的信息非常不完整)。

hpt321 写道
我之前查的资料说spin是指自旋锁,在这里“-XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 ”输出的结果“[time: spin block sync cleanup vmop] ”中spin是指什么呢?

PrintSafepointStatics打印出来的spin值指的是SafepointSynchronize在同步每个线程时做的自旋。这跟thread locking / biased locking的spin完全没关系,自然设置那些参数也不会影响safepoint的自旋(UseSpinning之类控制的是thread locking的自旋)。

具体在safepoint.cpp里第238行的
begin_statistics(nof_threads, still_running);

这里开始,然后到第327行的
update_statistics_on_spin_end();

这里结束。中间做的spin的注释是:
     // Spin to avoid context switching.
      // There's a tension between allowing the mutators to run (and rendezvous)
      // vs spinning.  As the VM thread spins, wasting cycles, it consumes CPU that
      // a mutator might otherwise use profitably to reach a safepoint.  Excessive
      // spinning by the VM thread on a saturated system can increase rendezvous latency.
      // Blocking or yielding incur their own penalties in the form of context switching
      // and the resultant loss of $ residency.
      //
      // Further complicating matters is that yield() does not work as naively expected
      // on many platforms -- yield() does not guarantee that any other ready threads
      // will run.   As such we revert yield_all() after some number of iterations.
      // Yield_all() is implemented as a short unconditional sleep on some platforms.
      // Typical operating systems round a "short" sleep period up to 10 msecs, so sleeping
      // can actually increase the time it takes the VM thread to detect that a system-wide
      // stop-the-world safepoint has been reached.  In a pathological scenario such as that
      // described in CR6415670 the VMthread may sleep just before the mutator(s) become safe.
      // In that case the mutators will be stalled waiting for the safepoint to complete and the
      // the VMthread will be sleeping, waiting for the mutators to rendezvous.  The VMthread
      // will eventually wake up and detect that all mutators are safe, at which point
      // we'll again make progress.
      //
      // Beware too that that the VMThread typically runs at elevated priority.
      // Its default priority is higher than the default mutator priority.
      // Obviously, this complicates spinning.
      //
      // Note too that on Windows XP SwitchThreadTo() has quite different behavior than Sleep(0).
      // Sleep(0) will _not yield to lower priority threads, while SwitchThreadTo() will.
      //
      // See the comments in synchronizer.cpp for additional remarks on spinning.
      //
      // In the future we might:
      // 1. Modify the safepoint scheme to avoid potentally unbounded spinning.
      //    This is tricky as the path used by a thread exiting the JVM (say on
      //    on JNI call-out) simply stores into its state field.  The burden
      //    is placed on the VM thread, which must poll (spin).
      // 2. Find something useful to do while spinning.  If the safepoint is GC-related
      //    we might aggressively scan the stacks of threads that are already safe.
      // 3. Use Solaris schedctl to examine the state of the still-running mutators.
      //    If all the mutators are ONPROC there's no reason to sleep or yield.
      // 4. YieldTo() any still-running mutators that are ready but OFFPROC.
      // 5. Check system saturation.  If the system is not fully saturated then
      //    simply spin and avoid sleep/yield.
      // 6. As still-running mutators rendezvous they could unpark the sleeping
      //    VMthread.  This works well for still-running mutators that become
      //    safe.  The VMthread must still poll for mutators that call-out.
      // 7. Drive the policy on time-since-begin instead of iterations.
      // 8. Consider making the spin duration a function of the # of CPUs:
      //    Spin = (((ncpus-1) * M) + K) + F(still_running)
      //    Alternately, instead of counting iterations of the outer loop
      //    we could count the # of threads visited in the inner loop, above.
      // 9. On windows consider using the return value from SwitchThreadTo()
      //    to drive subsequent spin/SwitchThreadTo()/Sleep(N) decisions.
hpt321 2013-07-12
谢谢RednaxelaFX的详细回复,我先在测试机上使用fastdebug看一下。
hpt321 2013-07-15
@RednaxelaFX, 如果在for循环中,如果每隔10000次 Thread.sleep(1) sleep 1ms,这样是否会在sleep地方停止并进入到safepoint?
Global site tag (gtag.js) - Google Analytics