[资料] [Dalvik VM] Dalvik VM的JIT编译器的资料堆积(dumping...work in progress)

RednaxelaFX 2010-02-11
Android的Dalvik VM在去年下半年新增了JIT编译器,应该能提高Android上的Java程序的性能。暂时只支持armv5te、armv5te-vfp和armv7-a这三种CPU。

去年11月在android-platform上的一帖提到:
Dalvik JIT Compiler
Bill Buzbee 写道
Nov 17 2009, 6:53 am
From: Bill Buzbee <buz...@android.com>
Date: Mon, 16 Nov 2009 14:53:50 -0800 (PST)
Local: Tues, Nov 17 2009 6:53 am
Subject: Dalvik JIT Compiler

As some of you have noticed, the latest Android Open Source Project tree (eclair) includes source code for a Dalvik JIT compiler.  The Dalvik team has been actively investigating what kind of JIT would work best over a wide range of memory- and power-constrained portable Android devices, and the code in AOSP master is an old snapshot of what we consider a promising proof-of-concept.  It is a trace-based JIT, compiling only hot code traces rather than method-at-a-time strategy typically found on server-class JITs.  It attempts to minimize heap usage, and it requires no persistent storage.  The goal is to give a quick performance boost using very little heap and battery.

The JIT has progressed significantly since the snapshot in AOSP eclair, and we're working on pushing out a more current version. Meanwhile, if you'd like to play with the prototype, you can build it by creating a buildspec.mk file in your AOSP root which includes the line "WITH_JIT := true".

Note that the prototype JIT had not been extensively tested at the time the snapshot was taken, so you can expect some breakage.  Also, it contains few optimizations other than the basic elimination of the interpreter's fetch/decode cycle.  We're looking forward to getting a newer version into the AOSP tree.

Bill Buzbee, Ben Cheng & the rest of the Dalvik team

Ben Cheng 写道
Nov 21 2009, 3:34 am
From: Ben Cheng <bcch...@android.com>
Date: Fri, 20 Nov 2009 11:34:09 -0800
Local: Sat, Nov 21 2009 3:34 am
Subject: Re: Dalvik JIT Compiler

Hi Tomei,

Please see my replies inline:

On Fri, Nov 20, 2009 at 10:05 AM, Tomei <tomei.ninge...@gmail.com> wrote:
> Hi Ben,

> For "switchable between JIT and interpreter at bytecode level" -- is
> this a just convenience way for developing/debugging the JIT, or part
> of a fundamental design (i.e., to support single-stepping inside the
> debugger). I hope it's the former, or else it seems performance gain
> will be quite limited.

It is just a convenient way to develop/debug the JIT. Once a trace is cut for compilation, the JIT'ed instructions are allowed to move across the original bytecode boundaries. For example, we have implemented ld/st scheduling to aggressively hoist loads and sink stores until inserted scheduling barriers or memory instructions that cannot be disambiguated. Similarly, redundant stores to the Dalvik registers can also be eliminated so that not all computed results are written back to memory.

We use the "easy switch" feature to handle heavy bytecode instructions like array initializations. Instead of inflating the codegen routine to handle it inline, we issue a single-step instruction so that all compiled instructions prior to that are optimized as usual but memory state is guaranteed to be updated prior to the array initialization instruction. Then we use the interpreter to handle it (most of the time will be spent in memcpy after all), and offer a chance to jump back to the same compilation unit and continue from the next instruction in the trace.

> How much overhead does the trace collection add to interpretation?
> There's a large part of the framework code that's executed exactly
> once during app initialization. Will app start-up be hurt? With method-
> based JIT it's quit easy to turn off JIT-sampling until the app has
> painted the first screen, etc. However, with trace-based JIT, I
> imagine that such a strategy would be much harder to do.

We have two levels of trace collection mechanism in the interpreter. The first level hashes the branch target address to update a counter in a low-cost way, and the second level kicks in only after the first-level counter has saturated to show hotness. As a result, the overhead introduced to the interpreter should be very small.

> What would be the worst case scenario for trace-based compilation? For
> example, if I have a large switch statement like this

>   switch (x) { // case 0 .. 255
>   case 1:  very simple code; break;
>   case 2:  very simple code; break;
>   case 3:  very simple code; break;
>   ...
>   case 255:  very simple code; break;
>  }

> can this be handled well?

I am glad you use this code snippet as an example as it shows a benefit of the trace based JIT mechanism. When a switch statement turns hot, we compile all the cases first (ie form an empty dispatch table) where each case occupies a fixed-amount and small overhead in the code cache. Later whenever a corresponding "very simple code; break;" turns hot we compile it and chain the entry in the dispatch table to the newly constructed compilation. In this way we don't waste space and CPU cycles on rarely exercised code in a big switch statement. However, this optimization didn't make it to the Eclair snapshot in time so it is currently absent in the JIT until the next code drop.

Cheers,
-Ben

Thanks!


今年5月份的Google I/O 2010上有一个session,A JIT Compiler for Android's Dalvik VM,将会讲解该JIT的设计与实现。

该JIT编译器是一个tracing JIT(也叫trace-based JIT),主要以“trace”为单位来决定要编译的内容;目前也同时支持以整个方法为单位的编译。
有另外两种目前很热门的JIT编译器也是tracing JIT:新FireFox中的TraceMonkey以及LuaJIT 2.0
关于tracing JIT,这篇描述TraceMonkey的论文值得推荐:Trace-based Just-in-Time Type Specialization for Dynamic
Andreas Gal称tracing是一种非常方便的由解释器演化为编译器的实现方式,特别适用于资源紧缺的嵌入式系统中。参考他的博士论文,Efficient Bytecode Verification and Compilation in a Virtual Machine
Tracing编译器跟另外一个名词,region-based compiler(RBC)从效果上看非常相似,但tracing的实现(应该)更加容易。例如IBM的JVM里实现了RBC。
关于RBC可以有一篇早期论文:Region-Based Compilation: An Introduction and Motivation
可以看看RBC的早期形态。

在一个从Java源码编译到JVM字节码的编译器(如javac、ECJ)里,一个“编译单元”(CompilationUnit)指的是一个Java源文件。而在Dalvik VM的JIT里也有一个结构体名为“CompilationUnit”,这个千万不能跟Java源码级的编译单元弄混了——它在这里指的就是一个“trace”。
许多早期的JIT编译器以“函数”或者“方法”为单位进行编译,并通过函数/方法内联来降低调用成本、扩大优化的作用域。但一个函数/方法中也可能存在热路径与冷路径的区别,如果以函数/方法为粒度来编译,很可能会在冷路径上浪费了编译的时间和空间,却没有得到执行速度的提升。为此,许多JIT编译器会记录方法内分支的执行频率,在JIT编译时只对热路径编译,将冷路径生成为“uncommon trap”,等真的执行到冷路径时跳回到解释器或其它备用实行方式继续。
Tracing JIT则能够更简单有效的获取到涉及循环的热代码中的执行路径。(<< 这里回头继续补充)

该编译器的中间表示分为两种,MIR(middle-level intermediate representation)与LIR(low-level intermediate representation)。MIR与LIR节点各自形成链表,分别被组织在BasicBlock与ConpilationUnit中。
编译流程是:
0、创建CompilationUnit对象来存放一次编译中需要的信息;
1、将dex文件中的Dalvik字节码解码为DecodedInstruction,并创建对应的MIR节点;
2、定位基本块的边界,并创建相应的BasicBlock对象,将MIR塞进去;
3、确定控制流关系,将基本块连接起来构成控制流图(CFG),并添加恢复解释器状态和异常处理用的基本块;
4、将基本块都加到CompilationUnit里去;
5、将MIR转换为LIR(带有局部优化和全局优化)
6、从LIR生成机器码

Java程序要在Android上跑,在由源码编译为JVM字节码之后,还需要经过dx的处理从JVM字节码转换为Dalvik VM字节码。dx在转换过程中已经做过一些优化了,所以可以理解为什么Dalvik VM中的JIT在MIR层面基本上没做优化。
不过现在的Dalvik VM的JIT的完成度还很低,局部优化只有冗余load/store消除,全局优化只有冗余分支消除。dx本身并没有做像是循环不变量外提之类的优化,所以就算有了JIT,Dalvik生成出来的代码质量也不会很好,目前能看到的最明显的效果只是把解释器中指令分派的开销给消除掉了而已。
RednaxelaFX 2010-02-11
为了对JIT编译器提供良好支持,在Dalvik VM原本的解释器里也有相应修改,添加了新的方法入口和profile处理。
例如在vm/mterp/armv5te/header.S里可以看到新添加的一些宏:
#if defined(WITH_JIT)
#define GET_JIT_ENABLED(_reg)       ldr     _reg,[rGLUE,#offGlue_jitEnabled]
#define GET_JIT_PROF_TABLE(_reg)    ldr     _reg,[rGLUE,#offGlue_pJitProfTable]
#endif

而在一些指令的处理代码里也有对它的应用,例如比较指令
%verify "branch taken"
%verify "branch not taken"
    /*
     * Generic two-operand compare-and-branch operation.  Provide a "revcmp"
     * fragment that specifies the *reverse* comparison to perform, e.g.
     * for "if-le" you would use "gt".
     *
     * For: if-eq, if-ne, if-lt, if-ge, if-gt, if-le
     */
    /* if-cmp vA, vB, +CCCC */
    mov     r0, rINST, lsr #8           @ r0<- A+
    mov     r1, rINST, lsr #12          @ r1<- B
    and     r0, r0, #15
    GET_VREG(r3, r1)                    @ r3<- vB
    GET_VREG(r2, r0)                    @ r2<- vA
    mov     r9, #4                      @ r0<- BYTE branch dist for not-taken
    cmp     r2, r3                      @ compare (vA, vB)
    b${revcmp}  1f                      @ branch to 1 if comparison failed
    FETCH_S(r9, 1)                      @ r9<- branch offset, in code units
    movs    r9, r9, asl #1              @ convert to bytes, check sign
    bmi     common_backwardBranch       @ yes, do periodic checks
1:
#if defined(WITH_JIT)
    GET_JIT_PROF_TABLE(r0)
    FETCH_ADVANCE_INST_RB(r9)           @ update rPC, load rINST
    b        common_testUpdateProfile
#else
    FETCH_ADVANCE_INST_RB(r9)           @ update rPC, load rINST
    GET_INST_OPCODE(ip)                 @ extract opcode from rINST
    GOTO_OPCODE(ip)                     @ jump to next instruction
#endif

新添的方法入口的注释,vm/interp/InterpDefs.h
/*
 * There are six entry points from the compiled code to the interpreter:
 * 1) dvmJitToInterpNormal: find if there is a corresponding compilation for
 *    the new dalvik PC. If so, chain the originating compilation with the
 *    target then jump to it.
 * 2) dvmJitToInterpInvokeNoChain: similar to 1) but don't chain. This is
 *    for handling 1-to-many mappings like virtual method call and
 *    packed switch.
 * 3) dvmJitToInterpPunt: use the fast interpreter to execute the next
 *    instruction(s) and stay there as long as it is appropriate to return
 *    to the compiled land. This is used when the jit'ed code is about to
 *    throw an exception.
 * 4) dvmJitToInterpSingleStep: use the portable interpreter to execute the
 *    next instruction only and return to pre-specified location in the
 *    compiled code to resume execution. This is mainly used as debugging
 *    feature to bypass problematic opcode implementations without
 *    disturbing the trace formation.
 * 5) dvmJitToTraceSelect: if there is a single exit from a translation that
 *    has already gone hot enough to be translated, we should assume that
 *    the exit point should also be translated (this is a common case for
 *    invokes).  This trace exit will first check for a chaining
 *    opportunity, and if none is available will switch to the debug
 *    interpreter immediately for trace selection (as if threshold had
 *    just been reached).
 * 6) dvmJitToPredictedChain: patch the chaining cell for a virtual call site
 *    to a predicted callee.
 */


=====================

mterp解释器的统一入口是dvmMterpStdRun()。这个在mterp的平台相关代码的entry.S里定义。

=====================

在vm/compiler/template/out/CompilerTemplateAsmXXX里可以看到JIT编译器用的指令模板。Dalvik的JIT编译器看来在最终生成代码时也是template-based。
xuehui.hf 2010-04-15
对JIT不是很了解,能否推荐些学习资料
RednaxelaFX 2010-04-18
xuehui.hf 写道
对JIT不是很了解,能否推荐些学习资料

希望获取什么程度的学习资料呢?是希望了解JIT编译的概念和目标,还是更关注内部实现细节?前者可以读wiki或者是下面推荐的书;后者基本上就要读论文了……有时间的话我会整理出一份觉得值得读的论文列表出来。

我读过的书里面专门针对JIT编译而写的一本也没有。讲到虚拟机具体实现的都没有详细讲JIT编译器,基本上都是讲解释器的;虚拟机概论的书有提到JIT编译器的,不怎么详细,不过解释概念和思路的有不错的书。于是可以推荐读《虚拟机——系统与进程的通用平台》,里面关于binary translation的部分跟JIT编译基本上是不同名字描述同一件事。另外如果可以读《Shared Source CLI 2.0 Internals》的话,里面有提到SSCLI 2.0中的简单的JIT编译器的实现细节。

如果想要了解JIT编译器如何实现的话,先掌握一些传统编译器的知识会比较好,很多资料里出现的很多缩写读起来就不会那么费力。用龙书或者虎书来入门都不错。

=================================================================

引用一篇2000年Sun研究所的论文对JIT编译器的诠释:
Mixed-mode Bytecode Execution
David Detlefs, Ole Agesen 写道
JIT compilers form a continuum, from very fast compilers producing mediocre code to slow compilers producing highly optimized code. Still, it is possible to divide this continuum roughly in two, calling one side true JITs and the other traditional compilers. This naming conveys the generally true fact that fast JIT compilers are written from scratch for use as dynamic compilers in virtual machines. Compilation speed is usually the paramount concern for such compilers. On the other hand, many of the slower, highly-optimizing compilers in use in VMs are actually traditional static compilers adapted and pressed into use as JIT compilers.

The continuum continues, of course, within these categories. Template-based JIT compilers produce a fixed code pattern for each virtual machine bytecode, doing almost no optimization but running extremely fast. More aggressive JIT compilers introduce some optimizations, but only those that preserve compilation speed. Often this means performing only optimizations that can be accomplished in a single linear code-generation pass.

虽然这篇论文已经很老,其中有些论点现在看可能已经过时,但上面引用的两段还是没啥问题的。

简单说,just-in-time compilation,JIT编译,是对代码的动态编译/翻译;也就是说在程序运行的时候才将某种形式的源翻译为目标代码。“源”可能是高级语言的文本形式的源码,不过更常见的是字节码或者说虚拟机指令形式的;而目标代码一般是实际机器的指令。

JIT跟传统的静态编译器最大的不同在于,前者是在用户程序运行过程中进行编译的,而传统静态编译器则是在用户程序运行之前先完成编译。为此,在许多取舍上两者都有所不同。JIT能承受的编译开销较小,而传统静态编译器可看作能承受无限的编译开销。早期的JIT编译器受到编译开销的限制,就只能生成质量一般的代码了。

现在许多动态编译器已经算不上原本意义上的“JIT”了——“just-in-time”是指“刚好赶上”,一般就是说在某个函数/方法(也可能是别的编译单元)初次执行的时候就对它做编译,生成的目标代码刚好赶上该函数/方法的执行。而现在的一些混合执行模式的虚拟机/动态编译系统中,函数/方法或许一开始是在解释器里执行的,等一阵子才被动态编译,按原本“JIT”的意思就已经太迟了——不是“刚好赶上”,不过JIT更多的已经成为一种惯用称谓,也就不必那么在意这个小细节。

上面引用的两段文字里提到,JIT编译器是一个连续体:一端是编译速度非常快,但只能生成质量一般的代码的编译器;另一端是编译速度较慢,而生成高度优化代码的编译器。
其中可以分为许多小类别,最简单的一类快速JIT编译器可以是基于模板的,也就是每个字节码对应一个固定的目标代码模式。所谓“编译”在这种JIT编译器里就是简单的根据源程序把预置的模板串在一起的过程而已,基本上不对代码做分析,也不做优化。而其实解释器也是一个连续体,从最简单最慢的行解释器到相对高速的context-threading、inline-threading等;高速的解释器与基于模板的JIT编译器正好接上,都会在运行时生成新的目标代码用于执行用户程序。所以把视野再放开一点的话,从解释器到动态编译器整个其实也构成一个连续体。解释器与动态编译器的分水岭,我自己的观点是,在于用户程序中的指令分派(instruction dispatch)是用软件来做还是直接由硬件实现。一个解释器如果优化到完全消除了用户程序的指令分派在软件一侧的开销,我觉得就可以称之为编译系统了。

提到解释器与编译器的分水岭时可能太抽象了……下周末做完某个分享后会放出一个演示稿,里面有例子。
nkhanxh 2011-12-26
其实感觉jit最好不仅仅是省却了指令分派那些多余指令,更好的是使分支更好预测了。
不然流水线损失对于现代cpu来说真是不可承受的。

而分派指令就是一个很难预测的循环,似乎可以通过直接线索能减轻一点,但是也好不了哪去,毕竟还是要一个跳转,而这个跳转本身由 bytecode.op来索引,所以也是不可预测的。
niho 2013-08-28
感觉dalvik jit对loop的处理比较弱,在compileLoop里面有个函数dvmCompilerFilterLoopBlocks里面既不支持嵌套循环,也不能处理第一个块不在循环体的情况,那么jit对这些情况是怎么处理的呢,一直没弄明白。。。
pzz2011 2014-11-08
@R大,chaining cell是什么?  搞不懂
Global site tag (gtag.js) - Google Analytics