众所周知,Java程序是运行在JVM之上的,但是JVM在程序运行时,会做一系列优化,这些优化包括函数“链接”,自动内联,循环展开,分支预测,反射,简单类型和对象的隐式转换,尾递归,异常处理,等等,当然还有多线程中的锁机制。理解JVM在程序优化的一系列条件假设,对于编写正确高效的代码十分重要。文章用了很多代码片段比较,总结出10个对性能有影响的案例,并用基准测试来比较性能差别,让读者深刻体会不同的程序写法,对于性能的影响。

注脚

展开查看详情

1.An introduction to JVM performance

2.Performance-talk disclaimer EVERYTHING IS A LIE!! Please k eep in mind: The JVM’s performance model is an implementation detail you cannot rely on. Performance is hard to get right and it is difficult to measure. We look at HotSpot in this talk, other JVMs might behave differently. Occasionally, implementations are performant without appearing to be.

3.How is Java code executed ? Java javac JVM processor s ource code byte code m achine code Optimizations are applied almost exclusively after handing resposibility to the JVM. This makes them difficult to trace as the JVM is often seen as a black box . Other compilers such as for example scalac might however apply optimizations such as resolving tail recursion into ordinary loops.

4.HotSpot : interpretation and tiered compilation interpreter C1 ( client ) C2 (server) level 0 level 1 level 2 level 3 level 4 C2 is busy trivial method m achine code templating no profiling simple profiling advanced profiling profile-based optimization Mostly, steady state performance is of interest. Compilation only of “hot spots” with a single method as the smallest compilation unit.

5.A central building block: call sites class Foo { void bar () { System . out . println ( " Hello !" ); } } A call site , that is a specific method call instruction in the code . void doSomething ( Foo val ) { val . bar (); } Other than in many languages, in Java, most method calls are virtual. The question is: How does the JVM reason about what code to execute? Method invocation is a very common task for a JVM, it better be fast! indirection

6.Virtual method tables ( vtables / itables ) # Method Code 1 hashCode () 0x234522 2 equals (Object) 0x65B4A6 3 toString () 0x588252 … … … 8 bar() class Foo { void bar () { System . out . println ( " Hello !" ); } } class Sub extends Foo { @ Override void bar () { System . out . println ( " Woops !" ); } } # Method Run 1 hashCode () 0x234522 2 equals (Object) 0x65B4A6 3 toString () 0x588252 … … … 8 bar() class Foo class Sub Single inheritance allows for index-based lookup of a method implementation. But resolving this triple indirection on every method call is still too slow!

7.Inline caches class Foo { void bar () { System . out . println ( " Hello !" ); } } void doSomething ( Foo val ) { val . bar (); [ cache : val => Foo : address ] } c ached link Inline caches observe instance classes and remember the address of a class’s m ethod implementation. This would avoid the lookup in a virtual method table. Smalltalk is a prominent user of such caches. But this double indirection is still to slow!

8.Monomorphic (“linked”) call site class Foo { void bar () { System . out . println ( " Hello !" ); } } void doSomething ( Foo val ) { [ assert : val => Foo ] [ goto : method address ] } d irect link The JVM is based on making optimistic assumptions and adding traps when these a ssumptions are not met (“adaptive runtime”). Heuristics show that most call sites only ever observe a single class (“monomorphic”). These same heuristics also show that non-monomorphic call sites often observe many types (“ megamorphic ”). The JVM has created a profile for this call site . It is now optimisitc about what instances it will observe .

9.monomorphic bimorphic polymorphic megamorphic direct link v table lookup ( about 90%) A call site’s profile is generated at runtime and it is adapted after collecting sufficient information. In general, the JVM tries to be optimistic and becomes more pessimistic o nce it must. This is an adaptive approach, native programs cannot do this. optimization deoptimization home of rumors c onditional direct link (data structures ) ( but dominant targets)

10.Inlining class Foo { void bar () { System . out . println ( " Hello !" ); } } void doSomething ( Foo val ) { [ assert : val => Foo ] System . out . println ( " Hello !" ); } inlined Inlining is often consider an “uber optimization” as it gives the JVM more code to o mtimize as a single block. The C1 compiler does only little inlining after performing “class hierarchy analysis” (CHA). The C2 compiler inlines monomorphic and bimorphic call sites (with a conditional jump) and the dominant target (> 90%) of a megamorphic call site. Small methods (< 35 byte) are always inlined . Huge methods are never inlined . class Foo { void bar () { System . out . println ( " Hello !" ); } } void doSomething ( Foo val ) { [ assert : val => Foo ] [ goto : method address ] }

11.Call receiver profiling: every type matters! List < String > list = ... // either ArrayList or LinkedList list . size (); // a bimorphic call site // new class turns call site into megamorphic state new ArrayList < String >() {{ add ( " foo " ); add ( "bar" ); }}; When the JVM profiles call sites or conducts class hierarchy analysis, it takes the receiver type at a call site into consideration, it does not analyze if a method is actually overridden. For this reason, every type matters (even when calling final methods). You might wonder why this is not optimized: Looking up an object’s class is an order-one operation. Examining a class hierarchy is not. The JVM needs to choose a trade-off when optimizing and analyzing the hierarchy does n ot pay off (educated guess). “Double brace initialization” i s a however often introducing new (obsolete) types at call sites. Often enough, this results in vtable / itable lookups!

12.Microoptimizing method dispatch interface Foo { void m () ; } class Sub1 implements Foo { @ Override void m () { ... } } class Sub2 implements Foo { @ Override void m () { ... } } class Sub3 implements Foo { @ Override void m () { ... } } void doSomething ( Foo foo ) { foo . m (); } If all three types are observed , this call site is megamorphic . A target is only inlined if it is dominant (>90%). Do not microoptimize , unless you must! The improvement is minimal. In general : static /private > class virtual (null check) > interface virtual (null + type check). This is true for all dispatchers (C2, C1, interpreter ) Source: http ://shipilev.net/blog/2015/black-magic-method-dispatch/ class Foo { int id // 1, 2, 3 static void sub1 () { ... } static void sub2 () { ... } static void sub3 () { ... } } Fields are never resolved dynamically . Static call sites always have an explicit target . Idea: avoid dynamic dispatch but emulate it at the call site. ( “call by id”) void doSomething ( Foo foo ) { switch ( foo . id ) { case 1 : Foo . sub1 (); break ; case 2 : Foo . sub2 (); break ; case 3 : Foo . sub3 (); break ; default : throw new IllegalStateException (); } }

13.static void log ( Object ... args ) { System . out . println ( "Log: " ); for ( Object arg : args ) { System . out . println ( arg . toString ()); } } void doSomething () { System . out . println ( "Log: " ); System . out . println ( " foo " . toString ()); System . out . println ( new Integer ( 4 ) . toString ()); System . out . println ( new Object (). toString ()); } Call site specialization void doSomething () { log ( " foo " , 4 , new Object ()); } inlined void doSomething () { System . out . println ( "Log: " ); Object [] args = new Object []{ " foo " , 4 , new Object ()}; for ( Object arg : args ) { System . out . println ( arg . toString ()); } } static void log ( Object ... args ) { System . out . println ( "Log: " ); for ( Object arg : args ) { System . out . println ( arg . toString ()); } } Thanks to inlining (and loop unrolling), additional call sites are introduced. This way, formerly megamorphic call sites can become monomorphic after duplication. Generally, optimizations allow for new optimizations. This is especially true for inlining . Unroll the entire loop as it is now of a fixed size .

14.ONE TYPE GOOD! MANY TYPES BAD! The Hulk performance rule #1

15.All programs are typed! Types (which do not equal to classes) allow us to identify “things” in our programs that are similar. If nothing in your program has similarities, there might be something wrong. Thus, even machines for dynamic languages look for types. (e.g. V8, Nashorn ) var foo = { }; foo . x = foo ; foo . y = 42 ; var bar = { }; bar . y = 42 ; bar . x = bar ; * x x, y y y, x If your program has no structure, how should an o ptimizer find any? Any “dynamic program” is typed, but it is so implicitly. In the end, y ou simply did not make this structure explicit. V8, hidden class

16.int size = 20_000 ; int maximum = 100 ; int [] values = randomValues ( size , maximum ); Arrays . sort ( values ); Can the outcome of this conditional instruction be predicted ( by the processor )? Branch prediction A conditional control flow is referred to as branch . int sum = 0 ; for ( int i = 0 ; i < 1_000 ; i ++ ) { for ( int value : values ) { if ( value > 50 ) { sum += value ; } else { sum -= value ; } } } Warning : This example is too simple, the VM ( loop interchange, conditional moves ) has become smarter than that. After adding more “noise”, the example would however work. An unfortunate example where the above problem applies are (currently!) Java 8 streams which build on (internal) iteration and conditionals (i.e. filters). If the VM fails to inline such a stream expression (under a polluted profile), streams can be a performance bottle neck.

17.Loop peeling (in combination with branch specialization ) int [][] matrix = ... for ( int [] row : matrix ) { boolean first = true ; for ( int value : row ) { if ( first ) { first = false ; System . out . println ( " Row : " ); } System . out . print ( value + " " ); } System . out . println ( " --- " ); } int [][] matrix = ... for ( int [] row : matrix ) { boolean first = true ; int index = 0 ; if ( first ) { first = false ; System . out . println ( " Row : " ); } System . out . print ( value + " " ); for ( index = 1 ; index < row . length ; index ++) { if ( first ) { first = false ; System . out . println ( " Row : " ); } System . out . print ( value + " " ); } System . out . println ( " --- " ); } Disclaimer: There is much more “loop stuff”.

18.PREDICTION GOOD! RANDOM BAD! The Hulk performance rule #2 Keep in mind: Obviously, any application contains an inherent unpredictability that cannot be removed. Performant programs should however not add more complexity as necessary as this burdens modern processors which prefer processing long, predictable pipes of instructions.

19.List < String > list = ...; for ( String s : list ) { System . out . println ( s ); } Escape analysis List < String > list = ...; Iterator < String > it = list . iterator (); while ( it . hasNext ()) { System . out . println ( it . next ()); } o bject allocation Escape analysis is difficult ( expensive ) to conduct . By avoiding long scopes , i.e. writing s hort methods , an object’s scope is easier to determine . This will most likely improve in f uture JVM implementations . scope Any heap allocated object needs to be garbage collected at some point . Even worse , a ccessing an object on the heap implies an indirection what should be avoided .

20.STACK GOOD! HEAP BAD! The Hulk performance rule #3

21.long start = System . currentTimeMillis (); long end = System . currentTimeMillis (); System . out . println ( " Took " + ( end - start ) + " ms" ); int sum = 0 ; for ( int value : values ) { sum += value ; } int size = 20_000 ; int [] values = randomValues ( size ); int sum = 0 ; for ( int value : values ) { sum += value ; } int size = 20_000 ; int [] values = randomValues ( size ); Dead-code elimination Also, the outcome might dependant on the JVM ’s collected code profile that was gathered before the benchmark is run. Also , the measured time represents wall- clock time which is not a good choice for measuring small amounts of time.

22.void run () { int size = 500_000 ; for ( int i = ; i < 10_000 ; i ++) { doBenchmark ( randomValues ( size ) ); } int [] values = randomValues ( size ); System . out . println ( " This time is for real!" ); doBenchmark ( values ); } void doBenchmark ( int [] values ) { long start = System . nanoTime (); int sum = 0 ; for ( int value : values ) { sum += value ; } long end = System . nanoTime (); System . out . println ( " Ignore : " + sum ); System . out . println ( " Took " + ( end - start ) + " ns " ); } A better benchmark

23.A good benchmark : JMH class Sum { int [] values ; @ Setup void setup () { values = randomValues ( size ); } @ Benchmark int sum () { int sum = 0 ; for ( int value : values ) { sum += value ; } return sum ; } } In general, avoid measuring loops.

24.Measuring the right thing , the right way Measuring the performance of two operational blocks does not normally resemble the performance of the performance of both blocks if executed subsequently. The actual performance might be better or worse (due to “profile pollution”)! Best example for such “volume contractions”: Repeated operations. The more the JIT has to chew on, the more the compiler can usually optimize.

25.HARNESS GOOD! SELF-MADE BAD! The Hulk performance rule #4

26.On- stack replacement p ublic static void main ( String [] args ) { int size = 5 00_000 ; long start = System . nanoTime (); int sum = 0 ; for ( int value : randomValues ( size ) ) { sum += value ; } long end = System . nanoTime (); System . out . println ( " Took " + ( end - start ) + " ns " ); } On- stack replacement allows the compilation of methods that are already running . If you need it , you did something wrong . ( It mainly tackles awkward benchmarks .)

27.ON-STACK REPLACEMENT? OVERRATED! The Hulk performance rule #5 However: If the VM must deoptimize a running method, this also implies an on-stack replacement of the running, compiled method. Normally, such deoptimization is however not referred to as on-stack replacement .

28.void sort ( int [] values ) { while (! isSorted ( values )) { shuffle ( values ); } } Algorithmic complexity D ata structures are a sort of algorithm ! ArrayList LinkedList Grows by array copying . Random access of elements. Grows by adding element . Chained access of elements. The JVM does not reason about algorithms . That is your job !

29.THINK GOOD! GUESS BAD! The Hulk performance rule #6