rednaxelafx · December 30, 2010 05:21
diff --git a/gistfile1.txt b/gistfile1.txt
 Java source code:
 k = i + j;

 May compile to Java bytecode:
 iload_0 
 iload_1 
 iadd 
 istore_2

 And may turn into Dalvik VM code:
 add-int v2, v1, v0

 Compare HotSpot Client VM's interpreter in JDK6u18 with Dalvik's interpreter in Android 2.0, on x86.
 To execute the program above, the code traces from unrolling the intepreters' fetch-dispatch-execute loop,
 are:

 HotSpot's interpreter (client mode default config):
 ;;-------------iload_0-------------
 mov   eax, dword ptr [edi]
 movzx ebx, byte ptr [esi + 1]
 inc   esi
 jmp   dword ptr [ebx*4 + 6DB188C8]
 ;;-------------iload_1-------------
 push  eax
 mov   eax, dword ptr [edi-4]
 movzx ebx, byte ptr [esi+1]
 inc   esi
 jmp   dword ptr [ebx*4 + 6DB188C8]
 ;;--------------iadd---------------
 pop   edx
 add   eax, edx
 movzx ebx, byte ptr [esi + 1]
 inc   esi
 jmp   dword ptr [ebx*4 + 6DB188C8]
 ;;------------istore_2-------------
 mov   dword ptr [edi-8],eax
 movzx ebx,byte ptr [esi+1]
 inc   esi
 jmp   dword ptr [ebx*4 + 6DB19CC8]

 Dalvik's interpreter:
 ;;------------add-int--------------
 movzx eax, byte ptr [edx + 2]
 movzx ecx, byte ptr [edx + 3]
 mov   eax, dword ptr [esi + eax*4]
 add   eax, dword ptr [esi + ecx*4]
 movzx ecx, bh
 movzx ebx, word ptr [edx + 4]
 lea   edx, dword ptr [edx + 4]
 mov   dword ptr [esi + ecx*4], eax
 movzx eax, bl                       ; GOTO_NEXT "computed next" version
 sal   eax, $$$handler_size_bits
 add   eax, edi
 jmp   eax


 If we strip off the fetch/dispatch part from the two code traces above, we'll get:

 HotSpot:
 ;;-------------iload_0-------------
 mov   eax, dword ptr [edi]
 ;;-------------iload_1-------------
 push  eax
 mov   eax, dword ptr [edi - 4]
 ;;--------------iadd---------------
 pop   edx
 add   eax, edx
 ;;------------istore_2-------------
 mov   dword ptr [edi - 8], eax

 Dalvik:
 ;;------------add-int--------------
 movzx eax, byte ptr [edx + 2]
 movzx ecx, byte ptr [edx + 3]
 mov   eax, dword ptr [esi + 4*eax]
 add   eax, dword ptr [esi + 4*ecx]
 movzx ecx, bh
 mov   dword ptr [esi + 4*ecx], eax

 Now we can see that in this example, counting the number of instruction that actually executes user code's
 original semantics, both HotSpot's and Dalvik's interpreter uses 6 x86 instructions.

 Which means, HotSpot doesn't lose performance in the "execution" part just because the JVM spec defined a
 stack-based instruction set. By using 1-top-of-stack caching, HotSpot can still make efficient use of machine
 registers during interpretation, in spite of the fact it's emulating a stack-based abstract machine.

 On the other hand, Dalvik's interpreter (on x86) stores all of its "virtual registers" on the stack frames,
 which is in memory, which is in turn slower to access than HotSpot's TOS (top-of-stack) value. Of course,
 Dalvik can further tune the interpreter to try and squeeze even more performance out, but due to the scarce
 number of registers available on x86, it's going to be pretty hard. It'll be easier if there are more free
 registers, like x86-64 or some RISC processor.

 But because JVM has to use more number of bytecode instructions than Dalvik to do the same work, the "fetch-
 dispatch" part makes HotSpot's interpreter have to pay more interpretation overhead than Dalvik's.

 ------------------------------------------------------------------------------------------------

 It's interesting if we look at Sun JDK 1.1.8's interpreter. To run the example shown above, and again count-
 ing just the "execution" part, we'd get:
 ;;-------------iload_0-------------
 mov   ebx, dword ptr [ebp]
 ;;-------------iload_1-------------
 mov   ecx, dword ptr [ebp + 4]
 ;;--------------iadd---------------
 add   ebx, ecx
 ;;------------istore_2-------------
 mov   dword ptr [ebp + 8], ebx

 That's 2 memory reads and 1 memory write, exactly what you'd get were the example written in C and compiled
 without optimization, which is not bad for an interpreter. This is also the effect of multi-state top-of-
 stack caching.
	Java source code:
	k = i + j;

	May compile to Java bytecode:
	iload_0
	iload_1
	iadd
	istore_2

	And may turn into Dalvik VM code:
	add-int v2, v1, v0

	Compare HotSpot Client VM's interpreter in JDK6u18 with Dalvik's interpreter in Android 2.0, on x86.
	To execute the program above, the code traces from unrolling the intepreters' fetch-dispatch-execute loop,
	are:

	HotSpot's interpreter (client mode default config):
	;;-------------iload_0-------------
	mov eax, dword ptr [edi]
	movzx ebx, byte ptr [esi + 1]
	inc esi
	jmp dword ptr [ebx*4 + 6DB188C8]
	;;-------------iload_1-------------
	push eax
	mov eax, dword ptr [edi-4]
	movzx ebx, byte ptr [esi+1]
	inc esi
	jmp dword ptr [ebx*4 + 6DB188C8]
	;;--------------iadd---------------
	pop edx
	add eax, edx
	movzx ebx, byte ptr [esi + 1]
	inc esi
	jmp dword ptr [ebx*4 + 6DB188C8]
	;;------------istore_2-------------
	mov dword ptr [edi-8],eax
	movzx ebx,byte ptr [esi+1]
	inc esi
	jmp dword ptr [ebx*4 + 6DB19CC8]

	Dalvik's interpreter:
	;;------------add-int--------------
	movzx eax, byte ptr [edx + 2]
	movzx ecx, byte ptr [edx + 3]
	mov eax, dword ptr [esi + eax*4]
	add eax, dword ptr [esi + ecx*4]
	movzx ecx, bh
	movzx ebx, word ptr [edx + 4]
	lea edx, dword ptr [edx + 4]
	mov dword ptr [esi + ecx*4], eax
	movzx eax, bl ; GOTO_NEXT "computed next" version
	sal eax, $$$handler_size_bits
	add eax, edi
	jmp eax


	If we strip off the fetch/dispatch part from the two code traces above, we'll get:

	HotSpot:
	;;-------------iload_0-------------
	mov eax, dword ptr [edi]
	;;-------------iload_1-------------
	push eax
	mov eax, dword ptr [edi - 4]
	;;--------------iadd---------------
	pop edx
	add eax, edx
	;;------------istore_2-------------
	mov dword ptr [edi - 8], eax

	Dalvik:
	;;------------add-int--------------
	movzx eax, byte ptr [edx + 2]
	movzx ecx, byte ptr [edx + 3]
	mov eax, dword ptr [esi + 4*eax]
	add eax, dword ptr [esi + 4*ecx]
	movzx ecx, bh
	mov dword ptr [esi + 4*ecx], eax

	Now we can see that in this example, counting the number of instruction that actually executes user code's
	original semantics, both HotSpot's and Dalvik's interpreter uses 6 x86 instructions.

	Which means, HotSpot doesn't lose performance in the "execution" part just because the JVM spec defined a
	stack-based instruction set. By using 1-top-of-stack caching, HotSpot can still make efficient use of machine
	registers during interpretation, in spite of the fact it's emulating a stack-based abstract machine.

	On the other hand, Dalvik's interpreter (on x86) stores all of its "virtual registers" on the stack frames,
	which is in memory, which is in turn slower to access than HotSpot's TOS (top-of-stack) value. Of course,
	Dalvik can further tune the interpreter to try and squeeze even more performance out, but due to the scarce
	number of registers available on x86, it's going to be pretty hard. It'll be easier if there are more free
	registers, like x86-64 or some RISC processor.

	But because JVM has to use more number of bytecode instructions than Dalvik to do the same work, the "fetch-
	dispatch" part makes HotSpot's interpreter have to pay more interpretation overhead than Dalvik's.

	------------------------------------------------------------------------------------------------

	It's interesting if we look at Sun JDK 1.1.8's interpreter. To run the example shown above, and again count-
	ing just the "execution" part, we'd get:
	;;-------------iload_0-------------
	mov ebx, dword ptr [ebp]
	;;-------------iload_1-------------
	mov ecx, dword ptr [ebp + 4]
	;;--------------iadd---------------
	add ebx, ecx
	;;------------istore_2-------------
	mov dword ptr [ebp + 8], ebx

	That's 2 memory reads and 1 memory write, exactly what you'd get were the example written in C and compiled
	without optimization, which is not bad for an interpreter. This is also the effect of multi-state top-of-
	stack caching.