Go defer opencoded

2021-10-29

字数统计: 3.8k字 | 阅读时长≈ 19分

引子

在Go 1.13的时候，每当遇到defer语句，运行时就会生成一个_defer结构体对象（结构体保存着延迟函数的地址，参数及参数大小等信息），并将其插入一个 defer链表的头部（该链表位于当前g上），如下图所示：

_defer结构体的完整定义：

// A _defer holds an entry on the list of deferred calls.
// If you add a field here, add code to clear it in freedefer and deferProcStack
// This struct must match the code in cmd/compile/internal/reflectdata/reflect.go:deferstruct
// and cmd/compile/internal/gc/ssa.go:(*state).call.
// Some defers will be allocated on the stack and some on the heap.
// All defers are logically part of the stack, so write barriers to
// initialize them are not required. All defers must be manually scanned,
// and for heap defers, marked.
type _defer struct {
	siz     int32 // includes both arguments and results
	started bool
	heap    bool
	// openDefer indicates that this _defer is for a frame with open-coded
	// defers. We have only one defer record for the entire frame (which may
	// currently have 0, 1, or more defers active).
	openDefer bool
	sp        uintptr  // sp at time of defer
	pc        uintptr  // pc at time of defer
	fn        *funcval // can be nil for open-coded defers
	_panic    *_panic  // panic that is running defer
	link      *_defer

	// If openDefer is true, the fields below record values about the stack
	// frame and associated function that has the open-coded defer(s). sp
	// above will be the sp for the frame, and pc will be address of the
	// deferreturn call in the function.
	fd   unsafe.Pointer // funcdata for the function associated with the frame
	varp uintptr        // value of varp for the stack frame
	// framepc is the current pc associated with the stack frame. Together,
	// with sp above (which is the sp associated with the stack frame),
	// framepc/sp can be used as pc/sp pair to continue a stack trace via
	// gentraceback().
	framepc uintptr
}

在函数return处，编译器会插入 runtime.deferreturn函数，该函数会从链表头处开始依次执行defer结构体所关联的延迟函数（由于是从头部开始执行，最后的defer语句会最先执行）。由于通过结构体还原运行延迟函数的上下文信息，需要运行时在初期准备一系列延迟函数所需要的上下文环境（参数，调用栈等），因此性能会有一定的损耗（大约35ns，Go 1.12 的50ns，因为1.13将_defer结构体优化到了栈上保存），而如果将这些延迟调用函数在编译时内联展开的话，则只需要大约6ns的时间。因此，Go为了让defer特性不成为性能诟病，在Go 1.14进行了opencoded的优化。

opencoded优化方案

根据 defer 开放代码优化提案这描述的，我们这里主要看下编译器是如何优化defer性能的。

如果一个defer语句处于循环中的话，则无法进行优化。
如果defer语句处于条件判断中（如果在编译阶段能计算出条件值的话，则if语句会被直接优化掉）的话，则需要一个defer bit来对其进行标识

条件标志位的逻辑如下：

defer f1(a)
if cond {
 defer f2(b)
}
body...
==========================================================================================================
deferBits |= 1<<0
tmpF1 = f1
tmpA = a
if cond {
 deferBits |= 1<<1
 tmpF2 = f2
 tmpB = b
}
body...
exit:
if deferBits & 1<<1 != 0 {
 deferBits &^= 1<<1
 tmpF2(tmpB)
}
if deferBits & 1<<0 != 0 {
 deferBits &^= 1<<0
 tmpF1(tmpA)
}

简单解释一下，就是在执行defer语句的时候，会将对应的标志位置1，并保存函数指针及其参数。当函数退出前，则会以倒序的方式检测标志位，如果标志为1，则表示需要执行相应的延迟函数，但在执行前，先把对应的标志位归0，然后再调用。

验证

package main

import "math/rand"

func main() {
    if 500 > rand.Int(){
        defer named()
    }
    if 600 > rand.Int(){
        defer named()
    }
    if 700 > rand.Int(){
        defer named()
    }
}

func named()(result int){
  result = 1
  return
}

准备实验代码，将三个延迟函数放入三个条件语句中。
首先看一下不优化是什么情况
go build -gcflags=”-N -l” main.go，-N表示禁止优化 -l表示禁止内联，然后通过gdb查看汇编代码

Dump of assembler code for function main.main:
   0x00000000004633e0 <+0>:     lea    -0x88(%rsp),%r12
   0x00000000004633e8 <+8>:     cmp    0x10(%r14),%r12
   0x00000000004633ec <+12>:    jbe    0x46351f <main.main+319>
   0x00000000004633f2 <+18>:    sub    $0x108,%rsp
   0x00000000004633f9 <+25>:    mov    %rbp,0x100(%rsp)
   0x0000000000463401 <+33>:    lea    0x100(%rsp),%rbp
   0x0000000000463409 <+41>:    call   0x462e60 <math/rand.Int>
   0x000000000046340e <+46>:    mov    %rax,0x8(%rsp)
   0x0000000000463413 <+51>:    cmp    $0x1f4,%rax
   0x0000000000463419 <+57>:    jl     0x46341d <main.main+61>
   0x000000000046341b <+59>:    jmp    0x463463 <main.main+131>
   0x000000000046341d <+61>:    movl   $0x0,0xb0(%rsp)
   0x0000000000463428 <+72>:    lea    0x19c39(%rip),%rcx        # 0x47d068
   0x000000000046342f <+79>:    mov    %rcx,0xc8(%rsp)
   0x0000000000463437 <+87>:    lea    0xb0(%rsp),%rax
   0x000000000046343f <+95>:    nop
   0x0000000000463440 <+96>:    call   0x42b380 <runtime.deferprocStack>
   0x0000000000463445 <+101>:   test   %eax,%eax
   0x0000000000463447 <+103>:   jne    0x46344d <main.main+109>
   0x0000000000463449 <+105>:   jmp    0x46344b <main.main+107>
   0x000000000046344b <+107>:   jmp    0x463465 <main.main+133>
   0x000000000046344d <+109>:   nop
   0x000000000046344e <+110>:   call   0x42bfe0 <runtime.deferreturn>
   0x0000000000463453 <+115>:   mov    0x100(%rsp),%rbp
   0x000000000046345b <+123>:   add    $0x108,%rsp
   0x0000000000463462 <+130>:   ret    
   0x0000000000463463 <+131>:   jmp    0x463465 <main.main+133>
   0x0000000000463465 <+133>:   call   0x462e60 <math/rand.Int>
   0x000000000046346a <+138>:   mov    %rax,0x8(%rsp)
   0x000000000046346f <+143>:   cmp    $0x258,%rax
   0x0000000000463475 <+149>:   jl     0x463479 <main.main+153>
   0x0000000000463477 <+151>:   jmp    0x4634b5 <main.main+213>
   0x0000000000463479 <+153>:   movl   $0x0,0x60(%rsp)
   0x0000000000463481 <+161>:   lea    0x19be8(%rip),%rcx        # 0x47d070
   0x0000000000463488 <+168>:   mov    %rcx,0x78(%rsp)
   0x000000000046348d <+173>:   lea    0x60(%rsp),%rax
   0x0000000000463492 <+178>:   call   0x42b380 <runtime.deferprocStack>
   0x0000000000463497 <+183>:   test   %eax,%eax
   0x0000000000463499 <+185>:   jne    0x46349f <main.main+191>
   0x000000000046349b <+187>:   jmp    0x46349d <main.main+189>
   0x000000000046349d <+189>:   jmp    0x4634b7 <main.main+215>
   0x000000000046349f <+191>:   nop
   0x00000000004634a0 <+192>:   call   0x42bfe0 <runtime.deferreturn>
   0x00000000004634a5 <+197>:   mov    0x100(%rsp),%rbp
   0x00000000004634ad <+205>:   add    $0x108,%rsp
   0x00000000004634b4 <+212>:   ret    
   0x00000000004634b5 <+213>:   jmp    0x4634b7 <main.main+215>
   0x00000000004634b7 <+215>:   call   0x462e60 <math/rand.Int>
   0x00000000004634bc <+220>:   mov    %rax,0x8(%rsp)
   0x00000000004634c1 <+225>:   cmp    $0x2bc,%rax
   0x00000000004634c7 <+231>:   jl     0x4634cb <main.main+235>
   0x00000000004634c9 <+233>:   jmp    0x463507 <main.main+295>
   0x00000000004634cb <+235>:   movl   $0x0,0x10(%rsp)
   0x00000000004634d3 <+243>:   lea    0x19b9e(%rip),%rcx        # 0x47d078
   0x00000000004634da <+250>:   mov    %rcx,0x28(%rsp)
   0x00000000004634df <+255>:   lea    0x10(%rsp),%rax
   0x00000000004634e4 <+260>:   call   0x42b380 <runtime.deferprocStack>
   0x00000000004634e9 <+265>:   test   %eax,%eax
   0x00000000004634eb <+267>:   jne    0x4634f1 <main.main+273>
   0x00000000004634ed <+269>:   jmp    0x4634ef <main.main+271>
   0x00000000004634ef <+271>:   jmp    0x463509 <main.main+297>
   0x00000000004634f1 <+273>:   nop
   0x00000000004634f2 <+274>:   call   0x42bfe0 <runtime.deferreturn>
   0x00000000004634f7 <+279>:   mov    0x100(%rsp),%rbp
   0x00000000004634ff <+287>:   add    $0x108,%rsp
   0x0000000000463506 <+294>:   ret    
   0x0000000000463507 <+295>:   jmp    0x463509 <main.main+297>
   0x0000000000463509 <+297>:   nop
   0x000000000046350a <+298>:   call   0x42bfe0 <runtime.deferreturn>
   0x000000000046350f <+303>:   mov    0x100(%rsp),%rbp
   0x0000000000463517 <+311>:   add    $0x108,%rsp
   0x000000000046351e <+318>:   ret    
   0x000000000046351f <+319>:   nop
   0x0000000000463520 <+320>:   call   0x4554c0 <runtime.morestack_noctxt>
   0x0000000000463525 <+325>:   jmp    0x4633e0 <main.main>
End of assembler dump.

这里可以看到 96，178，260处均调用了runtime.deferprocStack,此函数将会构造_defer结构体并加入defer链表。
再来看一下优化后的代码是什么情况：go build main.go，（不加 -gcflags=”-N -l”参数），优化后的汇编代码：

Dump of assembler code for function main.main:
   # 栈检查，不够则跳转至动态扩容处
   ==============================================================================================
   0x0000000000463300 <+0>:     cmp    0x10(%r14),%rsp
   0x0000000000463304 <+4>:     jbe    0x46340f <main.main+271>
   
   # main函数栈帧40字节
   ==============================================================================================
   0x000000000046330a <+10>:    sub    $0x30,%rsp                # 栈顶调整
   0x000000000046330e <+14>:    mov    %rbp,0x28(%rsp)           # 保存栈底旧值
   0x0000000000463313 <+19>:    lea    0x28(%rsp),%rbp           # 栈底调整
   
   ==============================================================================================
   0x0000000000463318 <+24>:    movups %xmm15,0x10(%rsp)
   0x000000000046331e <+30>:    movups %xmm15,0x18(%rsp)
   
   ==============================================================================================
   0x0000000000463324 <+36>:    movb   $0x0,0x7(%rsp)            # 0x7（第8个字节）处，置0，该字节为 defer bit
   
   # 第一个if， if 500 > rand.Int(), 如果随机数大于等于500，则直接跳转至76行
   ==============================================================================================
   0x0000000000463329 <+41>:    call   0x462d80 <math/rand.Int>
   0x000000000046332e <+46>:    mov    %rax,0x8(%rsp)
   0x0000000000463333 <+51>:    cmp    $0x1f4,%rax
   0x0000000000463339 <+57>:    jge    0x46334c <main.main+76>
   
   ==============================================================================================
   0x000000000046333b <+59>:    lea    0x19d16(%rip),%rcx        # 0x47d058
   0x0000000000463342 <+66>:    mov    %rcx,0x20(%rsp)
   
   # 将defer bit 第一位置1
   ==============================================================================================
   0x0000000000463347 <+71>:    movb   $0x1,0x7(%rsp)            # 0x7处的值为  00000001
   
   第二个if, if 600 > rand.Int()
   ==============================================================================================
   0x000000000046334c <+76>:    call   0x462d80 <math/rand.Int>
   0x0000000000463351 <+81>:    mov    0x8(%rsp),%rcx            # 注意0x8留存的是第一个rand的结果（46行）
   0x0000000000463356 <+86>:    cmp    $0x1f4,%rcx               # 这里又进行一次和（500）的比较，是不是还有优化的空间？
   0x000000000046335d <+93>:    setl   %cl                       # 如果rcx小于500，则cl置为1
   0x0000000000463360 <+96>:    cmp    $0x258,%rax               # 第二个rand的值和600比较
   0x0000000000463366 <+102>:   jge    0x46337b <main.main+123>  # 如果大于等于600则直接跳至123行
   
   ==============================================================================================
   0x0000000000463368 <+104>:   lea    0x19cf1(%rip),%rax        # 0x47d060
   0x000000000046336f <+111>:   mov    %rax,0x18(%rsp)
   
   # 将第二个bit位置1
   ==============================================================================================
   0x0000000000463374 <+116>:   or     $0x2,%ecx                 # exc = 00000010 | ecx
   0x0000000000463377 <+119>:   mov    %cl,0x7(%rsp)             # 0x7 存储defer bit
   0x000000000046337b <+123>:   mov    %cl,0x6(%rsp)             # 0x6 存储defer bit
   
   # 第三个if,  if 700 > rand.Int()
   ==============================================================================================
   0x000000000046337f <+127>:   nop
   0x0000000000463380 <+128>:   call   0x462d80 <math/rand.Int>
   0x0000000000463385 <+133>:   cmp    $0x2bc,%rax
   0x000000000046338b <+139>:   jge    0x4633a7 <main.main+167>
   
   ==============================================================================================
   0x000000000046338d <+141>:   lea    0x19cd4(%rip),%rax        # 0x47d068
   0x0000000000463394 <+148>:   mov    %rax,0x10(%rsp)
   
   # 将第三个bit位置1
   ==============================================================================================
   0x0000000000463399 <+153>:   movzbl 0x6(%rsp),%ecx            # 移动8(b)位至32(l)位，高24位用0(z:zero)补齐
   0x000000000046339e <+158>:   or     $0x4,%ecx                 # ecx = 00000100 | ecx
   0x00000000004633a1 <+161>:   mov    %cl,0x7(%rsp)             # 此时 0x7保存着defer bit标识着三个if都进行了处理了
   0x00000000004633a5 <+165>:   jmp    0x4633ac <main.main+172>
   ==============================================================================================
   0x00000000004633a7 <+167>:   movzbl 0x6(%rsp),%ecx            # 移动8(b)位至32(l)位，高24位用0(z:zero)补齐
   
   # 如果第3个bit为0，则跳过dwrap·3的调用
   ==============================================================================================
   0x00000000004633ac <+172>:   test   $0x4,%cl                  
   0x00000000004633af <+175>:   je     0x4633ca <main.main+202>  
   
   # 调用main.main·dwrap·3
   ==============================================================================================
   0x00000000004633b1 <+177>:   and    $0xfffffffb,%ecx          # ecx = 1011 & ecx  执行之前先将第3位置0
   0x00000000004633b4 <+180>:   mov    %cl,0x6(%rsp)             # 0x6 保存 defer bit
   0x00000000004633b8 <+184>:   mov    %cl,0x7(%rsp)             # 0x7 保存 defer bit  
   0x00000000004633bc <+188>:   nopl   0x0(%rax)        
   0x00000000004633c0 <+192>:   call   0x463500 <main.main·dwrap·3>
   0x00000000004633c5 <+197>:   movzbl 0x6(%rsp),%ecx
   
   # 如果第2个bit为0，则跳过dwrap·2的调用
   ==============================================================================================
   0x00000000004633ca <+202>:   test   $0x2,%cl
   0x00000000004633cd <+205>:   je     0x4633e4 <main.main+228>
   
   # 调用main.main·dwrap·2
   ==============================================================================================
   0x00000000004633cf <+207>:   and    $0xfffffffd,%ecx          # ecx = 1101 & ecx  执行之前先将第2位置0
   0x00000000004633d2 <+210>:   mov    %cl,0x6(%rsp)             # 0x6 保存 defer bit 
   0x00000000004633d6 <+214>:   mov    %cl,0x7(%rsp)             # 0x7 保存 defer bit
   0x00000000004633da <+218>:   call   0x4634a0 <main.main·dwrap·2>
   0x00000000004633df <+223>:   movzbl 0x6(%rsp),%ecx
   
   # 如果第1个bit为0，则跳过dwrap·1的调用
   ==============================================================================================
   0x00000000004633e4 <+228>:   test   $0x1,%cl
   0x00000000004633e7 <+231>:   je     0x4633f5 <main.main+245>
   
   # 调用main.main·dwrap·1
   ==============================================================================================
   0x00000000004633e9 <+233>:   and    $0xfffffffe,%ecx           # ecx = 1110 & ecx  执行之前先将第1位置0
   0x00000000004633ec <+236>:   mov    %cl,0x7(%rsp)              # 0x7 保存 defer bit
   0x00000000004633f0 <+240>:   call   0x463440 <main.main·dwrap·1>
   
   # main函数返回
   ==============================================================================================
   0x00000000004633f5 <+245>:   mov    0x28(%rsp),%rbp            # 栈底调整     
   0x00000000004633fa <+250>:   add    $0x30,%rsp                 # 栈顶调整
   0x00000000004633fe <+254>:   ret    
   
   ==============================================================================================
   0x00000000004633ff <+255>:   nop
   0x0000000000463400 <+256>:   call   0x42bf20 <runtime.deferreturn>
   
   ==============================================================================================
   0x0000000000463405 <+261>:   mov    0x28(%rsp),%rbp
   0x000000000046340a <+266>:   add    $0x30,%rsp
   0x000000000046340e <+270>:   ret
   
   ==============================================================================================
   0x000000000046340f <+271>:   call   0x455400 <runtime.morestack_noctxt>
   0x0000000000463414 <+276>:   jmp    0x463300 <main.main>
End of assembler dump.

我们看一下main函数的执行逻辑，从以上代码可以看出，在0x7这个位置的字节上，保存的应该就是我们提到的defer bit标志数据。这里我们并没有看到runtime.deferproc或者是runtime.deferprocStack调用，表明延迟函数确实被编译时展开了。而每个条件判断成立时，则会设置相应的标志位，我们以第二个条件为例：

1 2	0x0000000000463360 <+96>: cmp $0x258,%rax # 第二个rand的值和600比较 0x0000000000463366 <+102>: jge 0x46337b <main.main+123> # 如果大于等于600则直接跳至123行

可以看到如果随机数大于等于600时，则会跳过第二个bit的设置：(116行)，则该位的值还是0

1
2
3

0x0000000000463374 <+116>:   or     $0x2,%ecx                 # exc = 00000010 | ecx
0x0000000000463377 <+119>:   mov    %cl,0x7(%rsp)             # 0x7 存储defer bit
0x000000000046337b <+123>:   mov    %cl,0x6(%rsp)             # 0x6 存储defer bit

疑问：为啥这里还要有一个0x6来保存一下defer bit？

而接下来，在函数ret前，被插入的代码则是以倒序的方式检测每个标志位 0x4->0x2->0x1, 看下0x2标志位是如何检测的：

1 2	0x00000000004633ca <+202>: test $0x2,%cl 0x00000000004633cd <+205>: je 0x4633e4 <main.main+228>

此时cl保存着defer bit，test指令将两个操作数执行逻辑与操作，如果cl的第二位为1，则zf寄存器则为0，则je条件不成立，则不会进行跳转，因为接下来的指令正是对第二个延迟函数的调用：

0x00000000004633cf <+207>:   and    $0xfffffffd,%ecx          # ecx = 1101 & ecx  执行之前先将第2位置0
0x00000000004633d2 <+210>:   mov    %cl,0x6(%rsp)             # 0x6 保存 defer bit 
0x00000000004633d6 <+214>:   mov    %cl,0x7(%rsp)             # 0x7 保存 defer bit
0x00000000004633da <+218>:   call   0x4634a0 <main.main·dwrap·2>
0x00000000004633df <+223>:   movzbl 0x6(%rsp),%ecx

在真正的调用之前，会先将第二个bit置成0。然后在218处调用main.main·dwrap·2函数，我们再展开该函数：

Dump of assembler code for function main.main·dwrap·2:
   # 栈检查，不够则跳转至动态扩容处
   ==============================================================================================
   0x0000000000463440 <+0>:     cmp    0x10(%r14),%rsp
   0x0000000000463444 <+4>:     jbe    0x46346e <main.main·dwrap·2+46>
   
   # main.main·dwrap·2 栈帧 0字节
   ==============================================================================================
   0x0000000000463446 <+6>:     sub    $0x8,%rsp                   # 调整栈顶
   0x000000000046344a <+10>:    mov    %rbp,(%rsp)                 # 保留栈底旧值 
   0x000000000046344e <+14>:    lea    (%rsp),%rbp                 # 栈底调整  
   
   ==============================================================================================
   0x0000000000463452 <+18>:    mov    0x20(%r14),%r12
   0x0000000000463456 <+22>:    test   %r12,%r12                         
   0x0000000000463459 <+25>:    jne    0x463475 <main.main·dwrap·2+53>  # r12 > 0
   
   ==============================================================================================
   0x000000000046345b <+27>:    nopl   0x0(%rax,%rax,1)
   0x0000000000463460 <+32>:    call   0x463420 <main.named>
   
   # 回收  main.main·dwrap·2 栈帧
   ==============================================================================================
   0x0000000000463465 <+37>:    mov    (%rsp),%rbp
   0x0000000000463469 <+41>:    add    $0x8,%rsp
   0x000000000046346d <+45>:    ret    
   
   ==============================================================================================
   0x000000000046346e <+46>:    call   0x455400 <runtime.morestack_noctxt>
   0x0000000000463473 <+51>:    jmp    0x463440 <main.main·dwrap·2>
   
   ==============================================================================================
   0x0000000000463475 <+53>:    lea    0x10(%rsp),%r13
   0x000000000046347a <+58>:    nopw   0x0(%rax,%rax,1)
   0x0000000000463480 <+64>:    cmp    %r13,(%r12)    
   0x0000000000463484 <+68>:    jne    0x46345b <main.main·dwrap·2+27>
   0x0000000000463486 <+70>:    mov    %rsp,(%r12)
   0x000000000046348a <+74>:    jmp    0x46345b <main.main·dwrap·2+27>
End of assembler dump.

可以看到，无论何种情况，都会最终调用真正的main.named这个函数

25 -> 53 -> 68 -> 27 -> 32(main.named)
25 -> 53 -> 74 -> 27 -> 32(main.named)
25 -> 27 -> 32(main.named)

最后展开main.named，看到确实是 return 1(通过eax寄存器返回)

Dump of assembler code for function main.named:
   0x0000000000463420 <+0>:     mov    $0x1,%eax
   0x0000000000463425 <+5>:     ret    
End of assembler dump.

总结

标志位的变化

最后我们来看一下defer bit标志位数据变化的情况，我们假设三个if条件全部成立：
0000 -> rand.Int() -> 0001 -> rand.Int() -> 0011 -> rand.Int() -> 0111 -> named() -> 0011 -> named() -> 0001 -> named() -> 0000

免去defer链表递归调用

从实验代码可以看出，优化后的代码并没有出现runtime.deferproc或runtime.deferprocStack调用，最后函数返回时，也跳过了runtime.deferreturn的调用，我们知道一旦进入defer链表的递归调用（runtime.jmpdefer 尾递归）后，因为维护延迟函数的上下文环境需要花费非常多的指令，（defer结构体的创建和销毁等操作）这也是早期Go版本的defer特性被人诟病性能低下的重要原因。

经过开放代码优化后，我们可以看到这和直接调用函数的性能相差无几（多了标志位的维护），当然这里我们还可以看到deferbit只有一个字节也就是8位，因此我们最多支持8个defer语句，超过8将会回到defer链表模式（循环中的defer无法优化）。

（实验环境go1.17.2 linux/amd64）