Go开发关键技术指南：Could Not Recover

Could Not Recover

在C/C++中最苦恼的莫过于上线后发现有野指针或内存越界，导致不可能崩溃的地方崩溃；最无语的是因为很早写的日志打印比如%s把整数当字符串，突然某天执行到了崩溃；最无奈的是无论因为什么崩溃都导致服务的所有用户收到影响。

如果能有一种方案，将指针和内存都管理起来，避免用户错误访问和释放，这样虽然浪费了一部分的CPU，但是可以在快速变化的业务中避免这些头疼的问题。在现代的高级语言中，比如Java、Python和JS的异常，Go的panic-recover都是这种机制。

毕竟，用一些CPU换得快速迭代中的不Crash，怎么算都是划得来的。

哪些可以Recover

Go有Defer, Panic, and Recover。其中defer一般用在资源释放或者捕获panic。而panic是中止正常的执行流程，执行所有的defer，返回调用函数继续panic；主动调用panic函数，还有些运行时错误都会进入panic过程。最后recover是在panic时获取控制权，进入正常的执行逻辑。

注意recover只有在defer函数中才有用，在defer的函数调用的函数中recover不起作用，如下实例代码不会recover：

package main

import "fmt"

func main() {
    f := func() {
        if r := recover(); r != nil {
            fmt.Println(r)
        }
    }

    defer func() {
        f()
    } ()

    panic("ok")
}

执行时依旧会panic，结果如下：

$ go run t.go
panic: ok

goroutine 1 [running]:
main.main()
    /Users/winlin/temp/t.go:16 +0x6b
exit status 2

有些情况是不可以被捕获，程序会自动退出，这种都是无法正常recover。当然，一般的panic都是能捕获的，比如Slice越界、nil指针、除零、写关闭的chan。

下面是Slice越界的例子，recover可以捕获到：

package main

import (
  "fmt"
)

func main() {
  defer func() {
    if r := recover(); r != nil {
      fmt.Println(r)
    }
  }()

  b := []int{0, 1}
  fmt.Println("Hello, playground", b[2])
}

下面是nil指针被引用的例子，recover可以捕获到：

package main

import (
  "bytes"
  "fmt"
)

func main() {
  defer func() {
    if r := recover(); r != nil {
      fmt.Println(r)
    }
  }()

  var b *bytes.Buffer
  fmt.Println("Hello, playground", b.Bytes())
}

下面是除零的例子，recover可以捕获到：

package main

import (
  "fmt"
)

func main() {
  defer func() {
    if r := recover(); r != nil {
      fmt.Println(r)
    }
  }()

  var v int
  fmt.Println("Hello, playground", 1/v)
}

下面是写关闭的chan的例子，recover可以捕获到：

package main

import (
  "fmt"
)

func main() {
  defer func() {
    if r := recover(); r != nil {
      fmt.Println(r)
    }
  }()

  c := make(chan bool)
  close(c)
  c <- true
}

Recover最佳实践

一般recover后会判断是否为err，有可能需要处理特殊的error，一般也需要打印日志或者告警，给一个recover的例子：

package main

import (
    "fmt"
)

type Handler interface {
    Filter(err error, r interface{}) error
}

type Logger interface {
    Ef(format string, a ...interface{})
}

// Handle panic by hdr, which filter the error.
// Finally log err with logger.
func HandlePanic(hdr Handler, logger Logger) error {
    return handlePanic(recover(), hdr, logger)
}

type hdrFunc func(err error, r interface{}) error

func (v hdrFunc) Filter(err error, r interface{}) error {
    return v(err, r)
}

type loggerFunc func(format string, a ...interface{})

func (v loggerFunc) Ef(format string, a ...interface{}) {
    v(format, a...)
}

// Handle panic by hdr, which filter the error.
// Finally log err with logger.
func HandlePanicFunc(hdr func(err error, r interface{}) error,
    logger func(format string, a ...interface{}),
) error {
    var f Handler
    if hdr != nil {
        f = hdrFunc(hdr)
    }

    var l Logger
    if logger != nil {
        l = loggerFunc(logger)
    }

    return handlePanic(recover(), f, l)
}

func handlePanic(r interface{}, hdr Handler, logger Logger) error {
    if r != nil {
        err, ok := r.(error)
        if !ok {
            err = fmt.Errorf("r is %v", r)
        }

        if hdr != nil {
            err = hdr.Filter(err, r)
        }

        if err != nil && logger != nil {
            logger.Ef("panic err %+v", err)
        }

        return err
    }

    return nil
}

func main() {
    func() {
        defer HandlePanicFunc(nil, func(format string, a ...interface{}) {
            fmt.Println(fmt.Sprintf(format, a...))
        })

        panic("ok")
    }()

    logger := func(format string, a ...interface{}) {
        fmt.Println(fmt.Sprintf(format, a...))
    }
    func() {
        defer HandlePanicFunc(nil, logger)

        panic("ok")
    }()
}

对于库如果需要启动goroutine，如何recover呢：

如果不可能出现panic，可以不用recover，比如tls.go中的一个goroutine：errChannel <- conn.Handshake()
如果可能出现panic，也比较明确的可以recover，可以用调用用户回调，或者让用户设置logger，比如http/server.go处理请求的goroutine：if err := recover(); err != nil && err != ErrAbortHandler {
如果完全不知道如何处理recover，比如一个cache库，丢弃数据可能会造成问题，那么就应该由用户来启动goroutine，返回异常数据和错误，用户决定如何recover如何重试。
如果完全知道如何recover，比如忽略panic继续跑，或者能使用logger打印日志，那就按照正常的panic-recover逻辑处理。

哪些不能Recover

下面看看一些情况是无法捕获的，包括（不限于）：

Thread Limit，超过了系统的线程限制，详细参考下面的说明。
Concurrent Map Writers，竞争条件，同时写map，参考下面的例子。推荐使用标准库的sync.Map解决这个问题。

Map竞争写导致panic的实例代码如下：

package main

import (
    "fmt"
    "time"
)

func main() {
    m := map[string]int{}
    p := func() {
        defer func() {
            if r := recover(); r != nil {
                fmt.Println(r)
            }
        }()
        for {
            m["t"] = 0
        }
    }

    go p()
    go p()
    time.Sleep(1 * time.Second)
}

注意：如果编译时加了-race，其他竞争条件也会退出，一般用于死锁检测，但这会导致严重的性能问题，使用需要谨慎。

备注：一般标准库中通过throw抛出的错误都是无法recover的，搜索了下Go1.11一共有690个地方有调用throw。

Go1.2引入了能使用的最多线程数限制ThreadLimit，如果超过了就panic，这个panic是无法recover的。

fatal error: thread exhaustion

runtime stack:
runtime.throw(0x10b60fd, 0x11)
    /usr/local/Cellar/go/1.8.3/libexec/src/runtime/panic.go:596 +0x95
runtime.mstart()
    /usr/local/Cellar/go/1.8.3/libexec/src/runtime/proc.go:1132

默认是1万个物理线程，我们可以调用runtime的debug.SetMaxThreads设置最大线程数。

SetMaxThreads sets the maximum number of operating system threads that the Go program can use. If it attempts to use more than this many, the program crashes. SetMaxThreads returns the previous setting. The initial setting is 10,000 threads.

用这个函数设置程序能使用的最大系统线程数，如果超过了程序就crash。返回的是之前设置的值，默认是1万个线程。

The limit controls the number of operating system threads, not the number of goroutines. A Go program creates a new thread only when a goroutine is ready to run but all the existing threads are blocked in system calls, cgo calls, or are locked to other goroutines due to use of runtime.LockOSThread.

注意限制的并不是goroutine的数目，而是使用的系统线程的限制。goroutine启动时，并不总是新开系统线程，只有当目前所有的物理线程都阻塞在系统调用，cgo调用，或者显示有调用runtime.LockOSThread时。

SetMaxThreads is useful mainly for limiting the damage done by programs that create an unbounded number of threads. The idea is to take down the program before it takes down the operating system.

这个是最后的防御措施，可以在程序干死系统前把有问题的程序干掉。

举一个简单的例子，限制使用10个线程，然后用runtime.LockOSThread来绑定goroutine到系统线程，可以看到没有创建10个goroutine就退出了（runtime也需要使用线程）。参考下面的例子Playground: ThreadLimit：

package main

import (
  "fmt"
  "runtime"
  "runtime/debug"
  "sync"
  "time"
)

func main() {
  nv := 10
  ov := debug.SetMaxThreads(nv)
  fmt.Println(fmt.Sprintf("Change max threads %d=>%d", ov, nv))

  var wg sync.WaitGroup
  c := make(chan bool, 0)
  for i := 0; i < 10; i++ {
    fmt.Println(fmt.Sprintf("Start goroutine #%v", i))

    wg.Add(1)
    go func() {
      c <- true
      defer wg.Done()
      runtime.LockOSThread()
      time.Sleep(10 * time.Second)
      fmt.Println("Goroutine quit")
    }()

    <- c
    fmt.Println(fmt.Sprintf("Start goroutine #%v ok", i))
  }

  fmt.Println("Wait for all goroutines about 10s...")
  wg.Wait()

  fmt.Println("All goroutines done")
}

运行结果如下：

Change max threads 10000=>10
Start goroutine #0
Start goroutine #0 ok
......
Start goroutine #6
Start goroutine #6 ok
Start goroutine #7
runtime: program exceeds 10-thread limit
fatal error: thread exhaustion

runtime stack:
runtime.throw(0xffdef, 0x11)
    /usr/local/go/src/runtime/panic.go:616 +0x100
runtime.checkmcount()
    /usr/local/go/src/runtime/proc.go:542 +0x100
......
    /usr/local/go/src/runtime/proc.go:1830 +0x40
runtime.startm(0x1040e000, 0x1040e000)
    /usr/local/go/src/runtime/proc.go:2002 +0x180

从这次运行可以看出，限制可用的物理线程为10个，其中系统占用了3个物理线程，user-level可运行7个线程，开启第8个线程时就崩溃了。

注意这个运行结果在不同的go版本是不同的，比如Go1.8有时候启动4到5个goroutine就会崩溃。

而且加recover也无法恢复，参考下面的实例代码。可见这个机制是最后的防御，不能突破的底线。我们在线上服务时，曾经因为block的goroutine过多，导致触发了这个机制。

package main

import (
  "fmt"
  "runtime"
  "runtime/debug"
  "sync"
  "time"
)

func main() {
  defer func() {
    if r := recover(); r != nil {
      fmt.Println("main recover is", r)
    }
  } ()

  nv := 10
  ov := debug.SetMaxThreads(nv)
  fmt.Println(fmt.Sprintf("Change max threads %d=>%d", ov, nv))

  var wg sync.WaitGroup
  c := make(chan bool, 0)
  for i := 0; i < 10; i++ {
    fmt.Println(fmt.Sprintf("Start goroutine #%v", i))

    wg.Add(1)
    go func() {
      c <- true

      defer func() {
        if r := recover(); r != nil {
          fmt.Println("main recover is", r)
        }
      } ()

      defer wg.Done()
      runtime.LockOSThread()
      time.Sleep(10 * time.Second)
      fmt.Println("Goroutine quit")
    }()

    <- c
    fmt.Println(fmt.Sprintf("Start goroutine #%v ok", i))
  }

  fmt.Println("Wait for all goroutines about 10s...")
  wg.Wait()

  fmt.Println("All goroutines done")
}

如何避免程序超过线程限制被干掉？一般可能阻塞在system call，那么什么时候会阻塞？还有，GOMAXPROCS又有什么作用呢？

The GOMAXPROCS variable limits the number of operating system threads that can execute user-level Go code simultaneously. There is no limit to the number of threads that can be blocked in system calls on behalf of Go code; those do not count against the GOMAXPROCS limit. This package's GOMAXPROCS function queries and changes the limit.

GOMAXPROCS sets the maximum number of CPUs that can be executing simultaneously and returns the previous setting. If n < 1, it does not change the current setting. The number of logical CPUs on the local machine can be queried with NumCPU. This call will go away when the scheduler improves.

可见GOMAXPROCS只是设置user-level并行执行的线程数，也就是真正执行的线程数。实际上如果物理线程阻塞在system calls，实际上会开启更多的物理线程。关于这个参数的说明，这个文章Number of threads used by goroutine解释得很清楚：

There is no direct correlation. Threads used by your app may be less than, equal to or more than 10.

So if your application does not start any new goroutines, threads count will be less than 10.

If your app starts many goroutines (>10) where none is blocking (e.g. in system calls), 10 operating system threads will execute your goroutines simultaneously.

If your app starts many goroutines where many (>10) are blocked in system calls, more than 10 OS threads will be spawned (but only at most 10 will be executing user-level Go code).

设置GOMAXPROCS为10：如果开启的goroutine小于10个，那么物理线程也小于10个。如果有很多goroutines，但是没有阻塞在system calls，那么只有10个线程会并行执行。如果有很多goroutines同时超过10个阻塞在system calls，那么超过10个物理线程会被创建，但是只有10个活跃的线程执行user-level代码。

那么什么时候会阻塞在system blocking呢？这个例子Why does it not create many threads when many goroutines are blocked in writing解释很清楚，虽然设置了GOMAXPROCS为1，但是实际上还是开启了12个线程，每个goroutine一个物理线程，具体执行下面的代码Writing Large Block：

package main

import (
  "io/ioutil"
  "os"
  "runtime"
  "strconv"
  "sync"
)

func main() {
  runtime.GOMAXPROCS(1)
  data := make([]byte, 128*1024*1024)

  var wg sync.WaitGroup
  for i := 0; i < 10; i++ {
    wg.Add(1)
    go func(n int) {
      defer wg.Done()
      for {
        ioutil.WriteFile("testxxx"+strconv.Itoa(n), []byte(data), os.ModePerm)
      }
    }(i)
  }

  wg.Wait()
}

运行结果如下：

Mac chengli.ycl$ time go run t.go
real    1m44.679s
user    0m0.230s
sys 0m53.474s

虽然GOMAXPROCS设置为1，实际上创建了12个物理线程。

有大量的时间是在sys上面，也就是system calls。

So I think the syscalls were exiting too quickly in your original test to show the effect you were expecting.

Effective Go中的解释:

Goroutines are multiplexed onto multiple OS threads so if one should block, such as while waiting for I/O, others continue to run. Their design hides many of the complexities of thread creation and management.

由此可见，如果程序出现因为超过线程限制而崩溃，那么可以在出现瓶颈时，用linux工具查看系统调用的统计，看哪些系统调用导致创建了过多的线程。

Go开发关键技术指南：Could Not Recover

Go开发关键技术指南：Could Not Recover

Could Not Recover

哪些可以Recover

Recover最佳实践

哪些不能Recover

Links

推荐阅读更多精彩内容