golang内存管理之内存分配

一、知识准备

GMP运行时调度模型

go原生支持并发，不需要像Java那样需要显示地开启一个线程，也不像Python那样需要程序员显示地切换协程，引发回调地狱。对于go来说，只需要 go func()的方式，就可以启动一个协程goroutine，依靠调度器来处理goroutine之间的切换，以及操作系统线程的调度。GMP就是运行时的调度模型，原先只有GM，但是后来发现因为需要获取全局锁导致效率很低，所以加了一层P

G：goroutine的缩写，保存着goroutine的上下文，使用go关键字即可创建一个gorountine

M：machine，对应操作系统的线程，每个M都需要绑定到P上

P：Processor，保存着M运行G所需要的资源，每个P有一个本地队列存放G任务，每个P还有自己的mcache，即本地线程缓存。

运行的过程如下图所示，图中有4个P, 其中M1~M3正在运行G并且运行后会从拥有的P的运行队列继续获取G。除了每个P都有本地队列之外，还有一个全局的队列，当本地队列的G超过256会放到全局队列等待执行

image.png

基于tcmalloc的内存分配

tcmalloc是Google推出的内存管理算法，由于减少了锁的获取和释放，因此比起其他的内存分配算法效率更高，适合高并发的场景。

image.png

tcmalloc的思想是对于较小的内存（<=32K），优先在Thread Cache分配，在Thread Cache的页中，根据申请内存的大小，又分成多块，这样申请2KB和3KB所占用的锁不同，减少了锁竞争。对于较大的页（>32K），会在Heap上用页级分配器进行分配

Small Object

image.png

对于每个较小的对象，会映射到size-class，每个size-class都有一个空闲链表，比如说961-1024的对象都映射到1024。

当分配对象时，先把请求的大小映射到size-class，然后看当前线程的thread cache，如果size-class对应的链表不为空，我们将链表的第一位移除并且返回。如果链表空，那就Central list获取size-class替换到thread local的list，返回最新获取的对象。

如果central list也是空，那就从central page allocator分配，把它分成相应size-class的objects，把这些objects替换到central free list，并且把这些objects移动到thread-local free list

2. large object

对于>32K的对象，根据需要的page，在heap相应的page上申请，比如说需要5page，就从第5page申请，如果第五pages是空的，就需要从第六页申请，如果申请不到，就需要从操作系统申请。

image.png

二、逃逸分析

package main
 
import ()
 
func foo() *int {
    var x int
    return &x
}
 
func bar() int {
    x := new(int)
    *x = 1
    return *x
}
 
func main() {}

对于foo中的x变量会在堆上分配，而对于在bar中的x变量会在栈上分配，这是因为经过编译器的分析，foo中的x会通过取地址从而逃逸出去。

在官网 (golang.org) FAQ

From a correctness standpoint, you don’t need to know. Each variable in Go exists as long as there are references to it. The storage location chosen by the implementation is irrelevant to the semantics of the language.

The storage location does have an effect on writing efficient programs. When possible, the Go compilers will allocate variables that are local to a function in that function’s stack frame. However, if the compiler cannot prove that the variable is not referenced after the function returns, then the compiler must allocate the variable on the garbage-collected heap to avoid dangling pointer errors. Also, if a local variable is very large, it might make more sense to store it on the heap rather than the stack.

In the current compilers, if a variable has its address taken, that variable is a candidate for allocation on the heap. However, a basic escape analysis recognizes some cases when such variables will not live past the return from the function and can reside on the stack.

三、数据结构

go的内存管理基于tcmalloc，总体上如下图所示。

mcache: per-P cache，可以认为是 local cache。
mcentral: 全局 cache，mcache 不够用的时候向 mcentral 申请。
mheap: 当 mcentral 也不够用的时候，通过 mheap 向操作系统申请

image.png

mcache

type mcache struct {
 
    nextSample uintptr // trigger heap sample after allocating this many bytes
 
    scanAlloc  uintptr // bytes of scannable heap allocated
 
    tiny       uintptr
 
    tinyoffset uintptr
 
    tinyAllocs uintptr

 
    // The rest is not accessed on every malloc.
 
 
    alloc [numSpanClasses]*mspan // spans to allocate from, indexed by spanClass
 
    stackcache [_NumStackOrders]stackfreelist
 
    flushGen uint32
 
}

alloc [_NumSizeClasses]*mspan是一个大小为 67 的指针（指针指向 mspan ）数组（_NumSizeClasses = 67），每个数组元素用来包含特定大小的块。当要分配内存大小时，为 object 在 alloc 数组中选择合适的元素来分配。67 种块大小为 0，8 byte, 16 byte, …

image.png

mspan
mcache分配内存的单位是mspan，mspan的数据结构如下：

type mspan struct {

 next *mspan     // next span in list, or nil if none

 prev *mspan     // previous span in list, or nil if none

 list *mSpanList // For debugging. TODO: Remove.

 startAddr uintptr // address of first byte of span aka s.base()

 npages    uintptr // number of pages in span


 freeindex uintptr

 nelems uintptr // number of object in the span.

 allocCache uint64

 allocBits  *gcBits

 allocCount  uint16        // number of allocated objects

 spanclass   spanClass     // size class and noscan (uint8)

 state       mSpanStateBox // mSpanInUse etc; accessed atomically (get/set methods)


}

next, prev: 指针域，因为 mspan 一般都是以链表形式使用。
npages: mspan 的大小为 page 大小的整数倍。
sizeclass: 0 ~ _NumSizeClasses 之间的一个值，这个解释了我们的疑问。比如，sizeclass = 3，那么这个 mspan 被分割成 32 byte 的块。
elemsize: 通过 sizeclass 或者 npages 可以计算出来。比如 sizeclass = 3, elemsize = 32 byte。对于大于 32Kb 的内存分配，都是分配整数页，elemsize = page_size * npages。
nelems: span 中包块的总数目。
freeindex: 0 ~ nelemes-1，表示分配到第几个块。

mcentral

image.png

上面这个图很清晰地表明了mcentral的结构。

type mcentral struct {
    lock      mutex
    sizeclass int32
    nonempty  mSpanList // list of spans with a free object, ie a nonempty free list
    empty     mSpanList // list of spans with no free objects (or cached in an mcache)
}
 
type mSpanList struct {
    first *mspan
    last  *mspan
}

mheap

mheap是一个全局变量，在系统初始化时调用mallocinit初始化，因为多个线程都会竞争，所以需要lock加锁

type mheap struct {
 lock      mutex
 free      [_MaxMHeapList]mSpanList // free lists of given length
 freelarge mSpanList                // free lists length >= _MaxMHeapList
 busy      [_MaxMHeapList]mSpanList // busy lists of large objects of given length
 busylarge mSpanList                // busy lists of large objects length >= _MaxMHeapList
 sweepgen  uint32                   // sweep generation, see comment in mspan
 sweepdone uint32                   // all spans are swept

 spans []*mspan


 sweepSpans [2]gcSweepBuf

 _ uint32 // align uint64 fields on 32-bit for atomics

 largefree  uint64                  // bytes freed for large objects (>maxsmallsize)
 nlargefree uint64                  // number of frees for large objects (>maxsmallsize)
 nsmallfree [_NumSizeClasses]uint64 // number of frees for small objects (<=maxsmallsize)

 // range of addresses we might see in the heap
 bitmap         uintptr // Points to one byte past the end of the bitmap
 bitmap_mapped  uintptr
 arena_start    uintptr
 arena_used     uintptr // always mHeap_Map{Bits,Spans} before updating
 arena_end      uintptr
 arena_reserved bool

 central [_NumSizeClasses]struct {
     mcentral mcentral
     pad      [sys.CacheLineSize]byte
 }

 spanalloc             fixalloc // allocator for span*
 cachealloc            fixalloc // allocator for mcache*
 specialfinalizeralloc fixalloc // allocator for specialfinalizer*
 specialprofilealloc   fixalloc // allocator for specialprofile*
 speciallock           mutex    // lock for special record allocators.
}

var mheap_ mheap

image.png

*   allspans []*mspan: 所有的 spans 都是通过 mheap_ 申请，所有申请过的 mspan 都会记录在 allspans。结构体中的 lock 就是用来保证并发安全的。注释中有关于 STW 的说明，这个之后会在 Golang 的 GC 文章中细说。

*   central [_NumSizeClasses]…: 这个就是之前介绍的 mcentral ，每种大小的块对应一个 mcentral。mcentral 上面介绍过了。pad 可以认为是一个字节填充，为了避免伪共享（false sharing）问题的。False Sharing 可以参考 [False Sharing - wikipedia](https://en.wikipedia.org/wiki/False_sharing)，这里就不细说了。

*   sweepgen, sweepdone: GC 相关。（Golang 的 GC 策略是 Mark & Sweep, 这里是用来表示 sweep 的，这里就不再深入了。）

*   free [_MaxMHeapList]mSpanList: 这是一个 SpanList 数组，每个 SpanList 里面的 mspan 由 1 ~ 127 (_MaxMHeapList - 1) 个 page 组成。比如 free[3] 是由包含 3 个 page 的 mspan 组成的链表。free 表示的是 free list，也就是未分配的。对应的还有 busy list。

*   freelarge mSpanList: mspan 组成的链表，每个元素（也就是 mspan）的 page 个数大于 127。对应的还有 busylarge。

*   spans []*mspan: 记录 arena 区域页号（page number）和 mspan 的映射关系。

*   arena_start, arena_end, arena_used: 要解释这几个变量之前要解释一下 arena。arena 是 Golang 中用于分配内存的连续虚拟地址区域。由 mheap 管理，堆上申请的所有内存都来自 arena。那么如何标志内存可用呢？操作系统的常见做法用两种：一种是用链表将所有的可用内存都串起来；另一种是使用位图来标志内存块是否可用。结合上面一条 spans，内存的布局是下面这样的。

四、内存初始化

mallocinit初始化

func mallocinit() {
    //一些系统检测代码，略去
    var p, bitmapSize, spansSize, pSize, limit uintptr
    var reserved bool
 
    // limit = runtime.memlimit();
    // See https://golang.org/issue/5049
    // TODO(rsc): Fix after 1.1.
    limit = 0
   
    //系统指针大小 PtrSize = 8，表示这是一个 64 位系统。
    if sys.PtrSize == 8 && (limit == 0 || limit > 1<<30) {
        //这里的 arenaSize, bitmapSize, spansSize 分别对应 mheap 那一小节里面提到 arena 区大小，bitmap 区大小，spans 区大小。
        arenaSize := round(_MaxMem, _PageSize)
        bitmapSize = arenaSize / (sys.PtrSize * 8 / 2)
        spansSize = arenaSize / _PageSize * sys.PtrSize
        spansSize = round(spansSize, _PageSize)
        //尝试从不同地址开始申请
        for i := 0; i <= 0x7f; i++ {
            switch {
            case GOARCH == "arm64" && GOOS == "darwin":
                p = uintptr(i)<<40 | uintptrMask&(0x0013<<28)
            case GOARCH == "arm64":
                p = uintptr(i)<<40 | uintptrMask&(0x0040<<32)
            default:
                p = uintptr(i)<<40 | uintptrMask&(0x00c0<<32)
            }
            pSize = bitmapSize + spansSize + arenaSize + _PageSize
            //向 OS 申请大小为 pSize 的连续的虚拟地址空间
            p = uintptr(sysReserve(unsafe.Pointer(p), pSize, &reserved))
            if p != 0 {
                break
            }
        }
    }
    //这里是 32 位系统代码对应的操作，略去。
    ...
     
    p1 := round(p, _PageSize)
 
    spansStart := p1
    mheap_.bitmap = p1 + spansSize + bitmapSize
    if sys.PtrSize == 4 {
        // Set arena_start such that we can accept memory
        // reservations located anywhere in the 4GB virtual space.
        mheap_.arena_start = 0
    } else {
        mheap_.arena_start = p1 + (spansSize + bitmapSize)
    }
    mheap_.arena_end = p + pSize
    mheap_.arena_used = p1 + (spansSize + bitmapSize)
    mheap_.arena_reserved = reserved
 
    if mheap_.arena_start&(_PageSize-1) != 0 {
        println("bad pagesize", hex(p), hex(p1), hex(spansSize), hex(bitmapSize), hex(_PageSize), "start", hex(mheap_.arena_start))
        throw("misrounded allocation in mallocinit")
    }
 
    // Initialize the rest of the allocator.
    mheap_.init(spansStart, spansSize)
    _g_ := getg()
    _g_.m.mcache = allocmcache()
}

mheap初始化

func (h *mheap) init(spansStart, spansBytes uintptr) {
    h.spanalloc.init(unsafe.Sizeof(mspan{}), recordspan, unsafe.Pointer(h), &memstats.mspan_sys)
    h.cachealloc.init(unsafe.Sizeof(mcache{}), nil, nil, &memstats.mcache_sys)
    h.specialfinalizeralloc.init(unsafe.Sizeof(specialfinalizer{}), nil, nil, &memstats.other_sys)
    h.specialprofilealloc.init(unsafe.Sizeof(specialprofile{}), nil, nil, &memstats.other_sys)
 
    h.spanalloc.zero = false
 
    // h->mapcache needs no init
    for i := range h.free {
        h.free[i].init()
        h.busy[i].init()
    }
 
    h.freelarge.init()
    h.busylarge.init()
    for i := range h.central {
        h.central[i].mcentral.init(int32(i))
    }
 
    sp := (*slice)(unsafe.Pointer(&h.spans))
    sp.array = unsafe.Pointer(spansStart)
    sp.len = 0
    sp.cap = int(spansBytes / sys.PtrSize)
}

mcache初始化

func schedinit() {
    ...
    mallocinit()
    ...
     
    if procs > _MaxGomaxprocs {
        procs = _MaxGomaxprocs
    }
    if procresize(procs) != nil {
        ...
    }
}
 
 
func procresize(nprocs int32) *p {
    ...
    // initialize new P's
    for i := int32(0); i < nprocs; i++ {
        pp := allp[i]
        if pp == nil {
            pp = new(p)
            pp.id = i
            pp.status = _Pgcstop
            pp.sudogcache = pp.sudogbuf[:0]
            for i := range pp.deferpool {
                pp.deferpool[i] = pp.deferpoolbuf[i][:0]
            }
            atomicstorep(unsafe.Pointer(&allp[i]), unsafe.Pointer(pp))
        }
        // P mcache 初始化
        if pp.mcache == nil {
            if old == 0 && i == 0 {
                if getg().m.mcache == nil {
                    throw("missing mcache?")
                }
                // P[0] 分配给主 Goroutine
                pp.mcache = getg().m.mcache // bootstrap
            } else {
                // P[0] 之外的 P 申请 mcache
                pp.mcache = allocmcache()
            }
        }
        ...
    }
    ...
}

五、内存分配

object size > 32K，则使用 mheap 直接分配。
object size < 16 byte，使用 mcache 的小对象分配器 tiny 直接分配。（其实 tiny 就是一个指针，暂且这么说吧。）
object size > 16 byte && size <=32K byte 时，先使用 mcache 中对应的 size class 分配。
如果 mcache 对应的 size class 的 span 已经没有可用的块，则向 mcentral 请求。
如果 mcentral 也没有可用的块，则向 mheap 申请，并切分。
如果 mheap 也没有合适的 span，则想操作系统申请。

堆上分配调用mallocgc

func newobject(typ *_type) unsafe.Pointer {
    return mallocgc(typ.size, typ, true)
}
 
func mallocgc(size uintptr, typ *_type, needzero bool) unsafe.Pointer {
    ...
    c := gomcache()
    var x unsafe.Pointer
    noscan := typ == nil || typ.kind&kindNoPointers != 0
    if size <= maxSmallSize {
        // object size <= 32K
        if noscan && size < maxTinySize {
            // 小于 16 byte 的小对象分配
            off := c.tinyoffset
            // Align tiny pointer for required (conservative) alignment.
            if size&7 == 0 {
                off = round(off, 8)
            } else if size&3 == 0 {
                off = round(off, 4)
            } else if size&1 == 0 {
                off = round(off, 2)
            }
            if off+size <= maxTinySize && c.tiny != 0 {
                // The object fits into existing tiny block.
                x = unsafe.Pointer(c.tiny + off)
                c.tinyoffset = off + size
                c.local_tinyallocs++
                mp.mallocing = 0
                releasem(mp)
                return x
            }
            // Allocate a new maxTinySize block.
            span := c.alloc[tinySizeClass]
            v := nextFreeFast(span)
            if v == 0 {
                v, _, shouldhelpgc = c.nextFree(tinySizeClass)
            }
            x = unsafe.Pointer(v)
            (*[2]uint64)(x)[0] = 0
            (*[2]uint64)(x)[1] = 0
            // See if we need to replace the existing tiny block with the new one
            // based on amount of remaining free space.
            if size < c.tinyoffset || c.tiny == 0 {
                c.tiny = uintptr(x)
                c.tinyoffset = size
            }
            size = maxTinySize
        } else {
            // object size >= 16 byte  && object size <= 32K byte
            var sizeclass uint8
            if size <= smallSizeMax-8 {
                sizeclass = size_to_class8[(size+smallSizeDiv-1)/smallSizeDiv]
            } else {
                sizeclass = size_to_class128[(size-smallSizeMax+largeSizeDiv-1)/largeSizeDiv]
            }
            size = uintptr(class_to_size[sizeclass])
            span := c.alloc[sizeclass]
            v := nextFreeFast(span)
            if v == 0 {
                v, span, shouldhelpgc = c.nextFree(sizeclass)
            }
            x = unsafe.Pointer(v)
            if needzero && span.needzero != 0 {
                memclrNoHeapPointers(unsafe.Pointer(v), size)
            }
        }
    } else {
        //object size > 32K byte
        var s *mspan
        shouldhelpgc = true
        systemstack(func() {
            s = largeAlloc(size, needzero)
        })
        s.freeindex = 1
        s.allocCount = 1
        x = unsafe.Pointer(s.base())
        size = s.elemsize
    }
}