处理一个 socket: too many open files 问题

大概情况是这样的, 前几天做完离职交接, 今天看到前公司的微信群里消息爆炸了, 用户 无法在微信小程序中下单. 接手后端项目的小胖已经重启了数据库(不晓得他为啥一进服务器 就先把数据库重启了), 说还是不行.

后来另一位同事重启了 consumer 服务, 然后服务就恢复了.

查看了 docker log, 发现了一个 ``dial tcp 127.0.0.1:8202: socket: too many open files 的问题, 但是请求的 http server 也是布署在同一台服务器上, 它们也共享了同一个 docker network, 通常是 不会出现这个问题的, 之后就猜到可能是因为本地的 TCP 端口号被占完了, 或者分配给该进程 的文件句柄数被用光了, http client 无法再发起新的 socket 连接导致的.

整个问题还是太清晰, 先理一下后台的服务布署情况:
services

其中, postal 服务提供了 GET /api/status 这个接口, 用于获取主控的状态. 该接口被 business, consumerstaff 使用.

/proc/CONSUMER_PID/fd/ 目录里面, 可以看到有数百个未关闭的 socket fd, 因为 这个服务刚被重启过, 才只剩这么多的. 同时去看了 /proc/BUSINESS_PID/fd/ 目录, 发现 里面有超过 12000 个 socket fd, 这些都是不正常的. 而 /proc/STAFF_PID/fd/ 里面只有 数百个, 因为 GET /api/status 这个接口在 staff 服务里不经常被调用.

因为是线上服务器, 可以选用访问量比较少的后台服务staff 来调试, 通过 staff 服务 请求了一下 postal 服务的 GET /api/status 接口, 发现 /proc/STAFF_PID/fd/ 里面 就多了一个 socket fd, 这种现象很稳定, 每次请求都会多一次.

问题可以大致定位到了, 估计就是 staff 里的 http client 那部分代码:

url := fmt.Sprintf("%s/api/stats/%s", settings.Current.BoxServer, box)
transport := &http.Transport{
    TLSHandshakeTimeout: TlsTimeout,
}

client := http.Client{
    Timeout:   HttpTimeout,
    Transport: transport,
}

resp, err := client.Get(url)
if err != nil {
    return nil, err
}
defer resp.Body.Close()
body, err := ioutil.ReadAll(resp.Body)
if err != nil {
    return nil, err
}

上面的代码也很简单, 没有多余的操作, 而且也依照 golang docs 里面要求的 关闭了 response body.

核心的问题是在:

transport := &http.Transport{
    TLSHandshakeTimeout: TlsTimeout,
}

这里, 只调整了 Transport 结构体的 TLS 握时超时的选项, 其它都是默认值.

Demo

服务端, server.go:

package main

import (
    "log"
    "net/http"
    "github.com/gin-gonic/gin"
)

func getStatus(c *gin.Context) {
    c.JSON(http.StatusOK, gin.H{
        "Status": "ok",
    })
}

func main() {
    route := gin.Default()
    api := route.Group("/api")
    api.GET("/status", getStatus)

    if err := route.Run("127.0.0.1:4200"); err != nil {
        log.Panic(err)
    }
}

客户端代码, client.go:

package main

import (
    "io/ioutil"
    "log"
    "net/http"
    "time"
)

func request() {
    log.Println("request()")
    url := "http://127.0.0.1:4200/api/status"

    transport := &http.Transport{
        // ResponseHeaderTimeout: 15 * time.Second,
    }

    client := http.Client{
        // Timeout:   30,
        Transport: transport,
    }

    resp, err := client.Get(url)
    if err != nil {
        log.Println(err)
        return
    }
    defer resp.Body.Close()
    body, err := ioutil.ReadAll(resp.Body)
    if err != nil {
        log.Println(err)
        return
    }
    log.Println(len(string(body)))

    time.Sleep(1 * time.Millisecond)
}

func main() {
    for i := 0; i < 10000; i ++ {
        go request()
    }

    time.Sleep(1 * time.Hour)
}

以上的代码是在本地运行的, 同样在 client 那里可以看到打印的报错:

2019/09/19 15:26:34 Get http://127.0.0.1:4200/api/status: dial tcp 127.0.0.1:4200: socket: too many open files

然后看一下该进程打开了文件数:

$ ls /proc/5689/limits | wc -l
1024

这个值要比之前在服务器上面出现的要低很多, 后来确认是本地的linux 系统没有优化配置, 一个进程默认只允许打开 1024 个文件, 下面也可以看到这样的 resource limit 信息:

$ cat /proc/5689/limits 
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             63754                63754                processes 
Max open files            1024                 1048576              files     
Max locked memory         67108864             67108864             bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       63754                63754                signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us  

分析 Transport

以上问题的核心就是 golang 的 http client 并没有及时关闭 socket 连接导致的, 这个 要看它的 Transport 究竟怎么实现的了.

先看看 golang 的 transport 源码 (src/net/http/transport.go):

// Transport is an implementation of RoundTripper that supports HTTP,
// HTTPS, and HTTP proxies (for either HTTP or HTTPS with CONNECT).
//
// By default, Transport caches connections for future re-use.
// This may leave many open connections when accessing many hosts.
// This behavior can be managed using Transport's CloseIdleConnections method
// and the MaxIdleConnsPerHost and DisableKeepAlives fields.
//
// Transports should be reused instead of created as needed.
// Transports are safe for concurrent use by multiple goroutines.
//
// A Transport is a low-level primitive for making HTTP and HTTPS requests.
// For high-level functionality, such as cookies and redirects, see Client.
//
// Transport uses HTTP/1.1 for HTTP URLs and either HTTP/1.1 or HTTP/2
// for HTTPS URLs, depending on whether the server supports HTTP/2,
// and how the Transport is configured. The DefaultTransport supports HTTP/2.
// To explicitly enable HTTP/2 on a transport, use golang.org/x/net/http2
// and call ConfigureTransport. See the package docs for more about HTTP/2.
//
// Responses with status codes in the 1xx range are either handled
// automatically (100 expect-continue) or ignored. The one
// exception is HTTP status code 101 (Switching Protocols), which is
// considered a terminal status and returned by RoundTrip. To see the
// ignored 1xx responses, use the httptrace trace package's
// ClientTrace.Got1xxResponse.
//
// Transport only retries a request upon encountering a network error
// if the request is idempotent and either has no body or has its
// Request.GetBody defined. HTTP requests are considered idempotent if
// they have HTTP methods GET, HEAD, OPTIONS, or TRACE; or if their
// Header map contains an "Idempotency-Key" or "X-Idempotency-Key"
// entry. If the idempotency key value is an zero-length slice, the
// request is treated as idempotent but the header is not sent on the
// wire.
type Transport struct {
    idleMu     sync.Mutex
    wantIdle   bool                                // user has requested to close all idle conns
    idleConn   map[connectMethodKey][]*persistConn // most recently used at end
    idleConnCh map[connectMethodKey]chan *persistConn
    idleLRU    connLRU

    reqMu       sync.Mutex
    reqCanceler map[*Request]func(error)

    altMu    sync.Mutex   // guards changing altProto only
    altProto atomic.Value // of nil or map[string]RoundTripper, key is URI scheme

    connCountMu          sync.Mutex
    connPerHostCount     map[connectMethodKey]int
    connPerHostAvailable map[connectMethodKey]chan struct{}

    // Proxy specifies a function to return a proxy for a given
    // Request. If the function returns a non-nil error, the
    // request is aborted with the provided error.
    //
    // The proxy type is determined by the URL scheme. "http",
    // "https", and "socks5" are supported. If the scheme is empty,
    // "http" is assumed.
    //
    // If Proxy is nil or returns a nil *URL, no proxy is used.
    Proxy func(*Request) (*url.URL, error)

    // DialContext specifies the dial function for creating unencrypted TCP connections.
    // If DialContext is nil (and the deprecated Dial below is also nil),
    // then the transport dials using package net.
    //
    // DialContext runs concurrently with calls to RoundTrip.
    // A RoundTrip call that initiates a dial may end up using
    // a connection dialed previously when the earlier connection
    // becomes idle before the later DialContext completes.
    DialContext func(ctx context.Context, network, addr string) (net.Conn, error)

    // Dial specifies the dial function for creating unencrypted TCP connections.
    //
    // Dial runs concurrently with calls to RoundTrip.
    // A RoundTrip call that initiates a dial may end up using
    // a connection dialed previously when the earlier connection
    // becomes idle before the later Dial completes.
    //
    // Deprecated: Use DialContext instead, which allows the transport
    // to cancel dials as soon as they are no longer needed.
    // If both are set, DialContext takes priority.
    Dial func(network, addr string) (net.Conn, error)

    // DialTLS specifies an optional dial function for creating
    // TLS connections for non-proxied HTTPS requests.
    //
    // If DialTLS is nil, Dial and TLSClientConfig are used.
    //
    // If DialTLS is set, the Dial hook is not used for HTTPS
    // requests and the TLSClientConfig and TLSHandshakeTimeout
    // are ignored. The returned net.Conn is assumed to already be
    // past the TLS handshake.
    DialTLS func(network, addr string) (net.Conn, error)

    // TLSClientConfig specifies the TLS configuration to use with
    // tls.Client.
    // If nil, the default configuration is used.
    // If non-nil, HTTP/2 support may not be enabled by default.
    TLSClientConfig *tls.Config

    // TLSHandshakeTimeout specifies the maximum amount of time waiting to
    // wait for a TLS handshake. Zero means no timeout.
    TLSHandshakeTimeout time.Duration

    // DisableKeepAlives, if true, disables HTTP keep-alives and
    // will only use the connection to the server for a single
    // HTTP request.
    //
    // This is unrelated to the similarly named TCP keep-alives.
    DisableKeepAlives bool

    // DisableCompression, if true, prevents the Transport from
    // requesting compression with an "Accept-Encoding: gzip"
    // request header when the Request contains no existing
    // Accept-Encoding value. If the Transport requests gzip on
    // its own and gets a gzipped response, it's transparently
    // decoded in the Response.Body. However, if the user
    // explicitly requested gzip it is not automatically
    // uncompressed.
    DisableCompression bool

    // MaxIdleConns controls the maximum number of idle (keep-alive)
    // connections across all hosts. Zero means no limit.
    MaxIdleConns int

    // MaxIdleConnsPerHost, if non-zero, controls the maximum idle
    // (keep-alive) connections to keep per-host. If zero,
    // DefaultMaxIdleConnsPerHost is used.
    MaxIdleConnsPerHost int

    // MaxConnsPerHost optionally limits the total number of
    // connections per host, including connections in the dialing,
    // active, and idle states. On limit violation, dials will block.
    //
    // Zero means no limit.
    //
    // For HTTP/2, this currently only controls the number of new
    // connections being created at a time, instead of the total
    // number. In practice, hosts using HTTP/2 only have about one
    // idle connection, though.
    MaxConnsPerHost int

    // IdleConnTimeout is the maximum amount of time an idle
    // (keep-alive) connection will remain idle before closing
    // itself.
    // Zero means no limit.
    IdleConnTimeout time.Duration

    // ResponseHeaderTimeout, if non-zero, specifies the amount of
    // time to wait for a server's response headers after fully
    // writing the request (including its body, if any). This
    // time does not include the time to read the response body.
    ResponseHeaderTimeout time.Duration

    // ExpectContinueTimeout, if non-zero, specifies the amount of
    // time to wait for a server's first response headers after fully
    // writing the request headers if the request has an
    // "Expect: 100-continue" header. Zero means no timeout and
    // causes the body to be sent immediately, without
    // waiting for the server to approve.
    // This time does not include the time to send the request header.
    ExpectContinueTimeout time.Duration

    // TLSNextProto specifies how the Transport switches to an
    // alternate protocol (such as HTTP/2) after a TLS NPN/ALPN
    // protocol negotiation. If Transport dials an TLS connection
    // with a non-empty protocol name and TLSNextProto contains a
    // map entry for that key (such as "h2"), then the func is
    // called with the request's authority (such as "example.com"
    // or "example.com:1234") and the TLS connection. The function
    // must return a RoundTripper that then handles the request.
    // If TLSNextProto is not nil, HTTP/2 support is not enabled
    // automatically.
    TLSNextProto map[string]func(authority string, c *tls.Conn) RoundTripper

    // ProxyConnectHeader optionally specifies headers to send to
    // proxies during CONNECT requests.
    ProxyConnectHeader Header

    // MaxResponseHeaderBytes specifies a limit on how many
    // response bytes are allowed in the server's response
    // header.
    //
    // Zero means to use a default limit.
    MaxResponseHeaderBytes int64

    // nextProtoOnce guards initialization of TLSNextProto and
    // h2transport (via onceSetNextProtoDefaults)
    nextProtoOnce sync.Once
    h2transport   h2Transport // non-nil if http2 wired up
}

核心的是这几句话:

// By default, Transport caches connections for future re-use.
// This may leave many open connections when accessing many hosts.
// This behavior can be managed using Transport's CloseIdleConnections method
// and the MaxIdleConnsPerHost and DisableKeepAlives fields.
//
// Transports should be reused instead of created as needed.
// Transports are safe for concurrent use by multiple goroutines.

这些话指出了我们之前代码的问题: 应该定义一个全局的 transport 结构体, 在多个 goroutine 之间共享.

改动后的 client.go 代码如下:

package main

import (
    "io/ioutil"
    "log"
    "net/http"
    "time"
)

var globalTransport *http.Transport

func init() {
    globalTransport = &http.Transport{}
}

func request() {
    log.Println("request()")
    url := "http://127.0.0.1:4200/api/status"

    client := http.Client{
        // Timeout:   30,
        Transport: globalTransport,
    }

    resp, err := client.Get(url)
    if err != nil {
        log.Println(err)
        return
    }
    defer resp.Body.Close()
    body, err := ioutil.ReadAll(resp.Body)
    if err != nil {
        log.Println(err)
        return
    }
    log.Println(len(string(body)))

    time.Sleep(10 * time.Millisecond)
}

func main() {
    for i := 0; i < 10000; i++ {
        go request()
        time.Sleep(1 * time.Millisecond)
    }

    time.Sleep(1 * time.Hour)
}

再运行测试代码, 发现它打开的 socket fd 数量恢复到个位数了, 一切正常.

$ ls /proc/19100/fd -l
total 0
lrwx------ 1 shaohua shaohua 64 Sep 19 15:56 0 -> /dev/pts/8
lrwx------ 1 shaohua shaohua 64 Sep 19 15:56 1 -> /dev/pts/8
lrwx------ 1 shaohua shaohua 64 Sep 19 15:56 2 -> /dev/pts/8
lrwx------ 1 shaohua shaohua 64 Sep 19 15:56 3 -> 'socket:[820276]'
lrwx------ 1 shaohua shaohua 64 Sep 19 15:56 4 -> 'anon_inode:[eventpoll]'

另外的问题

当时还用 tcpdump 抓取了 postal 服务的网络数据, 发现它里面有大量的请求在重传:

001.cap.png

这个问题目前还没有确认出现在哪里.

引用:https://blog.biofan.org/2019/09/go-http-client/
参考:https://blog.csdn.net/Roy_70/article/details/78423880

1、增大允许打开的文件数——命令方式
ulimit -n 2048

这样就可以把当前用户的最大允许打开文件数量设置为2048了,但这种设置方法在重启后会还原为默认值。
ulimit -n命令非root用户只能设置到4096。
想要设置到8192需要sudo权限或者root用户。

2、增大允许打开的文件数——修改系统配置文件

vim /etc/security/limits.conf

  • 在最后加入
* soft nofile 4096  
* hard nofile 4096  

或者只加入

 * - nofile 8192

最前的 * 表示所有用户,可根据需要设置某一用户,例如

roy soft nofile 8192  
roy hard nofile 8192  

注意”nofile”项有两个可能的限制措施。就是项下的hard和soft。 要使修改过得最大打开文件数生效,必须对这两种限制进行设定。 如果使用”-“字符设定, 则hard和soft设定会同时被设定。

参考:https://kebingzao.com/2018/06/26/golang-too-many-file/
golang 踩坑之 - 服务的文件句柄超出系统限制(too many open files)

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 214,951评论 6 497
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 91,606评论 3 389
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 160,601评论 0 350
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,478评论 1 288
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,565评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,587评论 1 293
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,590评论 3 414
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,337评论 0 270
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,785评论 1 307
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,096评论 2 330
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,273评论 1 344
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,935评论 5 339
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,578评论 3 322
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,199评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,440评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,163评论 2 366
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,133评论 2 352

推荐阅读更多精彩内容