用 Valgrind 剖析 MySQL 内存使用

前言

最近在线上遇到一个实例经常 OOM(Out Of Memory)，通过监控看，内存使用在短时间内快速上涨，最终超过 CGroup 上限。从以往经验来看，这种一般是连接数短时间内爆涨（几百~上千）导致的，因为每个连接对应 MySQL 内的一个用户线程，每个用户线程都挂了一 THD 对象用来维护上下文，以及各种 thread local cache，可能在几百 KB或者 MB 级别，但是从监控数据看最多也就几十个连接；还有一种可能是 Memory Leak，但是这种通常内存是慢慢涨上去的，不是爆涨，但是不排除这种可能。

问题分析

对于这种陌生的问题，第一个想法是能否复现（可以复现的问题都不是问题），观察了下用户的语句和表结构：

就一张表，用的是分区表，有 149 个分区，引擎是 TokuDB；
语句非常简单，都是 SELECT + ORDER BY + LIMIT，但是 Where 条件里没有指定分区字段，因此所有分区都会被访问。

根据这2个特征，线下很容易就复现出来了。

用 64 个并发线程来压测，可以看到物理内存(RSS)从 500M 左右迅速涨到 20G 左右，tokudb_cache_size 配置为 2.4G 左右。

04:14:38 PM       PID  minflt/s  majflt/s     VSZ    RSS   %MEM  Command
04:14:39 PM     32552      0.00      0.00 5255548 293956   0.06  mysqld
04:14:40 PM     32552      0.00      0.00 5255548 293956   0.06  mysqld
04:14:41 PM     32552  70980.00      0.00 6537852 568712   0.11  mysqld
04:14:42 PM     32552 2369116.00      0.00 17990268 10008148   1.89  mysqld
04:14:43 PM     32552 1347716.00      0.00 23970428 8793920   1.66  mysqld
04:14:44 PM     32552 669286.00      0.00 24920700 11250280   2.13  mysqld
04:14:45 PM     32552 1378265.00      0.00 29225596 14673244   2.77  mysqld
04:14:46 PM     32552 1164854.00      0.00 29225596 18549660   3.50  mysqld
04:14:47 PM     32552 117668.00      0.00 29225596 19010792   3.59  mysqld
04:14:48 PM     32552  30837.00      0.00 29225596 19133736   3.61  mysqld
04:14:49 PM     32552  74417.00      0.00 29225596 19422264   3.67  mysqld
04:14:50 PM     32552  32878.00      0.00 29225596 19553356   3.69  mysqld
04:14:51 PM     32552  73059.00      0.00 29225596 19836520   3.75  mysqld
04:14:52 PM     32552  32665.00      0.00 29229692 19966784   3.77  mysqld
04:14:53 PM     32552  68573.00      0.00 29545084 20232064   3.82  mysqld
04:14:54 PM     32552  33559.00      0.00 29749884 20365884   3.85  mysqld
04:14:55 PM     32552 781718.00      0.00 30765692 16992452   3.21  mysqld
04:14:56 PM     32552 1052088.00      0.00 31748732 15055824   2.84  mysqld
04:14:57 PM     32552 1137534.00      0.00 32707196 17160700   3.24  mysqld
04:14:58 PM     32552 797537.00      0.00 32707196 20223936   3.82  mysqld
04:14:59 PM     32552  10240.00      0.00 32707196 20160464   3.81  mysqld
04:15:00 PM     32552  34351.00      0.00 32707196 20116904   3.80  mysqld

Valgrind 是一个非常好用的内存检测工具，通过它我们可以快速来定位 Memory Leak 问题，我们用 valgrind --tool=memcheck 来重新启动 mysqld，然后跑复现压测，但是并没有检测到内存泄漏。

那么究竟是哪里把内存吃掉了呢？这就该本文的主角了——Massif 登场了，和 memcheck 类似，massif 也是 Valgrind 套装中一个工具，主要用来做 memory profiling：

Massif is a heap profiler. It measures how much heap memory your program uses.

用起来也比较简单 valgrind --tool=massif prog 即可，压测完后，会生成一个 massif.out.<pid> 的文件，这是一个 txt 文件，虽然可以直接打开看，但是为了更好的看结果，valgrind 提供了另一个工具 ms_print 来展示，官方认为将数据采集和展示分开，是为了将来更好更方便的对展示工具做扩展 :)

    GB
22.80^                                       :
     |                      :@@:::::@::::::#::
     |                 @@::::@ :::::@: ::::#::                     ::::::@::::
     |                 @@: ::@ :::::@: ::::#::                  :::@:::::@::::
     |                 @@: ::@ :::::@: ::::#::              @::::::@:::::@::::
     |                @@@: ::@ :::::@: ::::#::::        ::::@::::::@:::::@::::
     |                @@@: ::@ :::::@: ::::#::: :::@::::::::@::::::@:::::@::::
     |                @@@: ::@ :::::@: ::::#::: :::@::::::::@::::::@:::::@::::
     |               :@@@: ::@ :::::@: ::::#::: :::@::::::::@::::::@:::::@::::
     |    ::::::::::::@@@: ::@ :::::@: ::::#::: :::@::::::::@::::::@:::::@::::
     |    :::::: :::::@@@: ::@ :::::@: ::::#::: :::@::::::::@::::::@:::::@::::
     |    :::::: :::::@@@: ::@ :::::@: ::::#::: :::@::::::::@::::::@:::::@::::
     |    :::::: :::::@@@: ::@ :::::@: ::::#::: :::@::::::::@::::::@:::::@::::
     |    :::::: :::::@@@: ::@ :::::@: ::::#::: :::@::::::::@::::::@:::::@::::
     |  :::::::: :::::@@@: ::@ :::::@: ::::#::: :::@::::::::@::::::@:::::@::::
     | :: :::::: :::::@@@: ::@ :::::@: ::::#::: :::@::::::::@::::::@:::::@::::
     | @: :::::: :::::@@@: ::@ :::::@: ::::#::: :::@::::::::@::::::@:::::@::::
     | @: :::::: :::::@@@: ::@ :::::@: ::::#::: :::@::::::::@::::::@:::::@::::
     | @: :::::: :::::@@@: ::@ :::::@: ::::#::: :::@::::::::@::::::@:::::@::::
     | @: :::::: :::::@@@: ::@ :::::@: ::::#::: :::@::::::::@::::::@:::::@::::
   0 +----------------------------------------------------------------------->Gi
     0                                                                   281.7

Number of snapshots: 82
 Detailed snapshots: [1, 15, 16, 17, 22, 28, 34 (peak), 41, 54, 64, 74]

massif 会对程序运行过程中的内存使用做 snapshot，上图中纵坐标代表内存使用量，横坐标可以看作是一种时间单位，默认是 CPU 指令条数；图中每一列代表一次内存 snapshot，不同的字符有不同的含义，其中 : 表示一次正常的 snapshot，@ 表示一次 detailed snapshot，# 表示本次采集的峰值(peak 一定是 detailed)。

我们这里只关心 detailed snapshot，因为它包含内存使用的 stack trace 信息，通过这个我们就知道当前的内存使用，是哪些代码路径引起的。

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
  2  8,139,138,982    7,192,367,296    7,068,742,047   123,625,249            0
  3 12,518,919,899    7,674,937,104    7,519,173,315   155,763,789            0
  4 18,630,130,824   14,528,934,960   14,433,389,357    95,545,603            0
  5 23,053,916,332   14,528,940,728   14,433,394,482    95,546,246            0
  6 28,667,255,552   14,530,496,688   14,434,920,681    95,576,007            0
  7 31,442,322,447   14,530,513,504   14,434,933,623    95,579,881            0
  8 36,910,366,437   14,530,583,456   14,434,985,999    95,597,457            0
  9 40,736,846,313   14,530,640,472   14,435,028,200    95,612,272            0
 10 46,739,261,418   14,530,667,992   14,435,049,267    95,618,725            0
 11 50,685,776,389   14,530,668,784   14,435,049,971    95,618,813            0
 12 57,194,337,509   14,530,670,152   14,435,051,187    95,618,965            0
 13 60,902,677,929   14,551,221,560   14,456,544,294    94,677,266            0
 14 65,578,137,351   15,263,985,032   15,014,333,631   249,651,401            0
 15 69,870,684,301   19,339,915,904   17,864,686,175 1,475,229,729            0
92.37% (17,864,686,175B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.92.37% (17,864,686,175B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->39.91% (7,718,060,098B) 0x973463: my_malloc (my_malloc.c:xxx)
| ->28.51% (5,513,971,756B) 0xBE989B: ha_tokudb::unpack_blobs(unsigned char*, unsigned char const*, unsigned int, bool) (hatoku_defines.h:xxx)
| | ->19.31% (3,734,605,089B) 0xCAE6BD: c_getf_set_callback(unsigned int, void const*, unsigned int, void const*, void*, bool) (ydb_cursor.cc:xxx)
| | | ->19.31% (3,734,605,089B) 0xC8B1B5: ft_cursor_search_eq_k_x_getf(unsigned int, void const*, unsigned int, void const*, void*, bool) (cursor.cc:38
2)
| | |   ->19.31% (3,734,605,089B) 0xC4A000: ft_search_node(ft_handle*, ftnode*, ft_search*, int, int (*)(unsigned int, void const*, unsigned int, void
const*, void*, bool), void*, bool*, ft_cursor*, unlockers*, ancestors*, pivot_bounds const&, bool) (ft-ops.cc:xxx)
| | |     ->19.30% (3,732,546,926B) 0xC494AA: ft_search_node(ft_handle*, ftnode*, ft_search*, int, int (*)(unsigned int, void const*, unsigned int, voi
d const*, void*, bool), void*, bool*, ft_cursor*, unlockers*, ancestors*, pivot_bounds const&, bool) (ft-ops.cc:xxx)
| | |     | ->19.30% (3,732,546,926B) 0xC494AA: ft_search_node(ft_handle*, ftnode*, ft_search*, int, int (*)(unsigned int, void const*, unsigned int, void const*, void*, bool), void*, bool*, ft_cursor*, unlockers*, ancestors*, pivot_bounds const&, bool) (ft-ops.cc:xxx)

->30.41% (5,882,175,012B) 0x9735BC: my_realloc (my_malloc.c:xxx)
| ->26.92% (5,206,093,356B) 0xBE989B: ha_tokudb::unpack_blobs(unsigned char*, unsigned char const*, unsigned int, bool) (hatoku_defines.h:xxx)
| | ->24.27% (4,693,513,693B) 0xCAE6BD: c_getf_set_callback(unsigned int, void const*, unsigned int, void const*, void*, bool) (ydb_cursor.cc:xxx)
| | | ->24.27% (4,693,513,693B) 0xC8B1B5: ft_cursor_search_eq_k_x_getf(unsigned int, void const*, unsigned int, void const*, void*, bool) (cursor.cc:xxx)
| | |   ->24.27% (4,693,513,693B) 0xC4A000: ft_search_node(ft_handle*, ftnode*, ft_search*, int, int (*)(unsigned int, void const*, unsigned int, void const*, void*, bool), void*, bool*, ft_cursor*, unlockers*, ancestors*, pivot_bounds const&, bool) (ft-ops.cc:xxx)
| | |     ->24.27% (4,693,513,693B) 0xC494AA: ft_search_node(ft_handle*, ftnode*, ft_search*, int, int (*)(unsigned int, void const*, unsigned int, void const*, void*, bool), void*, bool*, ft_cursor*, unlockers*, ancestors*, pivot_bounds const&, bool) (ft-ops.cc:xxx)

从上面的 stack trace 中可以看到，在第15个 snapshot中, 有70%+(39.91% + 30.41%) 的内存是通过 ha_tokudb::unpack_blobs 这里占用的。

这个栈的逻辑是将 TokuDB engine 返回的数据，转换成 MySQL server 层的数据格式，而在 blob 字段转换过程中，会用到临时的 buffer(blob_buf)，问题就出在这个 buffer 使用上。这个 buffer 是弹性的，但是只能往上弹，并且只有在表 close 时才会释放。

加了调试代码后，我们发现 blob_buff 非常大，平均在 2M 左右，最大可以到20M左右，每一个 TokuDB handler 都有这样的一个buffer，而用户的 sql 又会打开所有的分区表(149个)，这样一个用户连接就会占 149*2M = 298M 左右的内存，64 个连接差不多用掉 18G 左右的内存。

至此问题已经非常清楚了，一个简单的解法就是尽快释放 blob_buffer，而不是等到 close 表的时候才释放。

总结

从上面的分析过程可以看出，massif 给出内存使用信息是关键，帮助我们快速定位到内存使用大头的stack。如果没有 massif，我们可能就要在内存 malloc 的地方埋点打印信息，然后人肉汇总分析。

valgrind massif 是一个通用的工具，本文是以 MySQL 为例分享一个 case，如果有其它程序也出现内存使用异常的，也可以拿来 profiling 一把。

Tips：

massif 默认是统计 *alloc/new 这些相对来说 high level 的函数的，如果程序里用了 mmap/brk 这些函数，是不是会统计的，需要使用 --pages-as-heap=yes 这个参数开关；
stack 使用默认不会统计，--stacks=yes 可以打开，如果用了 --pages-as-heap=yes，就不需要 --stack=yes，因为已经包含了；
程序 gcc 编译参数中应该包括 -g 选项。

官方文档有详细的关于各个参数的介绍，参数并不多，在使用前建议看一看。

Have fun!

fungo / 2017-07-23
Published under (CC) BY-NC-SA in categories technology MySQL tagged with MySQL Memory valgrind