编辑推荐

网易数帆开源API网关与容器云项目，让云原生生产落地“多快好

网易汪源：统一负载与多云环境的“开放姿态”，才是云原生

网易数帆如何用 Kubernetes“原语”搞定云原生中间件

快手打新挤爆券商系统，网易数帆推出券商稳定性保障方案

探索智慧校园新模式，网易有数在教育行业的实践分享

金融行业大数据治理之路——数据模型篇

Memcached内存管理

猪小花1号2018-09-11 08:57

作者：吕宗胜

1.Memcached介绍

Memcached是一套分布式的内存对象缓存系统，用于系统中减少数据库的负载，提高系统性能。本文介绍的Memcached内存管理方式基于1.4.24。旧版本的Memcached的内存管理方式与该版本会存在一定的不同，本文没有涉及旧版本的Memcached内存管理介绍。

2.Memcached模型

在具体介绍Memcached内存管理的源码实现之前，我们先介绍一些Memcached内存管理的重要概念。

Slab

Slab是Memcached中分配的一块内存，默认大小是1M。Slab是Memcached内存分配的最小单位。
Chunk

Slab是Memcached中分配的最小单位，而每一个Slab又会进一步划分成一个个的Chunk。Chunk是Memcached存储数据的最小单元，一个Chunk只能存储一个对象。同时，一个Slab中所有的Chunk的大小是相同的。
Item

Item是Memcached中存储的实际数据。Item本身是一个复杂的数据结构，其中除了包含对象的Key-Value键值对，过期时间外，还有其他一些数据结构，稍后会具体介绍。Memcached将Item保存对应的Slab的某个Chunk中。
SlabClass

通过上面几个概念的介绍，我们知道Memcached通过分配Slab并把Slab划分成等大小的Chunk来进行存储。那么不同的Item大小数据应该如何存储呢？SlabClass把Slab划分成不同大小的组合，每一个SlabClass对应一种Slab，在同一个SlabClass中所有的Slab都划分成相同大小的Chunk。

这里举一个还算比较形象的列子。Memcached的内存分配可以类比与我们上学时代的写字本。Slab相当于写字本的一页，Chunk相当于写字本一页中的一个个写字格子，Item相当于我们要写的字，而SlabClass相当于一个写字本。所以每一个写字本中的格子大小都是相同的，我们要写不同的字，只需要找最符合该字大小的写字本就可以了。

3. Memecached数据结构

Memcached中最为基本的数据结构是slabclass_t，该数据结构记录了memcached中单个slabclass的结构，具体的数据结构源码如下：

typedef struct {
    unsigned int size; /*Chunk的大小，固定不变*/
    unsigned int perslab;   /* 每一个slab中可以存储对象的数量 */

    void *slots;           /* 链表记录slabclass中的所有空闲Chunk的列表 */
    unsigned int sl_curr;   /* 总的空闲的item的数量，即slots的数量 */

    unsigned int slabs;     /* 该slabclass中分配的slab的数量 */

    void **slab_list;       /* slab指针的数组 */
    unsigned int list_size; /* slab的数量 */

    unsigned int killing;  /* index+1 of dying slab, or zero if none */
    size_t requested; /* 已经被分配的大小 */
} slabclass_t;

这里重点介绍slots、slabs和slab_list。slots是记录了该SlabClass中所有空闲的Chunk列表，是Memcached中内存分配的入口。这里的空闲列表主要来自于两个部分：新分配的Slab的空闲Chunk；已经使用的Chunk过期回收的Chunk。slabs记录的目前该SlabClass已经分配到内存的Slab的总数。slab_list是指针的数组，它表示目前该SlabClass中所有Slab的指针，但并不表示所有的Slab都已经分配了内存。

下面我们介绍一下Memcached中数据存储对象Item的结构：

typedef struct _stritem {
    /* Protected by LRU locks */
    struct _stritem *next; /* 链表中下一个对象，有可能是指slots中的下一个，也可能是LRU链表中的下一个 */
    struct _stritem *prev;
    /* Rest are protected by an item lock */
    struct _stritem *h_next;    /* 相同hash值的hash链表中的下一个 */
    rel_time_t      time;       /* 最近的访问时间 */
    rel_time_t      exptime;    /* 过期时间 */
    int             nbytes;     /* 数据大小 */
    unsigned short  refcount;
    uint8_t         nsuffix;    /* length of flags-and-length string */
    uint8_t         it_flags;   /* ITEM_* above */
    uint8_t         slabs_clsid;/* slab class的编号，表示该Item所在的slab class */
    uint8_t         nkey;       /* 键值的长度 */
    /* this odd type prevents type-punning issues when we do
     * the little shuffle to save space when not using CAS. */
    union {
        uint64_t cas;
        char end;
    } data[];
    /* if it_flags & ITEM_CAS(大小为2) we have 8 bytes CAS */
    /* then null-terminated key （以'\0'为结尾的key）*/
    /* then " flags length\r\n" (no terminating null) */
    /* then data with terminating \r\n (no terminating null; it's binary!) */
} item;

Item是Memcached存储数据的基本单位，这里我们重点介绍data[]数组。该数组的组成可以分成4个部分：8bytes的CAS，以'\0'结尾的Key，flags，二进制格式的数据。

介绍了SlabClass和Item，接下去介绍一下Memcached的LRU实现。内存的不足肯定会导致数据的换入和换出，而Memcached作为内存存储，采用了LRU的方式来进行数据替换。每一个SlabClass独自维护了一套LRU队列，分别是head和tail变量来LRU队列的头部和尾部，尾部的Item就是最近最少使用的数据，会先被淘汰。

上图我们可以比较简单的看出的slabclass的内存呢管理方式，但实际LRU实现远比这幅图要更为复杂，LRU策略分的更为细粒度。LRU对于Memcached来说单个SlabClass的LRU，而不是整体Memcached的LRU，这是由其内存的分配和管理方式决定，但是也可以通过内存的重分配来调整SlabClass之间的内存分配。

Memcached内存的分配和管理方式，虽然可以避免内存的碎片化，但带来的影响是会造成内存的浪费。举个列子：假设有Item的大小为50K，而可以存该Item的Chunk最合适大小为80K，这样就会造成30K内存的浪费。

4. 源码分析

通过上面内容的介绍，我们已经比较好的了解了Memcached的内存模型，下面我们从代码层面来看看Memcached是如何进行内存的分配的。

Slabclass数组的初始化：

/*系统初始化会进行slabclass的初始化*/
static slabclass_t slabclass[MAX_NUMBER_OF_SLAB_CLASSES];
void slabs_init(const size_t limit, const double factor, const bool prealloc) {
    int i = POWER_SMALLEST - 1; //确定SlabClass数组开始分配的起点
    unsigned int size = sizeof(item) + settings.chunk_size;//确定初始化SlabClass可以存储的固定大小，item记录元数据，chunk存储用户数据

    mem_limit = limit; //可以分配的最大内存

    if (prealloc) { //这段代码如果需要与分配内存的时候会调用
        /* Allocate everything in a big chunk with malloc */
        mem_base = malloc(mem_limit);
        if (mem_base != NULL) {
            mem_current = mem_base;
            mem_avail = mem_limit;
        } else {
            fprintf(stderr, "Warning: Failed to allocate requested memory in"
                    " one large chunk.\nWill allocate in smaller chunks\n");
        }
    }

    memset(slabclass, 0, sizeof(slabclass));

    while (++i < MAX_NUMBER_OF_SLAB_CLASSES-1 && size <= settings.item_size_max / factor) {
        /* 这段代码保证在Memcached中存储的对象都是对齐的，默认是使用8位对齐的*/
        if (size % CHUNK_ALIGN_BYTES)
            size += CHUNK_ALIGN_BYTES - (size % CHUNK_ALIGN_BYTES);

         /* 计算该slabclass存在的item的大小和数量*/
        slabclass[i].size = size;
        slabclass[i].perslab = settings.item_size_max / slabclass[i].size;
        /* 下一个slab的大小的size是按factor参数规律扩展，可以通过调节factor大小来控制memcached的存储分配 */
        size *= factor;
        if (settings.verbose > 1) {
            fprintf(stderr, "slab class %3d: chunk size %9u perslab %7u\n",
                    i, slabclass[i].size, slabclass[i].perslab);
        }
    }

        /* power_largest表示最后一个slabclass， 大小设置成item_size_max,可以存储我们约定的最大的对象，一个slab只能存一个item，一个slab默认的大小是1M */
    power_largest = i;
    slabclass[power_largest].size = settings.item_size_max;
    slabclass[power_largest].perslab = 1;
    if (settings.verbose > 1) {
        fprintf(stderr, "slab class %3d: chunk size %9u perslab %7u\n",
                i, slabclass[i].size, slabclass[i].perslab);
    }

    /* for the test suite:  faking of how much we've already malloc'd */
    {
        char *t_initial_malloc = getenv("T_MEMD_INITIAL_MALLOC");
        if (t_initial_malloc) {
            mem_malloced = (size_t)atol(t_initial_malloc);
        }

    }
        /* 如果需要预分配内存，则会执行这段逻辑，进行预分配 */
if (prealloc) {
/* 默认每个slabclass分配一个slab */
        slabs_preallocate(power_largest);
    }
}

从slabclass的初始化代码我们可知，每一个slabclass的可以存储的item大小是固定，不同的slabclass的item大小的增长是由growth_rate来控制（默认是1.25）。growth_rate的大小会影响到内存的使用率，这个需要根据应用来进行调优。

在下面介绍了slabclass的初始化中，有不少系统的设置，其定义具体如下：

static void settings_init(void) {
    settings.use_cas = true;
    settings.access = 0700;
    settings.port = 11211;
    settings.udpport = 11211;
    /* By default this string should be NULL for getaddrinfo() */
    settings.inter = NULL;
    settings.maxbytes = 64 * 1024 * 1024; /* default is 64MB */
    settings.maxconns = 1024;         /* to limit connections-related memory to about 5MB */
    settings.verbose = 0;
    settings.oldest_live = 0;
    settings.oldest_cas = 0;          /* supplements accuracy of oldest_live */
    settings.evict_to_free = 1;       /* push old items out of cache when memory runs out */
    settings.socketpath = NULL;       /* by default, not using a unix socket */
    settings.factor = 1.25;
    settings.chunk_size = 48;         /* 初始化的chunck_size */
    settings.num_threads = 4;         /* N workers */
    settings.num_threads_per_udp = 0;
    settings.prefix_delimiter = ':';
    settings.detail_enabled = 0;
    settings.reqs_per_event = 20;
    settings.backlog = 1024;
    settings.binding_protocol = negotiating_prot;
    settings.item_size_max = 1024 * 1024; /* 1M大小限制，它限制了Memcached缓存对象的最大值. */
    settings.maxconns_fast = false;
    settings.lru_crawler = false;
    settings.lru_crawler_sleep = 100;
    settings.lru_crawler_tocrawl = 0;
    settings.lru_maintainer_thread = false;
    settings.hot_lru_pct = 32;
    settings.warm_lru_pct = 32;
    settings.expirezero_does_not_evict = false;
    settings.hashpower_init = 0;
    settings.slab_reassign = false;
    settings.slab_automove = 0;
    settings.shutdown_command = false;
    settings.tail_repair_time = TAIL_REPAIR_TIME_DEFAULT;
    settings.flush_enabled = true;
    settings.crawls_persleep = 1000;
}

上面介绍了SlabClass的初始化，这里介绍一下每一个slab是如何进行分配的。

static int do_slabs_newslab(const unsigned int id) {
    slabclass_t *p = &slabclass[id];
      /*如果Memcached要进行重分配的话，则默认使用Slab的最大值*/
    int len = settings.slab_reassign ? settings.item_size_max
        : p->size * p->perslab;
    char *ptr;

    if ((mem_limit && mem_malloced + len > mem_limit && p->slabs > 0)) {
        mem_limit_reached = true;
        MEMCACHED_SLABS_SLABCLASS_ALLOCATE_FAILED(id);
        return 0;
    }
    /* grow_slab_list方法中实现了slabclass中slab size的重分配，如果slabs<list_size,说明还有空余的slab，否则进行2倍的扩张（默认是16个） */
    if ((grow_slab_list(id) == 0) ||
        ((ptr = memory_allocate((size_t)len)) == 0)) {

        MEMCACHED_SLABS_SLABCLASS_ALLOCATE_FAILED(id);
        return 0;
    }

    memset(ptr, 0, (size_t)len);
    /* 获取一个空闲的Slab，用指针ptr表示 */
    split_slab_page_into_freelist(ptr, id);

    p->slab_list[p->slabs++] = ptr;
    mem_malloced += len;
    MEMCACHED_SLABS_SLABCLASS_ALLOCATE(id);

    return 1;
}

SlabClass在进行Slab的申请的过程中，会先判断Slab是否已经达到上限（slabs>=list_size），然后来申请一个Slab的内存。在申请完Slab的内容后，Memcached会对该Slab进行划分，划分成一个个空闲的Chunk。下面我们具体看看空闲slab是如何获取的。

static void split_slab_page_into_freelist(char *ptr, const unsigned int id) {
    slabclass_t *p = &slabclass[id];
    int x;
    for (x = 0; x < p->perslab; x++) {
        do_slabs_free(ptr, 0, id);
        ptr += p->size;
    }
}

/*单个item的初始化*/
static void do_slabs_free(void *ptr, const size_t size, unsigned int id) {
    slabclass_t *p;
    item *it;

    assert(id >= POWER_SMALLEST && id <= power_largest);
    if (id < POWER_SMALLEST || id > power_largest)
        return;

    MEMCACHED_SLABS_FREE(size, id, ptr);
    p = &slabclass[id];
    /*把新分配处理的Item添加到空闲列表Slots中去*/
    it = (item *)ptr;
    it->it_flags |= ITEM_SLABBED;
    it->slabs_clsid = 0;
    it->prev = 0;
    it->next = p->slots;
    if (it->next) it->next->prev = it;
    p->slots = it;

    p->sl_curr++;
    p->requested -= size;
    return;
}

所以memcached每申请一个新的slab，都会把该slab进行item化的划分，并使用链表来记录。到此为止，所有的Memcached的内存都已经初始化了，结果是建立了所有slabclass的数组和每个slabclass中的数量（默认是16个），每个slabclass中的大小固定，每个slab中的数量固定，并初始化了一个Slab中的所有空闲Item。而所有的item在一个slot中是采用链表来记录的，数据结构为slots，这个是垮slab，所有的slab中的item都可以找到，并且是顺序存储的。

介绍了上面的slab的分配，下面重点介绍Item是如何获取内存的，可以该方法是Memcached内存分配的最核心入口，代码如下：

#define HOT_LRU 0
#define WARM_LRU 64
#define COLD_LRU 128
#define NOEXP_LRU 192
item *do_item_alloc(char *key, const size_t nkey, const int flags,
                    const rel_time_t exptime, const int nbytes,
                    const uint32_t cur_hv) {
    int i;
    uint8_t nsuffix;
    item *it = NULL;
    char suffix[40];
    unsigned int total_chunks;
    size_t ntotal = item_make_header(nkey + 1, flags, nbytes, suffix, &nsuffix); //计算item的大小
    if (settings.use_cas) {
        ntotal += sizeof(uint64_t);
    }

      //根据存储的总大小判断存储的slabclass，slabclass从1开始
    unsigned int id = slabs_clsid(ntotal);
    if (id == 0)
        return 0;

    /* If no memory is available, attempt a direct LRU juggle/eviction */
    /* This is a race in order to simplify lru_pull_tail; in cases where
     * locked items are on the tail, you want them to fall out and cause
     * occasional OOM's, rather than internally work around them.
     * This also gives one fewer code path for slab alloc/free
     */
    for (i = 0; i < 5; i++) {
        /* 首先进行内存的回收,默认lru_maintainer_thread为false */
        if (!settings.lru_maintainer_thread) {
            lru_pull_tail(id, COLD_LRU, 0, false, cur_hv);
        }
          /*进行了Item内存的分配*/
        it = slabs_alloc(ntotal, id, &total_chunks);
        if (settings.expirezero_does_not_evict)
            total_chunks -= noexp_lru_size(id);
          /*根据Item是否*/
        if (it == NULL) {
            if (settings.lru_maintainer_thread) {
                lru_pull_tail(id, HOT_LRU, total_chunks, false, cur_hv);
                lru_pull_tail(id, WARM_LRU, total_chunks, false, cur_hv);
                lru_pull_tail(id, COLD_LRU, total_chunks, true, cur_hv);
            } else {
                lru_pull_tail(id, COLD_LRU, 0, true, cur_hv);
            }
        } else {
            break;
        }
    }

    if (i > 0) {
        pthread_mutex_lock(&lru_locks[id]);
        itemstats[id].direct_reclaims += i;
        pthread_mutex_unlock(&lru_locks[id]);
    }
    /*没有足够的内存来保存*/
    if (it == NULL) {
        pthread_mutex_lock(&lru_locks[id]);
        itemstats[id].outofmemory++;
        pthread_mutex_unlock(&lru_locks[id]);
        return NULL;
    }

    assert(it->slabs_clsid == 0);
    //assert(it != heads[id]);

    /* Refcount is seeded to 1 by slabs_alloc() */
    it->next = it->prev = it->h_next = 0;
    /* Items are initially loaded into the HOT_LRU. This is '0' but I want at
     * least a note here. Compiler (hopefully?) optimizes this out.
     */
    if (settings.lru_maintainer_thread) {
        if (exptime == 0 && settings.expirezero_does_not_evict) {
            id |= NOEXP_LRU;
        } else {
            id |= HOT_LRU;
        }
    } else {
        /* There is only COLD in compat-mode */
        id |= COLD_LRU;
    }
    it->slabs_clsid = id;

    DEBUG_REFCNT(it, '*');
    it->it_flags = settings.use_cas ? ITEM_CAS : 0;
    it->nkey = nkey;
    it->nbytes = nbytes;
    memcpy(ITEM_key(it), key, nkey);
    it->exptime = exptime;
    memcpy(ITEM_suffix(it), suffix, (size_t)nsuffix);
    it->nsuffix = nsuffix;
    return it;
}

从这段代码中，item内存的分配是先进行LRU的判断，释放LRU中过期的Item，然后通过slab_alloc来进行分配，如果还没有足够的内存，继续进行LRU内存的释放（这次可以通过抛弃还未过期的最近最少使用的对象）。我们先看看lru_pull_tail的实现：

/* Returns number of items remove, expired, or evicted.
 * Callable from worker threads or the LRU maintainer thread */
static int lru_pull_tail(const int orig_id, const int cur_lru,
        const unsigned int total_chunks, const bool do_evict, const uint32_t cur_hv) {
    item *it = NULL;
    int id = orig_id;
    int removed = 0;
    if (id == 0)
        return 0;

    int tries = 5;
    item *search;
    item *next_it;
    void *hold_lock = NULL;
    unsigned int move_to_lru = 0;
    uint64_t limit;

    id |= cur_lru;
    pthread_mutex_lock(&lru_locks[id]);
    search = tails[id];
    /* We walk up *only* for locked items, and if bottom is expired. */
    for (; tries > 0 && search != NULL; tries--, search=next_it) {
        /* we might relink search mid-loop, so search->prev isn't reliable */
        next_it = search->prev;
        if (search->nbytes == 0 && search->nkey == 0 && search->it_flags == 1) {
            /* We are a crawler, ignore it. */
            tries++;
            continue;
        }
        uint32_t hv = hash(ITEM_key(search), search->nkey);
        /* 尝试lock search item，如果失败，则跳过
         * Attempt to hash item lock the "search" item. If locked, no
         * other callers can incr the refcount. Also skip ourselves. */
        if (hv == cur_hv || (hold_lock = item_trylock(hv)) == NULL)
            continue;
        /* Now see if the item is refcount locked */
        if (refcount_incr(&search->refcount) != 2) {
            /* Note pathological case with ref'ed items in tail.
             * Can still unlink the item, but it won't be reusable yet */
            itemstats[id].lrutail_reflocked++;
            /* In case of refcount leaks, enable for quick workaround. */
            /* WARNING: This can cause terrible corruption */
            if (settings.tail_repair_time &&
                    search->time + settings.tail_repair_time < current_time) {
                itemstats[id].tailrepairs++;
                search->refcount = 1;
                /* refcount为1的会进行回收释放处理*/
                do_item_unlink_nolock(search, hv);
                item_trylock_unlock(hold_lock);
                continue;
            }
        }

        /* Expired or flushed */
        if ((search->exptime != 0 && search->exptime < current_time)
            || is_flushed(search)) {
            itemstats[id].reclaimed++;
            if ((search->it_flags & ITEM_FETCHED) == 0) {
                itemstats[id].expired_unfetched++;
            }
            /* refcnt 2 -> 1 */
            do_item_unlink_nolock(search, hv);
            /* refcnt 1 -> 0 -> item_free */
            do_item_remove(search);
            item_trylock_unlock(hold_lock);
            removed++;

            /* If all we're finding are expired, can keep going */
            continue;
        }

        /* If we're HOT_LRU or WARM_LRU and over size limit, send to COLD_LRU.
         * If we're COLD_LRU, send to WARM_LRU unless we need to evict
         */
        switch (cur_lru) {
            case HOT_LRU:
                limit = total_chunks * settings.hot_lru_pct / 100;
            case WARM_LRU:
                limit = total_chunks * settings.warm_lru_pct / 100;
                if (sizes[id] > limit) {
                    itemstats[id].moves_to_cold++;
                    move_to_lru = COLD_LRU;
                    do_item_unlink_q(search);
                    it = search;
                    removed++;
                    break;
                } else if ((search->it_flags & ITEM_ACTIVE) != 0) {
                    /* Only allow ACTIVE relinking if we're not too large. */
                    itemstats[id].moves_within_lru++;
                    search->it_flags &= ~ITEM_ACTIVE;
                    do_item_update_nolock(search);
                    do_item_remove(search);
                    item_trylock_unlock(hold_lock);
                } else {
                    /* Don't want to move to COLD, not active, bail out */
                    it = search;
                }
                break;
            case COLD_LRU:
                it = search; /* No matter what, we're stopping */
                if (do_evict) {
                    if (settings.evict_to_free == 0) {
                        /* Don't think we need a counter for this. It'll OOM.  */
                        break;
                    }
                    itemstats[id].evicted++;
                    itemstats[id].evicted_time = current_time - search->time;
                    if (search->exptime != 0)
                        itemstats[id].evicted_nonzero++;
                    if ((search->it_flags & ITEM_FETCHED) == 0) {
                        itemstats[id].evicted_unfetched++;
                    }
                    do_item_unlink_nolock(search, hv);
                    removed++;
                } else if ((search->it_flags & ITEM_ACTIVE) != 0
                        && settings.lru_maintainer_thread) {
                    itemstats[id].moves_to_warm++;
                    search->it_flags &= ~ITEM_ACTIVE;
                    move_to_lru = WARM_LRU;
                    do_item_unlink_q(search);
                    removed++;
                }
                break;
        }
        if (it != NULL)
            break;
    }

    pthread_mutex_unlock(&lru_locks[id]);

    if (it != NULL) {
        if (move_to_lru) {
            it->slabs_clsid = ITEM_clsid(it);
            it->slabs_clsid |= move_to_lru;
            item_link_q(it);
        }
        do_item_remove(it);
        item_trylock_unlock(hold_lock);
    }

    return removed;
}