Memcached内存管理

猪小花1号2018-09-11 08:57

作者:吕宗胜


1.Memcached介绍

Memcached是一套分布式的内存对象缓存系统,用于系统中减少数据库的负载,提高系统性能。本文介绍的Memcached内存管理方式基于1.4.24。旧版本的Memcached的内存管理方式与该版本会存在一定的不同,本文没有涉及旧版本的Memcached内存管理介绍。

2.Memcached模型

在具体介绍Memcached内存管理的源码实现之前,我们先介绍一些Memcached内存管理的重要概念。

  1. Slab

    Slab是Memcached中分配的一块内存,默认大小是1M。Slab是Memcached内存分配的最小单位。

  2. Chunk

    Slab是Memcached中分配的最小单位,而每一个Slab又会进一步划分成一个个的Chunk。Chunk是Memcached存储数据的最小单元,一个Chunk只能存储一个对象。同时,一个Slab中所有的Chunk的大小是相同的。

  3. Item

    Item是Memcached中存储的实际数据。Item本身是一个复杂的数据结构,其中除了包含对象的Key-Value键值对,过期时间外,还有其他一些数据结构,稍后会具体介绍。Memcached将Item保存对应的Slab的某个Chunk中。

  4. SlabClass

    通过上面几个概念的介绍,我们知道Memcached通过分配Slab并把Slab划分成等大小的Chunk来进行存储。那么不同的Item大小数据应该如何存储呢?SlabClass把Slab划分成不同大小的组合,每一个SlabClass对应一种Slab,在同一个SlabClass中所有的Slab都划分成相同大小的Chunk。

这里举一个还算比较形象的列子。Memcached的内存分配可以类比与我们上学时代的写字本。Slab相当于写字本的一页,Chunk相当于写字本一页中的一个个写字格子,Item相当于我们要写的字,而SlabClass相当于一个写字本。所以每一个写字本中的格子大小都是相同的,我们要写不同的字,只需要找最符合该字大小的写字本就可以了。

3. Memecached数据结构

Memcached中最为基本的数据结构是slabclass_t,该数据结构记录了memcached中单个slabclass的结构,具体的数据结构源码如下:

typedef struct {
    unsigned int size; /*Chunk的大小,固定不变*/
    unsigned int perslab;   /* 每一个slab中可以存储对象的数量 */

    void *slots;           /* 链表记录slabclass中的所有空闲Chunk的列表 */
    unsigned int sl_curr;   /* 总的空闲的item的数量,即slots的数量 */

    unsigned int slabs;     /* 该slabclass中分配的slab的数量 */

    void **slab_list;       /* slab指针的数组 */
    unsigned int list_size; /* slab的数量 */

    unsigned int killing;  /* index+1 of dying slab, or zero if none */
    size_t requested; /* 已经被分配的大小 */
} slabclass_t;

这里重点介绍slots、slabs和slab_list。slots是记录了该SlabClass中所有空闲的Chunk列表,是Memcached中内存分配的入口。这里的空闲列表主要来自于两个部分:新分配的Slab的空闲Chunk;已经使用的Chunk过期回收的Chunk。slabs记录的目前该SlabClass已经分配到内存的Slab的总数。slab_list是指针的数组,它表示目前该SlabClass中所有Slab的指针,但并不表示所有的Slab都已经分配了内存。

下面我们介绍一下Memcached中数据存储对象Item的结构:

typedef struct _stritem {
    /* Protected by LRU locks */
    struct _stritem *next; /* 链表中下一个对象,有可能是指slots中的下一个,也可能是LRU链表中的下一个 */
    struct _stritem *prev;
    /* Rest are protected by an item lock */
    struct _stritem *h_next;    /* 相同hash值的hash链表中的下一个 */
    rel_time_t      time;       /* 最近的访问时间 */
    rel_time_t      exptime;    /* 过期时间 */
    int             nbytes;     /* 数据大小 */
    unsigned short  refcount;
    uint8_t         nsuffix;    /* length of flags-and-length string */
    uint8_t         it_flags;   /* ITEM_* above */
    uint8_t         slabs_clsid;/* slab class的编号,表示该Item所在的slab class */
    uint8_t         nkey;       /* 键值的长度 */
    /* this odd type prevents type-punning issues when we do
     * the little shuffle to save space when not using CAS. */
    union {
        uint64_t cas;
        char end;
    } data[];
    /* if it_flags & ITEM_CAS(大小为2) we have 8 bytes CAS */
    /* then null-terminated key (以'\0'为结尾的key)*/
    /* then " flags length\r\n" (no terminating null) */
    /* then data with terminating \r\n (no terminating null; it's binary!) */
} item;

Item是Memcached存储数据的基本单位,这里我们重点介绍data[]数组。该数组的组成可以分成4个部分:8bytes的CAS,以'\0'结尾的Key,flags,二进制格式的数据。

介绍了SlabClass和Item,接下去介绍一下Memcached的LRU实现。内存的不足肯定会导致数据的换入和换出,而Memcached作为内存存储,采用了LRU的方式来进行数据替换。每一个SlabClass独自维护了一套LRU队列,分别是head和tail变量来LRU队列的头部和尾部,尾部的Item就是最近最少使用的数据,会先被淘汰。

上图我们可以比较简单的看出的slabclass的内存呢管理方式,但实际LRU实现远比这幅图要更为复杂,LRU策略分的更为细粒度。LRU对于Memcached来说单个SlabClass的LRU,而不是整体Memcached的LRU,这是由其内存的分配和管理方式决定,但是也可以通过内存的重分配来调整SlabClass之间的内存分配。

Memcached内存的分配和管理方式,虽然可以避免内存的碎片化,但带来的影响是会造成内存的浪费。举个列子:假设有Item的大小为50K,而可以存该Item的Chunk最合适大小为80K,这样就会造成30K内存的浪费。

4. 源码分析

通过上面内容的介绍,我们已经比较好的了解了Memcached的内存模型,下面我们从代码层面来看看Memcached是如何进行内存的分配的。

Slabclass数组的初始化:

/*系统初始化会进行slabclass的初始化*/
static slabclass_t slabclass[MAX_NUMBER_OF_SLAB_CLASSES];
void slabs_init(const size_t limit, const double factor, const bool prealloc) {
    int i = POWER_SMALLEST - 1; //确定SlabClass数组开始分配的起点
    unsigned int size = sizeof(item) + settings.chunk_size;//确定初始化SlabClass可以存储的固定大小,item记录元数据,chunk存储用户数据

    mem_limit = limit; //可以分配的最大内存

    if (prealloc) { //这段代码如果需要与分配内存的时候会调用
        /* Allocate everything in a big chunk with malloc */
        mem_base = malloc(mem_limit);
        if (mem_base != NULL) {
            mem_current = mem_base;
            mem_avail = mem_limit;
        } else {
            fprintf(stderr, "Warning: Failed to allocate requested memory in"
                    " one large chunk.\nWill allocate in smaller chunks\n");
        }
    }

    memset(slabclass, 0, sizeof(slabclass));

    while (++i < MAX_NUMBER_OF_SLAB_CLASSES-1 && size <= settings.item_size_max / factor) {
        /* 这段代码保证在Memcached中存储的对象都是对齐的,默认是使用8位对齐的*/
        if (size % CHUNK_ALIGN_BYTES)
            size += CHUNK_ALIGN_BYTES - (size % CHUNK_ALIGN_BYTES);

         /* 计算该slabclass存在的item的大小和数量*/
        slabclass[i].size = size;
        slabclass[i].perslab = settings.item_size_max / slabclass[i].size;
        /* 下一个slab的大小的size是按factor参数规律扩展,可以通过调节factor大小来控制memcached的存储分配 */
        size *= factor;
        if (settings.verbose > 1) {
            fprintf(stderr, "slab class %3d: chunk size %9u perslab %7u\n",
                    i, slabclass[i].size, slabclass[i].perslab);
        }
    }

        /* power_largest表示最后一个slabclass, 大小设置成item_size_max,可以存储我们约定的最大的对象,一个slab只能存一个item,一个slab默认的大小是1M */
    power_largest = i;
    slabclass[power_largest].size = settings.item_size_max;
    slabclass[power_largest].perslab = 1;
    if (settings.verbose > 1) {
        fprintf(stderr, "slab class %3d: chunk size %9u perslab %7u\n",
                i, slabclass[i].size, slabclass[i].perslab);
    }

    /* for the test suite:  faking of how much we've already malloc'd */
    {
        char *t_initial_malloc = getenv("T_MEMD_INITIAL_MALLOC");
        if (t_initial_malloc) {
            mem_malloced = (size_t)atol(t_initial_malloc);
        }

    }
        /* 如果需要预分配内存,则会执行这段逻辑,进行预分配 */
if (prealloc) {
/* 默认每个slabclass分配一个slab */
        slabs_preallocate(power_largest);
    }
}

从slabclass的初始化代码我们可知,每一个slabclass的可以存储的item大小是固定,不同的slabclass的item大小的增长是由growth_rate来控制(默认是1.25)。growth_rate的大小会影响到内存的使用率,这个需要根据应用来进行调优。

在下面介绍了slabclass的初始化中,有不少系统的设置,其定义具体如下:

static void settings_init(void) {
    settings.use_cas = true;
    settings.access = 0700;
    settings.port = 11211;
    settings.udpport = 11211;
    /* By default this string should be NULL for getaddrinfo() */
    settings.inter = NULL;
    settings.maxbytes = 64 * 1024 * 1024; /* default is 64MB */
    settings.maxconns = 1024;         /* to limit connections-related memory to about 5MB */
    settings.verbose = 0;
    settings.oldest_live = 0;
    settings.oldest_cas = 0;          /* supplements accuracy of oldest_live */
    settings.evict_to_free = 1;       /* push old items out of cache when memory runs out */
    settings.socketpath = NULL;       /* by default, not using a unix socket */
    settings.factor = 1.25;
    settings.chunk_size = 48;         /* 初始化的chunck_size */
    settings.num_threads = 4;         /* N workers */
    settings.num_threads_per_udp = 0;
    settings.prefix_delimiter = ':';
    settings.detail_enabled = 0;
    settings.reqs_per_event = 20;
    settings.backlog = 1024;
    settings.binding_protocol = negotiating_prot;
    settings.item_size_max = 1024 * 1024; /* 1M大小限制,它限制了Memcached缓存对象的最大值. */
    settings.maxconns_fast = false;
    settings.lru_crawler = false;
    settings.lru_crawler_sleep = 100;
    settings.lru_crawler_tocrawl = 0;
    settings.lru_maintainer_thread = false;
    settings.hot_lru_pct = 32;
    settings.warm_lru_pct = 32;
    settings.expirezero_does_not_evict = false;
    settings.hashpower_init = 0;
    settings.slab_reassign = false;
    settings.slab_automove = 0;
    settings.shutdown_command = false;
    settings.tail_repair_time = TAIL_REPAIR_TIME_DEFAULT;
    settings.flush_enabled = true;
    settings.crawls_persleep = 1000;
}

上面介绍了SlabClass的初始化,这里介绍一下每一个slab是如何进行分配的。

static int do_slabs_newslab(const unsigned int id) {
    slabclass_t *p = &slabclass[id];
      /*如果Memcached要进行重分配的话,则默认使用Slab的最大值*/
    int len = settings.slab_reassign ? settings.item_size_max
        : p->size * p->perslab;
    char *ptr;

    if ((mem_limit && mem_malloced + len > mem_limit && p->slabs > 0)) {
        mem_limit_reached = true;
        MEMCACHED_SLABS_SLABCLASS_ALLOCATE_FAILED(id);
        return 0;
    }
    /* grow_slab_list方法中实现了slabclass中slab size的重分配,如果slabs<list_size,说明还有空余的slab,否则进行2倍的扩张(默认是16个) */
    if ((grow_slab_list(id) == 0) ||
        ((ptr = memory_allocate((size_t)len)) == 0)) {

        MEMCACHED_SLABS_SLABCLASS_ALLOCATE_FAILED(id);
        return 0;
    }

    memset(ptr, 0, (size_t)len);
    /* 获取一个空闲的Slab,用指针ptr表示 */
    split_slab_page_into_freelist(ptr, id);

    p->slab_list[p->slabs++] = ptr;
    mem_malloced += len;
    MEMCACHED_SLABS_SLABCLASS_ALLOCATE(id);

    return 1;
}

SlabClass在进行Slab的申请的过程中,会先判断Slab是否已经达到上限(slabs>=list_size),然后来申请一个Slab的内存。在申请完Slab的内容后,Memcached会对该Slab进行划分,划分成一个个空闲的Chunk。下面我们具体看看空闲slab是如何获取的。

static void split_slab_page_into_freelist(char *ptr, const unsigned int id) {
    slabclass_t *p = &slabclass[id];
    int x;
    for (x = 0; x < p->perslab; x++) {
        do_slabs_free(ptr, 0, id);
        ptr += p->size;
    }
}

/*单个item的初始化*/
static void do_slabs_free(void *ptr, const size_t size, unsigned int id) {
    slabclass_t *p;
    item *it;

    assert(id >= POWER_SMALLEST && id <= power_largest);
    if (id < POWER_SMALLEST || id > power_largest)
        return;

    MEMCACHED_SLABS_FREE(size, id, ptr);
    p = &slabclass[id];
    /*把新分配处理的Item添加到空闲列表Slots中去*/
    it = (item *)ptr;
    it->it_flags |= ITEM_SLABBED;
    it->slabs_clsid = 0;
    it->prev = 0;
    it->next = p->slots;
    if (it->next) it->next->prev = it;
    p->slots = it;

    p->sl_curr++;
    p->requested -= size;
    return;
}

所以memcached每申请一个新的slab,都会把该slab进行item化的划分,并使用链表来记录。到此为止,所有的Memcached的内存都已经初始化了,结果是建立了所有slabclass的数组和每个slabclass中的数量(默认是16个),每个slabclass中的大小固定,每个slab中的数量固定,并初始化了一个Slab中的所有空闲Item。而所有的item在一个slot中是采用链表来记录的,数据结构为slots,这个是垮slab,所有的slab中的item都可以找到,并且是顺序存储的。

介绍了上面的slab的分配,下面重点介绍Item是如何获取内存的,可以该方法是Memcached内存分配的最核心入口,代码如下:

#define HOT_LRU 0
#define WARM_LRU 64
#define COLD_LRU 128
#define NOEXP_LRU 192
item *do_item_alloc(char *key, const size_t nkey, const int flags,
                    const rel_time_t exptime, const int nbytes,
                    const uint32_t cur_hv) {
    int i;
    uint8_t nsuffix;
    item *it = NULL;
    char suffix[40];
    unsigned int total_chunks;
    size_t ntotal = item_make_header(nkey + 1, flags, nbytes, suffix, &nsuffix); //计算item的大小
    if (settings.use_cas) {
        ntotal += sizeof(uint64_t);
    }

      //根据存储的总大小判断存储的slabclass,slabclass从1开始
    unsigned int id = slabs_clsid(ntotal);
    if (id == 0)
        return 0;

    /* If no memory is available, attempt a direct LRU juggle/eviction */
    /* This is a race in order to simplify lru_pull_tail; in cases where
     * locked items are on the tail, you want them to fall out and cause
     * occasional OOM's, rather than internally work around them.
     * This also gives one fewer code path for slab alloc/free
     */
    for (i = 0; i < 5; i++) {
        /* 首先进行内存的回收,默认lru_maintainer_thread为false */
        if (!settings.lru_maintainer_thread) {
            lru_pull_tail(id, COLD_LRU, 0, false, cur_hv);
        }
          /*进行了Item内存的分配*/
        it = slabs_alloc(ntotal, id, &total_chunks);
        if (settings.expirezero_does_not_evict)
            total_chunks -= noexp_lru_size(id);
          /*根据Item是否*/
        if (it == NULL) {
            if (settings.lru_maintainer_thread) {
                lru_pull_tail(id, HOT_LRU, total_chunks, false, cur_hv);
                lru_pull_tail(id, WARM_LRU, total_chunks, false, cur_hv);
                lru_pull_tail(id, COLD_LRU, total_chunks, true, cur_hv);
            } else {
                lru_pull_tail(id, COLD_LRU, 0, true, cur_hv);
            }
        } else {
            break;
        }
    }

    if (i > 0) {
        pthread_mutex_lock(&lru_locks[id]);
        itemstats[id].direct_reclaims += i;
        pthread_mutex_unlock(&lru_locks[id]);
    }
    /*没有足够的内存来保存*/
    if (it == NULL) {
        pthread_mutex_lock(&lru_locks[id]);
        itemstats[id].outofmemory++;
        pthread_mutex_unlock(&lru_locks[id]);
        return NULL;
    }

    assert(it->slabs_clsid == 0);
    //assert(it != heads[id]);

    /* Refcount is seeded to 1 by slabs_alloc() */
    it->next = it->prev = it->h_next = 0;
    /* Items are initially loaded into the HOT_LRU. This is '0' but I want at
     * least a note here. Compiler (hopefully?) optimizes this out.
     */
    if (settings.lru_maintainer_thread) {
        if (exptime == 0 && settings.expirezero_does_not_evict) {
            id |= NOEXP_LRU;
        } else {
            id |= HOT_LRU;
        }
    } else {
        /* There is only COLD in compat-mode */
        id |= COLD_LRU;
    }
    it->slabs_clsid = id;

    DEBUG_REFCNT(it, '*');
    it->it_flags = settings.use_cas ? ITEM_CAS : 0;
    it->nkey = nkey;
    it->nbytes = nbytes;
    memcpy(ITEM_key(it), key, nkey);
    it->exptime = exptime;
    memcpy(ITEM_suffix(it), suffix, (size_t)nsuffix);
    it->nsuffix = nsuffix;
    return it;
}

从这段代码中,item内存的分配是先进行LRU的判断,释放LRU中过期的Item,然后通过slab_alloc来进行分配,如果还没有足够的内存,继续进行LRU内存的释放(这次可以通过抛弃还未过期的最近最少使用的对象)。我们先看看lru_pull_tail的实现:

/* Returns number of items remove, expired, or evicted.
* Callable from worker threads or the LRU maintainer thread */
static int lru_pull_tail(const int orig_id, const int cur_lru,
const unsigned int total_chunks, const bool do_evict, const uint32_t cur_hv) {
item *it = NULL;
int id = orig_id;
int removed = 0;
if (id == 0)
return 0;

int tries = 5;
item *search;
item *next_it;
void *hold_lock = NULL;
unsigned int move_to_lru = 0;
uint64_t limit;

id |= cur_lru;
pthread_mutex_lock(&lru_locks[id]);
search = tails[id];
/* We walk up *only* for locked items, and if bottom is expired. */
for (; tries > 0 && search != NULL; tries--, search=next_it) {
/* we might relink search mid-loop, so search->prev isn't reliable */
next_it = search->prev;
if (search->nbytes == 0 && search->nkey == 0 && search->it_flags == 1) {
/* We are a crawler, ignore it. */
tries++;
continue;
}
uint32_t hv = hash(ITEM_key(search), search->nkey);
/* 尝试lock search item,如果失败,则跳过
* Attempt to hash item lock the "search" item. If locked, no
* other callers can incr the refcount. Also skip ourselves. */
if (hv == cur_hv || (hold_lock = item_trylock(hv)) == NULL)
continue;
/* Now see if the item is refcount locked */
if (refcount_incr(&search->refcount) != 2) {
/* Note pathological case with ref'ed items in tail.
* Can still unlink the item, but it won't be reusable yet */
itemstats[id].lrutail_reflocked++;
/* In case of refcount leaks, enable for quick workaround. */
/* WARNING: This can cause terrible corruption */
if (settings.tail_repair_time &&
search->time + settings.tail_repair_time < current_time) {
itemstats[id].tailrepairs++;
search->refcount = 1;
/* refcount为1的会进行回收释放处理*/
do_item_unlink_nolock(search, hv);
item_trylock_unlock(hold_lock);
continue;
}
}

/* Expired or flushed */
if ((search->exptime != 0 && search->exptime < current_time)
|| is_flushed(search)) {
itemstats[id].reclaimed++;
if ((search->it_flags & ITEM_FETCHED) == 0) {
itemstats[id].expired_unfetched++;
}
/* refcnt 2 -> 1 */
do_item_unlink_nolock(search, hv);
/* refcnt 1 -> 0 -> item_free */
do_item_remove(search);
item_trylock_unlock(hold_lock);
removed++;

/* If all we're finding are expired, can keep going */
continue;
}

/* If we're HOT_LRU or WARM_LRU and over size limit, send to COLD_LRU.
* If we're COLD_LRU, send to WARM_LRU unless we need to evict
*/
switch (cur_lru) {
case HOT_LRU:
limit = total_chunks * settings.hot_lru_pct / 100;
case WARM_LRU:
limit = total_chunks * settings.warm_lru_pct / 100;
if (sizes[id] > limit) {
itemstats[id].moves_to_cold++;
move_to_lru = COLD_LRU;
do_item_unlink_q(search);
it = search;
removed++;
break;
} else if ((search->it_flags & ITEM_ACTIVE) != 0) {
/* Only allow ACTIVE relinking if we're not too large. */
itemstats[id].moves_within_lru++;
search->it_flags &= ~ITEM_ACTIVE;
do_item_update_nolock(search);
do_item_remove(search);
item_trylock_unlock(hold_lock);
} else {
/* Don't want to move to COLD, not active, bail out */
it = search;
}
break;
case COLD_LRU:
it = search; /* No matter what, we're stopping */
if (do_evict) {
if (settings.evict_to_free == 0) {
/* Don't think we need a counter for this. It'll OOM. */
break;
}
itemstats[id].evicted++;
itemstats[id].evicted_time = current_time - search->time;
if (search->exptime != 0)
itemstats[id].evicted_nonzero++;
if ((search->it_flags & ITEM_FETCHED) == 0) {
itemstats[id].evicted_unfetched++;
}
do_item_unlink_nolock(search, hv);
removed++;
} else if ((search->it_flags & ITEM_ACTIVE) != 0
&& settings.lru_maintainer_thread) {
itemstats[id].moves_to_warm++;
search->it_flags &= ~ITEM_ACTIVE;
move_to_lru = WARM_LRU;
do_item_unlink_q(search);
removed++;
}
break;
}
if (it != NULL)
break;
}

pthread_mutex_unlock(&lru_locks[id]);

if (it != NULL) {
if (move_to_lru) {
it->slabs_clsid = ITEM_clsid(it);
it->slabs_clsid |= move_to_lru;
item_link_q(it);
}
do_item_remove(it);
item_trylock_unlock(hold_lock);
}

return removed;
}


本文来自网易实践者社区,经作者吕宗胜授权发布