编辑推荐

网易数帆开源API网关与容器云项目，让云原生生产落地“多快好

网易汪源：统一负载与多云环境的“开放姿态”，才是云原生

网易数帆如何用 Kubernetes“原语”搞定云原生中间件

快手打新挤爆券商系统，网易数帆推出券商稳定性保障方案

探索智慧校园新模式，网易有数在教育行业的实践分享

金融行业大数据治理之路——数据模型篇

深入剖析MySQL group commit实现（上）

达芬奇密码2018-08-15 11:14

使用过使用过Oracle MySQL 5.5及之前版本的产品大概都对MySQL在sync_binlog和innodb_flush_log_at_trx_commit全部设置为1 的场景下性能急剧下降感到记忆犹新，这两个参数的主要作用是确保事务提交时，该事务对应的binary log和redo log都已经刷新到硬盘上。一方面可以确保已经提交的事务在数据库recover时不会丢失，但同时sync的代价也使得数据库事务提交的速度受到极大的约束。

出于数据可靠性方面的考虑，很多产品不得不接受性能差的现实，产品必须要在性能与数据可靠性之间权衡，当然这也促使MySQL的开发者们开始关注binary log的sync优化。Innodb的redo log的sync通过group commit使得效率大幅提升，但是binary log的实现却较为复杂。MariaDB的大牛Kristian在MariaDB 5.3版本率先实现了binary log的Group Commit，Oracle MySQL 官方也在5.6.6版本吸收了该特性，并进行了优化，本篇博文将结合理论和源码，试图从具体代码实现的深度剖析MySQL binary log group commit的实现原理。

注：博文中的源码Oracle MySQL 5.5参考5.5.40版本，Oracle MySQL 5.6参考5.6.23版本， InnoSQL参考 5.5.30 v5版本。

为什么引入Group Commit？

几乎所有支持事务的数据库系统出于性能方面的考虑，均采用了WAL（write-ahead logging）技术，即所有的更新操作通过记录事务日志的方式，保证数据库的持久化特性。更新操作只需要在内存中完成对应数据的修改，然后在外存上记录事务日志，即返回给用户事务已提交，内存中的脏页异步的刷新到外存中。

对于MySQL数据库的Innodb事务引擎，由于redo log采用的是操作系统缓冲写的实现方式，所以在每次事务提交时，都需要通过fsync()操作，保证io_cache中的内容刷新到文件中。但是，对于硬盘设备，fsync是一个代价非常昂贵的操作，尤其是针对普通的机械磁盘，即使转速在15000 rpm，完全随机的情况下最大的IOPS也不过200。

Innodb在redo log上实现了group commit，即将多次事务提交通过一次fsync完成持久化。所有需要fsync的事务提交线程都进入一个队列，每次由队首线程负责fsync，其他线程条件等待，等到队首完成fsync后，其他线程被唤醒，完成事务提交，所以其他线程基本不干任何工作。在上一个队列的队首线程fsync的过程中新进入的事务提交线程，又形成一个新的队列，等到上一个队首fsync完成后，本队列的队首同样帮助本队列的所有线程fsync，依此类推，这样极大减少了fsync的次数，当然对于机械磁盘，主要是节省了寻道时间的花销。

Innodb redo log的Group Commit实现原理非常容易理解，但是MySQL还有一类非常重要的日志binary log，相同的原理是否可以直接在binary log上套用？显然它的实现将复杂的多。MySQL在设计上，是一个插件式的数据库系统，分为Server层和Storage层，前者习惯上称为上层，后者习惯上称为下层。上层主要是处理SQL 解析，执行计划，复制等功能，下层主要实现数据的存储。MySQL允许多个存储引擎同时存在，为了处理跨引擎的事务提交，MySQL引入了分布式事务和二阶段协议。MySQL事务提交逻辑代码在handler.cc的ha_commit_trans方法实现。

［MySQL 5.5］

int ha_commit_trans(THD *thd, bool all){
   …..
   for (; ha_info && !error; ha_info= ha_info->next()){
        if ((err= ht->prepare(ht, thd, all)))
        {
          .....
        }
    }
   ......
   cookie= tc_log->log_xid(thd, xid))
  ……
   error=ha_commit_one_phase(thd, all) ? (cookie ? 2 : 1) : 0;
......
}

从代码上看，MySQL上层首先逐个调用下层引擎的prepare方法，所有引擎prepare后，调用log_xid方法，在log.cc中，我们了解到log_xid的主要作用就是将各个线程的binlog cache刷新到binary log 文件中。最后对各个底层引擎在ha_commit_one_phase中调用commit方法。从代码中可以得知，MySQL实际上将binary log作为一个协调者，以记录binary log为准，binary log记录的事务更新，表示所有引擎层都已完成prepare，在崩溃恢复时，应该进行提交，相反，如果binary log中没有记载，表示MySQL上层并没有确认所有的引擎都已完成prepare，在崩溃恢复时，应该进行回滚。在不允许已经提交的事务丢失的业务场景下，由于binary log也是缓冲写，所以MySQL需要每次事务提交都fsync binary log，这与innodb redo log的瓶颈类似，严重影响了性能，但是这还远不是最可怕的。

我们查阅Innodb的代码，我们在Ha_innodb.cc中找到了InnoDB的prepare实现函数innobase_xa_prepare，在实现中，我们发现了下面这段代码：

［MySQL 5.5］

/* For ibbackup to work the order of transactions in binlog
  and InnoDB must be the same. Consider the situation

    thread1> prepare; write to binlog; ...
     
    thread2> prepare; write to binlog; commit
    thread1>	     ... commit

  To ensure this will not happen we're taking the mutex on
  prepare, and releasing it on commit.

  Note: only do it for normal commits, done via ha_commit_trans.
  If 2pc protocol is executed by external transaction
  coordinator, it will be just a regular MySQL client
  executing XA PREPARE and XA COMMIT commands.
  In this case we cannot know how many minutes or hours
  will be between XA PREPARE and XA COMMIT, and we don't want
  to block for undefined period of time. */
  
mysql_mutex_lock(&prepare_commit_mutex);

为了保证binary log中记载的事务更新与底层引擎层提交的是事务顺序的一致性，innodb在prepare方法中获取了prepare_commit_mutex锁，并且要在commit结束后才释放，这意味着从prepare，写binlog，到commit整个事务提交三个过程所有线程都是串行执行的，这也意味着innodb redo group commit 完全成了一个摆设，这就是著名的Group Commit Bug。

基于MySQL 5.5版本的Group Commit 实现

要实现binary log的group commit，主要有2个挑战，其一是要保证binary log的写入顺序与下层redo log写入顺序必须是一致的，这就决定了上层binary log的写入与下层redo log的写入不同线程之间必须是串行的。其次又不能将下层 redo log的fsync与binary log的fsync串行起来，否则Innodb redo log 的group commit 将失效。前者出于可靠性方面的考虑，后者出于性能方面的考虑，最佳的解决方案是能兼顾其二。

早在2010年，MariaDB的Kristian Nielsen就在MariaDB上率先给出了基于MySQL 5.5版本的binary log group commit的实现方案，InnoSQL 按照Kristian Nielsen的设计思路，也实现了类似的功能。

整体上来说，就是在下层prepare时，不再获取prepare_commit_mutex锁，上层binary log和下层redo log的一致性由上层 binary log的group commit逻辑保证。下层innodb的commit方法拆分成commit_order和commit两个方法，前者负责完成下层除redo log fsync以外的所有事务提交工作，后者负责redo log fsync。

Binary log的group commit逻辑与innodb redo log group commit实现类似，所有进入write binary log阶段的线程首先都需要在一个队列上进行排队，由队首线程负责完成所有线程的binary log的写入和fsync，队列中的其他线程一进入队列，确认队首线程后，就进入条件等待，等待队首线程唤醒。与innodb group commit不同的是队首线程还需要记录队列中线程的先后顺序，在fsync binary log结束之后，依次调用各线程的commit_order方法，完成redo log的写入。由于binary log的写入和redo log的写入都由队首线程一个线程按照一致的顺序完成，所以二者的一致性是可以保证的。

结束所有任务之后队首线程唤醒所有处于条件等待的队列线程，唤醒后的各线程分别调用commit方法，完成redo log fsync，这个阶段各个线程是并发执行的，所以innodb redo log group commit是生效的。

下面我们就以InnoSQL代码为例，分析一下基于MySQL 5.5版本的binary log group commit的具体实现，在handler.cc文件中，我们看到ha_commit_trans方法主干逻辑代码：

［InnoSQL］

int ha_commit_trans(THD *thd, bool all){
   .....
   for (; ha_info && !error; ha_info= ha_info->next()){
        if ((err= ht->prepare(ht, thd, all)))
        {
          .....
        }
    }
   ......
   tc_log->log_and_order(thd, xid, all, need_commit_ordered);
  ......
   error = commit_one_phase_low(thd, all, trans, is_real_trans) ? 2 : 0;
......
}

log.cc文件的log_and_order方法，我们通过函数调用，找到bool MYSQL_BIN_LOG::write_transaction_to_binlog_events(group_commit_entry *entry) ，该方法的主要逻辑：

bool MYSQL_BIN_LOG::write_transaction_to_binlog_events(group_commit_entry *entry)
{
   ......
   is_leader= queue_for_group_commit(entry, wfc);
   ......
   /*
    The first in the queue handle group commit for all; the others just wait
    to be signalled when group commit is done.
  */
  if(is_leader)
    trx_group_commit_leader(entry);
 else if (!entry->queued_by_other)
    entry->thd->wait_for_wakeup_ready();
  else
  {
        /*
          If we were queued by another prior commit, then we are woken up
          only when the leader has already completed the commit for us.
          So nothing to do here then.
        */
  }
 ......
}

从逻辑上可以看出，进入commit阶段的线程，首先会判断是不是Leader，第一个进入空队列的线程即为队列Leader，所有的工作都由Leader线程完成，其他的线程等待Leader唤醒，不做任何事情。所以我们的重点是分析trx_group_commit_leader方法：

void MYSQL_BIN_LOG::trx_group_commit_leader(group_commit_entry *leader){
    //队列锁
    mysql_mutex_lock(&LOCK_group_commit_queue);
     ....
    //提供外部参数控制，队列等待提交，提高Group效率
    if (opt_binlog_commit_wait_count)
      wait_for_sufficient_commits();
    ......
    mysql_mutex_unlock(&LOCK_group_commit_queue);
    // 遍历队列中的每个线程，将各个线程队列的binlog cache刷新到binary log文件中
    for(current = queue; current != NULL; current = current->next)
    {
        current->error = write_transaction(current, commit_id)  
    }

    ......
   // 进行一次fsync
    flush_and_sync(&synced)
    ......
    mysql_mutex_lock(&LOCK_commit_ordered);
    ......
   //对队列中每个线程调用引擎层的order_commit方法，主要负责在引擎层写入redo log
    while(current != NULL)
    {
      group_commit_entry *next;
      run_commit_ordered(current->thd, current->all);
      next = current->next;
      current = next;
    }
    ......
    mysql_mutex_unlock(&LOCK_commit_ordered);
    ......
}

在ha_innodb.cc中我们查看order_commit的方法定义：

/*****************************************************************//**
Perform the first, fast part of InnoDB commit.
Doing it in this call ensures that we get the same commit order here
as in binlog and any other participating transactional storage engines.
Note that we want to do as little as really needed here, as we run
under a global mutex. The expensive fsync() is done later, in
innobase_commit(), without a lock so group commit can take place.
Note also that this method can be called from a different thread than
the one handling the rest of the transaction. */
static
void
innobase_commit_ordered(
/*====================*/
 handlerton *hton, /*!< in: Innodb handlerton */
 THD*	thd,	/*!< in: MySQL thread handle of the user for whom
   the transaction should be committed */
 bool	all)

order_commit方法完成了与上层binary log一致的redo log commit相关所有任务，但是fsync并没有执行，而是放在commit的第二个阶段innobase_commit函数中实现，而这个函数是在handler.cc的commit_one_phase_low方法中调用的（参照ha_commit_trans方法的主干逻辑代码）。各个线程执行过了log_and_order方法后，commit_one_phase_low是并发执行的，此时innodb的group commit也将在下层fsync时发挥作用。

未完，后续见：深入剖析MySQL group commit实现（下）

网易云新用户大礼包：https://www.163yun.com/gift

本文来自网易实践者社区，经作者郭忆授权发布。