产品产品产品

358篇博客

编辑推荐

网易数帆开源API网关与容器云项目，让云原生生产落地“多快好

网易汪源：统一负载与多云环境的“开放姿态”，才是云原生

网易数帆如何用 Kubernetes“原语”搞定云原生中间件

快手打新挤爆券商系统，网易数帆推出券商稳定性保障方案

探索智慧校园新模式，网易有数在教育行业的实践分享

金融行业大数据治理之路——数据模型篇

【大数据之数据仓库】kudu客户端java驱动缺陷

把生命浪费在美好事物上2018-06-25 15:01

场景说明：

总共7台物理机：其中6台组成kudu集群，每台上各部署1个kudu-tserver节点，且在其中1台上再部署1个kudu-master节点（kudu版本1.0.1）；剩下最后1台用于部署ycsb，作为压力源（ycsb版本v0.11.0）；
万兆交换机、万兆网卡，也即万兆网络；
1张大表hash分布在64个buckets上，预先注入1亿条记录，包含3个副本；
测试命令：./bin/ycsb run kudu -P workloads/workloade -p kudu_master_addresses=10.120.232.25:7051 -p table=usertable3 -p insertstart=0 -threads 16 -p kudu_sync_ops=true -s -p kudu_pre_split_num_tablets=64

测试结果：

理论上，1张表64个buckets均匀分布在6台物理机上，数据随机产生，不会出现数据倾斜的现象。但事实是出现了1台物理机的资源消耗特别严重，而其余5台平稳的现象，让人怀疑是不是出现了数据不均匀分布的问题，比对每台物理机的磁盘空间消耗都差不多，此时看上去感觉scan操作的全部负荷都落在了1台机器上了！

异常分析：

1台资源消耗严重的物理机监控：

command ‘top’:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

32383 kudu 20 0 21.735g 1.641g 32404 S 176.0 1.3 714:52.34 kudu-tserver

command ‘sar -n DEV 1’:

Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s

Average: eth0 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Average: eth1 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Average: eth2 4735.46 185349.84 447.21 273590.82 0.00 0.00 0.00

Average: eth3 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Average: lo 59.84 59.84 108.97 108.97 0.00 0.00 0.00

5台资源消耗平稳的物理机监控：

command ‘top’:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

21684 kudu 20 0 18.456g 2.536g 32772 S 6.2 2.0 263:12.00 kudu-tserver

Command ‘sar -n DEV 1’:

10:31:52 AM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s

10:31:53 AM eth0 0.00 0.00 0.00 0.00 0.00 0.00 0.00

10:31:53 AM eth1 0.00 0.00 0.00 0.00 0.00 0.00 0.00

10:31:53 AM eth2 188.75 188.75 30.13 31.23 0.00 0.00 0.00

10:31:53 AM eth3 0.00 0.00 0.00 0.00 0.00 0.00 0.00

10:31:53 AM lo 18.75 18.75 10.71 10.71 0.00 0.00 0.00

测试了好几轮都是这个情况：CPU和网络带宽消耗都集中在1台物理机上！百思不得其解，中间怀疑过是不是kudu客户端驱动不匹配的问题，于是替换上了0.9.1版本，但是问题依旧。翻了所有公开的文档 kudu documents和 kudu design doc、搜索了 kudu jira、联系了国内唯一的kudu committer，依然无果。。。。。。

几天以后，偶然收到了来自Todd（kudu负责人）的回复邮件，内容如下：

Hi Helifu

Is it possible that you created this table initially with only one tablet, and then later tried the same job with the presplit option? I'm wondering if somehow your table has only one tablet. You can check the Kudu master web UI to see.

The other thing to note is that currently the "scan" workload is not well implemented in the YCSB client. The Java client doesn't support a client-side LIMIT and so the scans all fetch significantly more data than expected. So, the scan performance measured by YCSB is not very accurate, and we haven't really tested it. It's possible there is some other bug which is causing the scan workload to always read from the same part of the table.

Does the issue occur with the other workloads like get/insert/load?

-Todd

好吧，我掉坑里了，希望后面的同学不要掉进去了！

最新博客

最新资源下载

编辑推荐

【大数据之数据仓库】kudu客户端java驱动缺陷

最新博客

最新资源下载

编辑推荐

【大数据之数据仓库】kudu客户端java驱动缺陷

推荐博客