产品产品产品

358篇博客

编辑推荐

网易数帆开源API网关与容器云项目，让云原生生产落地“多快好

网易汪源：统一负载与多云环境的“开放姿态”，才是云原生

网易数帆如何用 Kubernetes“原语”搞定云原生中间件

快手打新挤爆券商系统，网易数帆推出券商稳定性保障方案

探索智慧校园新模式，网易有数在教育行业的实践分享

金融行业大数据治理之路——数据模型篇

老版本网络问题简单调试步骤和技巧

把生命浪费在美好事物上2018-08-07 09:38

我们的老版本网络存在已经有很长一段时间了，代码上已经基本稳定，但是新上节点或者物理机重启后还是经常遇到网络不通的问题，SA和QA有时不知道排查问题的方法，这里我再简单总结几条，希望对大家有所帮助，也顺便减轻一下我们网络组的工作量。为了大家查问题方便，尽量减少文字说明，而以步骤代替。

1. 检查物理网络配置（这一步希望SA自查）

主要针对flat网络和svlan网络

svlan：

1)根据vmuuid查到对应的vlanid

neutron port-list --device-id <vm_uuid> 这里可以得到portid和subnetid

neutron subnet-show <subnet_uuid> 这里可以得到networid

neutron net-show <net_uuid> |grep physical_vlan 这里可以得到vlanid

2)得到虚拟机所在的宿主机

neutron port-list --device-id <vm_uuid> 这里可以得到portid和subnetid

neutron port-show <port_uuid> 这里可以得到vm所在的宿主机（若有nova命令权限，也可直接用nova show vm_uuid得到）

3)根据vlanid到宿主机上查看配置

cat /etc/neutron/plugins/ml2/ml2_conf.ini|grep svlan_phy_dev_pattern 得到对应的物理网卡

sudo ip link list|grep <phy_dev.vlanid> 查看对应的vlan子网网卡是否存在，若不存在，肯定有问题。

sudo ip -s link show dev <phy_dev.vlanid> 若网卡统计tx和rx不正常，比如为0，表明连通性有问题

sudo ovs-vsctl show|less 搜索<phy_dev.vlanid>是否存在，并且没有报任何异常，对应的tag也设置正常。

举例：

Port "eth0.322" 这种就是有问题的，表明这个设备不存在

tag: 322

Interface "eth0.322"

error: "could not open network device eth0.322 (No such device)"

Port "eth0.306" 这种就是正常的

tag: 306

Interface "eth0.306"

4）解决方法

如果是vlan子接口没有up，那么将对应的vlan子接口up, 并且重启ovs agent

ifdown <phy_dev.vlanid>

ifup <phy_dev.vlanid>

service neutron-plugin-openvswitch-agent restart

如果是vlan子接口已经up，但是统计值不正常，要联系机房查看网络连接

flat:

1)根据vmuuid查到对应的physical_network

neutron port-list --device-id <vm_uuid> 这里可以得到portid和subnetid

neutron subnet-show <subnet_uuid> 这里可以得到networid

neutron net-show <net_uuid> |grep physical_network 这里可以得到对应的bridge后缀

2)得到虚拟机所在的宿主机

neutron port-list --device-id <vm_uuid> 这里可以得到portid和subnetid

neutron port-show <port_uuid> 这里可以得到vm所在的宿主机（若有nova命令权限，也可直接用nova show vm_uuid得到）

3)根据得到的bridge到宿主机上查看

cat /etc/neutron/plugins/ml2/ml2_conf.ini|grep bridge_mapping 这里得到对应的bridge前缀

sudo ovs-vsctl show|less 搜索对应的bridge,查看对应的网卡是否存在

sudo ip -s link show <public_dev> 若网卡统计tx和rx不正常，比如为0，表明连通性有问题

举例：

neutron net-show <net_uuid> |grep physical_network 得到public-nlb

cat /etc/neutron/plugins/ml2/ml2_conf.ini|grep bridge_mapping 得到public:br-wan，则对应的bridge应该是br-wan-nlb

sudo ovs-vsctl show|less 得到如下结果：

Bridge br-wan-nlb 如果没有这个bridge，肯定有问题

fail_mode: secure

Port "eth2.114"

Interface "eth2.114" 如果这里显示No such device，说明物理网卡有问题

Port br-wan-nlb

Interface br-wan-nlb

type: internal

Port phy-br-wan-nlb

Interface phy-br-wan-nlb

type: patch

options: {peer=int-br-wan-nlb} 如果这里显示nonexistent-peer，则br-int和br-wan之间的连通性有问题

4)解决方法

如果是对应的bridge没有up，则要up该bridge，并重启ovs agent

ifup --allow=ovs br-wan-nlb

service neutron-plugin-openvswitch-agent restart

如果是物理网卡不正常，则按svlan方法处理

2. 检查相关的服务是否启动正常（这一步希望SA自查）

1) openvswitch服务

ps aux|grep vswitchd; ps aux|grep ovsdb

2) openvswitch内核模块是否正常

lsmod|grep openvswitch 主要要能看到这个vport_vxlan

modinfo openvswitch 我们要求使用dkms，类似这样：/lib/modules/3.10.45-openstack-amd64/updates/dkms/openvswitch.ko

3)ovs agent 服务

ps aux|grep openvswitch-agent

neutron agent-list|grep <agent-host> 如果出现xxx字样，说明要么进程挂了，要么agent和server之间的心跳不正常

4)解决方法

如果是openvswitch.ko问题，需要重装ovs

如果是ovs进程问题，则重启openvswitch服务（service openvswitch-switch restart）

如果是ovs agent问题，则重启ovs agent服务（service neutron-plugin-openvswitch-agent restart）

3. 查看虚拟机内部是否正常（这个希望报告问题的人自查）

遇到网络连通性问题，第一步要做的就是查看虚拟机

1) 查看虚拟机是否启动正常

如果有web控制台，尽量直接从控制台登录。

否则让SA或者QA先帮忙查看对应的串口日志 cat /data/nova/instances/vm_uuid/console.log |less 如果这里显示虚拟机启动卡在某个点，找主机相关同事看看

2）查看虚拟机内部的ip是否获取正常

如果前面第一步无法进入虚拟机，则看看能否通过后台改密码的方式登录vnc（但要注意，如果是线上用户，需要先跟客户沟通，而且密码不要设置的太简单）

nova root-password vm_uuid 针对管理员用户

nova get-vnc-console vm_uuid novnc 针对管理员用户

2.1) 进入虚拟执行 ip a；如果没有获取到ip，则转2.2

2.2) 在对应的网卡上执行dhclient -v <ethx> 如果能获取到ip，则转2.3

2.3) 执行ip route get <destination_ip>，看是否走到了正确的出口。

比如，访问外网，如果带外网ip，则应该从外网网卡走，如果不带外网ip，应该从私有网网卡走；

访问机房网，应该从机房网网卡走；

2.4) 访问相关的网关是否正常

ping <destination_gateway> 如果不能ping通，表示连通性有问题

ip neigh list|grep <destination_gateway> 如果neighbor显示为failed或者incomplete，表示2层互通有问题

2.5) 查看网卡是否正常工作

ip -s link show <ethx> 如果rx和tx任一方向的数值为0，表明网卡有问题

tc -s -d qdisc show dev <ethx> 如果backlog里的报文数目不为0，说明网卡有问题

4. 检查subnet的az和dhcp agent（针对ip获取不到问题）

这里针对的是使用dhcp来获取ip的镜像，线上有些镜像使用了静态ip注入，不在讨论之列。

4.1) az必须和nova节点匹配

neutron net-show <net_uuid> |grep support_azs 这里的az必须是一级az

4.2)子网的dhcp必须使能

neutron subnet-show <subnet_uuid> |grep enable_dhcp 如果这里为false，说明没有启用dhcp。此时通过dhclient肯定不行。

4.3)dhcp agent的调度策略必须正确

neutron dhcp-agent-list-hosting-net <net_uuid> 获取dhcp agent所在的节点。任意一个dhcp agent节点能收到报文都可以

neutron agent-show <dhcp_agent_uuid> 调度策略里如果针对的是az，则要看network的az和该dhcp agent az是否匹配

举例：

schedule_strategies {

"options": {

"azs": [

"xiaoshan1"

]

"strategy_type": "az" 调度策略可以为az或者tenant。如果是tenant，则不是对应的tenant就不能用该dhcp agent

}

started_at 2017-05-25 13:07:38

topic dhcp_agent

4.4)dhcp server必须正常启动

neutron agent-show <dhcp_agent_uuid> |grep alive 如果显示为false，说明agent服务有问题

ps aux|grep <net_uuid> 登录到network对应的dhcp agent上查看。如果对应的dnsmasq没有启动，说明有问题

sudo ip netns exec qdhcp-<net_uuid> ip a 如果这里没有生成相关的tap网卡，说明有问题

4.5)检查对应的物理网卡是否工作正常（针对flat网络和svlan网络）

检查方法和步骤1一样，唯一不同的是把虚拟机所在的宿主机节点替换成dhcp agent所在的宿主机节点去查。

5. 检查router以及l3 agent（针对虚拟机心跳问题）

这个一般是针对私有网。

5.1) 得到routerid

neutron router-list --tenant-id <tenantid>

5.2) 得到master router

neutron router-port-list <routerid> 如果看到的结果是两个master或者一个master都没有，说明有问题

5.3) 查看agent调度策略是否正常

neutron l3-agent-list-hosting-router <routerid>

neutron agent-show <l3_agent_uuid> |grep schedule_strategies 判断方法和dhcp agent一样，策略都有az和tenant之分

5.4) l3 agent是否正常启动

neutron agent-show <l3_agent_uuid> |grep alive 如果显示为false，说明agent服务有问题

ps aux|grep <router_uuid> 登录到router master对应的l3 agent上查看。如果keepalive未启动，说明有问题；如果openvpn进程未启动，则会对vpn登录虚拟机有影响

sudo ip netns exec qrouter-<router_uuid> ip a 如果这里没有生成相关的tap网卡，说明有问题。如果没有对应的tun网卡，则会对vpn登录虚拟机有影响

5.5) 查看外网网卡是否正常

cat /etc/neutron/vpn_agent.ini|grep isolate_l3 如果为true，则：

sudo ip netns exec l3agent ip a 查看是否有对应的外网网卡，网卡的ip地址应等于cat /etc/neutron/vpn_agent.ini|grep sslvpn_domain_ip 的值

5.6) 通过l3网关访问机房网的问题

cat /etc/neutron/vpn_agent.ini|grep acl_whitelist 查看白名单中是否有要访问的目的段

cat /etc/neutron/vpn_agent.ini|grep enable_acl 或者此值为false

6. 其他

6.1) 查看虚拟机端口是否正常

neutron port-show <port_uuid> 如果状态不是为active，则不正常。

恢复方法： neutron port-update <port_uuid> admin_state_up=false; neutron port-update <port_uuid> admin_state_up=true;

6.2) 查看宿主机上的虚拟机端口是否正常

sudo ovs-vsctl -- --columns=name find interface external_ids:iface-id=<vm_portuuid> 得到tap网卡名称

sudo ovs-vsctl list interface tapxxx |grep ofport 如果为-1，说明端口不正常。要查看虚拟机有没有生成对应的物理网卡(ip link list|grep tapxxx)。

sudo ovs-vsctl list port tapxxx|grep tag 如果为空或者4095, 都不正常。

6.3) tcpdump 端口报文

6.4) binding_failed问题

首先查一下对应的宿主机的ovs agent状态是否为alive；

其次查一下network里的support az和agent里的support az是否匹配；

如果是flat网络，还要查一下agent里的bridge mapping/bridge mapping patten和network里的physical network是否匹配

举例：

ovs agent show的部分结果：

configurations {

"bridge_mapping_patterns": {

"public": "br-wan"

net show的部分结果：

segments | {"network_id": "2425bbc7-caa6-4c4e-a454-9ab5b680d114", "network_type": "flat", "id": "61bc12ea-5f09-47d4-b6a6-62c711361162",

"provider:physical_network": "nlb-dg-1", "mtu": 1500}

这里的physical_network和bridge mapping pattern不匹配，所以肯定会绑定失败

网易云新用户大礼包：https://www.163yun.com/gift

本文来自网易实践者社区，经作者陈跃芳授权发布。

<上一篇OVS DPDK的安装使用

下一篇总结一下在使用ovs和dpdk过程中踩过的坑>

最新博客

最新资源下载

编辑推荐

老版本网络问题简单调试步骤和技巧

最新博客

最新资源下载

编辑推荐

老版本网络问题简单调试步骤和技巧

推荐博客