scaling bridge fdb database


本文讨论了大型网桥fdb数据库的性能和操作挑战,以及解决这些问题的方法。我们将讨论解决方案,如FDB DST端口故障转移,以加快收敛速度,从控制平面更新FDB的更快API,以及减少具有用于桥接隧道解决方案(如VXLAN)的轻量级隧道端点的FDB DST端口数量。





1. Scaling bridge forwarding database Roopa Prabhu, Nikolay Aleksandrov

2.Agenda ● Linux bridge forwarding database (FDB): quick overview ● Linux bridge deployments at scale: focus on multihoming ● Scaling bridge database: challenges and solutions 2

3.Bridge FDB entries FDB • Flood and learn (most basic <M1> dev swp1 vlan 10 <M2> dev swp2 vlan 10 case) • End point Orchestrator/provisioning controller based FDB switch1 programming bridge • Control plane learning: ▪ Local or distributed • [<Mac> <vlan> <dst_port>] swp1 swp2 H1 <M1> H2 <M2> 3

4.Bridge FDB entries: network virtualization (overlay: eg vxlan) ● Overlay macs point to overlay termination end-points ● Eg Vxlan tunnel termination endpoints (VTEPS) ○ Vxlan FDB extends bridge FDB ○ Vxlan FDB carries remote dst info ○ [ <mac> <vni> <remote_dst list> ] ■ Where remote_dst_list = remote overlay endpoint ip’s ■ Pkt is replicated to list of remote_dsts 4

5. Bridge FDB entries: overlay example ● switch1: M1 and M2 are local macs. M3 and M4 are remote macs Vxlan Overlay FDB <M1> dev swp1 vlan 10 FDB <M2> dev swp2 vlan 10 <M3> dev swp1 vlan 10 <M3> dev vxlan-10 vlan 10 <M4> dev swp2 vlan 10 <M4> dev vxlan-10 vlan 10 <M1> dev vxlan-10 vlan 10 <M2> dev vxlan-10 vlan 10 switch1 Vxlan FDB switch2 <M3> vxlan-10 dst Vxlan FDB <M1> vxlan-10 dst bridge <M4> dev vxlan-10 dst bridge <M2> dev vxlan-10 dst swp1 swp2 vxlan-10 swp1 swp2 vxlan-10 H1 <M1> H2 <M2> H3 <M3> H4 <M4> 5

6.Bridge FDB database scale 6

7.Bridging scale on a data center switch • layer-2 gateway • Bridging accelerated by hardware ▪ HW support for more than 100k entries ▪ Learning in hardware at line rate ▪ Flooding in hardware and software • IGMP snooping + optimized multicast forwarding • Bridging larger L2 domains with overlays (eg vxlan) • Multihoming: Bridging with distributed state 7

8. Layer-2 gateway in a datacenter architecture SPINE Layer-2 gateway LEAF (TOR) Layer2-3 boundary 8

9.Bridge FDB performance parameters at scale • Learning • Adding, deleting and updating FDB entries • Reduce flooding • Optimized Broadcast-Multicast-Unknown unicast handling • Network convergence on link failure events • Mac moves 9

10.Multihoming 10

11.Multihoming • Multihoming is the practice of connecting host or a network to more than one network (device) ▪ To increase reliability and performance • For the purpose of this discussion, let’s just say its a “Cluster of switches running Linux” providing redundancy to hosts 11

12.Common functions of a multihoming solution • Provide redundant paths to multihomed end-points • Faster network convergence in event of failures: ▪ Establish alternate redundant paths and move to them faster • Distributed state: ▪ Reduce flooding of unknown unicast, broadcast and multicast traffic regardless of which switch is active: • By keeping forwarding database in sync between peers • By Keeping multicast forwarding database in sync between peers 12

13.Multihoming: dedicated link ● Dedicated physical link peerlink (peerlink) between switch1 switch2 switches to sync multihoming state ● Hosts are connected to both switches Host1 Host2 ● Non-standard multihoming control plane 13

14.Multihoming: bridge: dedicated link ● Peerlink is a bridge port bridge bridge peerlink peerlink ● FDB entries to host swp1 swp2 swp1 swp2 switch1 switch2 point to host port <M1> dev swp1 eth0 eth1 eth0 eth1 ● FDB entry on swp1 bond0 bond0 failure, moved to H1 M1 H2 M2 peerlink: <M1> dev peerlink 14

15.Network convergence during failures • Multihoming Control plane reprogrames the FDB database: ▪ Update FDB entries to point to peer switch link ▪ Uses bridge FDB replace ▪ Restore when network failure is fixed • Problems: ▪ Too many FDB updates and netlink notifications ▪ Affects convergence 15

16.Bridge port backup port • For Faster network convergence: ▪ peer link is the static backup port for all host bridge ports ▪ Make peer link the backup port at config time: • bridge seamlessly redirects traffic to backup port ▪ Patch [1] does just that 16

17.Per Bridge backup port [1] Before: After: $bridge fdb show Bridge port swp1 has peerlink as backup port: mac1 dev swp1 $ip link set dev swp1 type bridge_slave /* On swp1 link failure event, control plane backup_port peerlink updates each fdb entry to point to peerlink */ $bridge fdb show $bridge fdb show mac1 dev swp1 mac1 dev peerlink /* On swp1 link failure event, kernel implicitly forwards traffic to backup port peerlink. No change to fdb entry */ $bridge fdb show mac1 dev swp1 17

18.Future enhancements Debuggability: • FDB dumps to carry indication that backup port is active 18

19.Multihoming: network overlay overlay overlay switch0 switch1 switch2 host1 host2 host2 19

20.Multihoming with network virtualization • E-VPN RFC [2]: BGP based multihoming control plane • No dedicated link between the clustered switches in a multihomed environment • Dedicated switch peer-link is now replaced by the overlay • Eg a vxlan tunnel port in a vxlan environment • More than 2 switches in a cluster • In the active-active case, more than one remote dst in the underlay: • mac <remote-end-point-underlay-ip-list> • Requires mac ECMP (FDB entry mac pointing to ecmp group containing remote dsts) 20

21.Multihoming: network overlay overlay overlay bridge bridge bridge switch0 vxlan0 switch0 vxlan0 switch0 vxlan0 swp3 swp1 swp2 swp1 swp2 swp1 swp2 switch1 switch2 switch3 eth0 eth1 eth0 eth1 eth0 eth1 H1 bond0 M1 H2 bond0 M2 H3 bond0 M3 21

22.Control plane strategies for faster convergence • Designated forwarder: avoid duplicating pkts [2,3] • Split horizon checks [4] • Aliasing: Instead of distributing all macs and withdrawing during failures infer from membership advertisements [5] 22

23.Forwarding database changes for faster convergence • Backup port: to redirect traffic to network overlay on failure [1] • Mac dst groups (for faster updates to FDB entries): ▪ FDB entry points to dst group (dst is an overlay end-point) ▪ Dst group is a list of vteps with paths to the MAC ▪ Think FDB entries as routes: • Ability to update dst groups separately is a huge win • Similar to recent updates to the routing API [6] 23

24.New way to look at overlay FDB entry: dst groups Current vxlan fwding database New proposed vxlan fwding database Eg: Vxlan fdb entry: Eg: Vxlan fdb entry: mac, vni dst_grp_id remote vni, mac, vni remote_ip Dst group db: remote vni, remote_ip dst_grp remote vni, (id) remote_ip remote vni, remote vni, remote_ip remote_ip remote vni, remote_ip 24

25.Fdb database API update New fdb netlink attribute to link an fdb entry to a dst group: • NDA_DST_GRP 25


27.Other considerations for the dst group api • Investigating possible re-use of route nexthop API [6] 27

28.Acknowledgements We would like to thank Wilson Kok, Anuradha Karuppiah, Vivek Venkataraman and Balki Ramakrishnan for discussion, knowledge and requirements for building better Multihoming solutions on Linux. 28

29.References [1] net: bridge: add support for backup port: [2] E-VPN Multihoming: [3] E-VPN Multihoming: Fast convergence: [4] E-VPN multihoming split horizon: [5] E-VPN Aliasing and Backup Path: [6] Nexthop groups: 29