Trouble connecting with Mellanox ConnectX-3 at 56GbE



NetworkManager: 1.20.6-1.fc31.x86_64
Kernel: 5.3.11-300.fc31.x86_64
Driver: mlx4_en 4.0-0

I am having trouble with my Mellanox ConnectX-3 cards (multiple hosts) connecting at 56GbE (56000baseCR4/Full).

I managed to discover that if I set "carrier-wait-timeout=60000" they would connect... on the second attempt.

It seems to take about 40 seconds for the card to link up with the switch. Once it's up it's solid and reliable. My ConnectX-4 cards link up right away. The ConnectX-3 card's didn't ship with 56GbE support but had that support enabled in a firmware update. So, it's possible that the long link-up has something to do with the fact that the card wasn't originally designed to run at that speed.

If I'm not running NetworkManager and either don't have any networking enabled or I'm running the old sysv network scripts the port links up after ~40s and stays linked. If I'm running NetworkManager it links at the ~40s mark and then immediately goes down and links up ~40 seconds later and then stays linked. All in all it takes 100s to get a stable link.

Something odd is happening during that first link detection which causes link to be dropped immediately and NM is clearly involved since I don't see this behavior without NM. Here's what I'm seeing:

[   19.024920] mlx4_en: eth1p0: Steering Mode 1
[   58.727579] mlx4_en: eth1p0: Link Up
[   58.731211] IPv6: ADDRCONF(NETDEV_CHANGE): eth1p0: link becomes ready
[   58.821563] mlx4_en: eth1p0: Steering Mode 1
[   58.941926] mlx4_en: eth1p0: Link Down
[  100.860086] mlx4_en: eth1p0: Link Up

For some reason, it appears that NM is triggering link negotiation 100ms after link comes up. The "Steering Mode 1" line appears to happen at the start of that process.

I have a log captured at debug. It's 200KB so not sure how best to provide that should someone be willing to help me.

The info level log looks like this:

[1574567719.8345] manager: (eth1p0): new Ethernet device (/org/freedesktop/NetworkManager/Devices/4) [1574567719.8350] device (eth1p0): state change: unmanaged -> unavailable (reason 'managed', sys-iface-state: 'external')
[1574567747.1065] device (eth1p0): carrier: link connected
[1574567747.1071] device (eth1p0): state change: unavailable -> disconnected (reason 'carrier-changed', sys-iface-state: 'managed') [1574567747.1075] policy: auto-activating connection 'System eth1p0' (3f05b6c0-424b-3e57-6c0a-7810da99d066) [1574567747.1078] device (eth1p0): Activation: starting connection 'System eth1p0' (3f05b6c0-424b-3e57-6c0a-7810da99d066) [1574567747.1079] device (eth1p0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
[1574567747.1081] manager: NetworkManager state is now CONNECTING
[1574567747.1083] device (eth1p0): state change: prepare -> config (reason 'none', sys-iface-state: 'managed') [1574567747.1618] device (eth1p0): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed') [1574567747.2026] device (eth1p0): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed') [1574567747.2116] device (eth1p0): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed') [1574567747.2118] device (eth1p0): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')
[1574567747.2120] manager: NetworkManager state is now CONNECTED_LOCAL
[1574567747.2129] manager: NetworkManager state is now CONNECTED_SITE
[1574567747.2130] policy: set 'System eth1p0' (eth1p0) as default for IPv4 routing and DNS
[1574567747.2140] device (eth1p0): Activation: successful, device activated.
[1574567747.2147] manager: NetworkManager state is now CONNECTED_GLOBAL
[1574567747.2152] manager: startup complete
[1574567788.9278] device (eth1p0): carrier: link connected
[1574567807.7456] dhcp6 (eth1p0): activation: beginning transaction (timeout in 45 seconds) [1574567807.7464] policy: set 'System eth1p0' (eth1p0) as default for IPv6 routing and DNS


Any tips for resolving this or ideas for additional debugging would be greatly appreciated.

--Larkin



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]