Trouble connecting with Mellanox ConnectX-3 at 56GbE
- From: Larkin Lowrey <llowrey nuclearwinter com>
- To: networkmanager-list gnome org
- Subject: Trouble connecting with Mellanox ConnectX-3 at 56GbE
- Date: Sun, 24 Nov 2019 14:36:08 -0500
NetworkManager: 1.20.6-1.fc31.x86_64
Kernel: 5.3.11-300.fc31.x86_64
Driver: mlx4_en 4.0-0
I am having trouble with my Mellanox ConnectX-3 cards (multiple hosts)
connecting at 56GbE (56000baseCR4/Full).
I managed to discover that if I set "carrier-wait-timeout=60000" they
would connect... on the second attempt.
It seems to take about 40 seconds for the card to link up with the
switch. Once it's up it's solid and reliable. My ConnectX-4 cards link
up right away. The ConnectX-3 card's didn't ship with 56GbE support but
had that support enabled in a firmware update. So, it's possible that
the long link-up has something to do with the fact that the card wasn't
originally designed to run at that speed.
If I'm not running NetworkManager and either don't have any networking
enabled or I'm running the old sysv network scripts the port links up
after ~40s and stays linked. If I'm running NetworkManager it links at
the ~40s mark and then immediately goes down and links up ~40 seconds
later and then stays linked. All in all it takes 100s to get a stable link.
Something odd is happening during that first link detection which causes
link to be dropped immediately and NM is clearly involved since I don't
see this behavior without NM. Here's what I'm seeing:
[ 19.024920] mlx4_en: eth1p0: Steering Mode 1
[ 58.727579] mlx4_en: eth1p0: Link Up
[ 58.731211] IPv6: ADDRCONF(NETDEV_CHANGE): eth1p0: link becomes ready
[ 58.821563] mlx4_en: eth1p0: Steering Mode 1
[ 58.941926] mlx4_en: eth1p0: Link Down
[ 100.860086] mlx4_en: eth1p0: Link Up
For some reason, it appears that NM is triggering link negotiation 100ms
after link comes up. The "Steering Mode 1" line appears to happen at the
start of that process.
I have a log captured at debug. It's 200KB so not sure how best to
provide that should someone be willing to help me.
The info level log looks like this:
[1574567719.8345] manager: (eth1p0): new Ethernet device
(/org/freedesktop/NetworkManager/Devices/4)
[1574567719.8350] device (eth1p0): state change: unmanaged ->
unavailable (reason 'managed', sys-iface-state: 'external')
[1574567747.1065] device (eth1p0): carrier: link connected
[1574567747.1071] device (eth1p0): state change: unavailable ->
disconnected (reason 'carrier-changed', sys-iface-state: 'managed')
[1574567747.1075] policy: auto-activating connection 'System eth1p0'
(3f05b6c0-424b-3e57-6c0a-7810da99d066)
[1574567747.1078] device (eth1p0): Activation: starting connection
'System eth1p0' (3f05b6c0-424b-3e57-6c0a-7810da99d066)
[1574567747.1079] device (eth1p0): state change: disconnected -> prepare
(reason 'none', sys-iface-state: 'managed')
[1574567747.1081] manager: NetworkManager state is now CONNECTING
[1574567747.1083] device (eth1p0): state change: prepare -> config
(reason 'none', sys-iface-state: 'managed')
[1574567747.1618] device (eth1p0): state change: config -> ip-config
(reason 'none', sys-iface-state: 'managed')
[1574567747.2026] device (eth1p0): state change: ip-config -> ip-check
(reason 'none', sys-iface-state: 'managed')
[1574567747.2116] device (eth1p0): state change: ip-check -> secondaries
(reason 'none', sys-iface-state: 'managed')
[1574567747.2118] device (eth1p0): state change: secondaries ->
activated (reason 'none', sys-iface-state: 'managed')
[1574567747.2120] manager: NetworkManager state is now CONNECTED_LOCAL
[1574567747.2129] manager: NetworkManager state is now CONNECTED_SITE
[1574567747.2130] policy: set 'System eth1p0' (eth1p0) as default for
IPv4 routing and DNS
[1574567747.2140] device (eth1p0): Activation: successful, device activated.
[1574567747.2147] manager: NetworkManager state is now CONNECTED_GLOBAL
[1574567747.2152] manager: startup complete
[1574567788.9278] device (eth1p0): carrier: link connected
[1574567807.7456] dhcp6 (eth1p0): activation: beginning transaction
(timeout in 45 seconds)
[1574567807.7464] policy: set 'System eth1p0' (eth1p0) as default for
IPv6 routing and DNS
Any tips for resolving this or ideas for additional debugging would be
greatly appreciated.
--Larkin
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]