On Wed, 2017-11-22 at 16:18 +0100, Piotr Figiel wrote:
Hi, I'm looking into couple issues we stumbled on with pppd and NetworkManager. It's tested on NetworkManager 1.8.4, ModemManager 1.6.8 and pppd 2.4.7. In general there are sporadic issues with reconnecting when pppd exits because of LCP echo timeout and when it exits due to other reasons. In simplest scenario with active pppd connection when pppd crashes/is forcibly killed and nm pppd plug-in doesn't have a chance to send the status update via D-Bus to NM the connection is not deactivated (nmcli c shows it's up). I've done some analysis and NM does notice the pppd process exiting, it enters nm-device-modem.c:ppp_failed() but because the "may-fail" ipv4 setting defaults to TRUE not much happens (the connection losses IP address and gateway setting but is still active). The behavior doesn't seem correct as according to the nm-settings(5) the connection should fail if none of ipv4 and ipv6 are available (even when they have may-fail setting to TRUE). There are actually more issues with nm-device-modem.c:ppp_failed(). Normally when pppd session ends (e.g. because of LCP echo timeout) it seems there are 2 messages sent from pppd which trigger ppp_failed(): SetState with NM_PPP_STATUS_DISCONNECT and NM_PPP_STATUS_DEAD. Additionally pppd process exiting triggers the ppp_failed() for the 3rd time. In such conditions NM deactivates connection with error: <warn> [1511357452.8500] device (ttyACM0): PPP failure in unexpected state 100 This seems to be a side effect of the may-fail logic and the fact that multiple events result in call to ppp_failed() but this side effect deactivates the device and triggers reconnect later on (ie fixes the issue). This though doesn't work always like that: I observed that sporadically the nm pppd plugin doesn't send the NM_PPP_STATUS_DEAD state (observed with dbus-monitor). I think it's related with g_dbus_proxy_call() being used in the plugin code, which is asynchronous and it may sometimes not complete before pppd exit. In such case and some amount of bad luck ppp_failed() is called because of NM_PPP_STATUS_DISCONNECT and pppd exiting only (i.e. only two times). If in such case ip4_state is configured (with may-fail) and e.g. ip6_state == IP_CONF then the connection will not be deactivated and reconnection won't be triggered. I think for the LCP echo timeout error case setting may-fail to FALSE on ipv4 on ppp connections should solve the issue. But if anything from above makes sense to you it seems there are at least couple bugs there. I think the simplest solution would be to replace nm-device-modem.c:ppp_failed() contents with: nm_device_state_changed (device, NM_DEVICE_STATE_FAILED, reason); Or is there any case when pppd failure should leave the connection active? Additionally perhaps g_dbus_proxy_call in nm-pppd-plugin.c:nm_phasechange() could be changed to sync counterpart.
Hi Piotr, may-fail is supposed to behave as you explained, and in general we should implement the logic, that a failure of one address family does not necessarily fail the entire connection. I think, at the very least, the if-else-if in ppp_failed() https://cgit.freedesktop.org/NetworkManager/NetworkManager/tree/src/devices/wwan/nm-device-modem.c?id=ca26d1ae07a4e8a7036051301d0103ab0822fd44#n89 should be if (ip4_state_in_conf()) schedule_ip4_timeout() else if (ip4_state_done()) ip4_method_failed() /*BREAK*/ if (ip6_state_in_conf()) schedule_ip6_timeout() else if (ip6_state_done()) ip6_method_failed() But I think (and I think with this I agree with you), if ppp fails, then the entire ppp connection failed. Not just one address family. So, your device_state_change(STATE_FAILED) makes sense to me. best, Thomas
Attachment:
signature.asc
Description: This is a digitally signed message part