Re: issues with pppd error handling



On Wed, 2017-11-22 at 16:18 +0100, Piotr Figiel wrote:
Hi,
I'm looking into couple issues we stumbled on with pppd and
NetworkManager. It's tested on NetworkManager 1.8.4, ModemManager
1.6.8 and pppd 2.4.7. In general there are sporadic issues with
reconnecting when pppd exits because of LCP echo timeout and when it
exits due to other reasons.

In simplest scenario with active pppd connection when pppd crashes/is
forcibly killed and nm pppd plug-in doesn't have a chance to send the
status update via D-Bus to NM the connection is not deactivated
(nmcli
c shows it's up).
I've done some analysis and NM does notice the pppd process exiting,
it enters nm-device-modem.c:ppp_failed() but because the "may-fail"
ipv4 setting defaults to TRUE not much happens (the connection losses
IP address and gateway setting but is still active).
The behavior doesn't seem correct as according to the nm-settings(5)
the connection should fail if none of ipv4 and ipv6 are available
(even when they have may-fail setting to TRUE).

There are actually more issues with nm-device-modem.c:ppp_failed().
Normally when pppd session ends (e.g. because of LCP echo timeout) it
seems there are 2 messages sent from pppd which trigger ppp_failed():
SetState with NM_PPP_STATUS_DISCONNECT and NM_PPP_STATUS_DEAD.
Additionally pppd process exiting triggers the ppp_failed() for the
3rd time. In such conditions NM deactivates connection with error:
<warn>  [1511357452.8500] device (ttyACM0): PPP failure in unexpected
state 100
This seems to be a side effect of the may-fail logic and the fact
that
multiple events result in call to ppp_failed() but this side effect
deactivates the device and triggers reconnect later on (ie fixes the
issue).

This though doesn't work always like that: I observed that
sporadically the nm pppd plugin doesn't send the NM_PPP_STATUS_DEAD
state (observed with dbus-monitor). I think it's related with
g_dbus_proxy_call() being used in the plugin code, which is
asynchronous and it may sometimes not complete before pppd exit.
In such case and some amount of bad luck ppp_failed() is called
because of NM_PPP_STATUS_DISCONNECT and pppd exiting only (i.e. only
two times). If in such case ip4_state is configured (with may-fail)
and e.g. ip6_state == IP_CONF then the connection will not be
deactivated and reconnection won't be triggered.

I think for the LCP echo timeout error case setting may-fail to FALSE
on ipv4 on ppp connections should solve the issue. But if anything
from above makes sense to you it seems there are at least couple bugs
there. I think the simplest solution would be to replace
nm-device-modem.c:ppp_failed() contents with:

nm_device_state_changed (device, NM_DEVICE_STATE_FAILED, reason);

Or is there any case when pppd failure should leave the connection
active?
Additionally perhaps g_dbus_proxy_call in
nm-pppd-plugin.c:nm_phasechange() could be changed to sync
counterpart.

Hi Piotr,

may-fail is supposed to behave as you explained, and in general we
should implement the logic, that a failure of one address family does
not necessarily fail the entire connection.

I think, at the very least, the if-else-if in ppp_failed()
https://cgit.freedesktop.org/NetworkManager/NetworkManager/tree/src/devices/wwan/nm-device-modem.c?id=ca26d1ae07a4e8a7036051301d0103ab0822fd44#n89
should be

   if (ip4_state_in_conf())
       schedule_ip4_timeout()
   else if (ip4_state_done())
       ip4_method_failed()
   /*BREAK*/
   if (ip6_state_in_conf())
       schedule_ip6_timeout()
   else if (ip6_state_done())
       ip6_method_failed()

But I think (and I think with this I agree with you), if ppp fails,
then the entire ppp connection failed. Not just one address family.
So, your device_state_change(STATE_FAILED) makes sense to me.



best,
Thomas

Attachment: signature.asc
Description: This is a digitally signed message part



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]