Re: About "Received DHCP event from unexpected PID xxxx (expected xxxx)"



2009/2/20 Dan Williams <dcbw redhat com>
On Fri, 2009-02-20 at 19:22 +0800, cee1 wrote:
> Hi, Dan
> I've read some code snippets of dhcpcd: it tries to read the pidfile,
> and then writes(the pid to pidfile) & flock the pidfile.
>
> Image a situation:
> 1) NM forks a DHCP process
> 2) The DHCP process begins to execute
> 3) The DHCP process reads the pidfile
> 4) The DHCP process blocks (since it doing some I/O), and OS chooses
> NM to run again
> 5) NM decieds to start another DHCP routine, and it finds no pidfile
> exists, so it forks another DHCP process

NM decides to start another DHCP process for the same device, or a
different device?  It's normal that there should be more than one DHCP
client running when multiple interfaces are being used, but there should
only ever be one DHCP client per interface.

Dan
Sorry for not making myself clear.
Say using dhcpcd as the DHCP backend, the following sequence of operations causes more than one dhcpcd processes running for the same net interface.

1) NM forks a dhcpcd process for wlan0, OS runs the dhcpcd process, stops the execution of NM (assuming the cpu has a single core, without SMT or HT enabled)
2) The dhcpcd process begins to execute, it tries to read the file "/var/run/dhcpcd-wlan0.pid", this I/O operation makes it block
3) OS chooses NM to run again
4) NM decides to fork another dhcpcd process for wla0
5) NM tries to read the file "/var/run/dhcpcd-wlan0.pid", but it gets nothing, so the second dhcpcd process for wlan0 is forked


> 6) Now there are two DHCP processes running
>
> May this is the reason why NM forgets some DHCP processes.
>
> 2009/2/20 cee1 <fykcee1 gmail com>
>         Hi, Dan
>         I've tried 0.7.1-rc1, and modify the code slightly:
>         <nm_dhcp_client_start>: kill the DHCP client according
>         device->pid. In my test, the problem seems to be fixed.
>
>         Is it a little reasonable that killing a DHCP client by
>         device->pid?
>         First, the logical is consistent with
>         <nm_dhcp_manager_cancel_transaction_real>, which kills a DHCP
>         client by device->pid.
>         Second, it seems more reliablable.
>         Third, although it is possible to kill the DHCP client which
>         is not created by NM, by pidfile,
>         that DHCP client may not use the same pidfile. (since the code
>         in <nm-dhcp-dhclient.c> specified the path of the pidfile)
>
>         BTW, I still see some log message like "NetworkManager: <WARN>
>         nm_dhcp_manager_handle_event(): Received DHCP event from
>         unexpected PID 16219 (expected 16221)" (but I'm sure the
>         process has exited eailier.) I guess NM received the data from
>         the process eailier, but was busy, and then it had chance to
>         process the data, but the process had exited. Am I right?
>
>         I'm runing neither SELinux nor AppArmor. I will try to find
>         out the reason why the pidfile disappears while there is still
>         a DHCP client running : a) The DHCP client fails to create
>         pidfile ? b) The DHCP client delete the pidfile when it is
>         running? c) ...
>
>         2009/2/20 Dan Williams <dcbw redhat com>
>
>
>                 On Thu, 2009-02-19 at 20:37 +0800, cee1 wrote:
>                 > Hi, Dan
>                 >      I've reviewed  the code: when starts a new dhcp
>                 process (on the
>                 > same net interface), the previous one will be
>                 terminated. So the there
>                 > should always be one dhcp process for the same net
>                 interface, am I
>                 > right?
>
>
>                 Correct.
>
>                 >      But in my test (still with tag NM-0.7.0 &
>                 dhclient-3.1.0),
>                 > switching between APs, some log message shows NM
>                 can't get pid from
>                 > pidfile (the pidfile doesn't  exist), but the
>                 *previous* dhcp process
>                 > is actully alive . This is really strange.
>
>
>                 Yeah, it's strange.  Are you running something like
>                 SELinux or AppArmor?
>
>                 Dan
>
>
>                 > 2009/2/19 cee1 <fykcee1 gmail com>
>                 >         Hi, Dan
>                 >         Thanks for the replay.
>                 >         The code of NetworkManger is downloaded from
>                 gnome's ftp
>                 >         server with version 0.7.0.
>                 >         I've tried this NM-0.7 with both
>                 dhcpcd-4.0.10 and
>                 >         dhclient-3.1.0.
>                 >
>                 >         Following your advice, I add some log output
>                 code, and find
>                 >         the device->state in function
>                 >         <nm_dhcp_manager_begin_transaction> is
>                 DHC_NBI (0) or DHC_END
>                 >         (15).
>                 >
>                 >         In the case of dhcpcd-4.0.10, I see some log
>                 messages like:
>                 >         "wlan0: dhcpcd already running on pid 1626
>                 >         (/var/run/dhcpcd-wlan0.pid)", and NM forgets
>                 some dhcpcd
>                 >         processes.
>                 >         Also I see log messages like "dhcpcd[1630]:
>                 write signal 15:
>                 >         Bad file descriptor", does this suggests
>                 that the dhcpcd
>                 >         process can't write the pid file??
>                 >
>                 >         In the case of dhclient-3.1.0, (in my
>                 testing period), no
>                 >         <dhcp_watch_cb> called (doesn't like the
>                 case of
>                 >         dhcpcd-4.0.10), but still, log messages like
>                 >         "nm_dhcp_manager_handle_event(): Received
>                 DHCP event from
>                 >         unexpected PID 19924 (expected 0)" appear.
>                 >         I see only one log message like "dhclient:
>                 There is already a
>                 >         pid file /usr/var/run/dhclient-wlan0.pid
>                 with pid 19911".
>                 >         This time, NM forgets dhclient processes
>                 more easily. When I
>                 >         switch connections between APs, the number
>                 of dhclient
>                 >         processes increases to 4 or 5 in a short
>                 time (which doesn't
>                 >         like the case of dhcpcd-4.0.10).
>                 >
>                 >         Do these suggest both of the two dhcp
>                 clients above having
>                 >         problems?
>                 >         Is the pidfile
>                 >
>                 reliable?(<nm_dhcp_manager_cancel_transaction_real>
>                 uses
>                 >         device->pid to kill the dhcp process, but
>                 <nm_dhcp_client>
>                 >         uses pidfile)
>                 >
>                 >         BTW, thanks for your work, I will try the
>                 code with the tag
>                 >         0.7.1rc1 later.
>                 >
>                 >         2009/2/19 Dan Williams <dcbw redhat com>
>                 >
>                 >
>                 >                 On Wed, 2009-02-18 at 17:10 +0800,
>                 cee1 wrote:
>                 >                 > Hi,
>                 >                 >      I'm using
>                 NetwokManager-0.7(using dhcpcd) with
>                 >                 wifi, and see a
>                 >                 > lot of messages like  "Received
>                 DHCP event from
>                 >                 unexpected PID xxxx
>                 >                 > (expected xxxx)" in the syslog.
>                 >
>                 >
>                 >                 What version of NM specifically?
>                  The final NM release
>                 >                 from late
>                 >                 November 2008, or a snapshot from
>                 some other point in
>                 >                 time?
>                 >
>                 >                 >      I also see more than one
>                 dhcpcd processes
>                 >                 running. Sometimes,
>                 >                 > only one dhcpcd process lefts at
>                 last, but sometimes
>                 >                 not. (searching
>                 >                 > google, find the same problem
>                 someone has reported:
>                 >                 >
>                 >
>                 https://bugs.launchpad.net/ubuntu/+source/network-manager/+bug/255829)
>                 >                 >      Following the source
>                 code(NM-0.7), in the
>                 >                 function
>                 >                 > <nm_dhcp_client_start>
>                 (nm-dhcp-dhcpcd.c), it kills
>                 >                 the previous
>                 >                 > dhcpcd process and saves the new
>                 one's pid in
>                 >                 device->pid. But the
>                 >                 > function <dhcp_watch_cb>, knowing
>                 the termination of
>                 >                 the previous
>                 >                 > dhcpd process, sets device->pid to
>                 zero. So the
>                 >                 current dhcp process's
>                 >                 > pid is lost, and
>                 <nm_dhcp_manager_handle_event> will
>                 >                 print log like
>                 >                 > "Received DHCP event from
>                 unexpected PID 8812
>                 >                 (expected 0)".
>                 >
>                 >
>                 >                 Obviously the previous one isn't
>                 getting cleaned up
>                 >                 correctly.  It might
>                 >                 be that dhcpcd returns a state NM
>                 doesn't expect, and
>                 >                 thus isn't getting
>                 >                 cleaned up in
>                 nm_dhcp_manager_begin_transaction():
>                 >
>                 >                        if (state_is_bound
>                 (device->state) ||
>                 >                 (device->state == DHC_START)) {
>                 >                                /* Cancel any DHCP
>                 transaction already
>                 >                 in progress */
>                 >
>                  nm_dhcp_manager_cancel_transaction_real
>                 >                 (device);
>                 >                        }
>                 >
>                 >                 Basically, it should *always* be the
>                 case that a
>                 >                 previous dhcp client
>                 >                 instance is cleaned up before
>                 starting a new one, so
>                 >                 the watch callback
>                 >                 should certainly be removed before
>                 starting a new dhcp
>                 >                 client run.
>                 >
>                 >                 >      Also the previous dhcpcd
>                 process's time_out
>                 >                 handler may expire,
>                 >                 > and affects the current running
>                 dhcpcd process?
>                 >
>                 >
>                 >                 The timeout gets cleared whenever
>                 DHCP options get
>                 >                 returned, or when the
>                 >                 dhcp device gets destroyed, or when
>                 >                 nm_dhcp_manager_cancel_transaction()
>                 >                 is called to clean up any ongoing
>                 DHCP transaction.
>                 >                 cancel_transaction() gets called
>                 when the device is
>                 >                 deactivated.
>                 >
>                 >                 In any case, I've fixed up a few
>                 issues with the DHCP
>                 >                 client handling
>                 >                 code and made it somewhat more
>                 robust against this
>                 >                 issue.  If you can
>                 >                 possibly test the latest git, or at
>                 least figure out
>                 >                 what state the DHCP
>                 >                 client is left in (device->state)
>                 when
>                 >                 nm_dhcp_manager_begin_transaction()
>                 gets called, that
>                 >                 would be great.
>                 >
>                 >                 Dan
>                 >
>                 >                 >      It seems the function
>                 >                 <supplicant_interface_acquire>
>                 >                 > (nm-device-wifi.c) registers a
>                 function
>                 >                 >
>                 <supplicant_iface_connection_state_cb> as the
>                 >                 handler of the signal
>                 >                 > "connection-state" .
>                 >                 > And:
>                 >                 >
>                 <supplicant_iface_connection_state_cb> shcedules
>                 >                 >
>                 <supplicant_iface_connection_state_cb_handler>,
>                 >                 which calls
>                 >                 >
>                 >
>                 <nm_device_activate_schedule_stage3_ip_config_start>,
>                 >                 which shcedules
>                 >                 >
>                 <nm_device_activate_stage3_ip_config_start> , which
>                 >                 calls
>                 >                 > <real_act_stage3_ip_config_start>,
>                 which calls
>                 >                 >
>                 <nm_dhcp_manager_begin_transaction>.
>                 >                 > So it is possiable that more than
>                 one dhcpcd
>                 >                 processes running
>                 >                 > meanwhile.
>                 >                 >
>                 >                 >     To fix this, shoud I do:
>                 >                 > 1) Protect the corresponding data
>                 struct
>                 >                 > or
>                 >                 > 2) Add a clean step when a second
>                 dhcpcd process to
>                 >                 be started
>                 >                 >
>                 >                 >      Any ideas? Thanks.
>                 >
>                 >                 >
>                 _______________________________________________
>                 >                 > NetworkManager-list mailing list
>                 >                 > NetworkManager-list gnome org
>                 >                 >
>                 >
>                 http://mail.gnome.org/mailman/listinfo/networkmanager-list
>                 >
>                 >
>                 >
>
>
>
>




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]