Re: About "Received DHCP event from unexpected PID xxxx (expected xxxx)"



Hi, Dan
I've read some code snippets of dhcpcd: it tries to read the pidfile, and then writes(the pid to pidfile) & flock the pidfile.

Image a situation:
1) NM forks a DHCP process
2) The DHCP process begins to execute
3) The DHCP process reads the pidfile
4) The DHCP process blocks (since it doing some I/O), and OS chooses NM to run again
5) NM decieds to start another DHCP routine, and it finds no pidfile exists, so it forks another DHCP process
6) Now there are two DHCP processes running

May this is the reason why NM forgets some DHCP processes.

2009/2/20 cee1 <fykcee1 gmail com>
Hi, Dan
I've tried 0.7.1-rc1, and modify the code slightly: <nm_dhcp_client_start>: kill the DHCP client according device->pid. In my test, the problem seems to be fixed.

Is it a little reasonable that killing a DHCP client by device->pid?
First, the logical is consistent with <nm_dhcp_manager_cancel_transaction_real>, which kills a DHCP client by device->pid.
Second, it seems more reliablable.
Third, although it is possible to kill the DHCP client which is not created by NM, by pidfile,
that DHCP client may not use the same pidfile. (since the code in <nm-dhcp-dhclient.c> specified the path of the pidfile)

BTW, I still see some log message like "NetworkManager: <WARN>  nm_dhcp_manager_handle_event(): Received DHCP event from unexpected PID 16219 (expected 16221)" (but I'm sure the process has exited eailier.) I guess NM received the data from the process eailier, but was busy, and then it had chance to process the data, but the process had exited. Am I right?

I'm runing neither SELinux nor AppArmor. I will try to find out the reason why the pidfile disappears while there is still a DHCP client running : a) The DHCP client fails to create pidfile ? b) The DHCP client delete the pidfile when it is running? c) ...

2009/2/20 Dan Williams <dcbw redhat com>

On Thu, 2009-02-19 at 20:37 +0800, cee1 wrote:
> Hi, Dan
>      I've reviewed  the code: when starts a new dhcp process (on the
> same net interface), the previous one will be terminated. So the there
> should always be one dhcp process for the same net interface, am I
> right?

Correct.

>      But in my test (still with tag NM-0.7.0 & dhclient-3.1.0),
> switching between APs, some log message shows NM can't get pid from
> pidfile (the pidfile doesn't  exist), but the *previous* dhcp process
> is actully alive . This is really strange.

Yeah, it's strange.  Are you running something like SELinux or AppArmor?

Dan

> 2009/2/19 cee1 <fykcee1 gmail com>
>         Hi, Dan
>         Thanks for the replay.
>         The code of NetworkManger is downloaded from gnome's ftp
>         server with version 0.7.0.
>         I've tried this NM-0.7 with both dhcpcd-4.0.10 and
>         dhclient-3.1.0.
>
>         Following your advice, I add some log output code, and find
>         the device->state in function
>         <nm_dhcp_manager_begin_transaction> is DHC_NBI (0) or DHC_END
>         (15).
>
>         In the case of dhcpcd-4.0.10, I see some log messages like:
>         "wlan0: dhcpcd already running on pid 1626
>         (/var/run/dhcpcd-wlan0.pid)", and NM forgets some dhcpcd
>         processes.
>         Also I see log messages like "dhcpcd[1630]: write signal 15:
>         Bad file descriptor", does this suggests that the dhcpcd
>         process can't write the pid file??
>
>         In the case of dhclient-3.1.0, (in my testing period), no
>         <dhcp_watch_cb> called (doesn't like the case of
>         dhcpcd-4.0.10), but still, log messages like
>         "nm_dhcp_manager_handle_event(): Received DHCP event from
>         unexpected PID 19924 (expected 0)" appear.
>         I see only one log message like "dhclient: There is already a
>         pid file /usr/var/run/dhclient-wlan0.pid with pid 19911".
>         This time, NM forgets dhclient processes more easily. When I
>         switch connections between APs, the number of dhclient
>         processes increases to 4 or 5 in a short time (which doesn't
>         like the case of dhcpcd-4.0.10).
>
>         Do these suggest both of the two dhcp clients above having
>         problems?
>         Is the pidfile
>         reliable?(<nm_dhcp_manager_cancel_transaction_real> uses
>         device->pid to kill the dhcp process, but <nm_dhcp_client>
>         uses pidfile)
>
>         BTW, thanks for your work, I will try the code with the tag
>         0.7.1rc1 later.
>
>         2009/2/19 Dan Williams <dcbw redhat com>
>
>
>                 On Wed, 2009-02-18 at 17:10 +0800, cee1 wrote:
>                 > Hi,
>                 >      I'm using NetwokManager-0.7(using dhcpcd) with
>                 wifi, and see a
>                 > lot of messages like  "Received DHCP event from
>                 unexpected PID xxxx
>                 > (expected xxxx)" in the syslog.
>
>
>                 What version of NM specifically?  The final NM release
>                 from late
>                 November 2008, or a snapshot from some other point in
>                 time?
>
>                 >      I also see more than one dhcpcd processes
>                 running. Sometimes,
>                 > only one dhcpcd process lefts at last, but sometimes
>                 not. (searching
>                 > google, find the same problem someone has reported:
>                 >
>                 https://bugs.launchpad.net/ubuntu/+source/network-manager/+bug/255829)
>                 >      Following the source code(NM-0.7), in the
>                 function
>                 > <nm_dhcp_client_start> (nm-dhcp-dhcpcd.c), it kills
>                 the previous
>                 > dhcpcd process and saves the new one's pid in
>                 device->pid. But the
>                 > function <dhcp_watch_cb>, knowing the termination of
>                 the previous
>                 > dhcpd process, sets device->pid to zero. So the
>                 current dhcp process's
>                 > pid is lost, and <nm_dhcp_manager_handle_event> will
>                 print log like
>                 > "Received DHCP event from unexpected PID 8812
>                 (expected 0)".
>
>
>                 Obviously the previous one isn't getting cleaned up
>                 correctly.  It might
>                 be that dhcpcd returns a state NM doesn't expect, and
>                 thus isn't getting
>                 cleaned up in nm_dhcp_manager_begin_transaction():
>
>                        if (state_is_bound (device->state) ||
>                 (device->state == DHC_START)) {
>                                /* Cancel any DHCP transaction already
>                 in progress */
>                                nm_dhcp_manager_cancel_transaction_real
>                 (device);
>                        }
>
>                 Basically, it should *always* be the case that a
>                 previous dhcp client
>                 instance is cleaned up before starting a new one, so
>                 the watch callback
>                 should certainly be removed before starting a new dhcp
>                 client run.
>
>                 >      Also the previous dhcpcd process's time_out
>                 handler may expire,
>                 > and affects the current running dhcpcd process?
>
>
>                 The timeout gets cleared whenever DHCP options get
>                 returned, or when the
>                 dhcp device gets destroyed, or when
>                 nm_dhcp_manager_cancel_transaction()
>                 is called to clean up any ongoing DHCP transaction.
>                 cancel_transaction() gets called when the device is
>                 deactivated.
>
>                 In any case, I've fixed up a few issues with the DHCP
>                 client handling
>                 code and made it somewhat more robust against this
>                 issue.  If you can
>                 possibly test the latest git, or at least figure out
>                 what state the DHCP
>                 client is left in (device->state) when
>                 nm_dhcp_manager_begin_transaction() gets called, that
>                 would be great.
>
>                 Dan
>
>                 >      It seems the function
>                 <supplicant_interface_acquire>
>                 > (nm-device-wifi.c) registers a function
>                 > <supplicant_iface_connection_state_cb> as the
>                 handler of the signal
>                 > "connection-state" .
>                 > And:
>                 > <supplicant_iface_connection_state_cb> shcedules
>                 > <supplicant_iface_connection_state_cb_handler>,
>                 which calls
>                 >
>                 <nm_device_activate_schedule_stage3_ip_config_start>,
>                 which shcedules
>                 > <nm_device_activate_stage3_ip_config_start> , which
>                 calls
>                 > <real_act_stage3_ip_config_start>, which calls
>                 > <nm_dhcp_manager_begin_transaction>.
>                 > So it is possiable that more than one dhcpcd
>                 processes running
>                 > meanwhile.
>                 >
>                 >     To fix this, shoud I do:
>                 > 1) Protect the corresponding data struct
>                 > or
>                 > 2) Add a clean step when a second dhcpcd process to
>                 be started
>                 >
>                 >      Any ideas? Thanks.
>
>                 > _______________________________________________
>                 > NetworkManager-list mailing list
>                 > NetworkManager-list gnome org
>                 >
>                 http://mail.gnome.org/mailman/listinfo/networkmanager-list
>
>
>





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]