Re: About "Received DHCP event from unexpected PID xxxx (expected xxxx)"
- From: Dan Williams <dcbw redhat com>
- To: cee1 <fykcee1 gmail com>
- Cc: networkmanager-list gnome org
- Subject: Re: About "Received DHCP event from unexpected PID xxxx (expected xxxx)"
- Date: Sun, 22 Feb 2009 10:48:03 -0500
On Sun, 2009-02-22 at 15:39 +0800, cee1 wrote:
> 2009/2/22 Dan Williams <dcbw redhat com>
>
> On Sat, 2009-02-21 at 16:38 +0800, cee1 wrote:
> > 2009/2/20 Dan Williams <dcbw redhat com>
> > On Fri, 2009-02-20 at 19:22 +0800, cee1 wrote:
> > > Hi, Dan
> >
> > > I've read some code snippets of dhcpcd: it tries
> to read the
> > pidfile,
> > > and then writes(the pid to pidfile) & flock the
> pidfile.
> > >
> > > Image a situation:
> > > 1) NM forks a DHCP process
> > > 2) The DHCP process begins to execute
> > > 3) The DHCP process reads the pidfile
> > > 4) The DHCP process blocks (since it doing some
> I/O), and OS
> > chooses
> > > NM to run again
> > > 5) NM decieds to start another DHCP routine, and
> it finds no
> > pidfile
> > > exists, so it forks another DHCP process
> >
> >
> > NM decides to start another DHCP process for the
> same device,
> > or a
> > different device? It's normal that there should be
> more than
> > one DHCP
> > client running when multiple interfaces are being
> used, but
> > there should
> > only ever be one DHCP client per interface.
> >
> > Dan
> >
> >
> > Sorry for not making myself clear.
> > Say using dhcpcd as the DHCP backend, the following sequence
> of
> > operations causes more than one dhcpcd processes running for
> the same
> > net interface.
> >
> > 1) NM forks a dhcpcd process for wlan0, OS runs the dhcpcd
> process,
> > stops the execution of NM (assuming the cpu has a single
> core, without
> > SMT or HT enabled)
> > 2) The dhcpcd process begins to execute, it tries to read
> the file
> > "/var/run/dhcpcd-wlan0.pid", this I/O operation makes it
> block
> > 3) OS chooses NM to run again
> > 4) NM decides to fork another dhcpcd process for wla0
>
>
> What would make NM decide to fork another dhcpcd process for
> wlan0?
>
> NM calls nm_dhcp_manager_begin_transaction() from
> real_act_stage3_ip_config_start(), and the dhcpcd process is
> created.
> But there's a timeout of 45 seconds between when the DHCP
> client is
> started, and when NM decides its taken too long and kills it.
> Certainly
> within that timeout, the OS has scheduled dhcpcd again.
>
> Or, the user can switch connections, which will cause the old
> connection
> to be torn down.
> What causes the old connection to be torn down? Is it "if (device->pid
> && (device->state < DHC_ABEND))" ? if so, is the condition (in
> <nm_dhcp_client_start>) "if (g_file_get_contents (device->pid_file,
> &pid_contents, NULL, NULL))" always false?
Either the user can switch connections from the menu, which will cause
the old connection to be torn down, or the device can loose carrier (for
ethernet) or association to the access point (wifi) or the PPP link can
drop (for 3G).
The "if (device->pid && (device->state < DHC_ABEND))" is only executed
*after* the decision to tear down the connection has been made, and its
purpose is to kill any existing DHCP client process that may still be
running when the connection is torn down.
Dan
> This ends up in
> nm_dhcp_manager_cancel_transaction_real(), which calls
> nm_dhcp_client_stop(), which explicitly kills the old dhcpcd
> process
> using the PID from the original g_spawn_async() call, which is
> always
> correct...
>
> So I guess I'm just not seeing what would cause NM to try to
> start two
> dhcpcd processes at the same time. There could well be a bug,
> but I
> don't quite follow your explanation yet. Any chance you could
> clarify
> why your NM is starting a second dhcpcd?
>
> Dan
>
>
> > 5) NM tries to read the file "/var/run/dhcpcd-wlan0.pid",
> but it gets
> > nothing, so the second dhcpcd process for wlan0 is forked
> >
> >
> >
> > > 6) Now there are two DHCP processes running
> > >
> > > May this is the reason why NM forgets some DHCP
> processes.
> > >
> > > 2009/2/20 cee1 <fykcee1 gmail com>
> > > Hi, Dan
> > > I've tried 0.7.1-rc1, and modify the code
> slightly:
> > > <nm_dhcp_client_start>: kill the DHCP
> client
> > according
> > > device->pid. In my test, the problem seems
> to be
> > fixed.
> > >
> > > Is it a little reasonable that killing a
> DHCP client
> > by
> > > device->pid?
> > > First, the logical is consistent with
> > > <nm_dhcp_manager_cancel_transaction_real>,
> which
> > kills a DHCP
> > > client by device->pid.
> > > Second, it seems more reliablable.
> > > Third, although it is possible to kill the
> DHCP
> > client which
> > > is not created by NM, by pidfile,
> > > that DHCP client may not use the same
> pidfile.
> > (since the code
> > > in <nm-dhcp-dhclient.c> specified the path
> of the
> > pidfile)
> > >
> > > BTW, I still see some log message like
> > "NetworkManager: <WARN>
> > > nm_dhcp_manager_handle_event(): Received
> DHCP event
> > from
> > > unexpected PID 16219 (expected
> 16221)" (but I'm sure
> > the
> > > process has exited eailier.) I guess NM
> received the
> > data from
> > > the process eailier, but was busy, and
> then it had
> > chance to
> > > process the data, but the process had
> exited. Am I
> > right?
> > >
> > > I'm runing neither SELinux nor AppArmor. I
> will try
> > to find
> > > out the reason why the pidfile disappears
> while
> > there is still
> > > a DHCP client running : a) The DHCP client
> fails to
> > create
> > > pidfile ? b) The DHCP client delete the
> pidfile when
> > it is
> > > running? c) ...
> > >
> > > 2009/2/20 Dan Williams <dcbw redhat com>
> > >
> > >
> > > On Thu, 2009-02-19 at 20:37 +0800,
> cee1
> > wrote:
> > > > Hi, Dan
> > > > I've reviewed the code:
> when starts
> > a new dhcp
> > > process (on the
> > > > same net interface), the
> previous one will
> > be
> > > terminated. So the there
> > > > should always be one dhcp
> process for the
> > same net
> > > interface, am I
> > > > right?
> > >
> > >
> > > Correct.
> > >
> > > > But in my test (still with
> tag
> > NM-0.7.0 &
> > > dhclient-3.1.0),
> > > > switching between APs, some log
> message
> > shows NM
> > > can't get pid from
> > > > pidfile (the pidfile doesn't
> exist), but
> > the
> > > *previous* dhcp process
> > > > is actully alive . This is
> really strange.
> > >
> > >
> > > Yeah, it's strange. Are you
> running
> > something like
> > > SELinux or AppArmor?
> > >
> > > Dan
> > >
> > >
> > > > 2009/2/19 cee1
> <fykcee1 gmail com>
> > > > Hi, Dan
> > > > Thanks for the replay.
> > > > The code of
> NetworkManger is
> > downloaded from
> > > gnome's ftp
> > > > server with version
> 0.7.0.
> > > > I've tried this NM-0.7
> with both
> > > dhcpcd-4.0.10 and
> > > > dhclient-3.1.0.
> > > >
> > > > Following your advice, I
> add some
> > log output
> > > code, and find
> > > > the device->state in
> function
> > > >
> > <nm_dhcp_manager_begin_transaction> is
> > > DHC_NBI (0) or DHC_END
> > > > (15).
> > > >
> > > > In the case of
> dhcpcd-4.0.10, I
> > see some log
> > > messages like:
> > > > "wlan0: dhcpcd already
> running on
> > pid 1626
> > > >
> (/var/run/dhcpcd-wlan0.pid)", and
> > NM forgets
> > > some dhcpcd
> > > > processes.
> > > > Also I see log messages
> like
> > "dhcpcd[1630]:
> > > write signal 15:
> > > > Bad file descriptor",
> does this
> > suggests
> > > that the dhcpcd
> > > > process can't write the
> pid file??
> > > >
> > > > In the case of
> dhclient-3.1.0, (in
> > my
> > > testing period), no
> > > > <dhcp_watch_cb> called
> (doesn't
> > like the
> > > case of
> > > > dhcpcd-4.0.10), but
> still, log
> > messages like
> > > >
> "nm_dhcp_manager_handle_event():
> > Received
> > > DHCP event from
> > > > unexpected PID 19924
> (expected 0)"
> > appear.
> > > > I see only one log
> message like
> > "dhclient:
> > > There is already a
> > > > pid
> > file /usr/var/run/dhclient-wlan0.pid
> > > with pid 19911".
> > > > This time, NM forgets
> dhclient
> > processes
> > > more easily. When I
> > > > switch connections
> between APs,
> > the number
> > > of dhclient
> > > > processes increases to 4
> or 5 in a
> > short
> > > time (which doesn't
> > > > like the case of
> dhcpcd-4.0.10).
> > > >
> > > > Do these suggest both of
> the two
> > dhcp
> > > clients above having
> > > > problems?
> > > > Is the pidfile
> > > >
> > >
> > reliable?(<nm_dhcp_manager_cancel_transaction_real>
> > > uses
> > > > device->pid to kill the
> dhcp
> > process, but
> > > <nm_dhcp_client>
> > > > uses pidfile)
> > > >
> > > > BTW, thanks for your
> work, I will
> > try the
> > > code with the tag
> > > > 0.7.1rc1 later.
> > > >
> > > > 2009/2/19 Dan Williams
> > <dcbw redhat com>
> > > >
> > > >
> > > > On Wed,
> 2009-02-18 at
> > 17:10 +0800,
> > > cee1 wrote:
> > > > > Hi,
> > > > > I'm using
> > > NetwokManager-0.7(using dhcpcd)
> with
> > > > wifi, and see a
> > > > > lot of
> messages like
> > "Received
> > > DHCP event from
> > > > unexpected PID
> xxxx
> > > > > (expected
> xxxx)" in the
> > syslog.
> > > >
> > > >
> > > > What version of
> NM
> > specifically?
> > > The final NM release
> > > > from late
> > > > November 2008,
> or a
> > snapshot from
> > > some other point in
> > > > time?
> > > >
> > > > > I also
> see more
> > than one
> > > dhcpcd processes
> > > > running.
> Sometimes,
> > > > > only one
> dhcpcd process
> > lefts at
> > > last, but sometimes
> > > > not. (searching
> > > > > google, find
> the same
> > problem
> > > someone has reported:
> > > > >
> > > >
> > >
> >
> https://bugs.launchpad.net/ubuntu/+source/network-manager/+bug/255829)
> > > > > Following
> the
> > source
> > > code(NM-0.7), in the
> > > > function
> > > > >
> <nm_dhcp_client_start>
> > > (nm-dhcp-dhcpcd.c), it kills
> > > > the previous
> > > > > dhcpcd process
> and saves
> > the new
> > > one's pid in
> > > > device->pid. But
> the
> > > > > function
> > <dhcp_watch_cb>, knowing
> > > the termination of
> > > > the previous
> > > > > dhcpd process,
> sets
> > device->pid to
> > > zero. So the
> > > > current dhcp
> process's
> > > > > pid is lost,
> and
> > > <nm_dhcp_manager_handle_event>
> will
> > > > print log like
> > > > > "Received DHCP
> event
> > from
> > > unexpected PID 8812
> > > > (expected 0)".
> > > >
> > > >
> > > > Obviously the
> previous one
> > isn't
> > > getting cleaned up
> > > > correctly. It
> might
> > > > be that dhcpcd
> returns a
> > state NM
> > > doesn't expect, and
> > > > thus isn't
> getting
> > > > cleaned up in
> > >
> nm_dhcp_manager_begin_transaction():
> > > >
> > > > if
> (state_is_bound
> > > (device->state) ||
> > > > (device->state
> ==
> > DHC_START)) {
> > > >
> /* Cancel
> > any DHCP
> > > transaction already
> > > > in progress */
> > > >
> > >
> nm_dhcp_manager_cancel_transaction_real
> > > > (device);
> > > > }
> > > >
> > > > Basically, it
> should
> > *always* be the
> > > case that a
> > > > previous dhcp
> client
> > > > instance is
> cleaned up
> > before
> > > starting a new one, so
> > > > the watch
> callback
> > > > should certainly
> be
> > removed before
> > > starting a new dhcp
> > > > client run.
> > > >
> > > > > Also the
> previous
> > dhcpcd
> > > process's time_out
> > > > handler may
> expire,
> > > > > and affects
> the current
> > running
> > > dhcpcd process?
> > > >
> > > >
> > > > The timeout gets
> cleared
> > whenever
> > > DHCP options get
> > > > returned, or
> when the
> > > > dhcp device gets
> > destroyed, or when
> > > >
> > nm_dhcp_manager_cancel_transaction()
> > > > is called to
> clean up any
> > ongoing
> > > DHCP transaction.
> > > >
> cancel_transaction() gets
> > called
> > > when the device is
> > > > deactivated.
> > > >
> > > > In any case,
> I've fixed up
> > a few
> > > issues with the DHCP
> > > > client handling
> > > > code and made it
> somewhat
> > more
> > > robust against this
> > > > issue. If you
> can
> > > > possibly test
> the latest
> > git, or at
> > > least figure out
> > > > what state the
> DHCP
> > > > client is left
> in
> > (device->state)
> > > when
> > > >
> > nm_dhcp_manager_begin_transaction()
> > > gets called, that
> > > > would be great.
> > > >
> > > > Dan
> > > >
> > > > > It seems
> the
> > function
> > > >
> > <supplicant_interface_acquire>
> > > > >
> (nm-device-wifi.c)
> > registers a
> > > function
> > > > >
> > >
> <supplicant_iface_connection_state_cb> as
> > the
> > > > handler of the
> signal
> > > > >
> "connection-state" .
> > > > > And:
> > > > >
> > >
> <supplicant_iface_connection_state_cb>
> > shcedules
> > > > >
> > >
> > <supplicant_iface_connection_state_cb_handler>,
> > > > which calls
> > > > >
> > > >
> > >
> >
> <nm_device_activate_schedule_stage3_ip_config_start>,
> > > > which shcedules
> > > > >
> > >
> > <nm_device_activate_stage3_ip_config_start> , which
> > > > calls
> > > > >
> > <real_act_stage3_ip_config_start>,
> > > which calls
> > > > >
> > >
> <nm_dhcp_manager_begin_transaction>.
> > > > > So it is
> possiable that
> > more than
> > > one dhcpcd
> > > > processes
> running
> > > > > meanwhile.
> > > > >
> > > > > To fix
> this, shoud I
> > do:
> > > > > 1) Protect the
> > corresponding data
> > > struct
> > > > > or
> > > > > 2) Add a clean
> step when
> > a second
> > > dhcpcd process to
> > > > be started
> > > > >
> > > > > Any
> ideas? Thanks.
> > > >
> > > > >
> > >
> > _______________________________________________
> > > > >
> NetworkManager-list
> > mailing list
> > > > >
> > NetworkManager-list gnome org
> > > > >
> > > >
> > >
> >
> http://mail.gnome.org/mailman/listinfo/networkmanager-list
> > > >
> > > >
> > > >
> > >
> > >
> > >
> > >
> >
> >
> >
>
>
>
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]