Oneway Call Dies in ORBit2



We need help with a problem we found in ORBit2 while integrating our software. My co-worker, Donna, describes how ORBit2 stops transmitting data on a oneway call:

To Whom It Concerns -

I am having a write problem with ORBit2.

I am running a CORBA Client on an x86 Linux machine.  The CORBA Server is running on an IXDP425 (ARM processor) Linux machine.  I am sending more than 4Mbits/second of data through this CORBA connection (I have seen the problem with sending only 1Mbit/second through the connection for about 20 seconds).  The remote procedure call from Client to Server is a one-way call.  

Things go well for a while and then the CORBA Client simply stops sending any data.

I have placed printf's in the ORBit2 source code to try to determine what is going on.   Here is what I have pieced together.

Procedure link_connection_writev() (found in ORBit2-2.7.6/linc2/src/linc-connection.c) is being called to write the data.  This procedure calls write_data_T(), which calls the writev() system call.  When things are going well, write_data_T() calls writev() and all of the data is written on this call.

When things start going bad, when write_data_T() calls writev() only some of the data is written and write_data_T() loops around and calls writev() again immediately.  At this point, writev() returns an error of EAGAIN.  Because this is a one-way, the connection options indicate NON-BLOCKING.  So, write_data_T() returns LINK_IO_QUEUED_DATA to link_connection_writev().



static glong
write_data_T (LinkConnection *cnx, QueuedWrite *qw)
{
                                :                               :                              :
                                :                               :                              :
			n = writev (cnx->priv->fd, qw->vecs,
				    MIN (qw->nvecs, LINK_IOV_MAX));

		d_printf ("wrote %d bytes\n", n);

		if (n < 0) {
#ifdef LINK_SSL_SUPPORT
			if (cnx->options & LINK_CONNECTION_SSL) {
				gulong rv;
					
				rv = SSL_get_error (cnx->priv->ssl, n);
					
				if ((rv == SSL_ERROR_WANT_READ || 
				     rv == SSL_ERROR_WANT_WRITE) &&
				    cnx->options & LINK_CONNECTION_NONBLOCKING)
					return LINK_IO_QUEUED_DATA;
				else
					return LINK_IO_FATAL_ERROR;
			} else
#endif
			{
				if (errno == EINTR)
					continue;

----------------Note code between marks------------------------
				else if (errno == EAGAIN &&
					 (cnx->options & LINK_CONNECTION_NONBLOCKING))
					return LINK_IO_QUEUED_DATA;
----------------End note---------------------------------------

				else if (errno == EBADF)
					g_warning ("Serious fd usage error %d", cnx->priv->fd);
				
				return LINK_IO_FATAL_ERROR; /* Unhandlable error */
			}
                                :                               :                              :
                                :                               :                              :
}


At this point, link_connection_writev() queues the data that was not written onto the connection's private write queue (cnx->priv->write_queue) and returns LINK_IO_QUEUED_DATA to its caller, which indicates everything is OK.   From then on link_connection_writev() never calls write_data_T() again, because the connection's private write queue is never cleared.   link_connection_writev() keeps returning LINK_IO_QUEUED_DATA to its caller, indicating everything is OK.  At this point, the application is still working (just not sending data), the application's memory usage begins to go up very quickly, and it never recovers.   Just to make sure that it never recovers, as soon as this write_data_T() block out started I stopped all data traffic.  An hour later I tried to send one 64 byte packet through this interface, but write_data_T() was stilled blocked out.  The application is still working.


LinkIOStatus
link_connection_writev (LinkConnection       *cnx,
			struct iovec         *vecs,
			int                   nvecs,
			const LinkWriteOpts  *opt_write_opts)
{
	QueuedWrite qw;
	int         status;

	CNX_LOCK (cnx);
	link_connection_ref_T (cnx);

	if (link_thread_safe ()) {
		d_printf ("Thread safe writev\n");
		if (cnx->status == LINK_CONNECTING) {
			queue_flattened_T_R (cnx, vecs, nvecs, TRUE);
			link_connection_unref_unlock (cnx);
			return LINK_IO_QUEUED_DATA;
		}
	} else if (cnx->options & LINK_CONNECTION_NONBLOCKING)
		link_connection_wait_connected (cnx);

	if (cnx->status == LINK_DISCONNECTED) {
		link_connection_unref_unlock (cnx);
		return LINK_IO_FATAL_ERROR;
	}

------------------Note code between marks-----------------------
	if (cnx->priv->write_queue) {
		/* FIXME: we should really retry the write here, but we'll
		 * get a POLLOUT for this lot at some stage anyway */
		queue_flattened_T_R (cnx, vecs, nvecs, FALSE);
		link_connection_unref_unlock (cnx);
		return LINK_IO_QUEUED_DATA;
	}
--------------End Note-------------------------------------------

	qw.vecs  = vecs;
	qw.nvecs = nvecs;

 continue_write:
	status = write_data_T (cnx, &qw);

	if (status == LINK_IO_QUEUED_DATA) {

------------------Note code between marks---------------------
		if (link_thread_safe ()) {
			queue_flattened_T_R (cnx, qw.vecs, qw.nvecs, TRUE);
			link_connection_unref_unlock (cnx);
			return LINK_IO_QUEUED_DATA;
-------------------End Note-----------------------------------

		}

		/* Queue data & listen for buffer space */
		link_watch_set_condition
			(cnx->priv->tag,
			 LINK_ERR_CONDS | LINK_IN_CONDS | G_IO_OUT);
                                :                               :                              :
                                :                               :                              :
}

In looking through the code, I could not figure out how this was supposed to recover.  I did not understand the "POLLOUT" comment that is in link_connection_writev().   I could not find a way for the connection's private write queue to get cleared.   I do not know if my setup is causing this problem or what.   Please advise.

Thank You,
Donna





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]