avik.ghosh
8/11/2004 12:40:00 AM
avik.ghosh@gmail.com (Avik Ghosh) wrote in message news:<6f04c0dd.0408101023.5b8db4af@posting.google.com>...
> Joe Seigh <jseigh_01@xemaps.com> wrote in message news:<4118DDE5.53F678E6@xemaps.com>...
> > Avik Ghosh wrote:
> > >
> > > Joe Seigh <jseigh_01@xemaps.com> wrote in message news:<4118A958.34EABF79@xemaps
> > > > What do you mean flush the data in the common buffer?
> > > >
> > >
> > > Th#2 copies the data from the common buffer into another buffer which
> > > only it operates. It then attempts to write out this data onto one of
> > > the sockets that it manages. If only part of the data is sent, it
> > > inserts the socket into the select fd_set for writing. It then signals
> > > Th#1 by sending the 'done' message on the pipe, to indicate that it
> > > has 'flushed the buffer', i.e, it has handled the data.
> > >
> > > By 'buffer', I mean a simple struct which has a malloc()ed char *
> > > pointer, and integers to indicate the current length and the malloc
> > > size.
> > >
> > > I will try to run the same application on Solaris today to see if I
> > > face the same race condition.
> > >
> >
> > You're talking messages but read() and write() don't operate on messages, they
> > operate on a byte stream. There's nothing wrong with using read or write as
> > long as you realize there is no correspondence between the sizes of data what
> > you write and the sizes that you read except that the sum of what you read is
> > always less than or equal to the size that you write.
> >
> > Joe Seigh
>
>
> Hi,
>
> Sorry, I should have been a bit more clear. A 'message' is just a
> stream of bytes, as you mention. It is a message from the point of
> view of the application layer on top, complete with header etc. The
> Th#1 and Th#2 ( which are fast becoming old acquaintances ) that I
> mention only know bytestreams.
>
> In a nutshell, the design is this :
>
> Th#2 has a number of sockets to read and write data from. One of these
> is a socket whose peer is Th#1, and when a brief byte sequence is read
> from this socket, Th#2 knows to copy data from a buffer ( into which
> data has been copied by Th#1 prior to sending the byte sequence ) to
> another buffer and to acknowledge receipt to Th#1 by writing another
> byte sequence onto a pipe. This data is then written out onto a socket
> as part of Th#2's standard processing loop.
>
> I should mention that the Th#2 loop is part of a standard messaging
> library that has been in production for years, and is quite stable.
> Only standard read/write/select calls are used. It runs on Solaris,
> Linux and Windows, so there is no special Unix magic involved.
>
> In the application that I mention, I have encapsulated the main event
> loop ( Th#2 ) in a thread. The communication between this event loop
> thread and the main thread is using the socket/pipe combination that I
> have described.
>
> I compile using _REENTRANT for good measure, besides -Wall,
> -Wmissing-prototypes and other switches.
>
> Am I correct in assuming read() and write(), along with select() can
> be safely used with pthreads ? Do I have to do something special, like
> masking signals ?( as I mention, I do not handle any signals, other
> than ignoring SIG_PIPE )
>
> One thing I noticed about the strace output :
>
> When Th#1 is sending a stream of messages to Th#2 ( signalling back
> and forth as above )
>
> I see several calls to kill(pid, RTMIN) ( where pid is the process id
> of Th#2 ) interspersed with the write() and read() calls to the socket
> and pipe respectively. This does not seem to cause any problem, as the
> application continues correctly.
>
> However, the application hangs the moment Th#2 calls rt_sigsuspend(),
> immediately following a successful call to rt_sigprocmask(SIG_SETMASK,
> NULL, [RTMIN], 8)
>
> I feel I must be missing something obvious, like compiling without
> using some special flags or something.
>
> Thanks again for your interest.
>
> Avik.
Right, I have found the problem, and, thankfully, it is in my code.
What I had omitted to mention, is that the event processing thread (
Th#2 ) handles application level timers ( also done through select )
as well as sockets.
In my application, there is a timer which kicks in every now and then.
This timer obtains a lock, does some processing and releases it.
The race occurs when a large number of exchanges are taking place
between Th#1 and Th#2 as described above, and the timer mentioned
above expires just when Th#1 has signalled, but Th#2 has not yet been
woken up from select().
The select() call returns because Th#1 has sent a signal indicating
data has to be sent, but also, at this time, the timer needs to be
run. The timer code is attempted to run first, and deadlocks in trying
to get the lock ( this lock is held by Th#1 ).
Sorry about the false alarm - but I feel more confident about my
entire application after crawling all over it since yesterday !
Thanks,
Avik.