[NTLUG:Discuss] pthreads on Linux

Sat Aug 11 13:24:40 CDT 2001

Peter Koren wrote:
> Steve Baker wrote:
> > 
> > Chris Cox wrote:
> > >
> > > I know previously Linus and others claimed that threads were not
> > > needed in Linux because they wouldn't buy that much in performance.
> > 
> > For once, Linus was wrong.  We've used threads at work and they certainly
> > do help THE RIGHT KIND OF APPLICATION.
> > 
> > However, it's not that this is a kind of magic wand.  If you have only
> > one CPU and your application is just doing heavy CPU/memory work, then
> > splitting it into two threads can only hurt performance.
> > 
> > But if your application mixes calculations with I/O, then of course
> > threading helps - while one thread is doing I/O, the other can use
> > the CPU for doing calculations - and vice-versa.
> > 
> 
> My code is mainly floating point number crunching with only enough I/O
> to keep the user aware of how the run is proceeding.
> 
> > Then if you have multiple CPU's, a single threaded application can't
> > possibly gain any benefit from that second CPU - so you pretty much
> > HAVE to have threads to take advantage of it.
> 
> My system is a dual CPU machine running an SMP kernel (2.4.3).

Which indicates that this particular situation is one that sits at the edge of 
those either:
a) Able to benefit somewhat from threading, or
b) Just barely _unable_ to benefit.

The place where threading is going to win _BIG_ is if the task involves hefty 
asynchronous activity, where it makes it easy for a bunch of threads to block, 
and consume minimal resources, whilst other threads can work hard.

Your scenario certainly isn't one like that; you are trying to keep CPUs 
working hard.

On the other hand, threading allows pushing work to multiple CPUs, so you can 
get some benefit from them.  Which you're doing, so you _can_ get some benefit.

The "gripping hand" likely is that those multiple CPUs are probably feeling 
starved for memory bandwidth.  Threading helps, in your application, if it 
allows one CPU to be busy computing whilst the other one is busy grabbing 
values from RAM, and vice-versa.

It sounds as though that's not happening cooperatively enough:  the CPUs are 
probably blocking each other a significant proportion of the time as they try 
to access memory.

If you had some way of synchronizing things so that a CPU could check to see 
if memory bus is busy, and instead do some on-register calculations, that 
would probably improve performance.

> I suspect you are correct about being main memory bandwidth bound. I
> have a few isolated global variables accessed in the thread, but also
> two global floating point arrays each comprised of 17,500 doubles and
> some smaller arrays. Anything which is written to by another thread is
> protected by a mutex or semaphore. One of the large arrays is read only 
> and the other is write only. No array element is written to by more
> than one thread.

If a CPU has to wait 150ns to pull a couple bytes of data, that stalls 
computation for a while.  Losing situation.

In effect, this shows why benefiting from threading is a Pretty Hard Problem...

> The answers given by both the threaded and non threaded versions are
> consistent. They are not identical, because the algorithm is calculation
> order sensitive because of the calls to the random number generator use
> by the algorithm.
> 
> Unfortunately, from a threading standpoint, each thread must randomly
> access memory in the large global arrays, so that there seems to be no
> way to arrange memory optimally for this algorithm.

... And rearranging accesses may not be relevant, if what happens is that the 
CPUs keep butting heads just trying to get to the memory bus ...

> The only approach that might have some payoff would be to reduce the
> number of simultaneous threads. This might improve things if the problem
> is related to the thread manager thrashing. This would break the natural
> structure of the code which uses NP parallel threads where NP is the
> population size of each generation. But then threads are expected to be
> painful.

The win would seem to be if you could make sure that most of the computations 
took place in registers, and diminish the number of times they have to go out 
and actually hit RAM, because it's the registers that are really independent.

Of course, that also assumes that the loops involved can fit into the 
instruction cache...
--
(reverse (concatenate 'string "gro.mca@" "enworbbc"))
http://www.cbbrowne.com/info/spreadsheets.html
Found in a TOPS-10 MCO:
    Quotation for the day: "a counter that doesn't exist can't get messed up."
    "Once in a blue moon" is defined as the creation of a new SFD or the
renaming of an old one.