[NTLUG:Discuss] 64/32 question

Steve Baker steve at sjbaker.org
Tue Sep 23 23:10:06 CDT 2008


MadHat Unspecific wrote:
> Can you provide any references on this? Specifically that 4G on a 32bit
> Linux system will experience performance hits.   I understand that you
> have to have specific kernel settings to access all 4G, but once this is
> done, are you saying there are performance hits?
>   
This is my understanding of the problem - I may have glossed over a few
details - but in "broad brush" terms, I think it's correct:

The problem is that you only have 32 address lines and 2^32 is 4Gigs.

The hardware "Memory Mapping Unit" (MMU) is effectively a big lookup
table that can translate any "block" address to any physical location in 
RAM.

When the system is running an application, the program wants to see it's own
chunk of memory positioned in the address space such that it runs from zero
upwards.  When that application is running, the MMU maps some chunks of
physical memory into the addresses from zero up to however much
memory the application has allocated to it. Physical memory gets chopped up
as the system allocates and deallocates blocks - so the actual physical 
memory
the application is using is scattered throughout the physical RAM chips with
the MMU descrambling it to make it LOOK like a nice, clean, contiguous
block.
 
But when the application (or a hardware interrupt) hands control over to 
the kernel,
it wants to see all of the physical memory as one nice long contiguous 
chunk so
that it can easily read and write to all of it - no matter which 
application is running
and no matter how it's currently scrambled for the application's 
benefit.  The
kernel can't afford to have memory continually shuffling around - it 
would be
impossible to maintain things like disk caches and such if the order of the
addresses changed every time an application took control of the CPU.

("swap-space" memory also plays a part in this - but I won't complicate 
matters
by explaining that)

So, the following three situations arise (and have different kernel 
compilation
settings):

* When you have 2Gbytes of physical RAM or less, the kernel can hand out
something close to 2Gbytes for a particularly hungry application and 
still have
2Gbytes of address space on the input to the MMU to lay out that same 
physical
RAM in an order that enables the kernel to access all of it nice and 
linearly. Every
block of physical RAM is at two places in the address space - one that 
is handy
for the application that's running - and another that's convenient for 
the kernel.

When you are multitasking, the kernel has to reload the bottom half of 
the MMU
whenever it switches from one application program to another - but that's a
relatively infrequent thing (on the timescale of a modern computer at 
least).

* When you have (say) 3Gbytes - you can still get by.  If you never let any
ONE application have more than 1Gbyte then there is still room to lay
out 1Gbyte for the application that's running and still have all 3 Gigs of
physical RAM laid out nicely for the kernel.  Sadly - if you upgrade the
RAM in your computer from 2Gbytes to 3Gbytes - then ironically, each
application can only see 1Gbyte instead of 2 - so in a sense, you actually
downgraded the amount of RAM for large monolithic programs!

BUT...

* If you insist on having 4Gbytes of physical RAM then there is a
big problem.  Since the kernel needs to be able to access all of physical
RAM but that fills up the entire address space, the application can't have
any address space at all because the kernel needs all of it.   Clearly 
that's
just not going to work.   So there is a kludge (of sorts) to get around that
- but it entails using different kernel code.  What happens is that every
single call from application to kernel and every single hardware interrupt
has to remap memory by reloading all of the data in the MMU.   Hence,
when the application is running, it can see all 4Gbytes of RAM - and
when it calls the kernel, that memory is rapidly rearranged to be how
the kernel wants to see it.  When the kernel returns back to the 
application,
the MMU has to be reloaded again to lay out the memory how this
application likes it to be.  Probably this messes up the cache too - so
probably cache coherency goes to the wall whenever there is an interrupt
or kernel call (I'm not 100% clear on this point though).

This last approach lets the application use almost all of the 4Gbytes
of RAM - but results in much slower response to hardware interrupts
and much slower kernel calls.  In an intense hardware-interrupt
environment (maybe a server with a bunch of hard drives and a ton
of busy network connections) - that overhead can become serious.

In a regular desktop PC, the effect on performance really depends on
your application.  If you have a big number-cruncher that hardly ever
makes kernel calls - then you may not care.   But if you have programs
that do a lot of kernel operations then the performance hit can be
very significant.

So, to summarize, your choices are:

Up to 2Gbytes of physical RAM:  Everything goes very fast and each
application can use almost all of the 2Gbytes.

Above 2Gbytes - but less than 4Gbytes:  Everything goes fast but if
you have N Gigs of RAM then each application can only see at
most (4-N) Gigs.

At 4Gbytes: The machine runs more slowly - but each application
can again use almost all of physical RAM if it needs to.

In a 64 bit machine (with a hypothetical 64 bit MMU) - you can
comfortably accomodate 16 petabytes of address space - and up
to 8 petabytes of physical RAM will work just fine without any
kludges.  Problem solved for MANY more cycles of Moores' Law!

In practice, a true 64bit MMU is a lot of circuitry and it's unnecessary
because nobody can afford 8 petabytes of anything.  Hence (I believe)
current 64 bit MMU's top out at maybe 64Gbytes - so you can have
32Gbytes of physical RAM and still have things work nicely.

  -- Steve.




More information about the Discuss mailing list