Can a x86/x64 cpu/memory system be changed into a barrel processor ?

**Skybuck Flying[_7_]**

Hello,

Question is:

Can a x86/x64 cpu/memory system be changed into a barrel processor ?

I shall provide an idea here and then you guys figure out if it would be
possible or not.

What I would want as a programmer is something like the following:

1. Request memory contents/addresses with an instruction which does not
block, for example:

EnqueueReadRequest address1

Then it should be possible to "machine gun" these requests like so:

EnqueueReadRequest address1
EnqueueReadRequest address2
EnqueueReadRequest address3
EnqueueReadRequest address4
EnqueueReadRequest address5

2. Block on response queue and get memory contents

DequeueReadResponse register1

do something with register1, perhaps enqueue another read request

DequeueReadResponse register2
DequeueReadResponse register3

If the queues act in order... then this would be sufficient.

Otherwise extra information would be necessary to know which is what.

So if queues would be out of order then the dequeue would need to provide
which address the contents where for.

DeQueueReadResponse content_register1, address_register2

The same would be done for writing as well:

EnqueueWriteRequest address1, content_register
EnqueueWriteRequest address2, content_register
EnqueueWriteRequest address3, content_register

There could then also be a response queue which notifies the thread when
certain memory addresses where written.

DequeueWriteResponse register1 (in order design)

or

DequeueWriteResponse content_register1, address_register2 (out of order
design)

There could also be some special instructions which would return queue
status without blocking...

Like queue empty count, queue full count, queue max count and perhaps a
queue up count which could be used to change queue status in case something
happened to the queue.

For example each queue has a maximum ammount of entries available.

The queueing/dequeuing instructions mentioned above would block until they
succeed (meaning their request is placed on queue or response removed from
queue)

The counting instructions would not block.

This way the cpu would have 4 queues at least:

1. Read Request Queue
2. Read Response Queue
3. Write Request Queue
4. Write Response Queue

Each queue would have a certain maximum size.

Each queue has counters to indicate how much "free entries there are" and
how much "taken entries there are".

For example, these are also querieable via instructions and do not block the
thread, the counters are protected via hardware mutexes or so because of
queieing and dequeing
but as long as nothing is happening these counters should be able to return
properly.

GetReadRequestQueueEmptyCount register
GetReadRequestQueueFullCount register

GetReadResponseQueueEmptyCount register
GetReadResponseQueueFillCount register

GetWriteRequestQueueEmptyCount register
GetWriteRequestQueueFullCount register

GetWriteResponseQueueEmptyCount register
GetWriteResponseQueueFillCount register

All instructions should be shareable by threads... so that for example one
thread might be postings read requests and another thread might be
retrieving those read responses.

Otherwise the first thread might block because of read request full, and
nobody responding to response queue.

Alternatively perhaps the instructions could also be made non-blocking, and
return a status code to indicate if they operation succeeded or not, however
then an additional code or mode would also be necessary to specify if it
should be blocking or non-blocking... which might make things a bit too
complex, but this is hardware-maker decision... in case many threads sharing
is too difficult or impossible or too slow then non-blocking might be
better, the thread can then cycle around read responses and see if anything
came in so it can do something... however this would lead to high cpu
usage... so for efficiency sake blocking is preferred, or perhaps a context
switch until the thread no longer blocks. It would then still be necessary
for the thread to somehow deal with responses... so this this seem to need
multiple threads to work together for the blocking situation.

The memory system/chips would probably also need some modifications to be
able to deal with these memory requests and return responses.

Perhaps also special wiring/protocols to be able to "pipeline"/"transfer as
much of these requests/responses back and forth.

So what you think of a "barrel" like addition to current amd/intel x86/x64
cpu's and there memory systems ?!? Possible or not ?!?

This idea described above is a bit messy... but it's the idea that counts...
if cpu manufacturers interested I might work it out some more to see how it
would flesh out/work exactly

Bye,
Skybuck.

**Skybuck Flying[_7_]**

"Skybuck Flying" wrote in message
b.home.nl...

Hello,

Question is:

Can a x86/x64 cpu/memory system be changed into a barrel processor ?

I shall provide an idea here and then you guys figure out if it would be
possible or not.

What I would want as a programmer is something like the following:

1. Request memory contents/addresses with an instruction which does not
block, for example:

EnqueueReadRequest address1

Then it should be possible to "machine gun" these requests like so:

EnqueueReadRequest address1
EnqueueReadRequest address2
EnqueueReadRequest address3
EnqueueReadRequest address4
EnqueueReadRequest address5

2. Block on response queue and get memory contents

DequeueReadResponse register1

do something with register1, perhaps enqueue another read request

DequeueReadResponse register2
DequeueReadResponse register3

If the queues act in order... then this would be sufficient.

Otherwise extra information would be necessary to know which is what.

So if queues would be out of order then the dequeue would need to provide
which address the contents where for.

DeQueueReadResponse content_register1, address_register2

The same would be done for writing as well:

EnqueueWriteRequest address1, content_register
EnqueueWriteRequest address2, content_register
EnqueueWriteRequest address3, content_register

There could then also be a response queue which notifies the thread when
certain memory addresses where written.

DequeueWriteResponse register1 (in order design)

or

DequeueWriteResponse content_register1, address_register2 (out of order
design)

There could also be some special instructions which would return queue
status without blocking...

Like queue empty count, queue full count, queue max count and perhaps a
queue up count which could be used to change queue status in case something
happened to the queue.

For example each queue has a maximum ammount of entries available.

The queueing/dequeuing instructions mentioned above would block until they
succeed (meaning their request is placed on queue or response removed from
queue)

The counting instructions would not block.

This way the cpu would have 4 queues at least:

1. Read Request Queue
2. Read Response Queue
3. Write Request Queue
4. Write Response Queue

Each queue would have a certain maximum size.

Each queue has counters to indicate how much "free entries there are" and
how much "taken entries there are".

For example, these are also querieable via instructions and do not block the
thread, the counters are protected via hardware mutexes or so because of
queieing and dequeing
but as long as nothing is happening these counters should be able to return
properly.

Little correct: full should have been fill:

GetReadRequestQueueEmptyCount register
GetReadRequestQueueFillCount register

GetReadResponseQueueEmptyCount register
GetReadResponseQueueFillCount register

GetWriteRequestQueueEmptyCount register
GetWriteRequestQueueFillCount register

GetWriteResponseQueueEmptyCount register
GetWriteResponseQueueFillCount register

All instructions should be shareable by threads... so that for example one
thread might be postings read requests and another thread might be
retrieving those read responses.

Otherwise the first thread might block because of read request full, and
nobody responding to response queue.

Alternatively perhaps the instructions could also be made non-blocking, and
return a status code to indicate if they operation succeeded or not, however
then an additional code or mode would also be necessary to specify if it
should be blocking or non-blocking... which might make things a bit too
complex, but this is hardware-maker decision... in case many threads sharing
is too difficult or impossible or too slow then non-blocking might be
better, the thread can then cycle around read responses and see if anything
came in so it can do something... however this would lead to high cpu
usage... so for efficiency sake blocking is preferred, or perhaps a context
switch until the thread no longer blocks. It would then still be necessary
for the thread to somehow deal with responses... so this this seem to need
multiple threads to work together for the blocking situation.

The memory system/chips would probably also need some modifications to be
able to deal with these memory requests and return responses.

Perhaps also special wiring/protocols to be able to "pipeline"/"transfer as
much of these requests/responses back and forth.

So what you think of a "barrel" like addition to current amd/intel x86/x64
cpu's and there memory systems ?!? Possible or not ?!?

This idea described above is a bit messy... but it's the idea that counts...
if cpu manufacturers interested I might work it out some more to see how it
would flesh out/work exactly

Bye,
Skybuck.

**Joel Koltner**

"Skybuck Flying" wrote in message
b.home.nl...
Can a x86/x64 cpu/memory system be changed into a barrel processor ?

[deletia]

Not directly, but they... sort of... already a The high-end Intel and AMD
x86 CPUs are all superscalar designs, which means that internally the CPU is
viewed as a collection of "resources" -- ALUs, instruction decoders, memory
read units, memory write units, etc. -- and that there are (typically)
multiple instances of each of these resources, and the CPU scheduler tries
very hard to always keep all the resources busy, which effectively means that
multiple instructions can be executed simultaneously (this effectively
implements your "AddRequest, AddRequest, GetResponse, GetResponse" protocol
that you'd like).

Now, add on the hyper-threading that's been around for a number of years now,
and I'd say you have a result that, in practice, is not that far from a barrel
processor. In fact, it's probably better insofar as popular metrics such as
performance/(# of transistors*clock rate*power) or somesuch in that the
dynamic scheduling that a superscalar CPU performs is often more efficient
than a straight barrel implementation when you're running "general purpose"
code such as a web browser or word processor (although I would expect that
barrel CPUs have instructions that provide "hints" to the schedulers to
suggest it not switch threads or to keep or flush the caches or whatever just
as superscalar CPUs do... but also recall that when HT was added to Intel's
x86 CPUs, for certain workloads the HT actually slowed down the overall
throughput a bit too...).

As I think you've surmised, the trick to achieving high performance with CPUs
is to prevent stalls. This is of course a non-trivial problem, and companies
like Intel and AMD invest enormous resources into trying to get just a little
bit better performance out of their designs; you can be certain that someone
at these companies has very carefully considered which aspects of a barrel
processor design they might "borrow" to improve their performance.

---Joel

**Skybuck Flying[_7_]**

The only thing my program needs to do is fire off memory requests.

However it seems the x86 cpu blocks on the first memory request and does
nothing else.

This is AMD X2 3800+ processor.

Perhaps newer processors don't have this problem anymore but I would
seriously doubt that.

So unless you come up with any prove I am going to dismiss your story is
complex-non-relevant-bull****.

It's not so hard to write a program which requests random memory accesses.

You apperently should try it sometime.

Bye,
Skybuck.

**Joel Koltner**

"Skybuck Flying" wrote in message
b.home.nl...
The only thing my program needs to do is fire off memory requests.

However it seems the x86 cpu blocks on the first memory request and does
nothing else.

Hmm, it shouldn't do that, assuming there aren't any dependencies between the
next handful of instructions and the first one there. (But note that if you
perform a load operation and the data isn't in the caches, it takes *many tens
to hundreds* of CPU cycles to fetch the data from external DRAM; hence you
*will* stall. There actually are instructions in the x86 architecture these
days for "warming up" the cache by pre-fetching data, though -- this can help
a lot when you know in advance you'll need data, e.g., a few hundred cycles
from now; if you're looping over big sets of data, you just pre-fetch the next
block while you work on the current one.)

A program that requests random memory accesses will very quickly stall for a
long time (after the first couple of instructions), as you quickly exhaust the
number of "memory read" resources available and have near-constant cache
misses. Few real-world pograms exhibit behavior that bad AFAIK, although I
expect that some large database applications (that have to run through
multiple indices for each request, where the indices and/or data are too big
for the caches) might approach it.

---Joel

**Paul**

Joel Koltner wrote:
"Skybuck Flying" wrote in message
b.home.nl...
The only thing my program needs to do is fire off memory requests.

However it seems the x86 cpu blocks on the first memory request and
does nothing else.

Hmm, it shouldn't do that, assuming there aren't any dependencies
between the next handful of instructions and the first one there. (But
note that if you perform a load operation and the data isn't in the
caches, it takes *many tens to hundreds* of CPU cycles to fetch the data
from external DRAM; hence you *will* stall. There actually are
instructions in the x86 architecture these days for "warming up" the
cache by pre-fetching data, though -- this can help a lot when you know
in advance you'll need data, e.g., a few hundred cycles from now; if
you're looping over big sets of data, you just pre-fetch the next block
while you work on the current one.)

A program that requests random memory accesses will very quickly stall
for a long time (after the first couple of instructions), as you quickly
exhaust the number of "memory read" resources available and have
near-constant cache misses. Few real-world pograms exhibit behavior
that bad AFAIK, although I expect that some large database applications
(that have to run through multiple indices for each request, where the
indices and/or data are too big for the caches) might approach it.

---Joel

The Intel processor also has prefetch options, and works with both
incrementing memory access patterns or decrementing patterns. Using
a "warm up" option is one thing, but the processor should also be
able to handle prefetch on its own.

Perhaps AMD has something similar ? Since this is posted to comp.arch,
someone there should know. Skybuck's processor has an integrated memory
controller, so there are possibilities.

http://blogs.utexas.edu/jdm4372/2010...ead-read-only/

Both Intel and AMD, will have documentation on their website, addressing
the need to optimize programs to run on the respective processors. And
that is a good place for a programmer to start, to find the secrets
of getting best performance.

Paul

**Ken Hagan[_2_]**

On Fri, 10 Jun 2011 01:58:09 +0100, Skybuck Flying
wrote:

The only thing my program needs to do is fire off memory requests.

However it seems the x86 cpu blocks on the first memory request and does
nothing else.

How do you know? The whole point about out-of-order execution is that it
is transparent to the software, so it is not possible to write a program
whose behaviour depends on whether blocking occurs or not.

If you have a logic analyzer and you think you have results that prove
in-order behaviour then you'll have to provide more details. That said,
such things are well outside my comfort zone so I personally won't be able
to help.

**MitchAlsup**

On Jun 10, 3:44*am, "Ken Hagan" wrote:
On Fri, 10 Jun 2011 01:58:09 +0100, Skybuck Flying *

wrote:
The only thing my program needs to do is fire off memory requests.

However it seems the x86 cpu blocks on the first memory request and does *
nothing else.

The CPU will not block if all of the outstanding accesses are to write-
back cacheable memory.

How do you know? The whole point about out-of-order execution is that it *
is transparent to the software,

No, the whole point of precise exceptions is to be trasnparent to
software. The point of OoO is to improve performance, adding precise
exceptions to OoO gives you high performance and is relatively
transparent to software (but not entirely).

so it is not possible to write a program *
whose behaviour depends on whether blocking occurs or not.

One can EASILY detect blocking (or not) by comparing the wall clock
time on multi-million memory access codes. One can infer the latencies
to the entire cache hierchy including main memory and whether or no
main memory accesses are being processed with concurrency.

Mitch

**Skybuck Flying[_7_]**

Already tried prefetching for RAM it's pretty useless...

Especially for random access, especially for dependancies, especially when
the software doesn't yet know what to ask next.

However the problem may be parallized.

However the CPU stills blocks.

Therefore making it parallel doesn't help.

Only threading helps, but with two or four cores that doesn't impress.

I might read the article later on but I fear I will be wasting my time.

I scanned it a little bit, the code assummes insequence memory... pretty
lame, it has nothing to do with R in RAM.

Also my memory seeks are very short, 4 to 6 bytes, therefore fetching more
is pretty useless.

Bye,
Skybuck.

"Paul" wrote in message ...

Joel Koltner wrote:
"Skybuck Flying" wrote in message
b.home.nl...
The only thing my program needs to do is fire off memory requests.

However it seems the x86 cpu blocks on the first memory request and does
nothing else.

Hmm, it shouldn't do that, assuming there aren't any dependencies between
the next handful of instructions and the first one there. (But note that
if you perform a load operation and the data isn't in the caches, it takes
*many tens to hundreds* of CPU cycles to fetch the data from external
DRAM; hence you *will* stall. There actually are instructions in the x86
architecture these days for "warming up" the cache by pre-fetching data,
though -- this can help a lot when you know in advance you'll need data,
e.g., a few hundred cycles from now; if you're looping over big sets of
data, you just pre-fetch the next block while you work on the current
one.)

A program that requests random memory accesses will very quickly stall for
a long time (after the first couple of instructions), as you quickly
exhaust the number of "memory read" resources available and have
near-constant cache misses. Few real-world pograms exhibit behavior that
bad AFAIK, although I expect that some large database applications (that
have to run through multiple indices for each request, where the indices
and/or data are too big for the caches) might approach it.

---Joel

The Intel processor also has prefetch options, and works with both
incrementing memory access patterns or decrementing patterns. Using
a "warm up" option is one thing, but the processor should also be
able to handle prefetch on its own.

Perhaps AMD has something similar ? Since this is posted to comp.arch,
someone there should know. Skybuck's processor has an integrated memory
controller, so there are possibilities.

http://blogs.utexas.edu/jdm4372/2010...ead-read-only/

Both Intel and AMD, will have documentation on their website, addressing
the need to optimize programs to run on the respective processors. And
that is a good place for a programmer to start, to find the secrets
of getting best performance.

Paul

**Joel Koltner**

"Skybuck Flying" wrote in message
. home.nl...
Programmers try to write programs so they be fast.

Well, yes and no -- these days, the vast majority of programs are at least
*initially* written more from the point of view of trying to get them to be
maintanable and correct (bug-free); after that has occurred, if there are any
significant performance bottlenecks (and in many programs, there may not be
because the app is waiting on the human user for input or the Internet or
something else quite slow), programmers go back and work on those
performance-critical areas.

"We should forget about small efficiencies, say about 97% of the time:
premature optimization is the root of all evil" -- Donald Knuth -- see:
http://en.wikipedia.org/wiki/Program_optimization, where there are other good
quotes as well, such as "More computing sins are committed in the name of
efficiency (without necessarily achieving it) than for any other single
reason - including blind stupidity."

Also keep in mind that since the vast majority of code is now written in a
high-level language, it's primarily the purview of a *compiler* to generate
"reasonably" efficient code -- it has much more intimate knowledge of the
particular CPU architecture being targeted than the programmer usually does;
most programmers should be concetrating on efficient *algorithms* rather than
all these low level details regarding parallelism, caching, etc.

I mean, when I first started programming and learned C (back in the early
'90s), you were often told a couple of "tricks" to make the code run faster at
the expense of readibility. Today the advice is completely the opposite: Make
the code as readable as possible; a good compiler will generally create output
that's just as efficient as the old-school code ever was.

Do not think that slow programs would be released.

"Slow" is kinda relative, though. As much as it pains me and others around
here at times, it's hard to argue that just because a program is truly glacial
on a 233MHz Pentium (the original "minimum hardware requirement" for Windows
XP), if it's entirely snappy on a 2.4GHz CPU it's not *really* "slow."

No better hardware then no better software.

Hardware has gotten better, and software design is still struglling to catch
up: There's still no widely-adopted standard that has "taken over the world"
insofar as programming efficiently for multi-core CPUs. Take a look at
something like the GreenArrays CPUs: http://greenarraychips.com/ -- 144 CPU
cores, and no fall-off-the-log easy method to get all of them to execute in
parallel for many standard procedural algorithms.

Ask yourself one very important big question:
What does the R stand for in RAM ?

Notice that your motherboard is populated with SDRAM, which is a rather
different beast than "old school" RAM -- it's not nearly as "random" as you
might like, at least insofar as what provides the maximum bandwidth.

---Joel

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
New Windows Vista x86 and x64 v7.1	Jadranko Albert	Ati Videocards	2	January 30th 07 02:24 AM
CPU is unworkable or has been changed, please recheck CPU soft menu.	[email protected]	Overclocking	1	December 27th 05 01:53 AM
P4B533 hard drive slows down the whole system 2 gig processor and 1mb memory	Rich	Asus Motherboards	2	April 7th 05 05:36 AM
accessing memory mapped I/O in REAL MODE of x86 processor Arch..	banu	Intel	3	May 13th 04 02:40 PM
Changed Processor - Serious Problems!	Peter	General	0	May 1st 04 10:33 AM