If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
|
Thread Tools | Display Modes |
#1
|
|||
|
|||
Can a x86/x64 cpu/memory system be changed into a barrel processor ?
Hello,
Question is: Can a x86/x64 cpu/memory system be changed into a barrel processor ? I shall provide an idea here and then you guys figure out if it would be possible or not. What I would want as a programmer is something like the following: 1. Request memory contents/addresses with an instruction which does not block, for example: EnqueueReadRequest address1 Then it should be possible to "machine gun" these requests like so: EnqueueReadRequest address1 EnqueueReadRequest address2 EnqueueReadRequest address3 EnqueueReadRequest address4 EnqueueReadRequest address5 2. Block on response queue and get memory contents DequeueReadResponse register1 do something with register1, perhaps enqueue another read request DequeueReadResponse register2 DequeueReadResponse register3 If the queues act in order... then this would be sufficient. Otherwise extra information would be necessary to know which is what. So if queues would be out of order then the dequeue would need to provide which address the contents where for. DeQueueReadResponse content_register1, address_register2 The same would be done for writing as well: EnqueueWriteRequest address1, content_register EnqueueWriteRequest address2, content_register EnqueueWriteRequest address3, content_register There could then also be a response queue which notifies the thread when certain memory addresses where written. DequeueWriteResponse register1 (in order design) or DequeueWriteResponse content_register1, address_register2 (out of order design) There could also be some special instructions which would return queue status without blocking... Like queue empty count, queue full count, queue max count and perhaps a queue up count which could be used to change queue status in case something happened to the queue. For example each queue has a maximum ammount of entries available. The queueing/dequeuing instructions mentioned above would block until they succeed (meaning their request is placed on queue or response removed from queue) The counting instructions would not block. This way the cpu would have 4 queues at least: 1. Read Request Queue 2. Read Response Queue 3. Write Request Queue 4. Write Response Queue Each queue would have a certain maximum size. Each queue has counters to indicate how much "free entries there are" and how much "taken entries there are". For example, these are also querieable via instructions and do not block the thread, the counters are protected via hardware mutexes or so because of queieing and dequeing but as long as nothing is happening these counters should be able to return properly. GetReadRequestQueueEmptyCount register GetReadRequestQueueFullCount register GetReadResponseQueueEmptyCount register GetReadResponseQueueFillCount register GetWriteRequestQueueEmptyCount register GetWriteRequestQueueFullCount register GetWriteResponseQueueEmptyCount register GetWriteResponseQueueFillCount register All instructions should be shareable by threads... so that for example one thread might be postings read requests and another thread might be retrieving those read responses. Otherwise the first thread might block because of read request full, and nobody responding to response queue. Alternatively perhaps the instructions could also be made non-blocking, and return a status code to indicate if they operation succeeded or not, however then an additional code or mode would also be necessary to specify if it should be blocking or non-blocking... which might make things a bit too complex, but this is hardware-maker decision... in case many threads sharing is too difficult or impossible or too slow then non-blocking might be better, the thread can then cycle around read responses and see if anything came in so it can do something... however this would lead to high cpu usage... so for efficiency sake blocking is preferred, or perhaps a context switch until the thread no longer blocks. It would then still be necessary for the thread to somehow deal with responses... so this this seem to need multiple threads to work together for the blocking situation. The memory system/chips would probably also need some modifications to be able to deal with these memory requests and return responses. Perhaps also special wiring/protocols to be able to "pipeline"/"transfer as much of these requests/responses back and forth. So what you think of a "barrel" like addition to current amd/intel x86/x64 cpu's and there memory systems ?!? Possible or not ?!? This idea described above is a bit messy... but it's the idea that counts... if cpu manufacturers interested I might work it out some more to see how it would flesh out/work exactly Bye, Skybuck. |
#2
|
|||
|
|||
Can a x86/x64 cpu/memory system be changed into a barrel processor ?
"Skybuck Flying" wrote in message b.home.nl... Hello, Question is: Can a x86/x64 cpu/memory system be changed into a barrel processor ? I shall provide an idea here and then you guys figure out if it would be possible or not. What I would want as a programmer is something like the following: 1. Request memory contents/addresses with an instruction which does not block, for example: EnqueueReadRequest address1 Then it should be possible to "machine gun" these requests like so: EnqueueReadRequest address1 EnqueueReadRequest address2 EnqueueReadRequest address3 EnqueueReadRequest address4 EnqueueReadRequest address5 2. Block on response queue and get memory contents DequeueReadResponse register1 do something with register1, perhaps enqueue another read request DequeueReadResponse register2 DequeueReadResponse register3 If the queues act in order... then this would be sufficient. Otherwise extra information would be necessary to know which is what. So if queues would be out of order then the dequeue would need to provide which address the contents where for. DeQueueReadResponse content_register1, address_register2 The same would be done for writing as well: EnqueueWriteRequest address1, content_register EnqueueWriteRequest address2, content_register EnqueueWriteRequest address3, content_register There could then also be a response queue which notifies the thread when certain memory addresses where written. DequeueWriteResponse register1 (in order design) or DequeueWriteResponse content_register1, address_register2 (out of order design) There could also be some special instructions which would return queue status without blocking... Like queue empty count, queue full count, queue max count and perhaps a queue up count which could be used to change queue status in case something happened to the queue. For example each queue has a maximum ammount of entries available. The queueing/dequeuing instructions mentioned above would block until they succeed (meaning their request is placed on queue or response removed from queue) The counting instructions would not block. This way the cpu would have 4 queues at least: 1. Read Request Queue 2. Read Response Queue 3. Write Request Queue 4. Write Response Queue Each queue would have a certain maximum size. Each queue has counters to indicate how much "free entries there are" and how much "taken entries there are". For example, these are also querieable via instructions and do not block the thread, the counters are protected via hardware mutexes or so because of queieing and dequeing but as long as nothing is happening these counters should be able to return properly. Little correct: full should have been fill: GetReadRequestQueueEmptyCount register GetReadRequestQueueFillCount register GetReadResponseQueueEmptyCount register GetReadResponseQueueFillCount register GetWriteRequestQueueEmptyCount register GetWriteRequestQueueFillCount register GetWriteResponseQueueEmptyCount register GetWriteResponseQueueFillCount register All instructions should be shareable by threads... so that for example one thread might be postings read requests and another thread might be retrieving those read responses. Otherwise the first thread might block because of read request full, and nobody responding to response queue. Alternatively perhaps the instructions could also be made non-blocking, and return a status code to indicate if they operation succeeded or not, however then an additional code or mode would also be necessary to specify if it should be blocking or non-blocking... which might make things a bit too complex, but this is hardware-maker decision... in case many threads sharing is too difficult or impossible or too slow then non-blocking might be better, the thread can then cycle around read responses and see if anything came in so it can do something... however this would lead to high cpu usage... so for efficiency sake blocking is preferred, or perhaps a context switch until the thread no longer blocks. It would then still be necessary for the thread to somehow deal with responses... so this this seem to need multiple threads to work together for the blocking situation. The memory system/chips would probably also need some modifications to be able to deal with these memory requests and return responses. Perhaps also special wiring/protocols to be able to "pipeline"/"transfer as much of these requests/responses back and forth. So what you think of a "barrel" like addition to current amd/intel x86/x64 cpu's and there memory systems ?!? Possible or not ?!? This idea described above is a bit messy... but it's the idea that counts... if cpu manufacturers interested I might work it out some more to see how it would flesh out/work exactly Bye, Skybuck. |
#3
|
|||
|
|||
Can a x86/x64 cpu/memory system be changed into a barrel processor ?
"Skybuck Flying" wrote in message
b.home.nl... Can a x86/x64 cpu/memory system be changed into a barrel processor ? [deletia] Not directly, but they... sort of... already a The high-end Intel and AMD x86 CPUs are all superscalar designs, which means that internally the CPU is viewed as a collection of "resources" -- ALUs, instruction decoders, memory read units, memory write units, etc. -- and that there are (typically) multiple instances of each of these resources, and the CPU scheduler tries very hard to always keep all the resources busy, which effectively means that multiple instructions can be executed simultaneously (this effectively implements your "AddRequest, AddRequest, GetResponse, GetResponse" protocol that you'd like). Now, add on the hyper-threading that's been around for a number of years now, and I'd say you have a result that, in practice, is not that far from a barrel processor. In fact, it's probably better insofar as popular metrics such as performance/(# of transistors*clock rate*power) or somesuch in that the dynamic scheduling that a superscalar CPU performs is often more efficient than a straight barrel implementation when you're running "general purpose" code such as a web browser or word processor (although I would expect that barrel CPUs have instructions that provide "hints" to the schedulers to suggest it not switch threads or to keep or flush the caches or whatever just as superscalar CPUs do... but also recall that when HT was added to Intel's x86 CPUs, for certain workloads the HT actually slowed down the overall throughput a bit too...). As I think you've surmised, the trick to achieving high performance with CPUs is to prevent stalls. This is of course a non-trivial problem, and companies like Intel and AMD invest enormous resources into trying to get just a little bit better performance out of their designs; you can be certain that someone at these companies has very carefully considered which aspects of a barrel processor design they might "borrow" to improve their performance. ---Joel |
#4
|
|||
|
|||
Can a x86/x64 cpu/memory system be changed into a barrel processor ?
The only thing my program needs to do is fire off memory requests.
However it seems the x86 cpu blocks on the first memory request and does nothing else. This is AMD X2 3800+ processor. Perhaps newer processors don't have this problem anymore but I would seriously doubt that. So unless you come up with any prove I am going to dismiss your story is complex-non-relevant-bull****. It's not so hard to write a program which requests random memory accesses. You apperently should try it sometime. Bye, Skybuck. |
#5
|
|||
|
|||
Can a x86/x64 cpu/memory system be changed into a barrel processor ?
"Skybuck Flying" wrote in message
b.home.nl... The only thing my program needs to do is fire off memory requests. However it seems the x86 cpu blocks on the first memory request and does nothing else. Hmm, it shouldn't do that, assuming there aren't any dependencies between the next handful of instructions and the first one there. (But note that if you perform a load operation and the data isn't in the caches, it takes *many tens to hundreds* of CPU cycles to fetch the data from external DRAM; hence you *will* stall. There actually are instructions in the x86 architecture these days for "warming up" the cache by pre-fetching data, though -- this can help a lot when you know in advance you'll need data, e.g., a few hundred cycles from now; if you're looping over big sets of data, you just pre-fetch the next block while you work on the current one.) A program that requests random memory accesses will very quickly stall for a long time (after the first couple of instructions), as you quickly exhaust the number of "memory read" resources available and have near-constant cache misses. Few real-world pograms exhibit behavior that bad AFAIK, although I expect that some large database applications (that have to run through multiple indices for each request, where the indices and/or data are too big for the caches) might approach it. ---Joel |
#6
|
|||
|
|||
Can a x86/x64 cpu/memory system be changed into a barrel processor?
Joel Koltner wrote:
"Skybuck Flying" wrote in message b.home.nl... The only thing my program needs to do is fire off memory requests. However it seems the x86 cpu blocks on the first memory request and does nothing else. Hmm, it shouldn't do that, assuming there aren't any dependencies between the next handful of instructions and the first one there. (But note that if you perform a load operation and the data isn't in the caches, it takes *many tens to hundreds* of CPU cycles to fetch the data from external DRAM; hence you *will* stall. There actually are instructions in the x86 architecture these days for "warming up" the cache by pre-fetching data, though -- this can help a lot when you know in advance you'll need data, e.g., a few hundred cycles from now; if you're looping over big sets of data, you just pre-fetch the next block while you work on the current one.) A program that requests random memory accesses will very quickly stall for a long time (after the first couple of instructions), as you quickly exhaust the number of "memory read" resources available and have near-constant cache misses. Few real-world pograms exhibit behavior that bad AFAIK, although I expect that some large database applications (that have to run through multiple indices for each request, where the indices and/or data are too big for the caches) might approach it. ---Joel The Intel processor also has prefetch options, and works with both incrementing memory access patterns or decrementing patterns. Using a "warm up" option is one thing, but the processor should also be able to handle prefetch on its own. Perhaps AMD has something similar ? Since this is posted to comp.arch, someone there should know. Skybuck's processor has an integrated memory controller, so there are possibilities. http://blogs.utexas.edu/jdm4372/2010...ead-read-only/ Both Intel and AMD, will have documentation on their website, addressing the need to optimize programs to run on the respective processors. And that is a good place for a programmer to start, to find the secrets of getting best performance. Paul |
#7
|
|||
|
|||
Can a x86/x64 cpu/memory system be changed into a barrelprocessor ?
On Fri, 10 Jun 2011 01:58:09 +0100, Skybuck Flying
wrote: The only thing my program needs to do is fire off memory requests. However it seems the x86 cpu blocks on the first memory request and does nothing else. How do you know? The whole point about out-of-order execution is that it is transparent to the software, so it is not possible to write a program whose behaviour depends on whether blocking occurs or not. If you have a logic analyzer and you think you have results that prove in-order behaviour then you'll have to provide more details. That said, such things are well outside my comfort zone so I personally won't be able to help. |
#8
|
|||
|
|||
Can a x86/x64 cpu/memory system be changed into a barrelprocessor ?
On Jun 10, 3:44*am, "Ken Hagan" wrote:
On Fri, 10 Jun 2011 01:58:09 +0100, Skybuck Flying * wrote: The only thing my program needs to do is fire off memory requests. However it seems the x86 cpu blocks on the first memory request and does * nothing else. The CPU will not block if all of the outstanding accesses are to write- back cacheable memory. How do you know? The whole point about out-of-order execution is that it * is transparent to the software, No, the whole point of precise exceptions is to be trasnparent to software. The point of OoO is to improve performance, adding precise exceptions to OoO gives you high performance and is relatively transparent to software (but not entirely). so it is not possible to write a program * whose behaviour depends on whether blocking occurs or not. One can EASILY detect blocking (or not) by comparing the wall clock time on multi-million memory access codes. One can infer the latencies to the entire cache hierchy including main memory and whether or no main memory accesses are being processed with concurrency. Mitch |
#9
|
|||
|
|||
Can a x86/x64 cpu/memory system be changed into a barrel processor ?
Already tried prefetching for RAM it's pretty useless...
Especially for random access, especially for dependancies, especially when the software doesn't yet know what to ask next. However the problem may be parallized. However the CPU stills blocks. Therefore making it parallel doesn't help. Only threading helps, but with two or four cores that doesn't impress. I might read the article later on but I fear I will be wasting my time. I scanned it a little bit, the code assummes insequence memory... pretty lame, it has nothing to do with R in RAM. Also my memory seeks are very short, 4 to 6 bytes, therefore fetching more is pretty useless. Bye, Skybuck. "Paul" wrote in message ... Joel Koltner wrote: "Skybuck Flying" wrote in message b.home.nl... The only thing my program needs to do is fire off memory requests. However it seems the x86 cpu blocks on the first memory request and does nothing else. Hmm, it shouldn't do that, assuming there aren't any dependencies between the next handful of instructions and the first one there. (But note that if you perform a load operation and the data isn't in the caches, it takes *many tens to hundreds* of CPU cycles to fetch the data from external DRAM; hence you *will* stall. There actually are instructions in the x86 architecture these days for "warming up" the cache by pre-fetching data, though -- this can help a lot when you know in advance you'll need data, e.g., a few hundred cycles from now; if you're looping over big sets of data, you just pre-fetch the next block while you work on the current one.) A program that requests random memory accesses will very quickly stall for a long time (after the first couple of instructions), as you quickly exhaust the number of "memory read" resources available and have near-constant cache misses. Few real-world pograms exhibit behavior that bad AFAIK, although I expect that some large database applications (that have to run through multiple indices for each request, where the indices and/or data are too big for the caches) might approach it. ---Joel The Intel processor also has prefetch options, and works with both incrementing memory access patterns or decrementing patterns. Using a "warm up" option is one thing, but the processor should also be able to handle prefetch on its own. Perhaps AMD has something similar ? Since this is posted to comp.arch, someone there should know. Skybuck's processor has an integrated memory controller, so there are possibilities. http://blogs.utexas.edu/jdm4372/2010...ead-read-only/ Both Intel and AMD, will have documentation on their website, addressing the need to optimize programs to run on the respective processors. And that is a good place for a programmer to start, to find the secrets of getting best performance. Paul |
#10
|
|||
|
|||
Can a x86/x64 cpu/memory system be changed into a barrel processor ?
"Skybuck Flying" wrote in message
. home.nl... Programmers try to write programs so they be fast. Well, yes and no -- these days, the vast majority of programs are at least *initially* written more from the point of view of trying to get them to be maintanable and correct (bug-free); after that has occurred, if there are any significant performance bottlenecks (and in many programs, there may not be because the app is waiting on the human user for input or the Internet or something else quite slow), programmers go back and work on those performance-critical areas. "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil" -- Donald Knuth -- see: http://en.wikipedia.org/wiki/Program_optimization, where there are other good quotes as well, such as "More computing sins are committed in the name of efficiency (without necessarily achieving it) than for any other single reason - including blind stupidity." Also keep in mind that since the vast majority of code is now written in a high-level language, it's primarily the purview of a *compiler* to generate "reasonably" efficient code -- it has much more intimate knowledge of the particular CPU architecture being targeted than the programmer usually does; most programmers should be concetrating on efficient *algorithms* rather than all these low level details regarding parallelism, caching, etc. I mean, when I first started programming and learned C (back in the early '90s), you were often told a couple of "tricks" to make the code run faster at the expense of readibility. Today the advice is completely the opposite: Make the code as readable as possible; a good compiler will generally create output that's just as efficient as the old-school code ever was. Do not think that slow programs would be released. "Slow" is kinda relative, though. As much as it pains me and others around here at times, it's hard to argue that just because a program is truly glacial on a 233MHz Pentium (the original "minimum hardware requirement" for Windows XP), if it's entirely snappy on a 2.4GHz CPU it's not *really* "slow." No better hardware then no better software. Hardware has gotten better, and software design is still struglling to catch up: There's still no widely-adopted standard that has "taken over the world" insofar as programming efficiently for multi-core CPUs. Take a look at something like the GreenArrays CPUs: http://greenarraychips.com/ -- 144 CPU cores, and no fall-off-the-log easy method to get all of them to execute in parallel for many standard procedural algorithms. Ask yourself one very important big question: What does the R stand for in RAM ? Notice that your motherboard is populated with SDRAM, which is a rather different beast than "old school" RAM -- it's not nearly as "random" as you might like, at least insofar as what provides the maximum bandwidth. ---Joel |
|
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
New Windows Vista x86 and x64 v7.1 | Jadranko Albert | Ati Videocards | 2 | January 30th 07 02:24 AM |
CPU is unworkable or has been changed, please recheck CPU soft menu. | [email protected] | Overclocking | 1 | December 27th 05 01:53 AM |
P4B533 hard drive slows down the whole system 2 gig processor and 1mb memory | Rich | Asus Motherboards | 2 | April 7th 05 05:36 AM |
accessing memory mapped I/O in REAL MODE of x86 processor Arch.. | banu | Intel | 3 | May 13th 04 02:40 PM |
Changed Processor - Serious Problems! | Peter | General | 0 | May 1st 04 10:33 AM |