If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
Thread Tools | Display Modes |
#21
|
|||
|
|||
Tony Hill writes:
On 06 Dec 2004 11:17:19 +0100, Per Ekman wrote: Yousuf Khan writes: Yeah, but that's why I think AMD insists on calling their multiprocessor connection scheme as SUMO (Sufficiently Uniform Memory Organization), rather than NUMA. It's not worth headaching over such small differences in latency, is basically what they're saying. It's a bit of a crap argument isn't it? Even if the latency is small, the fact that it's a NUMA system impacts performance (potentially by a lot) as the available memory bandwidth is coupled to where you place your data. It does, but the difference is small, usually less than 10% and often much closer to 0%. And sometimes 50%... When well over 90% of your memory access is coming from cache anyway and (assuming a totally random distribution in a strictly UMA setup) 50% of your memory access is going to be local, most of the performance difference is lost in the noise. What you are saying is that if the application runs well on a NUMA machine it will run well on the Opteron. The problem is if the application _doesn't_ run well on a NUMA box, then it becomes important to know that this machine is in fact a NUMA machine. Besides, remember that even in a classic UMA environment (ie a 2P or 4P Xeon server... or even a single-processor system) you STILL have differences in latency depending on where in memory your data resides due to open vs. closed pages, TLB misses, etc. I'm not talking about latency here, I'm talking about bandwidth. Are your main memory references all going to the same memory controller or are they shared among the controllers in the system? That's the crux of the matter. Classic example is OpenMP parallelized STREAM. Parallelize all the loops except the data initialization loop on a system with hard memory affinity (such as Linux), then parallelize _all_ the loops and explain how the difference is "not worth headaching over". Most users don't use their computer to run STREAM though. Even in the HPC community where memory bandwidth is king, STREAM is still a rather extreme case. I admit I'm from the HPC-sector and memory bandwidth is very important to many applications here. STREAM is an extreme case in some sense, but the performance differences it can uncover are by no means insignificant for real codes. Shared memory MPI is one example, clearly not an uncommon software in HPC. Bottom line IMO is that pretending that the system isn't NUMA is doing customers a disservice. I've said it before and I'll say it again: Hardware is cheap, software is expensive. It would be a true disservice to your customers to tell them to spend thousands upon thousands of dollars changing all their software for the small improvement in performance equal to a few hundred dollars of hardware costs. It's a disservice not to let the customers decide themselves. If their code is memory bandwidth limited and runs poorly on NUMA systems then you're doing them a disservice by selling them a system where they _have_ to change their software to get it to run well without telling them about it. They should know that treating the system as a UMA one is a bad idea. Spending lots of money to make all your software NUMA is a bad idea when treating it as UMA and throwing a tiny amount of extra hardware at the job will do the trick. That's all that AMD is getting at. Besides, they do recognize that it is NUMA, just that they are saying you don't NEED to worry about that if you don't want to because for the vast majority of times the performance difference is lost in the noise. It's a pretty strange argument in my eyes, "If you ignore the applications that run poorly because of property X, then it makes sense to downplay property X." True, but not helpful if you have such an application. *p |
#22
|
|||
|
|||
In article , Tony Hill wrote:
It could be that Intel still has a reasonable amount of inventory of their old "Northwood" P4 chips and they want to clear those out first, but that certainly doesn't seem to be the case looking at Intel's pricing structure and what is being sold by the major OEMs (Intel seems to be pushing Prescott VERY hard here). A friend recently (1 month ago IIRC) wanted a Northwood for his DIY computer, but he found that none of the usual suspects around here had them in stock. Eventually he called the importer, who said that they're out of stock and they're not getting anymore either, buy a Prescott instead. Long story short, I'm not quite sure what the actual answer is, but excessive inventory of 32-bit chips doesn't seem to make sense from what I've seen. Considering the rate chips depreciate I guess manufacturers think pretty hard about what they can do to minimize inventory. -- Janne Blomqvist |
#23
|
|||
|
|||
George Macdonald wrote:
On Sun, 05 Dec 2004 17:37:16 GMT, Rob Stow wrote: George Macdonald wrote: On Sun, 05 Dec 2004 01:02:11 -0500, Yousuf Khan wrote: I found this whitepaper from HP to be pretty good, it is surprisingly candid, considering HP was the coinventor of the Itanium. It does a pretty good job of explaining and summarizing the similarities and differences between AMD64 and EM64T, and their comparison to the Itanium's IA64 instruction set. AMD64 and EM64T are "broadly compatible", but IA64 is a different animal altogether. Yousuf Khan http://h200001.www2.hp.com/bc/docs/s.../c00238028.pdf Hmm and the following quote: "However, the latency difference between local and remote accesses is actually very small because the memory controller is integrated into and operates at the core speed of the processor, and because of the fast interconnect between processors." is relevant to another discussion here. I wish we could get a firm answer on this one. Not sure if this is exactly what you are looking for in the way of a "firm answer", but the latencies in a Opteron system a 0 hops 80 ns uniprocessor (Local access) 100 ns multiprocessor (Local access, with cache snooping on other processors) 1 hop 115 ns 2 hops 150 ns 3 hops 190 ns I couldn't find my original source for those numbers, and the two and three hop numbers above are a little higher than I remembered them as being. This time around I got them from this thread: http://www.aceshardware.com/forum?read=80030960 That thread refers to this article: http://www.digit-life.com/articles2/amd-hammer-family/ which gives slightly different numbers for a 2 GHz Opteron with DDR333: Uni-processor system: 45 ns Dual-processor system: 0-hop - 69 ns, 1-hop - 117 ns. Four-processor system: 0-hop - 100 ns, 1-hop - 118 ns, 2-hop - 136 ns. I don't know if any of the numbers above are for cache misses or if they are averages that include both hits and misses. Thanks for the data but no I guess I should have highlighted better what I was getting at: "the memory controller is integrated into and operates at the core speed of the processor", which is what was being discussed/disputed in another thread. I haven't been able to find any hard data from AMD on where the clock domain boundaries are in the Opteron/Athlon64 but if the memory controller is not operating at "core speed" it's now at the stage of Internet Folklore. Ah, that one is much easier to answer. ;-) Straight from the horse's mouth: http://www.amd.com/us-en/Processors/...7983%2C00.html "By running at the processor’s core frequency, an integrated memory controller greatly increases bandwidth directly available to the processor at significantly reduced latencies." |
#24
|
|||
|
|||
keith wrote:
I'd say that because in small systems (less than 8 CPUs), Opterons are coherent in hardware thus sufficiently tightly coupled to be called UMA, as far as the user is concerned. Yes, exactly my point, it's more or less UMA in the upto 8 processor range. After that, then you can start thinking of it as NUMA. But having upto 8 processors being treated as UMA is quite a lot. Yousuf Khan |
#25
|
|||
|
|||
Per Ekman wrote:
It's a bit of a crap argument isn't it? Even if the latency is small, the fact that it's a NUMA system impacts performance (potentially by a lot) as the available memory bandwidth is coupled to where you place your data. Actually, there was a story here not so long ago where one of the Linux distros had been optimized up with NUMA assumptions, and it actually ran /slower/ than a non-NUMA kernel. In other words the Linux kernel might have spent more time making complex decisions about memory placement than it was actually going to save from the latencies. Accessing memory through the Hypertransport links should not be any worse than the traditional front-side bus arrangement like in Intel processors. So it will match the Intel architectures that way, at the very least. And whenever it goes through its own local memory controllers, it blows the Intel architectures away. Yousuf Khan |
#26
|
|||
|
|||
On Mon, 06 Dec 2004 18:39:46 GMT, Rob Stow wrote:
George Macdonald wrote: On Sun, 05 Dec 2004 17:37:16 GMT, Rob Stow wrote: snip Thanks for the data but no I guess I should have highlighted better what I was getting at: "the memory controller is integrated into and operates at the core speed of the processor", which is what was being discussed/disputed in another thread. I haven't been able to find any hard data from AMD on where the clock domain boundaries are in the Opteron/Athlon64 but if the memory controller is not operating at "core speed" it's now at the stage of Internet Folklore. Ah, that one is much easier to answer. ;-) Straight from the horse's mouth: http://www.amd.com/us-en/Processors/...7983%2C00.html "By running at the processor’s core frequency, an integrated memory controller greatly increases bandwidth directly available to the processor at significantly reduced latencies." Ah so there we have it... assuming this has been approved by the technical folks.:-) BTW I notice that AMD seems to cutting back on the depth of info in their technical docs - the Product Data Sheets now consist of one page... a far cry from the excruciating detail on cache operation etc. we used to get. Rgds, George Macdonald "Just because they're paranoid doesn't mean you're not psychotic" - Who, me?? |
#27
|
|||
|
|||
George Macdonald wrote:
On Mon, 06 Dec 2004 18:39:46 GMT, Rob Stow wrote: George Macdonald wrote: On Sun, 05 Dec 2004 17:37:16 GMT, Rob Stow wrote: snip Thanks for the data but no I guess I should have highlighted better what I was getting at: "the memory controller is integrated into and operates at the core speed of the processor", which is what was being discussed/disputed in another thread. I haven't been able to find any hard data from AMD on where the clock domain boundaries are in the Opteron/Athlon64 but if the memory controller is not operating at "core speed" it's now at the stage of Internet Folklore. Ah, that one is much easier to answer. ;-) Straight from the horse's mouth: http://www.amd.com/us-en/Processors/...7983%2C00.html "By running at the processor’s core frequency, an integrated memory controller greatly increases bandwidth directly available to the processor at significantly reduced latencies." Ah so there we have it... assuming this has been approved by the technical folks.:-) BTW I notice that AMD seems to cutting back on the depth of info in their technical docs - the Product Data Sheets now consist of one page... a far cry from the excruciating detail on cache operation etc. we used to get. The "Product Data Sheets" are indeed so brief as to be virtually useless, but there is still a wealth of PDFs that provide details about just about everything. The useless Product Data Sheet heads the list of "AMD Opteron™ Processor Tech Docs" at http://www.amd.com/us-en/Processors/...9_9003,00.html but the other PDFs there have mind numbing details about every little thing that does not give away trade secrets. For example, read the "BIOS and Kernel Developer's Guide for AMD Athlon™ 64 and AMD Opteron™ Processors". |
#28
|
|||
|
|||
"John Savard" wrote in message ... On Sun, 05 Dec 2004 01:02:11 -0500, Yousuf Khan wrote, in part: OK, I wont trim the wonderful newsgroup list, all of whose readers are breathlessly awaiting my imortal prose.... I found this whitepaper from HP to be pretty good, it is surprisingly candid, considering HP was the coinventor of the Itanium. It does a pretty good job of explaining and summarizing the similarities and differences between AMD64 and EM64T, and their comparison to the Itanium's IA64 instruction set. AMD64 and EM64T are "broadly compatible", but IA64 is a different animal altogether. http://h200001.www2.hp.com/bc/docs/s...c00238028/c002 38028.pdf I would have preferred if you had given the URL of a page with a *link* on it to this manual. That would make it easier to back-navigate for other items of related interest, and it would have meant that the manual could be downloaded with a right-click without waiting for the browser plug-in to display the whole manual. What braindamaged newsreader are you using that won't let you right click the link in the newsreader? Even OE does that. So quit whining and switch to a decent newsreader. On page 13, under the heading "Power Considerations", I noticed a real whopper. Or, at least, what _seemed_ to me to be a real whopper initially. It is true that for a given implementation, a higher clock speed means more power consumption. It takes more power to make gates switch faster. Probably referring to that esoteric equation P= (sf)*.5*C*V**2 which you may have encountered. Or perhaps I=Cdv/dt. However, if a higher clock speed is obtained by splitting the pipeline into more itty-bitty pieces, for the same level of instruction latency, then one still has the same number of gates, each consuming the same amount of power. (Except for the overhead of the pipelining process... and one more thing to be noted later.) If one adds pipe stages one has more gates and more latches and more clock drivers. And the power per gate goes up because of the higher frequency. What is the point of splitting up a pipeline into smaller pieces? Is it to put more megahertz in the ad copy? No, it is so that more instructions can be executing, in different stages, at once. (Which means that a Pentium IV ought to have explicit vector instructions. Yes, it has a separate instruction cache and data cache, but there's still only one bus to *main memory*, and caches do have to get filled from somewhere.) Actually one reason for intel to "superpipeline" was to jack up the freq for the ad copy. You lost me with the "Pentium IV ought to have explicit vector instructions" leap. Since CMOS gates only consume power when they are changing state, unused elements of a non-pipelined ALU are not consuming power, so it may well be that a 14-stage pipelined ALU can consume twice as much power as a 7-stage pipelined ALU. Or maybe 4 times, if the freq is double. But that will be because twice as much of it is in use, not because it is going "twice as fast". Clearly they are using "twice as fast" to mean "double the frequency". Why do you find that so hard to understand? Since they are still sort of right, even if for the wrong reason, perhaps all I am criticizing is an oversimplification here. But I think that this can lead to a profound misconception of how microprocessors work. What ARE you talking about? John Savard http://home.ecn.ab.ca/~jsavard/index.html Del Cecchi. |
#29
|
|||
|
|||
Yousuf Khan writes:
Per Ekman wrote: It's a bit of a crap argument isn't it? Even if the latency is small, the fact that it's a NUMA system impacts performance (potentially by a lot) as the available memory bandwidth is coupled to where you place your data. Actually, there was a story here not so long ago where one of the Linux distros had been optimized up with NUMA assumptions, and it actually ran /slower/ than a non-NUMA kernel. In other words the Linux kernel might have spent more time making complex decisions about memory placement than it was actually going to save from the latencies. And the conclusion was that a multi-CPU Opteron system must then be UMA, rather than that the NUMA "optimizations" were crap? Accessing memory through the Hypertransport links should not be any worse than the traditional front-side bus arrangement like in Intel processors. So it will match the Intel architectures that way, at the very least. If your 4-way system with 4 separate memory controllers matches the memory bandwidth on a shared-bus 1P system that's good enough?? *boggle* And whenever it goes through its own local memory controllers, it blows the Intel architectures away. Latency-wise perhaps, bandwidth-wise I doubt it but feel free to prove me wrong. You can't tout the scaling advantage of a NUMA approach while pretending that it isn't NUMA. Well you can, but IMO it's dishonest and self-defeating. I _like_ the Opteron systems, they have great performance and I work with them daily. However I would hate them if I didn't know that they were NUMA because important codes would run terribly slowly under UMA assumptions. NUMA is _NOT_ only about latency! *p |
#30
|
|||
|
|||
Yousuf Khan wrote: Per Ekman wrote: It's a bit of a crap argument isn't it? Even if the latency is small, the fact that it's a NUMA system impacts performance (potentially by a lot) as the available memory bandwidth is coupled to where you place your data. Actually, there was a story here not so long ago where one of the Linux distros had been optimized up with NUMA assumptions, and it actually ran /slower/ than a non-NUMA kernel. In other words the Linux kernel might have spent more time making complex decisions about memory placement than it was actually going to save from the latencies. Accessing memory through the Hypertransport links should not be any worse than the traditional front-side bus arrangement like in Intel processors. So it will match the Intel architectures that way, at the very least. And whenever it goes through its own local memory controllers, it blows the Intel architectures away. As Per correctly points out, it's not just latency, but also bandwidth that is key in NUMA. For example, if I have a process on Opteron A, that is reading data from Opteron B, then I can only read at Hypertransport rate, which is something like 2.7GB of data in one direction (that's the best data rate number I have for HT - if someone has a better one I'd be happy to use that - note that this is user data, not protocol overhead). This is quite different to the local bandwidth of memory directly connected to Opteron A. Hence NUMA awareness is important. This is different for example to the SGI Altix Bx2 system where an Itanium can read at the full bus rate (local or remote). Clearly the Altix architecture had a different set of tradeoffs made, to match the requirements of SGIs customer base. BTW, I think Opterons have some nice characteristics and for certain workloads are attractice, but they are not the best solution to all problems as some people seem to want to claim. Another example would be making sure that people understand that when Opteron goes dual core, unless you double the memory bandwidth available, you effectively cut the bandwidth per core in half. This will impact some workloads quite dramatically. Has AMD made public statements about supporting higher local bandwidth for the dual core chip? Cheers, Mike Yousuf Khan |
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
· · · Have You Heard The Good News? · · · | [email protected] | General | 0 | January 29th 05 07:59 PM |
(",) Good News for Google Groups, Usenet and Other Users | [email protected] | General | 0 | January 29th 05 06:30 AM |
(",) Good News for Google Groups, Usenet and Other Users | [email protected] | General | 0 | January 29th 05 03:14 AM |
(",) Good News for Google Groups, Usenet and Other Users | [email protected] | General | 0 | January 28th 05 04:13 PM |
Anyone know a good deal on a good mp3 player? | travel | General | 1 | November 30th 04 10:51 PM |