If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
Thread Tools | Display Modes |
#1
|
|||
|
|||
Pentium 4 HT vs dual XEON
Can someone tell me whether an Pentium 4, 2.8ghz chip with hyperthreading,
FSB of 800mhz and DDR 400 will equal or surpass in performance of a dual XEON 2.4ghz, 533FSB, and DDR266? |
#2
|
|||
|
|||
"David Wang" wrote in message ... Mike mike@nospam wrote: Can someone tell me whether an Pentium 4, 2.8ghz chip with hyperthreading, FSB of 800mhz and DDR 400 will equal or surpass in performance of a dual XEON 2.4ghz, 533FSB, and DDR266? Depends on your workload. What does your workload look like? -- davewang202(at)yahoo(dot)com I have an earlier P4 without HT running at 1.5ghz with my CPU being the bottleneck for a server app. I could use a second CPU but not the expense of it. If hyperthreading does actually appear and work as a second CPU it should likely be enough while saving me $ in the process. Not sure what is reality or hype. I also hate to have to spend an extra $1000 on a dual XEON only to find out the performance increase was only a few percent over an HT P4. |
#3
|
|||
|
|||
If you are using as a workstation then I would probably say the 2.8C would
be faster for most tasks. For a server that has many users connected a Dual Xeon box would allow for items like PCI-X, Real SCSI, etc. that would make it a more complete server platform. "Mike" mike@nospam wrote in message ... Can someone tell me whether an Pentium 4, 2.8ghz chip with hyperthreading, FSB of 800mhz and DDR 400 will equal or surpass in performance of a dual XEON 2.4ghz, 533FSB, and DDR266? |
#4
|
|||
|
|||
Mike mike@nospam wrote:
"David Wang" wrote in message ... Mike mike@nospam wrote: Can someone tell me whether an Pentium 4, 2.8ghz chip with hyperthreading, FSB of 800mhz and DDR 400 will equal or surpass in performance of a dual XEON 2.4ghz, 533FSB, and DDR266? Depends on your workload. What does your workload look like? I have an earlier P4 without HT running at 1.5ghz with my CPU being the bottleneck for a server app. "The CPU being the bottleneck" is still not enough of a characterization to know if a Dual Xeon or a HT P4 would be better. I could use a second CPU but not the expense of it. If hyperthreading does actually appear and work as a second CPU it should likely be enough while saving me $ in the process. Not sure what is reality or hype. I also hate to have to spend an extra $1000 on a dual XEON only to find out the performance increase was only a few percent over an HT P4. Reality or Hype depends on your workload. Think of it this way. The P4 has the advantage of raw memory bandwidth, on workload(applications) that heavily utilize the memory system, the P4 with multiple threads/processes will out run a dual Xeon system. On threads/applications that do not hit the memory system as hard, and is more cache friendly, the dual Xeon system will be better. So it goes back to my original statement. Depends on your workload. What does your workload look like? More cache bound-compute intensive, or memory bound and light on computations? Dual Xeons will be better for the former, and P4 HT will be better for the latter. In either case, the P4 HT will be the cheaper alternative, and likely the "good enough" alternative, but if your workload is cache bound-compute intensive, the dual Xeon will be better. -- davewang202(at)yahoo(dot)com |
#5
|
|||
|
|||
"David Wang" wrote in message ... Mike mike@nospam wrote: "David Wang" wrote in message ... Mike mike@nospam wrote: Can someone tell me whether an Pentium 4, 2.8ghz chip with hyperthreading, FSB of 800mhz and DDR 400 will equal or surpass in performance of a dual XEON 2.4ghz, 533FSB, and DDR266? Depends on your workload. What does your workload look like? I have an earlier P4 without HT running at 1.5ghz with my CPU being the bottleneck for a server app. "The CPU being the bottleneck" is still not enough of a characterization to know if a Dual Xeon or a HT P4 would be better. I could use a second CPU but not the expense of it. If hyperthreading does actually appear and work as a second CPU it should likely be enough while saving me $ in the process. Not sure what is reality or hype. I also hate to have to spend an extra $1000 on a dual XEON only to find out the performance increase was only a few percent over an HT P4. Reality or Hype depends on your workload. Think of it this way. The P4 has the advantage of raw memory bandwidth, on workload(applications) that heavily utilize the memory system, the P4 with multiple threads/processes will out run a dual Xeon system. On threads/applications that do not hit the memory system as hard, and is more cache friendly, the dual Xeon system will be better. So it goes back to my original statement. Depends on your workload. What does your workload look like? Not sure how additionally I should measure the bottleneck other than seeing the CPU at 95% utilization in top. The application uses most of my memory but no additional memory is being used when my cpu stays high for awhile along with no additional disk activity. More cache bound-compute intensive, or memory bound and light on computations? Dual Xeons will be better for the former, and P4 HT will be better for the latter. In either case, the P4 HT will be the cheaper alternative, and likely the "good enough" alternative, but if your workload is cache bound-compute intensive, the dual Xeon will be better. thx again. |
#6
|
|||
|
|||
Mike mike@nospam wrote:
"David Wang" wrote in message ... Reality or Hype depends on your workload. Think of it this way. The P4 has the advantage of raw memory bandwidth, on workload(applications) that heavily utilize the memory system, the P4 with multiple threads/processes will out run a dual Xeon system. On threads/applications that do not hit the memory system as hard, and is more cache friendly, the dual Xeon system will be better. So it goes back to my original statement. Depends on your workload. What does your workload look like? Not sure how additionally I should measure the bottleneck other than seeing the CPU at 95% utilization in top. The application uses most of my memory but no additional memory is being used when my cpu stays high for awhile along with no additional disk activity. 1) Talk to others that run workloads similar to yours. and/or 2) Run your workload on a Dual Xeon and a P4 HT box to compare. Sorry to say that no one can tell which platform will be faster for that one workload specifically, unless they've characterized it. Just watching the "95% utilization" in top doesn't do much for you. This is a classical case where there are two different systems, both systems have some respective areas of strength to them, which system is "faster" for your workload depends on which area of strength your workload would benefit the most from. To give an example, if you're doing a lot of graphics manipulations where the task is just to chew through a 100+ MB image, and twiddle each pixel in the image just a tiny bit. This task will be memory intensive, and the P4 with HT will be faster. OTOH, if the tasks are a couple of compute intensive programs whose respective working sets fit in the processor cache(s), and do not access the memory much, then the Xeons will be faster, since both CPU's will be running as fast as they can indepdently. -- davewang202(at)yahoo(dot)com |
#7
|
|||
|
|||
David Schwartz wrote:
"David Wang" wrote in message ... Think of it this way. The P4 has the advantage of raw memory bandwidth, on workload(applications) that heavily utilize the memory system, the P4 with multiple threads/processes will out run a dual Xeon system. On threads/applications that do not hit the memory system as hard, and is more cache friendly, the dual Xeon system will be better. In a single CPU system, the FSB speed pretty much only matters if it's slower than main memory. The only bandwidth-intensive data going over the FSB are memory accesses and graphics data. It's pretty hard to saturate a 533Mhz FSB on a single CPU machine. The P4 is HT enabled, so there are now two threaded contexts issuing memory references. If both threads are doing nothing except moving multi-hundred megabyte images around, it's not too hard to saturate the processor bus. On the other hand, on multi-processor machine, all the inter-processor traffic goes over the FSB. The FSB speed will control the latency when one CPU requests data that's in the other's cache -- so the faster the better. The problem here is that the Xeon platform has the 533 mbps FSB, while the P4 HT platform has the 800 mbps FSB. "Hitting each other's cache" depends on the workload. The comparison of which platform is "faster" still depends on the behaviour of the workload, which was what I had stated from the beginning. -- davewang202(at)yahoo(dot)com |
#8
|
|||
|
|||
"David Wang" wrote in message ... David Schwartz wrote: The P4 is HT enabled, so there are now two threaded contexts issuing memory references. If both threads are doing nothing except moving multi-hundred megabyte images around, it's not too hard to saturate the processor bus. Remember, the FSB can't move more data to and from memory than the memory bandwidth. Typical memory speeds are not sufficient to saturate a 533Mhz FSB, at least, I don't think they are. In other words, you're far more likely to max out your memory bandwidth before you max out your FSB. The problem here is that the Xeon platform has the 533 mbps FSB, while the P4 HT platform has the 800 mbps FSB. "Hitting each other's cache" depends on the workload. The comparison of which platform is "faster" still depends on the behaviour of the workload, which was what I had stated from the beginning. There is no conceivable workload in which an 800Mhz FSB will outperform a 533Mhz FSB if the FSB is not a system bottleneck. That is, if the total usable bandwidth of your graphics and memory subsystems can be satisfied with a 533Mhz FSB, no workload will make an 800Mhz FSB faster (assuming a single physical CPU). Now I do think that it's theoretically possible to max out a 533Mhz FSB with concurrent full-speed graphics blits and memory accesses, assuming you populated all of your motherboards memory slots with fast memory, your motherboard had at least two DDR channels, and you have an AGP8X graphics card. However, I doubt that any real-world workload could do this. Yes, the CPU itself can certainly move more data than the FSB can. But if the total speed of everything downstream of the FSB is less than the FSB speed, the cpu would never be able to actually max out the FSB anyway. To give you some real numbers: A 533Mhz FSB maxes out at 4.3GB/s a second. AGP 8X maxes out at 2.1 GB/s. A DDR266 memory DIMM maxes out at 2.1 GB/s. A single 32-bit/33Mhz PCI bus maxes out at .1GB/s. You do raise an interesting argument, however, that a machine with HT might be more likely to access these devices concurrently than one without. However, I don't believe this because the number of execution vehicles capable of memory access is limited, so in these cases, the CPU core itself will not run as fast because those units are ties up by the memory. Say logical processor thread on an HT processor needs to read an area of memory that's not in the L1/L2 cache. The execution unit that's doing the read uses the FSB to read the data. That execution unit is tied up until the memory can get the data. The memory is slower than the FSB, so the time that execution unit is tied up is bound by the memory speed. Until the memory finishes, that execution unit can't issue any more reads (for either logical processor), so again the memory speed will limit the processor's ability to issue requests over the FSB, not the FSB speed. The same argument applies to AGP accesses. So in sum, I don't believe that the FSB speed will make a difference on practically any workload. Again, all this is null and void for machines with more than one physical processor. It's also less certain if you can find memory subsystems more than twice as fast as a single DDR266/PC2100 DIMM. (Though generally such implementations, at least those now available, don't ever come close to their advertised peak speeds. About as often as you run at your peak running speed.) DS |
#9
|
|||
|
|||
David Schwartz wrote:
"David Wang" wrote in message ... The P4 is HT enabled, so there are now two threaded contexts issuing memory references. If both threads are doing nothing except moving multi-hundred megabyte images around, it's not too hard to saturate the processor bus. Remember, the FSB can't move more data to and from memory than the memory bandwidth. Typical memory speeds are not sufficient to saturate a 533Mhz FSB, at least, I don't think they are. In other words, you're far more likely to max out your memory bandwidth before you max out your FSB. If you want to take this offline, we can talk about how theoretically we can saturate one or the other. It is true that DRAM BW tends to be saturated first. The problem here is that the Xeon platform has the 533 mbps FSB, while the P4 HT platform has the 800 mbps FSB. "Hitting each other's cache" depends on the workload. The comparison of which platform is "faster" still depends on the behaviour of the workload, which was what I had stated from the beginning. There is no conceivable workload in which an 800Mhz FSB will outperform a 533Mhz FSB if the FSB is not a system bottleneck. That is, if the total usable bandwidth of your graphics and memory subsystems can be satisfied with a 533Mhz FSB, no workload will make an 800Mhz FSB faster (assuming a single physical CPU). The 800 mbps FSB is coupled with the peak (6.4 GB/s) "dual channel" DDR SDRAM memory system. The 533 mbps FSB doesn't get anything that has a higher peak BW. There are some systems such as the Serverworks GC series that has real independent memory controllers. Still, the "cheap bandwidth" available with the 800 mbps platform gives it a big advantage. Your went around a circle in the logic above, meaning you stated that if the 533 mbps FSB isn't a bottleneck, then the 800 mbps FSB won't be a bottleneck, and this is self explanatory. However, what I was stating was that "depending on the workload" (and there are some), that does give the edge to the 800 mbps FSB (with the associated higher memory BW memory system implicitly assumed). This is true even assuming only a single physical CPU. Now I do think that it's theoretically possible to max out a 533Mhz FSB with concurrent full-speed graphics blits and memory accesses, assuming you populated all of your motherboards memory slots with fast memory, your motherboard had at least two DDR channels, and you have an AGP8X graphics card. However, I doubt that any real-world workload could do this. Unless you have seen all of the real-world workloads, it's not possible to make such a statement. There are some workloads that's heavily biased toward memory movement, and once the data is moved onto the CPU, the CPU merely twiddles some bits, then moves the entire data stream back off of the CPU again. If you're talking about "real-world" workloads in terms of MS Office or netscape or some home-office application, then I will readily grant you the point. None of these things even come close to saturating a 533 mbps FSB, much less a 800 mbps FSB, but there are people running different types of workloads that they consider to be "real-world". Since the original poster did not disclose what he was running nor why he was choosing between a dual Xeon and P4-HT, I asked that he should examine his workload more carefully and talk to others that have similar workloads. Yes, the CPU itself can certainly move more data than the FSB can. But if the total speed of everything downstream of the FSB is less than the FSB speed, the cpu would never be able to actually max out the FSB anyway. To give you some real numbers: A 533Mhz FSB maxes out at 4.3GB/s a second. AGP 8X maxes out at 2.1 GB/s. A DDR266 memory DIMM maxes out at 2.1 GB/s. A single 32-bit/33Mhz PCI bus maxes out at .1GB/s. You do raise an interesting argument, however, that a machine with HT might be more likely to access these devices concurrently than one without. However, I don't believe this because the number of execution vehicles capable of memory access is limited, so in these cases, the CPU core itself will not run as fast because those units are ties up by the memory. I'm a bit lost by your statement here. I'm not sure what is being compared. To make myself clear, I was implicitly comparing a P4-HT platform with the 875 ot 865 chipset with the single controller but "dual channel" (not really, but close enough) DDR SDRAM. The P4-HT platform has a peak BW of 6.4 GB/s on both the FSB and the memory interface, and it can sustain about half of that, or roughly 3.2GB/s in "McCalpin-STREAM" bandwidth. It can sustain about 4.7 GB/s (IIRC) via a different bandwidth measurement benchmark. (If you're interested, I'll go lookup the reference later). The dual Xeon platform has the 4.3 GB/s peak BW on the FSB, and depending on the memory system, it can perhaps get about 2 to 2.5 GB/s of that on STREAM bandwidth. I think you can go even higher with the Dual channel Rambus. Still, the P4-HT platform will have more of the "cheap bandwidth" on the high end, so the dual Xeon will be at a disadvantage if/when you're hitting something that's BW intensive. Say logical processor thread on an HT processor needs to read an area of memory that's not in the L1/L2 cache. The execution unit that's doing the read uses the FSB to read the data. That execution unit is tied up until the memory can get the data. The memory is slower than the FSB, so the time that execution unit is tied up is bound by the memory speed. Until the memory finishes, that execution unit can't issue any more reads (for either logical processor), so again the memory speed will limit the processor's ability to issue requests over the FSB, not the FSB speed. The same argument applies to AGP accesses. That's not the way modern CPU's work. You don't ever "tie up" any execution resource on a memory miss. All modern OOO processors tries to hide memory access latencies by trying to "do useful work". As a matter of fact, that's one of the big advertised advantages of Hyperthreading (SMT). The problem with OOO is that while you have a cache miss, and you're out servicing the memory reference, the CPU tries to execute ahead, but eventually you run out of "valid" things to do, and the CPU has no choice but to stall and wait for the memory reference to return. With SMT/HT, you have two threaded contexts to pick instructions from, and the CPU can be kept "busy" more often, even while there are more memory references outstanding on the memory system. Both threaded contexts can and does issue multiple memory references into the memory system even while the previous memory reference is still outstanding on the memory system. So in sum, I don't believe that the FSB speed will make a difference on practically any workload. Again, all this is null and void for machines with more than one physical processor. It's also less certain if you can find memory subsystems more than twice as fast as a single DDR266/PC2100 DIMM. (Though generally such implementations, at least those now available, don't ever come close to their advertised peak speeds. About as often as you run at your peak running speed.) I disagree. -- davewang202(at)yahoo(dot)com |
#10
|
|||
|
|||
David Schwartz wrote:
"David Wang" wrote in message ... That's not the way modern CPU's work. You don't ever "tie up" any execution resource on a memory miss. All modern OOO processors tries to hide memory access latencies by trying to "do useful work". As a matter of fact, that's one of the big advertised advantages of Hyperthreading (SMT). The problem with OOO is that while you have a cache miss, and you're out servicing the memory reference, the CPU tries to execute ahead, but eventually you run out of "valid" things to do, and the CPU has no choice but to stall and wait for the memory reference to return. With SMT/HT, you have two threaded contexts to pick instructions from, and the CPU can be kept "busy" more often, even while there are more memory references outstanding on the memory system. Okay, let me try to be clearer. The presumption is that we're doing as much memory/FSB access as we possibly can. So the if the memory bus is the block, the other logical processor will very rapidly encounter a memory instruction, and it will have to wait for the bus too. All the instructions will wind up lined behind the memory because the memory will be the bottleneck. The point here is that each/both threaded contexts will try to execute intructions past memory references that miss the cache. With only a single threaded context, the CPU can only go so deep before it runs out of instructions to execute, so it will stop issuing furthere memory references, even if those references are completely independent of the outstanding memory references. Having HT allows the CPU to execute down multiple threaded instruction streams, and there could well be more memory reference before both threads stall out completely waiting for some old reference to return with data. It's a tautology -- if the memory is the bottleneck, then the memory is the bottleneck and the instructions will proceed as fast as the memory can get the data to and from them. HT or no HT. Having HT give you multiple contexts within the same silicon so that there are more instructions to choose from, and as follows, possibly more (concurrent) memory references as well. The HT/SMT aspect of the processor will be able to (more easily) saturate the FSB/memory system. If one logical processor is doing a lot of memory accesses and gets stalled and the other one doesn't need the memory bus, then the other logical processor can keep running. So again, one processor is held up due to memory speed and the other runs full tilt. How will a faster FSB help? The faster FSB is (presumed to be) tied to a memory system that supports higher memory bandwidth so that the queuing delays would be smaller on the 800 mbps FSB (because it has a higher BW memory system), as compared to the 533 mbps FSB on the Xeon platform. Both threaded contexts can and does issue multiple memory references into the memory system even while the previous memory reference is still outstanding on the memory system. Yep, and if the memory is not the bottleneck for the CPU, then the FSB speed doesn't matter. All the data will be prefetched and lazily stored. But if the memory is the bottleneck, then a faster FSB again won't help because it won't make the memory faster. We're not on the same page here. The P4-HT platform is assumed to have a memory system with higher DRAM bandwidth. I stand by my original statement. It is virtually impossible to max out a 533Mhz FSB on a machine with only one physical CPU. With dual channel DDR SDRAM, you might be able to do it for very short periods of time on realistic loads, but frankly I doubt even that (because a 533Mhz FSB is always 533Mhz, whereas memory bandwidth numbers are peak numbers that can only be sustained for times on the order of microseconds). The comparison here is not between one cpu and two cpu's on a 533 mbps FSB. The comparison here is between one P4 CPU with HT on 800 mbps FSB against two physical Xeon processors on the shared 533 mbps FSB with the proportionally lower bandwidth memory system. On both FSB as well as DRAM interface, there are combinations of sequences of events that will prevent the data bus from being utilized to the 100% "efficiency". It is true that the DRAM system is usually far less "efficient" and could only sustain a fraction of the peak throughput. If you look back a couple of posts, my references to McCalpins' stream numbers implicitly acknowledges that fact. However, the FSB isn't free to achieve 100% "efficiency" either. Notably, R/W turnarounds would require that a full cycle be burned in between cache block transfers. If we could simply take your statements about "virtually impossible to max out a 533 (mbps) interface with a single CPU", and replace it with "slightly less impossible to max out a 533 (mbps) interface with a single cpu that has two threaded contexts (SMT/HT)". Then that could be synced closely enough with my original statement about which platform would be "higher performance" for the original poster. Depends on your workload. What does your workload look like? -- davewang202(at)yahoo(dot)com |
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
dual CPU set-up/ | john_D | General | 6 | January 16th 05 12:28 AM |
< |
Alexander Gorban | Compaq Servers | 0 | October 24th 03 07:04 AM |
< |
Alexander Gorban | Compaq Servers | 0 | October 23rd 03 08:48 AM |
Pentium 4 HT vs. Pentium Xeon? | Ilker Tarkan | Intel | 2 | July 9th 03 12:48 PM |
Recomendations for a dual Xeon board | Jim H | Homebuilt PC's | 3 | July 4th 03 09:35 PM |