Pentium 4 HT vs dual XEON

#1 July 14th 03, 04:23 PM

Can someone tell me whether an Pentium 4, 2.8ghz chip with hyperthreading,
FSB of 800mhz and DDR 400 will equal or surpass in performance of a dual
XEON 2.4ghz, 533FSB, and DDR266?

#2 July 14th 03, 06:14 PM

"David Wang" wrote in message
...
Mike mike@nospam wrote:
Can someone tell me whether an Pentium 4, 2.8ghz chip with
hyperthreading,
FSB of 800mhz and DDR 400 will equal or surpass in performance of a dual
XEON 2.4ghz, 533FSB, and DDR266?

Depends on your workload. What does your workload look like?

--
davewang202(at)yahoo(dot)com

I have an earlier P4 without HT running at 1.5ghz with my CPU being the
bottleneck for a server app. I could use a second CPU but not the expense
of it. If hyperthreading does actually appear and work as a second CPU it
should likely be enough while saving me $ in the process. Not sure what is
reality or hype. I also hate to have to spend an extra $1000 on a dual XEON
only to find out the performance increase was only a few percent over an HT
P4.

#3 July 15th 03, 12:18 AM

If you are using as a workstation then I would probably say the 2.8C would
be faster for most tasks. For a server that has many users connected a Dual
Xeon box would allow for items like PCI-X, Real SCSI, etc. that would make
it a more complete server platform.

"Mike" mike@nospam wrote in message
...
Can someone tell me whether an Pentium 4, 2.8ghz chip with hyperthreading,
FSB of 800mhz and DDR 400 will equal or surpass in performance of a dual
XEON 2.4ghz, 533FSB, and DDR266?

#4 July 15th 03, 04:35 AM

Mike mike@nospam wrote:

"David Wang" wrote in message
...
Mike mike@nospam wrote:
Can someone tell me whether an Pentium 4, 2.8ghz chip with
hyperthreading,
FSB of 800mhz and DDR 400 will equal or surpass in performance of a dual
XEON 2.4ghz, 533FSB, and DDR266?

Depends on your workload. What does your workload look like?

I have an earlier P4 without HT running at 1.5ghz with my CPU being the
bottleneck for a server app.

"The CPU being the bottleneck" is still not enough of a characterization
to know if a Dual Xeon or a HT P4 would be better.

I could use a second CPU but not the expense
of it. If hyperthreading does actually appear and work as a second CPU it
should likely be enough while saving me $ in the process. Not sure what is
reality or hype. I also hate to have to spend an extra $1000 on a dual XEON
only to find out the performance increase was only a few percent over an HT
P4.

Reality or Hype depends on your workload.

Think of it this way. The P4 has the advantage of raw memory bandwidth,
on workload(applications) that heavily utilize the memory system, the P4
with multiple threads/processes will out run a dual Xeon system. On
threads/applications that do not hit the memory system as hard, and is
more cache friendly, the dual Xeon system will be better.

So it goes back to my original statement. Depends on your workload.
What does your workload look like?

More cache bound-compute intensive, or memory bound and light on
computations? Dual Xeons will be better for the former, and P4 HT
will be better for the latter. In either case, the P4 HT will be
the cheaper alternative, and likely the "good enough" alternative,
but if your workload is cache bound-compute intensive, the dual
Xeon will be better.

--
davewang202(at)yahoo(dot)com

#5 July 15th 03, 05:59 AM

"David Wang" wrote in message
...
Mike mike@nospam wrote:

"David Wang" wrote in message
...
Mike mike@nospam wrote:
Can someone tell me whether an Pentium 4, 2.8ghz chip with
hyperthreading,
FSB of 800mhz and DDR 400 will equal or surpass in performance of a
dual
XEON 2.4ghz, 533FSB, and DDR266?

Depends on your workload. What does your workload look like?

I have an earlier P4 without HT running at 1.5ghz with my CPU being the
bottleneck for a server app.

"The CPU being the bottleneck" is still not enough of a characterization
to know if a Dual Xeon or a HT P4 would be better.

I could use a second CPU but not the expense
of it. If hyperthreading does actually appear and work as a second CPU
it
should likely be enough while saving me $ in the process. Not sure what
is
reality or hype. I also hate to have to spend an extra $1000 on a dual
XEON
only to find out the performance increase was only a few percent over an
HT
P4.

Reality or Hype depends on your workload.

Think of it this way. The P4 has the advantage of raw memory bandwidth,
on workload(applications) that heavily utilize the memory system, the P4
with multiple threads/processes will out run a dual Xeon system. On
threads/applications that do not hit the memory system as hard, and is
more cache friendly, the dual Xeon system will be better.

So it goes back to my original statement. Depends on your workload.
What does your workload look like?

Not sure how additionally I should measure the bottleneck other than seeing
the CPU at 95% utilization in top. The application uses most of my memory
but no additional memory is being used when my cpu stays high for awhile
along with no additional disk activity.

More cache bound-compute intensive, or memory bound and light on
computations? Dual Xeons will be better for the former, and P4 HT
will be better for the latter. In either case, the P4 HT will be
the cheaper alternative, and likely the "good enough" alternative,
but if your workload is cache bound-compute intensive, the dual
Xeon will be better.

thx again.

#6 July 15th 03, 06:18 AM

Mike mike@nospam wrote:

"David Wang" wrote in message
...

Reality or Hype depends on your workload.

Think of it this way. The P4 has the advantage of raw memory bandwidth,
on workload(applications) that heavily utilize the memory system, the P4
with multiple threads/processes will out run a dual Xeon system. On
threads/applications that do not hit the memory system as hard, and is
more cache friendly, the dual Xeon system will be better.

So it goes back to my original statement. Depends on your workload.
What does your workload look like?

Not sure how additionally I should measure the bottleneck other than seeing
the CPU at 95% utilization in top. The application uses most of my memory
but no additional memory is being used when my cpu stays high for awhile
along with no additional disk activity.

1) Talk to others that run workloads similar to yours.

and/or

2) Run your workload on a Dual Xeon and a P4 HT box to compare.
Sorry to say that no one can tell which platform will be faster
for that one workload specifically, unless they've characterized
it.

Just watching the "95% utilization" in top doesn't do much for you.
This is a classical case where there are two different systems, both
systems have some respective areas of strength to them, which system is
"faster" for your workload depends on which area of strength your
workload would benefit the most from.

To give an example, if you're doing a lot of graphics manipulations
where the task is just to chew through a 100+ MB image, and twiddle
each pixel in the image just a tiny bit. This task will be memory
intensive, and the P4 with HT will be faster. OTOH, if the tasks
are a couple of compute intensive programs whose respective working
sets fit in the processor cache(s), and do not access the memory much,
then the Xeons will be faster, since both CPU's will be running as
fast as they can indepdently.

--
davewang202(at)yahoo(dot)com

#7 July 15th 03, 03:18 PM

David Schwartz wrote:

"David Wang" wrote in message
...

Think of it this way. The P4 has the advantage of raw memory bandwidth,
on workload(applications) that heavily utilize the memory system, the P4
with multiple threads/processes will out run a dual Xeon system. On
threads/applications that do not hit the memory system as hard, and is
more cache friendly, the dual Xeon system will be better.

In a single CPU system, the FSB speed pretty much only matters if it's
slower than main memory. The only bandwidth-intensive data going over the
FSB are memory accesses and graphics data. It's pretty hard to saturate a
533Mhz FSB on a single CPU machine.

The P4 is HT enabled, so there are now two threaded contexts issuing
memory references. If both threads are doing nothing except moving
multi-hundred megabyte images around, it's not too hard to saturate the
processor bus.

On the other hand, on multi-processor machine, all the inter-processor
traffic goes over the FSB. The FSB speed will control the latency when one
CPU requests data that's in the other's cache -- so the faster the better.

The problem here is that the Xeon platform has the 533 mbps FSB, while
the P4 HT platform has the 800 mbps FSB. "Hitting each other's cache"
depends on the workload. The comparison of which platform is "faster"
still depends on the behaviour of the workload, which was what I had
stated from the beginning.

--
davewang202(at)yahoo(dot)com

#8 July 15th 03, 04:39 PM

"David Wang" wrote in message
...

David Schwartz wrote:

The P4 is HT enabled, so there are now two threaded contexts issuing
memory references. If both threads are doing nothing except moving
multi-hundred megabyte images around, it's not too hard to saturate the
processor bus.

Remember, the FSB can't move more data to and from memory than the
memory bandwidth. Typical memory speeds are not sufficient to saturate a
533Mhz FSB, at least, I don't think they are. In other words, you're far
more likely to max out your memory bandwidth before you max out your FSB.

The problem here is that the Xeon platform has the 533 mbps FSB, while
the P4 HT platform has the 800 mbps FSB. "Hitting each other's cache"
depends on the workload. The comparison of which platform is "faster"
still depends on the behaviour of the workload, which was what I had
stated from the beginning.

There is no conceivable workload in which an 800Mhz FSB will outperform
a 533Mhz FSB if the FSB is not a system bottleneck. That is, if the total
usable bandwidth of your graphics and memory subsystems can be satisfied
with a 533Mhz FSB, no workload will make an 800Mhz FSB faster (assuming a
single physical CPU).

Now I do think that it's theoretically possible to max out a 533Mhz FSB
with concurrent full-speed graphics blits and memory accesses, assuming you
populated all of your motherboards memory slots with fast memory, your
motherboard had at least two DDR channels, and you have an AGP8X graphics
card. However, I doubt that any real-world workload could do this.

Yes, the CPU itself can certainly move more data than the FSB can. But
if the total speed of everything downstream of the FSB is less than the FSB
speed, the cpu would never be able to actually max out the FSB anyway.

To give you some real numbers:

A 533Mhz FSB maxes out at 4.3GB/s a second.
AGP 8X maxes out at 2.1 GB/s.
A DDR266 memory DIMM maxes out at 2.1 GB/s.
A single 32-bit/33Mhz PCI bus maxes out at .1GB/s.

You do raise an interesting argument, however, that a machine with HT
might be more likely to access these devices concurrently than one without.
However, I don't believe this because the number of execution vehicles
capable of memory access is limited, so in these cases, the CPU core itself
will not run as fast because those units are ties up by the memory.

Say logical processor thread on an HT processor needs to read an area of
memory that's not in the L1/L2 cache. The execution unit that's doing the
read uses the FSB to read the data. That execution unit is tied up until the
memory can get the data. The memory is slower than the FSB, so the time that
execution unit is tied up is bound by the memory speed. Until the memory
finishes, that execution unit can't issue any more reads (for either logical
processor), so again the memory speed will limit the processor's ability to
issue requests over the FSB, not the FSB speed. The same argument applies to
AGP accesses.

So in sum, I don't believe that the FSB speed will make a difference on
practically any workload. Again, all this is null and void for machines with
more than one physical processor. It's also less certain if you can find
memory subsystems more than twice as fast as a single DDR266/PC2100 DIMM.
(Though generally such implementations, at least those now available, don't
ever come close to their advertised peak speeds. About as often as you run
at your peak running speed.)

DS

#9 July 15th 03, 05:51 PM

David Schwartz wrote:

"David Wang" wrote in message
...

The P4 is HT enabled, so there are now two threaded contexts issuing
memory references. If both threads are doing nothing except moving
multi-hundred megabyte images around, it's not too hard to saturate the
processor bus.

Remember, the FSB can't move more data to and from memory than the
memory bandwidth. Typical memory speeds are not sufficient to saturate a
533Mhz FSB, at least, I don't think they are. In other words, you're far
more likely to max out your memory bandwidth before you max out your FSB.

If you want to take this offline, we can talk about how theoretically
we can saturate one or the other. It is true that DRAM BW tends to be
saturated first.

The problem here is that the Xeon platform has the 533 mbps FSB, while
the P4 HT platform has the 800 mbps FSB. "Hitting each other's cache"
depends on the workload. The comparison of which platform is "faster"
still depends on the behaviour of the workload, which was what I had
stated from the beginning.

There is no conceivable workload in which an 800Mhz FSB will outperform
a 533Mhz FSB if the FSB is not a system bottleneck. That is, if the total
usable bandwidth of your graphics and memory subsystems can be satisfied
with a 533Mhz FSB, no workload will make an 800Mhz FSB faster (assuming a
single physical CPU).

The 800 mbps FSB is coupled with the peak (6.4 GB/s) "dual channel" DDR
SDRAM memory system. The 533 mbps FSB doesn't get anything that has a
higher peak BW. There are some systems such as the Serverworks GC
series that has real independent memory controllers. Still, the "cheap
bandwidth" available with the 800 mbps platform gives it a big
advantage.

Your went around a circle in the logic above, meaning you stated that if
the 533 mbps FSB isn't a bottleneck, then the 800 mbps FSB won't be a
bottleneck, and this is self explanatory. However, what I was stating
was that "depending on the workload" (and there are some), that does
give the edge to the 800 mbps FSB (with the associated higher memory BW
memory system implicitly assumed). This is true even assuming only a
single physical CPU.

Now I do think that it's theoretically possible to max out a 533Mhz FSB
with concurrent full-speed graphics blits and memory accesses, assuming you
populated all of your motherboards memory slots with fast memory, your
motherboard had at least two DDR channels, and you have an AGP8X graphics
card. However, I doubt that any real-world workload could do this.

Unless you have seen all of the real-world workloads, it's not possible
to make such a statement. There are some workloads that's heavily biased
toward memory movement, and once the data is moved onto the CPU, the CPU
merely twiddles some bits, then moves the entire data stream back off of
the CPU again.

If you're talking about "real-world" workloads in terms of MS Office or
netscape or some home-office application, then I will readily grant you
the point. None of these things even come close to saturating a 533
mbps FSB, much less a 800 mbps FSB, but there are people running
different types of workloads that they consider to be "real-world".
Since the original poster did not disclose what he was running nor why
he was choosing between a dual Xeon and P4-HT, I asked that he should
examine his workload more carefully and talk to others that have similar
workloads.

Yes, the CPU itself can certainly move more data than the FSB can. But
if the total speed of everything downstream of the FSB is less than the FSB
speed, the cpu would never be able to actually max out the FSB anyway.

To give you some real numbers:

A 533Mhz FSB maxes out at 4.3GB/s a second.
AGP 8X maxes out at 2.1 GB/s.
A DDR266 memory DIMM maxes out at 2.1 GB/s.
A single 32-bit/33Mhz PCI bus maxes out at .1GB/s.

You do raise an interesting argument, however, that a machine with HT
might be more likely to access these devices concurrently than one without.
However, I don't believe this because the number of execution vehicles
capable of memory access is limited, so in these cases, the CPU core itself
will not run as fast because those units are ties up by the memory.

I'm a bit lost by your statement here. I'm not sure what is being
compared. To make myself clear, I was implicitly comparing a P4-HT
platform with the 875 ot 865 chipset with the single controller but
"dual channel" (not really, but close enough) DDR SDRAM. The P4-HT
platform has a peak BW of 6.4 GB/s on both the FSB and the memory
interface, and it can sustain about half of that, or roughly 3.2GB/s
in "McCalpin-STREAM" bandwidth. It can sustain about 4.7 GB/s (IIRC)
via a different bandwidth measurement benchmark. (If you're interested,
I'll go lookup the reference later). The dual Xeon platform has the
4.3 GB/s peak BW on the FSB, and depending on the memory system, it
can perhaps get about 2 to 2.5 GB/s of that on STREAM bandwidth.
I think you can go even higher with the Dual channel Rambus.

Still, the P4-HT platform will have more of the "cheap bandwidth" on
the high end, so the dual Xeon will be at a disadvantage if/when
you're hitting something that's BW intensive.

Say logical processor thread on an HT processor needs to read an area of
memory that's not in the L1/L2 cache. The execution unit that's doing the
read uses the FSB to read the data. That execution unit is tied up until the
memory can get the data. The memory is slower than the FSB, so the time that
execution unit is tied up is bound by the memory speed. Until the memory
finishes, that execution unit can't issue any more reads (for either logical
processor), so again the memory speed will limit the processor's ability to
issue requests over the FSB, not the FSB speed. The same argument applies to
AGP accesses.

That's not the way modern CPU's work. You don't ever "tie up" any execution
resource on a memory miss. All modern OOO processors tries to hide
memory access latencies by trying to "do useful work". As a matter of
fact, that's one of the big advertised advantages of Hyperthreading
(SMT). The problem with OOO is that while you have a cache miss, and
you're out servicing the memory reference, the CPU tries to execute
ahead, but eventually you run out of "valid" things to do, and the CPU
has no choice but to stall and wait for the memory reference to return.
With SMT/HT, you have two threaded contexts to pick instructions from,
and the CPU can be kept "busy" more often, even while there are more
memory references outstanding on the memory system.

Both threaded contexts can and does issue multiple memory references
into the memory system even while the previous memory reference is still
outstanding on the memory system.

So in sum, I don't believe that the FSB speed will make a difference on
practically any workload. Again, all this is null and void for machines with
more than one physical processor. It's also less certain if you can find
memory subsystems more than twice as fast as a single DDR266/PC2100 DIMM.
(Though generally such implementations, at least those now available, don't
ever come close to their advertised peak speeds. About as often as you run
at your peak running speed.)

I disagree.

--
davewang202(at)yahoo(dot)com

#10 July 15th 03, 07:21 PM

David Schwartz wrote:

"David Wang" wrote in message
...

That's not the way modern CPU's work. You don't ever "tie up" any
execution
resource on a memory miss. All modern OOO processors tries to hide
memory access latencies by trying to "do useful work". As a matter of
fact, that's one of the big advertised advantages of Hyperthreading
(SMT). The problem with OOO is that while you have a cache miss, and
you're out servicing the memory reference, the CPU tries to execute
ahead, but eventually you run out of "valid" things to do, and the CPU
has no choice but to stall and wait for the memory reference to return.
With SMT/HT, you have two threaded contexts to pick instructions from,
and the CPU can be kept "busy" more often, even while there are more
memory references outstanding on the memory system.

Okay, let me try to be clearer. The presumption is that we're doing as
much memory/FSB access as we possibly can. So the if the memory bus is the
block, the other logical processor will very rapidly encounter a memory
instruction, and it will have to wait for the bus too. All the instructions
will wind up lined behind the memory because the memory will be the
bottleneck.

The point here is that each/both threaded contexts will try to execute
intructions past memory references that miss the cache. With only a
single threaded context, the CPU can only go so deep before it runs out
of instructions to execute, so it will stop issuing furthere memory
references, even if those references are completely independent of the
outstanding memory references. Having HT allows the CPU to execute
down multiple threaded instruction streams, and there could well be
more memory reference before both threads stall out completely waiting
for some old reference to return with data.

It's a tautology -- if the memory is the bottleneck, then the memory is
the bottleneck and the instructions will proceed as fast as the memory can
get the data to and from them. HT or no HT.

Having HT give you multiple contexts within the same silicon so that
there are more instructions to choose from, and as follows, possibly
more (concurrent) memory references as well. The HT/SMT aspect of the
processor will be able to (more easily) saturate the FSB/memory system.

If one logical processor is doing a lot of memory accesses and gets
stalled and the other one doesn't need the memory bus, then the other
logical processor can keep running. So again, one processor is held up due
to memory speed and the other runs full tilt. How will a faster FSB help?

The faster FSB is (presumed to be) tied to a memory system that supports
higher memory bandwidth so that the queuing delays would be smaller on
the 800 mbps FSB (because it has a higher BW memory system), as compared
to the 533 mbps FSB on the Xeon platform.

Both threaded contexts can and does issue multiple memory references
into the memory system even while the previous memory reference is still
outstanding on the memory system.

Yep, and if the memory is not the bottleneck for the CPU, then the FSB
speed doesn't matter. All the data will be prefetched and lazily stored. But
if the memory is the bottleneck, then a faster FSB again won't help because
it won't make the memory faster.

We're not on the same page here. The P4-HT platform is assumed to have
a memory system with higher DRAM bandwidth.

I stand by my original statement. It is virtually impossible to max out
a 533Mhz FSB on a machine with only one physical CPU. With dual channel DDR
SDRAM, you might be able to do it for very short periods of time on
realistic loads, but frankly I doubt even that (because a 533Mhz FSB is
always 533Mhz, whereas memory bandwidth numbers are peak numbers that can
only be sustained for times on the order of microseconds).

The comparison here is not between one cpu and two cpu's on a 533 mbps
FSB. The comparison here is between one P4 CPU with HT on 800 mbps FSB
against two physical Xeon processors on the shared 533 mbps FSB with the
proportionally lower bandwidth memory system.

On both FSB as well as DRAM interface, there are combinations of
sequences of events that will prevent the data bus from being utilized
to the 100% "efficiency". It is true that the DRAM system is usually
far less "efficient" and could only sustain a fraction of the peak
throughput. If you look back a couple of posts, my references to
McCalpins' stream numbers implicitly acknowledges that fact. However,
the FSB isn't free to achieve 100% "efficiency" either. Notably,
R/W turnarounds would require that a full cycle be burned in between
cache block transfers.

If we could simply take your statements about "virtually impossible to
max out a 533 (mbps) interface with a single CPU", and replace it with
"slightly less impossible to max out a 533 (mbps) interface with a
single cpu that has two threaded contexts (SMT/HT)". Then that could
be synced closely enough with my original statement about which
platform would be "higher performance" for the original poster.

Depends on your workload. What does your workload look like?

--
davewang202(at)yahoo(dot)com

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
dual CPU set-up/	john_D	General	6	January 16th 05 12:28 AM
<> XEON PROCESSORS AND MEMORY	Alexander Gorban	Compaq Servers	0	October 24th 03 07:04 AM
<> XEON PROCESSORS AND MEMORY	Alexander Gorban	Compaq Servers	0	October 23rd 03 08:48 AM
Pentium 4 HT vs. Pentium Xeon?	Ilker Tarkan	Intel	2	July 9th 03 12:48 PM
Recomendations for a dual Xeon board	Jim H	Homebuilt PC's	3	July 4th 03 09:35 PM