No linearity in benchmarks (7-zip, video encoding) but kernel compiles show linearity with cores

**veek** · #1 June 20th 17, 11:51 AM posted to alt.comp.hardware

Hi, I've been watching a video on YT:
https://www.youtube.com/watch?v=HgrcLaixUmY

He's comparing the Pentium G4560 vs the 1400-OC. In the 7zip test you can
clearly
see the lack of linearity/scaling with cores against the i5.

However in
http://blog.stuffedcow.net/wp-conten...ad-scaling.png
http://blog.stuffedcow.net/2011/08/h...g-performance/
You can see a linear (1:1) improvement when new cores come online.

I was wondering why this was so? The processors are different .. Ivybridge
and the
task is also different - kernel compiles. But what causes such a drastic
lack of
linearity? If the cores are the same/cloned then you ought to see the same
performance
across them AND the i5 has the same core as the Pentium, just with HT
enabled..

Anyone looking at the stuffedcow link would conclude that HyperThreading was
of no
great significance and was roughly equivalent to 1/2 a core (with 4
threads).

Anyone looking at the YT video would rapidly conclude, HT was super duper
since it
enables the Pentium to catch up with the i5!

Two very opposing results and endless confusion why this is so..

**Paul[_28_]** · #2 June 20th 17, 01:47 PM posted to alt.comp.hardware

veek wrote:
Hi, I've been watching a video on YT:
https://www.youtube.com/watch?v=HgrcLaixUmY

He's comparing the Pentium G4560 vs the 1400-OC. In the 7zip test you can
clearly
see the lack of linearity/scaling with cores against the i5.

However in
http://blog.stuffedcow.net/wp-conten...ad-scaling.png
http://blog.stuffedcow.net/2011/08/h...g-performance/
You can see a linear (1:1) improvement when new cores come online.

I was wondering why this was so? The processors are different .. Ivybridge
and the
task is also different - kernel compiles. But what causes such a drastic
lack of
linearity? If the cores are the same/cloned then you ought to see the same
performance
across them AND the i5 has the same core as the Pentium, just with HT
enabled..

Anyone looking at the stuffedcow link would conclude that HyperThreading was
of no
great significance and was roughly equivalent to 1/2 a core (with 4
threads).

Anyone looking at the YT video would rapidly conclude, HT was super duper
since it
enables the Pentium to catch up with the i5!

Two very opposing results and endless confusion why this is so..

The purpose of HyoerThreading, is to give a computing core
something to do when memory access is blocking forward progress.

Register Set #1 Register Set #2
| |
+-------- Compute -------+
Engine

So what happens is, you could be running on Register Set #1.
The Compute Engine makes a memory access, and it takes maybe 30 cycles
at the high clock speed, before the data comes back.

The Compute Engine flips over to Register Set #2. Now, it
just happens, that the memory access that Register Set #2 was
doing, has become available. A new cache line is loaded into
the L1 cache. The "computing" of the thread running on Register Set #2
continues. However, eventually, Register Set #2 algorithm, makes a
memory access, it's not in the cache, and so the Computer Engine is
stalled. It flips back to Register Set #1.

In effect, you're getting slightly more use from the Compute Engine,
by making it work with "one outstanding miss" on memory fetch. The
optimization can cause -5% to +30% improvement in performance,
depending on usage pattern. In some applications, it's actually
better to go into the BIOS, and turn off HyperThreading.

On some OSes, the problem is with the OS itself, and how it is
treating cores and what it declares to software as "physical"
or "virtual" cores. For example, if I had a 6C 12T processor
and the OS is poorly designed, maybe it tells the application
that this is a 12C processor.

*******

If 7ZIP is running on all cores, 7ZIP makes a lot of memory
references. There will be lots of "blocking", because typical
processor designs are "memory starved". And even on higher count
cores, the core to core interconnect (ring bus or HyperTransport),
causes long delays before memory accesses are satisfied. In
extreme cases on a server, maybe it takes 2us for a memory access
to complete.

Analyzing the blockage on kernel compiles would be very
difficult. I've seen people throw good money at storage
subsystems, only to get no improvement in compile time.
The best way to reduce compile times, is with "build farms",
with things like distcc or equivalent. And even those
don't have infinite speedup, because distcc doesn't accelerate
all stages of building, it only accelerates a limited subset.

So if you're an "amateur plumbing engineer", you'll only
end up frustrated trying to make compiles scale well. You can
do everything right, and it's still unexplainably slow.

*******

If you have a 6C 12T processor, you don't have "12 cores".
You have 6.3 cores. Or maybe 6.6 cores. That's the speedup.

Hyperthreading helps a lot, when comparing a 1C 1T processor
to a 1C 2T processor. And that has more to do with process
scheduling in the OS, than anything else. It's not the
compute speed that differs in that case, it's the user
perception of "responsiveness". It just seems more
responsive in the desktop. If you're swimming in cores,
you not going to see a different between 4C 4T processor
and 4C 8T processor. The 4C 8T will win on benchmarks
of course, by a small percentage.

To get the best performance from 7ZIP, you "oversubscribe".
On a 6C 12T processor, you set the thread count to 24.
And the result is, the entire CPU stays at 100%. However,
to do that, in Ultra mode, you need 24*600MB of memory
for the dictionaries the compressor threads are using.
Which is 14.4GB of memory. If you're using Fastest mode,
the memory used for dictionaries is "nothing". Maybe the
compressor runs at 100MB/sec in Fastest mode. If the archive
being compressed consists of 4KB files, the compressor can
actually "starve" because the file reading thread, cannot
keep up... If the machine has enough memory, you can run
hashdeep, to preload the read cache, and then 7ZIP runs
at full speed. See what fun performance tuning is ?

Good luck,
Paul

**veek** · #3 June 21st 17, 04:45 AM posted to alt.comp.hardware

Paul wrote:

veek wrote:
Hi, I've been watching a video on YT:
https://www.youtube.com/watch?v=HgrcLaixUmY

He's comparing the Pentium G4560 vs the 1400-OC. In the 7zip test you can
clearly
see the lack of linearity/scaling with cores against the i5.

However in
http://blog.stuffedcow.net/wp-conten...ad-scaling.png
http://blog.stuffedcow.net/2011/08/h...g-performance/
You can see a linear (1:1) improvement when new cores come online.

I was wondering why this was so? The processors are different ..
Ivybridge and the
task is also different - kernel compiles. But what causes such a drastic
lack of
linearity? If the cores are the same/cloned then you ought to see the
same performance
across them AND the i5 has the same core as the Pentium, just with HT
enabled..

Anyone looking at the stuffedcow link would conclude that HyperThreading
was of no
great significance and was roughly equivalent to 1/2 a core (with 4
threads).

Anyone looking at the YT video would rapidly conclude, HT was super duper
since it
enables the Pentium to catch up with the i5!

Two very opposing results and endless confusion why this is so..

The purpose of HyoerThreading, is to give a computing core
something to do when memory access is blocking forward progress.

Register Set #1 Register Set #2
| |
+-------- Compute -------+
Engine

So what happens is, you could be running on Register Set #1.
The Compute Engine makes a memory access, and it takes maybe 30 cycles
at the high clock speed, before the data comes back.

The Compute Engine flips over to Register Set #2. Now, it
just happens, that the memory access that Register Set #2 was
doing, has become available. A new cache line is loaded into
the L1 cache. The "computing" of the thread running on Register Set #2
continues. However, eventually, Register Set #2 algorithm, makes a
memory access, it's not in the cache, and so the Computer Engine is
stalled. It flips back to Register Set #1.

In effect, you're getting slightly more use from the Compute Engine,
by making it work with "one outstanding miss" on memory fetch. The
optimization can cause -5% to +30% improvement in performance,
depending on usage pattern. In some applications, it's actually
better to go into the BIOS, and turn off HyperThreading.

On some OSes, the problem is with the OS itself, and how it is
treating cores and what it declares to software as "physical"
or "virtual" cores. For example, if I had a 6C 12T processor
and the OS is poorly designed, maybe it tells the application
that this is a 12C processor.

*******

If 7ZIP is running on all cores, 7ZIP makes a lot of memory
references. There will be lots of "blocking", because typical
processor designs are "memory starved". And even on higher count
cores, the core to core interconnect (ring bus or HyperTransport),
causes long delays before memory accesses are satisfied. In
extreme cases on a server, maybe it takes 2us for a memory access
to complete.

Analyzing the blockage on kernel compiles would be very
difficult. I've seen people throw good money at storage
subsystems, only to get no improvement in compile time.
The best way to reduce compile times, is with "build farms",
with things like distcc or equivalent. And even those
don't have infinite speedup, because distcc doesn't accelerate
all stages of building, it only accelerates a limited subset.

So if you're an "amateur plumbing engineer", you'll only
end up frustrated trying to make compiles scale well. You can
do everything right, and it's still unexplainably slow.

*******

If you have a 6C 12T processor, you don't have "12 cores".
You have 6.3 cores. Or maybe 6.6 cores. That's the speedup.

Hyperthreading helps a lot, when comparing a 1C 1T processor
to a 1C 2T processor. And that has more to do with process
scheduling in the OS, than anything else. It's not the
compute speed that differs in that case, it's the user
perception of "responsiveness". It just seems more
responsive in the desktop. If you're swimming in cores,
you not going to see a different between 4C 4T processor
and 4C 8T processor. The 4C 8T will win on benchmarks
of course, by a small percentage.

To get the best performance from 7ZIP, you "oversubscribe".
On a 6C 12T processor, you set the thread count to 24.
And the result is, the entire CPU stays at 100%. However,
to do that, in Ultra mode, you need 24*600MB of memory
for the dictionaries the compressor threads are using.
Which is 14.4GB of memory. If you're using Fastest mode,
the memory used for dictionaries is "nothing". Maybe the
compressor runs at 100MB/sec in Fastest mode. If the archive
being compressed consists of 4KB files, the compressor can
actually "starve" because the file reading thread, cannot
keep up... If the machine has enough memory, you can run
hashdeep, to preload the read cache, and then 7ZIP runs
at full speed. See what fun performance tuning is ?

Good luck,
Paul

ah! understanding at last! thanks awfully!

I also found this link:
https://scalibq.wordpress.com/2012/0...lti-threading/
He basically says that single-thread performance heavily influences
multiprocessing 'Amdhal's law' because it clears computational blocks more
quickly - the weakest/slowest computation, will make the faster ones wait
therefore single thread performance becomes important in a multi core
design.

The C++ concurrency book (Williams) also had some stuff on threads and
context switching but your explanation was more comprehensive and
understandable.

I need to buy a new computer for the house, hence the research. A lot of
review sites praise the G4560 and the benchmarks seem to validate their
claims - it seems to be catching up with the i5 and I was wondering how that
was so.

It's clear now.. if you have shared data (gaming, 7zip, encoding) then
hyper-threading becomes very important because you can run two different
computations (each thread would be a different computation) on the shared
data OR you could load a different register-set and run a different thread
(instruction cache) while the first one blocks on memory access. It's a
mechanism to improve core-utilization.

The i5 will do nicely when you have a lot of processes so if you are
multitasking - virtualbox + linux + gimp then you need the cores; really the
R5 would be a better fit in that case.

Anyway i'll probably buy the pentium.. it has IGP and for my use: movies,
browsing, typing.. and the occasional gimp/inkscape or CAD it would be okay.
The R3 would be double the price AND I would have to buy and external gfx
card which would be underused and right now gfx card prices are skyhigh
because of etherium mining. Additionally I can't go wrong with the pentium -
worst case I can use it as a HTPC.

**Paul[_28_]** · #4 June 21st 17, 06:47 AM posted to alt.comp.hardware

veek wrote:

ah! understanding at last! thanks awfully!

I also found this link:
https://scalibq.wordpress.com/2012/0...lti-threading/
He basically says that single-thread performance heavily influences
multiprocessing 'Amdhal's law' because it clears computational blocks more
quickly - the weakest/slowest computation, will make the faster ones wait
therefore single thread performance becomes important in a multi core
design.

The C++ concurrency book (Williams) also had some stuff on threads and
context switching but your explanation was more comprehensive and
understandable.

I need to buy a new computer for the house, hence the research. A lot of
review sites praise the G4560 and the benchmarks seem to validate their
claims - it seems to be catching up with the i5 and I was wondering how that
was so.

It's clear now.. if you have shared data (gaming, 7zip, encoding) then
hyper-threading becomes very important because you can run two different
computations (each thread would be a different computation) on the shared
data OR you could load a different register-set and run a different thread
(instruction cache) while the first one blocks on memory access. It's a
mechanism to improve core-utilization.

The i5 will do nicely when you have a lot of processes so if you are
multitasking - virtualbox + linux + gimp then you need the cores; really the
R5 would be a better fit in that case.

Anyway i'll probably buy the pentium.. it has IGP and for my use: movies,
browsing, typing.. and the occasional gimp/inkscape or CAD it would be okay.
The R3 would be double the price AND I would have to buy and external gfx
card which would be underused and right now gfx card prices are skyhigh
because of etherium mining. Additionally I can't go wrong with the pentium -
worst case I can use it as a HTPC.

https://www.techpowerup.com/231183/a...-at-reddit-ama

"This is likely also when the company rolls out "Raven Ridge" initially
as mobile Ryzen products (BGA packages, which will likely also be used
in AIOs), and later as desktop socket AM4 parts."

That's a 4C 8T Ryzen with a Vega APU.

The first ones will be BGA packages, and they will be soldered to
laptop motherboards. Later, the same processors will be put in
AM4 packages, for usage in retail motherboards.

There's a possibility that chip will fit in a motherboard like this.
This has video connectors. If the Raven Ridge comes out in AM4, then
you can use the graphics connectors on this motherboard.

"BIOSTAR X370GTN AM4"
https://www.newegg.com/Product/Produ...82E16813138452

http://www.hardwarecanucks.com/forum...rd-review.html

"However, this motherboard also supports AMD's new
seventh generation Bristol Ridge APUs and it will surely --- not verified
also support the upcoming Zen-based Raven Ridge APUs."

"If you do install an APU, your video output choices
will be limited to DVI-D or HDMI 1.4."

So don't commit too quickly to an upgrade just yet. AMD still has
more stuff to "dribble" out. Who knows, maybe Intel will
have to adjust their pricing a tiny bit.

Paul

**Paul[_28_]** · #5 June 21st 17, 11:16 PM posted to alt.comp.hardware

Paul wrote:
veek wrote:

ah! understanding at last! thanks awfully!

I also found this link:
https://scalibq.wordpress.com/2012/0...lti-threading/
He basically says that single-thread performance heavily influences
multiprocessing 'Amdhal's law' because it clears computational blocks
more quickly - the weakest/slowest computation, will make the faster
ones wait therefore single thread performance becomes important in a
multi core design.

The C++ concurrency book (Williams) also had some stuff on threads and
context switching but your explanation was more comprehensive and
understandable.

I need to buy a new computer for the house, hence the research. A lot
of review sites praise the G4560 and the benchmarks seem to validate
their claims - it seems to be catching up with the i5 and I was
wondering how that was so.

It's clear now.. if you have shared data (gaming, 7zip, encoding) then
hyper-threading becomes very important because you can run two
different computations (each thread would be a different computation)
on the shared data OR you could load a different register-set and run
a different thread (instruction cache) while the first one blocks on
memory access. It's a mechanism to improve core-utilization.

The i5 will do nicely when you have a lot of processes so if you are
multitasking - virtualbox + linux + gimp then you need the cores;
really the R5 would be a better fit in that case.

Anyway i'll probably buy the pentium.. it has IGP and for my use:
movies, browsing, typing.. and the occasional gimp/inkscape or CAD it
would be okay. The R3 would be double the price AND I would have to
buy and external gfx card which would be underused and right now gfx
card prices are skyhigh because of etherium mining. Additionally I
can't go wrong with the pentium - worst case I can use it as a HTPC.

https://www.techpowerup.com/231183/a...-at-reddit-ama

"This is likely also when the company rolls out "Raven Ridge" initially
as mobile Ryzen products (BGA packages, which will likely also be used
in AIOs), and later as desktop socket AM4 parts."

That's a 4C 8T Ryzen with a Vega APU.

The first ones will be BGA packages, and they will be soldered to
laptop motherboards. Later, the same processors will be put in
AM4 packages, for usage in retail motherboards.

There's a possibility that chip will fit in a motherboard like this.
This has video connectors. If the Raven Ridge comes out in AM4, then
you can use the graphics connectors on this motherboard.

"BIOSTAR X370GTN AM4"
https://www.newegg.com/Product/Produ...82E16813138452

http://www.hardwarecanucks.com/forum...rd-review.html

"However, this motherboard also supports AMD's new
seventh generation Bristol Ridge APUs and it will surely --- not
verified
also support the upcoming Zen-based Raven Ridge APUs."

"If you do install an APU, your video output choices
will be limited to DVI-D or HDMI 1.4."

So don't commit too quickly to an upgrade just yet. AMD still has
more stuff to "dribble" out. Who knows, maybe Intel will
have to adjust their pricing a tiny bit.

Paul

I found another article on the plans for the chip.

http://wccftech.com/amd-raven-ridge-...m-glofo-amkor/

It's amazing they're using a silicon interposer for a
lower end design. And the APU will have "local memory"
in the form of an HBM stack.

Paul

**veek** · #6 June 22nd 17, 05:09 AM posted to alt.comp.hardware

Paul wrote:

veek wrote:

ah! understanding at last! thanks awfully!

I also found this link:
https://scalibq.wordpress.com/2012/0...lti-threading/
He basically says that single-thread performance heavily influences
multiprocessing 'Amdhal's law' because it clears computational blocks
more quickly - the weakest/slowest computation, will make the faster ones
wait therefore single thread performance becomes important in a multi
core design.

The C++ concurrency book (Williams) also had some stuff on threads and
context switching but your explanation was more comprehensive and
understandable.

I need to buy a new computer for the house, hence the research. A lot of
review sites praise the G4560 and the benchmarks seem to validate their
claims - it seems to be catching up with the i5 and I was wondering how
that was so.

It's clear now.. if you have shared data (gaming, 7zip, encoding) then
hyper-threading becomes very important because you can run two different
computations (each thread would be a different computation) on the shared
data OR you could load a different register-set and run a different
thread (instruction cache) while the first one blocks on memory access.
It's a mechanism to improve core-utilization.

The i5 will do nicely when you have a lot of processes so if you are
multitasking - virtualbox + linux + gimp then you need the cores; really
the R5 would be a better fit in that case.

Anyway i'll probably buy the pentium.. it has IGP and for my use: movies,
browsing, typing.. and the occasional gimp/inkscape or CAD it would be
okay. The R3 would be double the price AND I would have to buy and
external gfx card which would be underused and right now gfx card prices
are skyhigh because of etherium mining. Additionally I can't go wrong
with the pentium - worst case I can use it as a HTPC.

https://www.techpowerup.com/231183/a...-at-reddit-ama

"This is likely also when the company rolls out "Raven Ridge"
initially
as mobile Ryzen products (BGA packages, which will likely also be
used in AIOs), and later as desktop socket AM4 parts."

That's a 4C 8T Ryzen with a Vega APU.

The first ones will be BGA packages, and they will be soldered to
laptop motherboards. Later, the same processors will be put in
AM4 packages, for usage in retail motherboards.

There's a possibility that chip will fit in a motherboard like this.
This has video connectors. If the Raven Ridge comes out in AM4, then
you can use the graphics connectors on this motherboard.

"BIOSTAR X370GTN AM4"
https://www.newegg.com/Product/Produ...82E16813138452

http://www.hardwarecanucks.com/forum...rd-review.html

"However, this motherboard also supports AMD's new
seventh generation Bristol Ridge APUs and it will surely --- not
verified also support the upcoming Zen-based Raven Ridge APUs."

"If you do install an APU, your video output choices
will be limited to DVI-D or HDMI 1.4."

So don't commit too quickly to an upgrade just yet. AMD still has
more stuff to "dribble" out. Who knows, maybe Intel will
have to adjust their pricing a tiny bit.

Paul
ah - yep I can wait - planned on waiting till month end anyhow, pending the
Kabylake release to store. I think the time frame is till someone kicks me
into spending money

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Video Encoding how do i speed it up ?	We Live for the One we Die for the One	Overclocking	2	June 7th 04 07:23 AM
Video Encoding how do i speed it up ?	Robert	Overclocking AMD Processors	1	June 5th 04 03:51 PM
Video Encoding how do i speed it up ?	Alex	Overclocking	0	June 5th 04 12:15 AM
Video Encoding how do i speed it up ?	jaster	Homebuilt PC's	0	June 4th 04 11:54 PM
No Video with 1GB of RAM and Linux Kernel 2.4.22	Reed Loefgren	Nvidia Videocards	3	October 5th 03 04:47 AM