If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
Thread Tools | Display Modes |
#1
|
|||
|
|||
No linearity in benchmarks (7-zip, video encoding) but kernel compiles show linearity with cores
Hi, I've been watching a video on YT:
https://www.youtube.com/watch?v=HgrcLaixUmY He's comparing the Pentium G4560 vs the 1400-OC. In the 7zip test you can clearly see the lack of linearity/scaling with cores against the i5. However in http://blog.stuffedcow.net/wp-conten...ad-scaling.png http://blog.stuffedcow.net/2011/08/h...g-performance/ You can see a linear (1:1) improvement when new cores come online. I was wondering why this was so? The processors are different .. Ivybridge and the task is also different - kernel compiles. But what causes such a drastic lack of linearity? If the cores are the same/cloned then you ought to see the same performance across them AND the i5 has the same core as the Pentium, just with HT enabled.. Anyone looking at the stuffedcow link would conclude that HyperThreading was of no great significance and was roughly equivalent to 1/2 a core (with 4 threads). Anyone looking at the YT video would rapidly conclude, HT was super duper since it enables the Pentium to catch up with the i5! Two very opposing results and endless confusion why this is so.. |
#2
|
|||
|
|||
No linearity in benchmarks (7-zip, video encoding) but kernelcompiles show linearity with cores
veek wrote:
Hi, I've been watching a video on YT: https://www.youtube.com/watch?v=HgrcLaixUmY He's comparing the Pentium G4560 vs the 1400-OC. In the 7zip test you can clearly see the lack of linearity/scaling with cores against the i5. However in http://blog.stuffedcow.net/wp-conten...ad-scaling.png http://blog.stuffedcow.net/2011/08/h...g-performance/ You can see a linear (1:1) improvement when new cores come online. I was wondering why this was so? The processors are different .. Ivybridge and the task is also different - kernel compiles. But what causes such a drastic lack of linearity? If the cores are the same/cloned then you ought to see the same performance across them AND the i5 has the same core as the Pentium, just with HT enabled.. Anyone looking at the stuffedcow link would conclude that HyperThreading was of no great significance and was roughly equivalent to 1/2 a core (with 4 threads). Anyone looking at the YT video would rapidly conclude, HT was super duper since it enables the Pentium to catch up with the i5! Two very opposing results and endless confusion why this is so.. The purpose of HyoerThreading, is to give a computing core something to do when memory access is blocking forward progress. Register Set #1 Register Set #2 | | +-------- Compute -------+ Engine So what happens is, you could be running on Register Set #1. The Compute Engine makes a memory access, and it takes maybe 30 cycles at the high clock speed, before the data comes back. The Compute Engine flips over to Register Set #2. Now, it just happens, that the memory access that Register Set #2 was doing, has become available. A new cache line is loaded into the L1 cache. The "computing" of the thread running on Register Set #2 continues. However, eventually, Register Set #2 algorithm, makes a memory access, it's not in the cache, and so the Computer Engine is stalled. It flips back to Register Set #1. In effect, you're getting slightly more use from the Compute Engine, by making it work with "one outstanding miss" on memory fetch. The optimization can cause -5% to +30% improvement in performance, depending on usage pattern. In some applications, it's actually better to go into the BIOS, and turn off HyperThreading. On some OSes, the problem is with the OS itself, and how it is treating cores and what it declares to software as "physical" or "virtual" cores. For example, if I had a 6C 12T processor and the OS is poorly designed, maybe it tells the application that this is a 12C processor. ******* If 7ZIP is running on all cores, 7ZIP makes a lot of memory references. There will be lots of "blocking", because typical processor designs are "memory starved". And even on higher count cores, the core to core interconnect (ring bus or HyperTransport), causes long delays before memory accesses are satisfied. In extreme cases on a server, maybe it takes 2us for a memory access to complete. Analyzing the blockage on kernel compiles would be very difficult. I've seen people throw good money at storage subsystems, only to get no improvement in compile time. The best way to reduce compile times, is with "build farms", with things like distcc or equivalent. And even those don't have infinite speedup, because distcc doesn't accelerate all stages of building, it only accelerates a limited subset. So if you're an "amateur plumbing engineer", you'll only end up frustrated trying to make compiles scale well. You can do everything right, and it's still unexplainably slow. ******* If you have a 6C 12T processor, you don't have "12 cores". You have 6.3 cores. Or maybe 6.6 cores. That's the speedup. Hyperthreading helps a lot, when comparing a 1C 1T processor to a 1C 2T processor. And that has more to do with process scheduling in the OS, than anything else. It's not the compute speed that differs in that case, it's the user perception of "responsiveness". It just seems more responsive in the desktop. If you're swimming in cores, you not going to see a different between 4C 4T processor and 4C 8T processor. The 4C 8T will win on benchmarks of course, by a small percentage. To get the best performance from 7ZIP, you "oversubscribe". On a 6C 12T processor, you set the thread count to 24. And the result is, the entire CPU stays at 100%. However, to do that, in Ultra mode, you need 24*600MB of memory for the dictionaries the compressor threads are using. Which is 14.4GB of memory. If you're using Fastest mode, the memory used for dictionaries is "nothing". Maybe the compressor runs at 100MB/sec in Fastest mode. If the archive being compressed consists of 4KB files, the compressor can actually "starve" because the file reading thread, cannot keep up... If the machine has enough memory, you can run hashdeep, to preload the read cache, and then 7ZIP runs at full speed. See what fun performance tuning is ? Good luck, Paul |
#3
|
|||
|
|||
No linearity in benchmarks (7-zip, video encoding) but kernel compiles show linearity with cores
Paul wrote:
veek wrote: Hi, I've been watching a video on YT: https://www.youtube.com/watch?v=HgrcLaixUmY He's comparing the Pentium G4560 vs the 1400-OC. In the 7zip test you can clearly see the lack of linearity/scaling with cores against the i5. However in http://blog.stuffedcow.net/wp-conten...ad-scaling.png http://blog.stuffedcow.net/2011/08/h...g-performance/ You can see a linear (1:1) improvement when new cores come online. I was wondering why this was so? The processors are different .. Ivybridge and the task is also different - kernel compiles. But what causes such a drastic lack of linearity? If the cores are the same/cloned then you ought to see the same performance across them AND the i5 has the same core as the Pentium, just with HT enabled.. Anyone looking at the stuffedcow link would conclude that HyperThreading was of no great significance and was roughly equivalent to 1/2 a core (with 4 threads). Anyone looking at the YT video would rapidly conclude, HT was super duper since it enables the Pentium to catch up with the i5! Two very opposing results and endless confusion why this is so.. The purpose of HyoerThreading, is to give a computing core something to do when memory access is blocking forward progress. Register Set #1 Register Set #2 | | +-------- Compute -------+ Engine So what happens is, you could be running on Register Set #1. The Compute Engine makes a memory access, and it takes maybe 30 cycles at the high clock speed, before the data comes back. The Compute Engine flips over to Register Set #2. Now, it just happens, that the memory access that Register Set #2 was doing, has become available. A new cache line is loaded into the L1 cache. The "computing" of the thread running on Register Set #2 continues. However, eventually, Register Set #2 algorithm, makes a memory access, it's not in the cache, and so the Computer Engine is stalled. It flips back to Register Set #1. In effect, you're getting slightly more use from the Compute Engine, by making it work with "one outstanding miss" on memory fetch. The optimization can cause -5% to +30% improvement in performance, depending on usage pattern. In some applications, it's actually better to go into the BIOS, and turn off HyperThreading. On some OSes, the problem is with the OS itself, and how it is treating cores and what it declares to software as "physical" or "virtual" cores. For example, if I had a 6C 12T processor and the OS is poorly designed, maybe it tells the application that this is a 12C processor. ******* If 7ZIP is running on all cores, 7ZIP makes a lot of memory references. There will be lots of "blocking", because typical processor designs are "memory starved". And even on higher count cores, the core to core interconnect (ring bus or HyperTransport), causes long delays before memory accesses are satisfied. In extreme cases on a server, maybe it takes 2us for a memory access to complete. Analyzing the blockage on kernel compiles would be very difficult. I've seen people throw good money at storage subsystems, only to get no improvement in compile time. The best way to reduce compile times, is with "build farms", with things like distcc or equivalent. And even those don't have infinite speedup, because distcc doesn't accelerate all stages of building, it only accelerates a limited subset. So if you're an "amateur plumbing engineer", you'll only end up frustrated trying to make compiles scale well. You can do everything right, and it's still unexplainably slow. ******* If you have a 6C 12T processor, you don't have "12 cores". You have 6.3 cores. Or maybe 6.6 cores. That's the speedup. Hyperthreading helps a lot, when comparing a 1C 1T processor to a 1C 2T processor. And that has more to do with process scheduling in the OS, than anything else. It's not the compute speed that differs in that case, it's the user perception of "responsiveness". It just seems more responsive in the desktop. If you're swimming in cores, you not going to see a different between 4C 4T processor and 4C 8T processor. The 4C 8T will win on benchmarks of course, by a small percentage. To get the best performance from 7ZIP, you "oversubscribe". On a 6C 12T processor, you set the thread count to 24. And the result is, the entire CPU stays at 100%. However, to do that, in Ultra mode, you need 24*600MB of memory for the dictionaries the compressor threads are using. Which is 14.4GB of memory. If you're using Fastest mode, the memory used for dictionaries is "nothing". Maybe the compressor runs at 100MB/sec in Fastest mode. If the archive being compressed consists of 4KB files, the compressor can actually "starve" because the file reading thread, cannot keep up... If the machine has enough memory, you can run hashdeep, to preload the read cache, and then 7ZIP runs at full speed. See what fun performance tuning is ? Good luck, Paul ah! understanding at last! thanks awfully! I also found this link: https://scalibq.wordpress.com/2012/0...lti-threading/ He basically says that single-thread performance heavily influences multiprocessing 'Amdhal's law' because it clears computational blocks more quickly - the weakest/slowest computation, will make the faster ones wait therefore single thread performance becomes important in a multi core design. The C++ concurrency book (Williams) also had some stuff on threads and context switching but your explanation was more comprehensive and understandable. I need to buy a new computer for the house, hence the research. A lot of review sites praise the G4560 and the benchmarks seem to validate their claims - it seems to be catching up with the i5 and I was wondering how that was so. It's clear now.. if you have shared data (gaming, 7zip, encoding) then hyper-threading becomes very important because you can run two different computations (each thread would be a different computation) on the shared data OR you could load a different register-set and run a different thread (instruction cache) while the first one blocks on memory access. It's a mechanism to improve core-utilization. The i5 will do nicely when you have a lot of processes so if you are multitasking - virtualbox + linux + gimp then you need the cores; really the R5 would be a better fit in that case. Anyway i'll probably buy the pentium.. it has IGP and for my use: movies, browsing, typing.. and the occasional gimp/inkscape or CAD it would be okay. The R3 would be double the price AND I would have to buy and external gfx card which would be underused and right now gfx card prices are skyhigh because of etherium mining. Additionally I can't go wrong with the pentium - worst case I can use it as a HTPC. |
#4
|
|||
|
|||
No linearity in benchmarks (7-zip, video encoding) but kernelcompiles show linearity with cores
veek wrote:
ah! understanding at last! thanks awfully! I also found this link: https://scalibq.wordpress.com/2012/0...lti-threading/ He basically says that single-thread performance heavily influences multiprocessing 'Amdhal's law' because it clears computational blocks more quickly - the weakest/slowest computation, will make the faster ones wait therefore single thread performance becomes important in a multi core design. The C++ concurrency book (Williams) also had some stuff on threads and context switching but your explanation was more comprehensive and understandable. I need to buy a new computer for the house, hence the research. A lot of review sites praise the G4560 and the benchmarks seem to validate their claims - it seems to be catching up with the i5 and I was wondering how that was so. It's clear now.. if you have shared data (gaming, 7zip, encoding) then hyper-threading becomes very important because you can run two different computations (each thread would be a different computation) on the shared data OR you could load a different register-set and run a different thread (instruction cache) while the first one blocks on memory access. It's a mechanism to improve core-utilization. The i5 will do nicely when you have a lot of processes so if you are multitasking - virtualbox + linux + gimp then you need the cores; really the R5 would be a better fit in that case. Anyway i'll probably buy the pentium.. it has IGP and for my use: movies, browsing, typing.. and the occasional gimp/inkscape or CAD it would be okay. The R3 would be double the price AND I would have to buy and external gfx card which would be underused and right now gfx card prices are skyhigh because of etherium mining. Additionally I can't go wrong with the pentium - worst case I can use it as a HTPC. https://www.techpowerup.com/231183/a...-at-reddit-ama "This is likely also when the company rolls out "Raven Ridge" initially as mobile Ryzen products (BGA packages, which will likely also be used in AIOs), and later as desktop socket AM4 parts." That's a 4C 8T Ryzen with a Vega APU. The first ones will be BGA packages, and they will be soldered to laptop motherboards. Later, the same processors will be put in AM4 packages, for usage in retail motherboards. There's a possibility that chip will fit in a motherboard like this. This has video connectors. If the Raven Ridge comes out in AM4, then you can use the graphics connectors on this motherboard. "BIOSTAR X370GTN AM4" https://www.newegg.com/Product/Produ...82E16813138452 http://www.hardwarecanucks.com/forum...rd-review.html "However, this motherboard also supports AMD's new seventh generation Bristol Ridge APUs and it will surely --- not verified also support the upcoming Zen-based Raven Ridge APUs." "If you do install an APU, your video output choices will be limited to DVI-D or HDMI 1.4." So don't commit too quickly to an upgrade just yet. AMD still has more stuff to "dribble" out. Who knows, maybe Intel will have to adjust their pricing a tiny bit. Paul |
#5
|
|||
|
|||
No linearity in benchmarks (7-zip, video encoding) but kernelcompiles show linearity with cores
Paul wrote:
veek wrote: ah! understanding at last! thanks awfully! I also found this link: https://scalibq.wordpress.com/2012/0...lti-threading/ He basically says that single-thread performance heavily influences multiprocessing 'Amdhal's law' because it clears computational blocks more quickly - the weakest/slowest computation, will make the faster ones wait therefore single thread performance becomes important in a multi core design. The C++ concurrency book (Williams) also had some stuff on threads and context switching but your explanation was more comprehensive and understandable. I need to buy a new computer for the house, hence the research. A lot of review sites praise the G4560 and the benchmarks seem to validate their claims - it seems to be catching up with the i5 and I was wondering how that was so. It's clear now.. if you have shared data (gaming, 7zip, encoding) then hyper-threading becomes very important because you can run two different computations (each thread would be a different computation) on the shared data OR you could load a different register-set and run a different thread (instruction cache) while the first one blocks on memory access. It's a mechanism to improve core-utilization. The i5 will do nicely when you have a lot of processes so if you are multitasking - virtualbox + linux + gimp then you need the cores; really the R5 would be a better fit in that case. Anyway i'll probably buy the pentium.. it has IGP and for my use: movies, browsing, typing.. and the occasional gimp/inkscape or CAD it would be okay. The R3 would be double the price AND I would have to buy and external gfx card which would be underused and right now gfx card prices are skyhigh because of etherium mining. Additionally I can't go wrong with the pentium - worst case I can use it as a HTPC. https://www.techpowerup.com/231183/a...-at-reddit-ama "This is likely also when the company rolls out "Raven Ridge" initially as mobile Ryzen products (BGA packages, which will likely also be used in AIOs), and later as desktop socket AM4 parts." That's a 4C 8T Ryzen with a Vega APU. The first ones will be BGA packages, and they will be soldered to laptop motherboards. Later, the same processors will be put in AM4 packages, for usage in retail motherboards. There's a possibility that chip will fit in a motherboard like this. This has video connectors. If the Raven Ridge comes out in AM4, then you can use the graphics connectors on this motherboard. "BIOSTAR X370GTN AM4" https://www.newegg.com/Product/Produ...82E16813138452 http://www.hardwarecanucks.com/forum...rd-review.html "However, this motherboard also supports AMD's new seventh generation Bristol Ridge APUs and it will surely --- not verified also support the upcoming Zen-based Raven Ridge APUs." "If you do install an APU, your video output choices will be limited to DVI-D or HDMI 1.4." So don't commit too quickly to an upgrade just yet. AMD still has more stuff to "dribble" out. Who knows, maybe Intel will have to adjust their pricing a tiny bit. Paul I found another article on the plans for the chip. http://wccftech.com/amd-raven-ridge-...m-glofo-amkor/ It's amazing they're using a silicon interposer for a lower end design. And the APU will have "local memory" in the form of an HBM stack. Paul |
#6
|
|||
|
|||
No linearity in benchmarks (7-zip, video encoding) but kernel compiles show linearity with cores
Paul wrote:
veek wrote: ah! understanding at last! thanks awfully! I also found this link: https://scalibq.wordpress.com/2012/0...lti-threading/ He basically says that single-thread performance heavily influences multiprocessing 'Amdhal's law' because it clears computational blocks more quickly - the weakest/slowest computation, will make the faster ones wait therefore single thread performance becomes important in a multi core design. The C++ concurrency book (Williams) also had some stuff on threads and context switching but your explanation was more comprehensive and understandable. I need to buy a new computer for the house, hence the research. A lot of review sites praise the G4560 and the benchmarks seem to validate their claims - it seems to be catching up with the i5 and I was wondering how that was so. It's clear now.. if you have shared data (gaming, 7zip, encoding) then hyper-threading becomes very important because you can run two different computations (each thread would be a different computation) on the shared data OR you could load a different register-set and run a different thread (instruction cache) while the first one blocks on memory access. It's a mechanism to improve core-utilization. The i5 will do nicely when you have a lot of processes so if you are multitasking - virtualbox + linux + gimp then you need the cores; really the R5 would be a better fit in that case. Anyway i'll probably buy the pentium.. it has IGP and for my use: movies, browsing, typing.. and the occasional gimp/inkscape or CAD it would be okay. The R3 would be double the price AND I would have to buy and external gfx card which would be underused and right now gfx card prices are skyhigh because of etherium mining. Additionally I can't go wrong with the pentium - worst case I can use it as a HTPC. https://www.techpowerup.com/231183/a...-at-reddit-ama "This is likely also when the company rolls out "Raven Ridge" initially as mobile Ryzen products (BGA packages, which will likely also be used in AIOs), and later as desktop socket AM4 parts." That's a 4C 8T Ryzen with a Vega APU. The first ones will be BGA packages, and they will be soldered to laptop motherboards. Later, the same processors will be put in AM4 packages, for usage in retail motherboards. There's a possibility that chip will fit in a motherboard like this. This has video connectors. If the Raven Ridge comes out in AM4, then you can use the graphics connectors on this motherboard. "BIOSTAR X370GTN AM4" https://www.newegg.com/Product/Produ...82E16813138452 http://www.hardwarecanucks.com/forum...rd-review.html "However, this motherboard also supports AMD's new seventh generation Bristol Ridge APUs and it will surely --- not verified also support the upcoming Zen-based Raven Ridge APUs." "If you do install an APU, your video output choices will be limited to DVI-D or HDMI 1.4." So don't commit too quickly to an upgrade just yet. AMD still has more stuff to "dribble" out. Who knows, maybe Intel will have to adjust their pricing a tiny bit. Paul ah - yep I can wait - planned on waiting till month end anyhow, pending the Kabylake release to store. I think the time frame is till someone kicks me into spending money |
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Video Encoding how do i speed it up ? | We Live for the One we Die for the One | Overclocking | 2 | June 7th 04 07:23 AM |
Video Encoding how do i speed it up ? | Robert | Overclocking AMD Processors | 1 | June 5th 04 03:51 PM |
Video Encoding how do i speed it up ? | Alex | Overclocking | 0 | June 5th 04 12:15 AM |
Video Encoding how do i speed it up ? | jaster | Homebuilt PC's | 0 | June 4th 04 11:54 PM |
No Video with 1GB of RAM and Linux Kernel 2.4.22 | Reed Loefgren | Nvidia Videocards | 3 | October 5th 03 04:47 AM |