CPU vs GPU

#11 May 17th 04, 05:23 AM

Skybuck Flying wrote:
From what I can tell from this text it goes something like this:

Pentium 4's can split instructions into multiple little instructions and
execute them at the same time.

no. the pentium 4 splits those instruction into microops that flow
serially through one pipeline. but there are 3 pipelines.

I know Pentium's in general can even pipeline instructions... heck a 80486
can do that !

no. the 80486 was a static processor. it did not decode instructions
nor use RISC-style architecture to execute instructions.

Now you're saying... a gpu can do many pixel/color operations... like
add/div/multiply whatever at the same time.

several unrelated operations across the screen. it does a *LOT* of work
in one cycle because it's processing so many different pixels
simultaneously.

( I believe this is what you or generally is called 'stages' )

And it can also pipeline these 'pixel instructions'.

each pipeline is dedicated to a single pixel. so in a sense, yes.

And apperently the number of instructions that can be done parallel or be
pipelined matters for performance.

very much so. longer pipelines tend to greatly hurt performance on CPUs
because CPUs depend heavily these days on their caches and branch
prediction units. when the branch prediction fails, there is a pipeline
stall and 22 cycles are lost (on the P4 in some cases). on a GPU
pipeline stalls like that don't occur because each pipeline is dedicated
to a pixel and no branch prediction must be done. no pipeline stalls
are possible.

on the other hand, more pipelines greatly increase performance on GPUs.
on CPUs, 3 pipelines seems to be the "sweet spot". any more and there
will be too many pipeline stalls. any less and the pipelines become too
crowded. on CPUs, instructions can be dependent on other instructions
that may be in another pipeline - another source for pipeline stalls.
on GPUs, there is no dependence on other pipelines. each pipeline is
independent of the others.

So not much difference is there ?!

more than you think. a LOT more.

Pentium's have pipelines and parallel instructions and gpu's have pipelines
and parallel instructions.

Where is the difference ?!

Pentium 4s are limited to the dregree they can execute parallel
operations. GPUs are, for all intents and purposes, *UNLIMITED* in the
degree which they can be parallelized.

--
-- Charles Banas

#12 May 17th 04, 07:04 AM

Well

I just spent some time reading on about differences and similarities between
CPU's and GPU's and how they work together =D

Performing graphics requires assloads of bandwidth ! Like many GByte/sec.

Current CPU's can only do 2 GB/sec which is too slow.

GPU's are becoming more generic. GPU's could store data inside their own
memory. GPU's in the future can also have conditional jumps/program flow
control, etc.

So GPU's are staring to look more and more like CPU's.

It seems like Intel and AMD are a little bit falling behind when it comes to
bandwidth with the main memory.

Maybe that's a result of Memory wars

Like Vram, Dram, DDR, DRRII,
Rambit(?) and god knows what =D

Though intel and amd probably have a little bit more experience with making
generic cpu's are maybe lot's of people have left and joined GPU's lol.

Or maybe amd and intel people are getting old and are going to retire soon
- braindrain

However AMD and INTEL have always done their best to keep things
COMPATIBLE... and that is where ATI and NVidia fail horribly it seems.

My TNT2 was only 5 years old and it now can't play some games lol =D

There is something called Riva Tuner... it has NVxx emulation... maybe that
works with my TNT2... I haven't tried it yet.

The greatest asset of GPU's is probably that they deliver a whole graphical
architecture with it... though opengl and directx have that as well... these
gpu stuff explain how to do like vertex and pixel shading and all the other
stuff around that level.

Though games still have to make sure to reduce the number of triangles that
need to be drawn... with bsp's, view frustum clipping, backface culling,
portal engines, and other things. Those can still be done fastest with
cpu's, since gpu's dont support/have it.

So my estimate would be:

1024x768x4 bytes * 70 Hz = 220.200.960 bytes per second = exactly 210
MB/sec

So as long as a programmer can simply draw to a frame buffer and have it
flipped to the graphics card this will work out just nicely...

So far no need for XX GB/Sec.

Ofcourse the triangles still have to be drawn....

Take a beast like Doom III.

How many triangles does it have at any given time...

Thx to bsp's, (possible portal engines), view frustum clipping, etc...

Doom III will only need to drawn maybe 4000 to maybe 10000 triangles for any
given time. ( It could be more... I'll find out how many triangles later

)

Maybe even more than that...

But I am beginning to see where the problem is.

Suppose a player is 'zoomed' in or standing close to a wall...

Then it doesn't really matter how many triangles have to be drawn....

Even if only 2 triangles have to be drawn... the problem is as follows:

All the pixels inside the triangles have to be interpolated...

And apperently even interpolated pixels have to be shaded etc...

Which makes me wonder if these shading calculations can be interpolated...
maybe that would be faster

But that's probably not possible otherwise it would already exist ?!

Or
somebody has to come up with a smart way to interpolate the shading etc for
the pixels

So now the problem is that:

1024x768 pixels have to be shaded... = 786.432 pixels !

That's a lot of pixels to shade !

There are only 2 normals needed I think... for each triangle... and maybe
with some smart code... each pixel can now have it's own normal.. or maybe
each pixel needs it's own normal... how does bump mapping work at this point
?

In any case let's assume the code has to work with 24 bytes for a normal.
(x,y,z in 64 bit floating point ).

The color is also in r,g,b,a in 64 bit floating point another 32 bytes for
color.

Maybe some other color has to be mixed together I ll give it another 32
bytes...

Well maybe some other things so let's round it at 100 bytes per pixel

786.432 pixels * 100 bytes = exactly 75 MB per frame * 70 Hz = 5250 MB /
sec.

So that's roughly 5.1 GB/sec that has to move through any processor just to
do my insane lighting per pixel

Ofcourse doom III or my insane game... uses a million fricking verteces (3d
points) plus some more stuff.

vertex x,y,z,
vertex normal x,y,z
vertex color r,g,b,a

So let's say another insane 100 bytes per vertex.

1 Million verteces * 100 bytes * 70 hz = 7.000.000.000

Which is rougly another 7 GB/sec for rotating, translating, storing the
verteces etc.

So that's a lot of data moving through any processor/memory !

I still think that if AMD or Intel is smart... though will increase the
bandwidth with main memory... so it reaches the Terabyte age

And I think these graphic cards will stop existing

just like windows
graphic accelerator cards stopped existing...

And then things will be back to normal =D

Just do everything via software on a generic processor - must easier I hope
=D

Bye, Bye,
Skybuck.

#13 May 17th 04, 08:50 AM

Though games still have to make sure to reduce the number of triangles
that
need to be drawn... with bsp's, view frustum clipping, backface culling,
portal engines, and other things. Those can still be done fastest with
cpu's, since gpu's dont support/have it.

That sort of design would slow down the GPU based rendering pipeline. It is
better render primitives in large batches than break the rendering into lot
of smaller batches even if there is smaller number of primitives in total
being rendered.

Frustum clipping with CPU? Never. The ideal is that the data is in GPU's
memory and rendering from there directly, if you clip with CPU it means you
will have to transfer the data into the GPU which is a major slowdown. If
you clip primitives you either render them individually (completely
braindead!) or you fill up the data into Vertex Buffer (Object) (and
optionally Index Buffer, or index list for glDrawElements() atleast).

Sometimes when the vertex data must be synthesized and there is no feasible
mechanism to do the synthesis on vertex program, then with DirectX 9 a very
good way is to lock a vertex buffer with NOOVERWRITE flag (this tells the M$
API that you won't overwrite vertices which might be processed at the time
so this leaves the GPU free to do what it wants while you write into the
buffer, A Very Good Thing). Then fill the buffer, burning CPU time.. and
memory bandwidth.

When done, you unlock and then render primitives dereferencing vertices in
the region of buffer you did just fill. But it is still Order Of blablabla..
faster to fill from static buffers. The next generation shaders will enable
sampling from textures in the vertex program (thinking of VS 3.0 and
hardware that supports this profile). This means floating-point data can be
stored into textures and chosen dynamically from there. This means bones for
skinning can be stored into texture. Or height values for displacement
mapping can be stored into textures. Or anything the imagination can think
of.. this will allow a new level of programmability to the GPU.

This will also mean slower pipeline, but, fear not, it won't come close to
the levels of CPU based geometry pipeline.. the base level for performance
with latest ATI and NVIDIA cards have increased a lot (though only NVIDIA
for now can do VS30).

Back to the topic at hand, clipping individual primitives just to "render
less" is a BIG no-no. BSP trees are also a big no-no, because their job was
to sort primitives "quickly" from changing point of view. Sorting means
broken batching. Broken batching means ****ing slow rendering (relatively
speaking). It is one of the Big No's in realtime GPU graphics.

What is A Good Thing, is to cull groups of primitives at a time. Say, you
have 'object', give it a bounding box or bounding sphere. If you can
determine the box or sphere is not visible then you don't have to render
whole 'object' at all. This is a Good Thing, no setup for the primitive
'collections', no setup for the transformations, etc. No primitives
submitted to the geometry pipeline. Et cetera. A good thing.

Similiarly, clipping to portals is redundant computation with CPU. If
clipping MUST be done, stenciling can work.. but it is fillrate consuming,
you have to fill the portal once into the stencil buffer (stencil refvalue
can be increased every time new portal is rendered so stencil buffer doesn't
need to be cleared.. unless stenciling is used for something else, which is
obviously architechtural problem and should be resolved depending on the
engine and/or rendering goals, effects... blabla.. the rule of thumb is more
generic the rendering system is going to be more performance it will eat to
cover all contingencies).

The lesson: Keep Them Batches Large. Saving in a wrong place can hurt the
performance A LOT. 160 fps or 5 fps? The difference can be in a ****up the
rendering engine does. But when know How **** Works, such mistakes are done
less frequently and when know what the hardware can do, but it won't,
atleast know there is a problem and try to fix it.

#14 May 17th 04, 11:02 AM

Hmmm I was wrong about the number of vertices and triangle doom 3 uses...

The doom 3 alpha 0.02 uses about 130.000 verteces and 70.000 triangles
during mars base levels.

All 6 monsters can mount up to 140.000 triangles !

Skybuck.

#15 May 17th 04, 11:11 AM

"joe smith" wrote in message
...
Though games still have to make sure to reduce the number of triangles
that
need to be drawn... with bsp's, view frustum clipping, backface culling,
portal engines, and other things. Those can still be done fastest with
cpu's, since gpu's dont support/have it.

That sort of design would slow down the GPU based rendering pipeline. It
is
better render primitives in large batches than break the rendering into
lot
of smaller batches even if there is smaller number of primitives in total
being rendered.

Frustum clipping with CPU? Never. The ideal is that the data is in GPU's
memory and rendering from there directly, if you clip with CPU it means
you
will have to transfer the data into the GPU which is a major slowdown. If
you clip primitives you either render them individually (completely
braindead!) or you fill up the data into Vertex Buffer (Object) (and
optionally Index Buffer, or index list for glDrawElements() atleast).

Well

Something has to determine what is visible and what is not...

Something has to determine what is in 'view' and what is not...

I know quake and other games used the cpu to do that, and bsp's and euhm
backface culling and god knows what =D

Now I believe you're saying that the GPU should be left to do that ?

How does the GPU do that ?

Skybuck.

#16 May 17th 04, 11:58 AM

Skybuck Flying wrote:

Well

I just spent some time reading on about differences and similarities
between CPU's and GPU's and how they work together =D

Performing graphics requires assloads of bandwidth ! Like many GByte/sec.

Current CPU's can only do 2 GB/sec which is too slow.

GPU's are becoming more generic. GPU's could store data inside their own
memory.

What exactly do you think they do with the 100+ meg they already _have_?
You think it's just sitting there to look pretty?

GPU's in the future can also have conditional jumps/program flow
control, etc.

What leads you to believe that they don't already do this?

So GPU's are staring to look more and more like CPU's.

A GPU is a special purpose CPU optimized for graphics use. There's no
reason why you couldn't port Linux to a Radeon or a GeforceFX except the
difficulty of connecting a disk to the thing.

It seems like Intel and AMD are a little bit falling behind when it comes
to bandwidth with the main memory.

Maybe that's a result of Memory wars Like Vram, Dram, DDR, DRRII,
Rambit(?) and god knows what =D

No, the "memory wars" are an attempt to get reasonably priced fast memory.

Though intel and amd probably have a little bit more experience with
making generic cpu's are maybe lot's of people have left and joined GPU's
lol.

Or maybe amd and intel people are getting old and are going to retire soon
- braindrain

However AMD and INTEL have always done their best to keep things
COMPATIBLE... and that is where ATI and NVidia fail horribly it seems.

My TNT2 was only 5 years old and it now can't play some games lol =D

There is something called Riva Tuner... it has NVxx emulation... maybe
that works with my TNT2... I haven't tried it yet.

The greatest asset of GPU's is probably that they deliver a whole
graphical architecture with it... though opengl and directx have that as
well... these gpu stuff explain how to do like vertex and pixel shading
and all the other stuff around that level.

OpenGL and DirectX are feature-set standards. At one time GPUs were
designed in the absence of standards--that has changed--the current
consumer boards are optimized around the DirectX standard and the
workstation boards around OpenGL, however this is done with firmware--the
GPUs are the same and can be microcoded for either.

Though games still have to make sure to reduce the number of triangles
that need to be drawn... with bsp's, view frustum clipping, backface
culling, portal engines, and other things. Those can still be done fastest
with cpu's, since gpu's dont support/have it.

And you base this assessment on what information?

So my estimate would be:

1024x768x4 bytes * 70 Hz = 220.200.960 bytes per second = exactly 210
MB/sec

So as long as a programmer can simply draw to a frame buffer and have it
flipped to the graphics card this will work out just nicely...

If the programmer can do all the necessary calculations. If you're talking
about a P4-3200 then that would mean that it would have to do every pixel
in 64 calculations or less.

So far no need for XX GB/Sec.

Ofcourse the triangles still have to be drawn....

Take a beast like Doom III.

How many triangles does it have at any given time...

Thx to bsp's, (possible portal engines), view frustum clipping, etc...

Doom III will only need to drawn maybe 4000 to maybe 10000 triangles for
any given time. ( It could be more... I'll find out how many triangles
later )

Maybe even more than that...

But I am beginning to see where the problem is.

Suppose a player is 'zoomed' in or standing close to a wall...

Then it doesn't really matter how many triangles have to be drawn....

Even if only 2 triangles have to be drawn... the problem is as follows:

All the pixels inside the triangles have to be interpolated...

And apperently even interpolated pixels have to be shaded etc...

Which makes me wonder if these shading calculations can be interpolated...
maybe that would be faster

But that's probably not possible otherwise it would already exist ?! Or
somebody has to come up with a smart way to interpolate the shading etc
for the pixels

So now the problem is that:

1024x768 pixels have to be shaded... = 786.432 pixels !

That's a lot of pixels to shade !

There are only 2 normals needed I think... for each triangle... and maybe
with some smart code... each pixel can now have it's own normal.. or maybe
each pixel needs it's own normal... how does bump mapping work at this
point ?

In any case let's assume the code has to work with 24 bytes for a normal.
(x,y,z in 64 bit floating point ).

The color is also in r,g,b,a in 64 bit floating point another 32 bytes for
color.

Maybe some other color has to be mixed together I ll give it another 32
bytes...

Well maybe some other things so let's round it at 100 bytes per pixel

786.432 pixels * 100 bytes = exactly 75 MB per frame * 70 Hz = 5250 MB /
sec.

So that's roughly 5.1 GB/sec that has to move through any processor just
to do my insane lighting per pixel

Ofcourse doom III or my insane game... uses a million fricking verteces
(3d points) plus some more stuff.

vertex x,y,z,
vertex normal x,y,z
vertex color r,g,b,a

So let's say another insane 100 bytes per vertex.

1 Million verteces * 100 bytes * 70 hz = 7.000.000.000

Which is rougly another 7 GB/sec for rotating, translating, storing the
verteces etc.

So that's a lot of data moving through any processor/memory !

I still think that if AMD or Intel is smart... though will increase the
bandwidth with main memory... so it reaches the Terabyte age

Eventually that will happen. By that time the feature set of video
processors will likely be very thoroughly standardized and they'll be able
to handle any image at a few thousand frames a second and cost 50 cents.

And I think these graphic cards will stop existing just like windows
graphic accelerator cards stopped existing...

Huh? What is a Radeon or a GeforceFX if not a "Windows graphic accelerator
card"? They're designed specifically to accelerate DirectX, which in case
you haven't checked recently you will find to be a part of Windows.

And then things will be back to normal =D

Just do everything via software on a generic processor - must easier I
hope =D

Nope. Lot easier to tell the GPU "draw me a sphere at thus and so
cooordinates" than it is to do all the calculations yourself.

Bye, Bye,
Skybuck.

--
--John
Reply to jclarke at ae tee tee global dot net
(was jclarke at eye bee em dot net)

#17 May 17th 04, 01:34 PM

Skybuck Flying wrote:

"joe smith" wrote in message
...

Though games still have to make sure to reduce the number of triangles

that

need to be drawn... with bsp's, view frustum clipping, backface culling,
portal engines, and other things. Those can still be done fastest with
cpu's, since gpu's dont support/have it.

That sort of design would slow down the GPU based rendering pipeline. It

is

better render primitives in large batches than break the rendering into

lot

of smaller batches even if there is smaller number of primitives in total
being rendered.

Frustum clipping with CPU? Never. The ideal is that the data is in GPU's
memory and rendering from there directly, if you clip with CPU it means

you

will have to transfer the data into the GPU which is a major slowdown. If
you clip primitives you either render them individually (completely
braindead!) or you fill up the data into Vertex Buffer (Object) (and
optionally Index Buffer, or index list for glDrawElements() atleast).

Well

Something has to determine what is visible and what is not...

Something has to determine what is in 'view' and what is not...

I know quake and other games used the cpu to do that, and bsp's and euhm
backface culling and god knows what =D

Now I believe you're saying that the GPU should be left to do that ?

They do now... those options in the ATI drivers ain't there for nothing.

How does the GPU do that ?

Magick

Skybuck.

#18 May 17th 04, 02:25 PM

"Minotaur" wrote in message
...
Skybuck Flying wrote:

"joe smith" wrote in message
...

Though games still have to make sure to reduce the number of triangles

that

need to be drawn... with bsp's, view frustum clipping, backface
culling,
portal engines, and other things. Those can still be done fastest with
cpu's, since gpu's dont support/have it.

That sort of design would slow down the GPU based rendering pipeline. It

is

better render primitives in large batches than break the rendering into

lot

of smaller batches even if there is smaller number of primitives in
total
being rendered.

Frustum clipping with CPU? Never. The ideal is that the data is in GPU's
memory and rendering from there directly, if you clip with CPU it means

you

will have to transfer the data into the GPU which is a major slowdown.
If
you clip primitives you either render them individually (completely
braindead!) or you fill up the data into Vertex Buffer (Object) (and
optionally Index Buffer, or index list for glDrawElements() atleast).

Well

Something has to determine what is visible and what is not...

Something has to determine what is in 'view' and what is not...

I know quake and other games used the cpu to do that, and bsp's and euhm
backface culling and god knows what =D

Now I believe you're saying that the GPU should be left to do that ?

They do now... those options in the ATI drivers ain't there for nothing.

How does the GPU do that ?

Magick

Well such an answer ofcourse won't do for any serieus developer.

The developer has to know how fast a card does this.

So he can decide to use the card (gpu) or do it on the cpu :P

Skybuck.

#19 May 17th 04, 02:42 PM

Well

Something has to determine what is visible and what is not...
Something has to determine what is in 'view' and what is not...

I know quake and other games used the cpu to do that, and bsp's and euhm
backface culling and god knows what =D

BSP trees are used for getting perfect zero-overlap sorted set of primitives
for renderer. GPU does not need, infact, the performance is hurt seriously
by such "optimization".. like I explained sorting is a Big No. I must check:
do you know what BSP tree is and what it is commonly used for? And what
Quake engine uses it for, specificly? Or are you just throwing buzzwords
around?

Now I believe you're saying that the GPU should be left to do that ?

You believe wrong. I am saying you cull away large batches at a time. Whole
objects and so on. Not individual primitives. That is da secret 2 sp3eD.

How does the GPU do that ?

There is functionality in the GPU which can query how many pixels pass
visibility test.. but this is poor approach due to fact that it has a lot of
latency. Doing visibility computation with CPU is quite cheap. Depending on
the system being implemented, ofcourse. Got something specific in mind?

#20 May 17th 04, 02:50 PM

So he can decide to use the card (gpu) or do it on the cpu :P

That is work CPU is better at. Using GPU for deteming if there are
potentially visible pixels requires 1x overdraw fill cost for the bound
volume being rasterized. Then, we might get results back that yes, the
object is visible.. so now you have the luxury of filling most of these
pixels *again*, or, you filled all these pixels just to find out that the
object isn't visible.

Doesn't sound very attractive to me.. does it sound very attactive to you?
Didn't think so. For CPU culling bounding volume such as box or sphere to
volume defined by planes is very cheap. The most commonly used volume is the
frustum. Occluders can also be implemented using same mathematics, just
which primitives are best occluders is interesting problem. Generally best
approach is to recognize in preprocess which surfaces are "large" in a
contributing ways.. the level design and gameplay mechanics determine this
to a large degree. With a decent portal rendering system it is a moot point.

There are a lot of solutions to visibility problem more and less generic. I
don't think that you are genuinely interested in any of this, though..

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode