If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
Thread Tools | Display Modes |
#11
|
|||
|
|||
Skybuck Flying wrote:
From what I can tell from this text it goes something like this: Pentium 4's can split instructions into multiple little instructions and execute them at the same time. no. the pentium 4 splits those instruction into microops that flow serially through one pipeline. but there are 3 pipelines. I know Pentium's in general can even pipeline instructions... heck a 80486 can do that ! no. the 80486 was a static processor. it did not decode instructions nor use RISC-style architecture to execute instructions. Now you're saying... a gpu can do many pixel/color operations... like add/div/multiply whatever at the same time. several unrelated operations across the screen. it does a *LOT* of work in one cycle because it's processing so many different pixels simultaneously. ( I believe this is what you or generally is called 'stages' ) And it can also pipeline these 'pixel instructions'. each pipeline is dedicated to a single pixel. so in a sense, yes. And apperently the number of instructions that can be done parallel or be pipelined matters for performance. very much so. longer pipelines tend to greatly hurt performance on CPUs because CPUs depend heavily these days on their caches and branch prediction units. when the branch prediction fails, there is a pipeline stall and 22 cycles are lost (on the P4 in some cases). on a GPU pipeline stalls like that don't occur because each pipeline is dedicated to a pixel and no branch prediction must be done. no pipeline stalls are possible. on the other hand, more pipelines greatly increase performance on GPUs. on CPUs, 3 pipelines seems to be the "sweet spot". any more and there will be too many pipeline stalls. any less and the pipelines become too crowded. on CPUs, instructions can be dependent on other instructions that may be in another pipeline - another source for pipeline stalls. on GPUs, there is no dependence on other pipelines. each pipeline is independent of the others. So not much difference is there ?! more than you think. a LOT more. Pentium's have pipelines and parallel instructions and gpu's have pipelines and parallel instructions. Where is the difference ?! Pentium 4s are limited to the dregree they can execute parallel operations. GPUs are, for all intents and purposes, *UNLIMITED* in the degree which they can be parallelized. -- -- Charles Banas |
#12
|
|||
|
|||
Well
I just spent some time reading on about differences and similarities between CPU's and GPU's and how they work together =D Performing graphics requires assloads of bandwidth ! Like many GByte/sec. Current CPU's can only do 2 GB/sec which is too slow. GPU's are becoming more generic. GPU's could store data inside their own memory. GPU's in the future can also have conditional jumps/program flow control, etc. So GPU's are staring to look more and more like CPU's. It seems like Intel and AMD are a little bit falling behind when it comes to bandwidth with the main memory. Maybe that's a result of Memory wars Like Vram, Dram, DDR, DRRII, Rambit(?) and god knows what =D Though intel and amd probably have a little bit more experience with making generic cpu's are maybe lot's of people have left and joined GPU's lol. Or maybe amd and intel people are getting old and are going to retire soon - braindrain However AMD and INTEL have always done their best to keep things COMPATIBLE... and that is where ATI and NVidia fail horribly it seems. My TNT2 was only 5 years old and it now can't play some games lol =D There is something called Riva Tuner... it has NVxx emulation... maybe that works with my TNT2... I haven't tried it yet. The greatest asset of GPU's is probably that they deliver a whole graphical architecture with it... though opengl and directx have that as well... these gpu stuff explain how to do like vertex and pixel shading and all the other stuff around that level. Though games still have to make sure to reduce the number of triangles that need to be drawn... with bsp's, view frustum clipping, backface culling, portal engines, and other things. Those can still be done fastest with cpu's, since gpu's dont support/have it. So my estimate would be: 1024x768x4 bytes * 70 Hz = 220.200.960 bytes per second = exactly 210 MB/sec So as long as a programmer can simply draw to a frame buffer and have it flipped to the graphics card this will work out just nicely... So far no need for XX GB/Sec. Ofcourse the triangles still have to be drawn.... Take a beast like Doom III. How many triangles does it have at any given time... Thx to bsp's, (possible portal engines), view frustum clipping, etc... Doom III will only need to drawn maybe 4000 to maybe 10000 triangles for any given time. ( It could be more... I'll find out how many triangles later ) Maybe even more than that... But I am beginning to see where the problem is. Suppose a player is 'zoomed' in or standing close to a wall... Then it doesn't really matter how many triangles have to be drawn.... Even if only 2 triangles have to be drawn... the problem is as follows: All the pixels inside the triangles have to be interpolated... And apperently even interpolated pixels have to be shaded etc... Which makes me wonder if these shading calculations can be interpolated... maybe that would be faster But that's probably not possible otherwise it would already exist ?! Or somebody has to come up with a smart way to interpolate the shading etc for the pixels So now the problem is that: 1024x768 pixels have to be shaded... = 786.432 pixels ! That's a lot of pixels to shade ! There are only 2 normals needed I think... for each triangle... and maybe with some smart code... each pixel can now have it's own normal.. or maybe each pixel needs it's own normal... how does bump mapping work at this point ? In any case let's assume the code has to work with 24 bytes for a normal. (x,y,z in 64 bit floating point ). The color is also in r,g,b,a in 64 bit floating point another 32 bytes for color. Maybe some other color has to be mixed together I ll give it another 32 bytes... Well maybe some other things so let's round it at 100 bytes per pixel 786.432 pixels * 100 bytes = exactly 75 MB per frame * 70 Hz = 5250 MB / sec. So that's roughly 5.1 GB/sec that has to move through any processor just to do my insane lighting per pixel Ofcourse doom III or my insane game... uses a million fricking verteces (3d points) plus some more stuff. vertex x,y,z, vertex normal x,y,z vertex color r,g,b,a So let's say another insane 100 bytes per vertex. 1 Million verteces * 100 bytes * 70 hz = 7.000.000.000 Which is rougly another 7 GB/sec for rotating, translating, storing the verteces etc. So that's a lot of data moving through any processor/memory ! I still think that if AMD or Intel is smart... though will increase the bandwidth with main memory... so it reaches the Terabyte age And I think these graphic cards will stop existing just like windows graphic accelerator cards stopped existing... And then things will be back to normal =D Just do everything via software on a generic processor - must easier I hope =D Bye, Bye, Skybuck. |
#13
|
|||
|
|||
Though games still have to make sure to reduce the number of triangles
that need to be drawn... with bsp's, view frustum clipping, backface culling, portal engines, and other things. Those can still be done fastest with cpu's, since gpu's dont support/have it. That sort of design would slow down the GPU based rendering pipeline. It is better render primitives in large batches than break the rendering into lot of smaller batches even if there is smaller number of primitives in total being rendered. Frustum clipping with CPU? Never. The ideal is that the data is in GPU's memory and rendering from there directly, if you clip with CPU it means you will have to transfer the data into the GPU which is a major slowdown. If you clip primitives you either render them individually (completely braindead!) or you fill up the data into Vertex Buffer (Object) (and optionally Index Buffer, or index list for glDrawElements() atleast). Sometimes when the vertex data must be synthesized and there is no feasible mechanism to do the synthesis on vertex program, then with DirectX 9 a very good way is to lock a vertex buffer with NOOVERWRITE flag (this tells the M$ API that you won't overwrite vertices which might be processed at the time so this leaves the GPU free to do what it wants while you write into the buffer, A Very Good Thing). Then fill the buffer, burning CPU time.. and memory bandwidth. When done, you unlock and then render primitives dereferencing vertices in the region of buffer you did just fill. But it is still Order Of blablabla.. faster to fill from static buffers. The next generation shaders will enable sampling from textures in the vertex program (thinking of VS 3.0 and hardware that supports this profile). This means floating-point data can be stored into textures and chosen dynamically from there. This means bones for skinning can be stored into texture. Or height values for displacement mapping can be stored into textures. Or anything the imagination can think of.. this will allow a new level of programmability to the GPU. This will also mean slower pipeline, but, fear not, it won't come close to the levels of CPU based geometry pipeline.. the base level for performance with latest ATI and NVIDIA cards have increased a lot (though only NVIDIA for now can do VS30). Back to the topic at hand, clipping individual primitives just to "render less" is a BIG no-no. BSP trees are also a big no-no, because their job was to sort primitives "quickly" from changing point of view. Sorting means broken batching. Broken batching means ****ing slow rendering (relatively speaking). It is one of the Big No's in realtime GPU graphics. What is A Good Thing, is to cull groups of primitives at a time. Say, you have 'object', give it a bounding box or bounding sphere. If you can determine the box or sphere is not visible then you don't have to render whole 'object' at all. This is a Good Thing, no setup for the primitive 'collections', no setup for the transformations, etc. No primitives submitted to the geometry pipeline. Et cetera. A good thing. Similiarly, clipping to portals is redundant computation with CPU. If clipping MUST be done, stenciling can work.. but it is fillrate consuming, you have to fill the portal once into the stencil buffer (stencil refvalue can be increased every time new portal is rendered so stencil buffer doesn't need to be cleared.. unless stenciling is used for something else, which is obviously architechtural problem and should be resolved depending on the engine and/or rendering goals, effects... blabla.. the rule of thumb is more generic the rendering system is going to be more performance it will eat to cover all contingencies). The lesson: Keep Them Batches Large. Saving in a wrong place can hurt the performance A LOT. 160 fps or 5 fps? The difference can be in a ****up the rendering engine does. But when know How **** Works, such mistakes are done less frequently and when know what the hardware can do, but it won't, atleast know there is a problem and try to fix it. |
#14
|
|||
|
|||
Hmmm I was wrong about the number of vertices and triangle doom 3 uses...
The doom 3 alpha 0.02 uses about 130.000 verteces and 70.000 triangles during mars base levels. All 6 monsters can mount up to 140.000 triangles ! Skybuck. |
#15
|
|||
|
|||
"joe smith" wrote in message ... Though games still have to make sure to reduce the number of triangles that need to be drawn... with bsp's, view frustum clipping, backface culling, portal engines, and other things. Those can still be done fastest with cpu's, since gpu's dont support/have it. That sort of design would slow down the GPU based rendering pipeline. It is better render primitives in large batches than break the rendering into lot of smaller batches even if there is smaller number of primitives in total being rendered. Frustum clipping with CPU? Never. The ideal is that the data is in GPU's memory and rendering from there directly, if you clip with CPU it means you will have to transfer the data into the GPU which is a major slowdown. If you clip primitives you either render them individually (completely braindead!) or you fill up the data into Vertex Buffer (Object) (and optionally Index Buffer, or index list for glDrawElements() atleast). Well Something has to determine what is visible and what is not... Something has to determine what is in 'view' and what is not... I know quake and other games used the cpu to do that, and bsp's and euhm backface culling and god knows what =D Now I believe you're saying that the GPU should be left to do that ? How does the GPU do that ? Skybuck. |
#16
|
|||
|
|||
Skybuck Flying wrote:
Well I just spent some time reading on about differences and similarities between CPU's and GPU's and how they work together =D Performing graphics requires assloads of bandwidth ! Like many GByte/sec. Current CPU's can only do 2 GB/sec which is too slow. GPU's are becoming more generic. GPU's could store data inside their own memory. What exactly do you think they do with the 100+ meg they already _have_? You think it's just sitting there to look pretty? GPU's in the future can also have conditional jumps/program flow control, etc. What leads you to believe that they don't already do this? So GPU's are staring to look more and more like CPU's. A GPU is a special purpose CPU optimized for graphics use. There's no reason why you couldn't port Linux to a Radeon or a GeforceFX except the difficulty of connecting a disk to the thing. It seems like Intel and AMD are a little bit falling behind when it comes to bandwidth with the main memory. Maybe that's a result of Memory wars Like Vram, Dram, DDR, DRRII, Rambit(?) and god knows what =D No, the "memory wars" are an attempt to get reasonably priced fast memory. Though intel and amd probably have a little bit more experience with making generic cpu's are maybe lot's of people have left and joined GPU's lol. Or maybe amd and intel people are getting old and are going to retire soon - braindrain However AMD and INTEL have always done their best to keep things COMPATIBLE... and that is where ATI and NVidia fail horribly it seems. My TNT2 was only 5 years old and it now can't play some games lol =D There is something called Riva Tuner... it has NVxx emulation... maybe that works with my TNT2... I haven't tried it yet. The greatest asset of GPU's is probably that they deliver a whole graphical architecture with it... though opengl and directx have that as well... these gpu stuff explain how to do like vertex and pixel shading and all the other stuff around that level. OpenGL and DirectX are feature-set standards. At one time GPUs were designed in the absence of standards--that has changed--the current consumer boards are optimized around the DirectX standard and the workstation boards around OpenGL, however this is done with firmware--the GPUs are the same and can be microcoded for either. Though games still have to make sure to reduce the number of triangles that need to be drawn... with bsp's, view frustum clipping, backface culling, portal engines, and other things. Those can still be done fastest with cpu's, since gpu's dont support/have it. And you base this assessment on what information? So my estimate would be: 1024x768x4 bytes * 70 Hz = 220.200.960 bytes per second = exactly 210 MB/sec So as long as a programmer can simply draw to a frame buffer and have it flipped to the graphics card this will work out just nicely... If the programmer can do all the necessary calculations. If you're talking about a P4-3200 then that would mean that it would have to do every pixel in 64 calculations or less. So far no need for XX GB/Sec. Ofcourse the triangles still have to be drawn.... Take a beast like Doom III. How many triangles does it have at any given time... Thx to bsp's, (possible portal engines), view frustum clipping, etc... Doom III will only need to drawn maybe 4000 to maybe 10000 triangles for any given time. ( It could be more... I'll find out how many triangles later ) Maybe even more than that... But I am beginning to see where the problem is. Suppose a player is 'zoomed' in or standing close to a wall... Then it doesn't really matter how many triangles have to be drawn.... Even if only 2 triangles have to be drawn... the problem is as follows: All the pixels inside the triangles have to be interpolated... And apperently even interpolated pixels have to be shaded etc... Which makes me wonder if these shading calculations can be interpolated... maybe that would be faster But that's probably not possible otherwise it would already exist ?! Or somebody has to come up with a smart way to interpolate the shading etc for the pixels So now the problem is that: 1024x768 pixels have to be shaded... = 786.432 pixels ! That's a lot of pixels to shade ! There are only 2 normals needed I think... for each triangle... and maybe with some smart code... each pixel can now have it's own normal.. or maybe each pixel needs it's own normal... how does bump mapping work at this point ? In any case let's assume the code has to work with 24 bytes for a normal. (x,y,z in 64 bit floating point ). The color is also in r,g,b,a in 64 bit floating point another 32 bytes for color. Maybe some other color has to be mixed together I ll give it another 32 bytes... Well maybe some other things so let's round it at 100 bytes per pixel 786.432 pixels * 100 bytes = exactly 75 MB per frame * 70 Hz = 5250 MB / sec. So that's roughly 5.1 GB/sec that has to move through any processor just to do my insane lighting per pixel Ofcourse doom III or my insane game... uses a million fricking verteces (3d points) plus some more stuff. vertex x,y,z, vertex normal x,y,z vertex color r,g,b,a So let's say another insane 100 bytes per vertex. 1 Million verteces * 100 bytes * 70 hz = 7.000.000.000 Which is rougly another 7 GB/sec for rotating, translating, storing the verteces etc. So that's a lot of data moving through any processor/memory ! I still think that if AMD or Intel is smart... though will increase the bandwidth with main memory... so it reaches the Terabyte age Eventually that will happen. By that time the feature set of video processors will likely be very thoroughly standardized and they'll be able to handle any image at a few thousand frames a second and cost 50 cents. And I think these graphic cards will stop existing just like windows graphic accelerator cards stopped existing... Huh? What is a Radeon or a GeforceFX if not a "Windows graphic accelerator card"? They're designed specifically to accelerate DirectX, which in case you haven't checked recently you will find to be a part of Windows. And then things will be back to normal =D Just do everything via software on a generic processor - must easier I hope =D Nope. Lot easier to tell the GPU "draw me a sphere at thus and so cooordinates" than it is to do all the calculations yourself. Bye, Bye, Skybuck. -- --John Reply to jclarke at ae tee tee global dot net (was jclarke at eye bee em dot net) |
#17
|
|||
|
|||
Skybuck Flying wrote:
"joe smith" wrote in message ... Though games still have to make sure to reduce the number of triangles that need to be drawn... with bsp's, view frustum clipping, backface culling, portal engines, and other things. Those can still be done fastest with cpu's, since gpu's dont support/have it. That sort of design would slow down the GPU based rendering pipeline. It is better render primitives in large batches than break the rendering into lot of smaller batches even if there is smaller number of primitives in total being rendered. Frustum clipping with CPU? Never. The ideal is that the data is in GPU's memory and rendering from there directly, if you clip with CPU it means you will have to transfer the data into the GPU which is a major slowdown. If you clip primitives you either render them individually (completely braindead!) or you fill up the data into Vertex Buffer (Object) (and optionally Index Buffer, or index list for glDrawElements() atleast). Well Something has to determine what is visible and what is not... Something has to determine what is in 'view' and what is not... I know quake and other games used the cpu to do that, and bsp's and euhm backface culling and god knows what =D Now I believe you're saying that the GPU should be left to do that ? They do now... those options in the ATI drivers ain't there for nothing. How does the GPU do that ? Magick Skybuck. |
#18
|
|||
|
|||
"Minotaur" wrote in message ... Skybuck Flying wrote: "joe smith" wrote in message ... Though games still have to make sure to reduce the number of triangles that need to be drawn... with bsp's, view frustum clipping, backface culling, portal engines, and other things. Those can still be done fastest with cpu's, since gpu's dont support/have it. That sort of design would slow down the GPU based rendering pipeline. It is better render primitives in large batches than break the rendering into lot of smaller batches even if there is smaller number of primitives in total being rendered. Frustum clipping with CPU? Never. The ideal is that the data is in GPU's memory and rendering from there directly, if you clip with CPU it means you will have to transfer the data into the GPU which is a major slowdown. If you clip primitives you either render them individually (completely braindead!) or you fill up the data into Vertex Buffer (Object) (and optionally Index Buffer, or index list for glDrawElements() atleast). Well Something has to determine what is visible and what is not... Something has to determine what is in 'view' and what is not... I know quake and other games used the cpu to do that, and bsp's and euhm backface culling and god knows what =D Now I believe you're saying that the GPU should be left to do that ? They do now... those options in the ATI drivers ain't there for nothing. How does the GPU do that ? Magick Well such an answer ofcourse won't do for any serieus developer. The developer has to know how fast a card does this. So he can decide to use the card (gpu) or do it on the cpu :P Skybuck. |
#19
|
|||
|
|||
Well
Something has to determine what is visible and what is not... Something has to determine what is in 'view' and what is not... I know quake and other games used the cpu to do that, and bsp's and euhm backface culling and god knows what =D BSP trees are used for getting perfect zero-overlap sorted set of primitives for renderer. GPU does not need, infact, the performance is hurt seriously by such "optimization".. like I explained sorting is a Big No. I must check: do you know what BSP tree is and what it is commonly used for? And what Quake engine uses it for, specificly? Or are you just throwing buzzwords around? Now I believe you're saying that the GPU should be left to do that ? You believe wrong. I am saying you cull away large batches at a time. Whole objects and so on. Not individual primitives. That is da secret 2 sp3eD. How does the GPU do that ? There is functionality in the GPU which can query how many pixels pass visibility test.. but this is poor approach due to fact that it has a lot of latency. Doing visibility computation with CPU is quite cheap. Depending on the system being implemented, ofcourse. Got something specific in mind? |
#20
|
|||
|
|||
So he can decide to use the card (gpu) or do it on the cpu :P
That is work CPU is better at. Using GPU for deteming if there are potentially visible pixels requires 1x overdraw fill cost for the bound volume being rasterized. Then, we might get results back that yes, the object is visible.. so now you have the luxury of filling most of these pixels *again*, or, you filled all these pixels just to find out that the object isn't visible. Doesn't sound very attractive to me.. does it sound very attactive to you? Didn't think so. For CPU culling bounding volume such as box or sphere to volume defined by planes is very cheap. The most commonly used volume is the frustum. Occluders can also be implemented using same mathematics, just which primitives are best occluders is interesting problem. Generally best approach is to recognize in preprocess which surfaces are "large" in a contributing ways.. the level design and gameplay mechanics determine this to a large degree. With a decent portal rendering system it is a moot point. There are a lot of solutions to visibility problem more and less generic. I don't think that you are genuinely interested in any of this, though.. |
Thread Tools | |
Display Modes | |
|
|