If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
Thread Tools | Display Modes |
#1
|
|||
|
|||
AMD R600 Architecture and GPU Analysis (long read)
this is only like HALF the article: the entire thing is here http://www.beyond3d.com/content/reviews/16/1 AMD R600 Architecture and GPU Analysis Published on 14th May 2007, written by Rys for Consumer Graphics - Last updated: 14th May 2007 Introduction Given everything surrounding the current graphics world at the time of writing -- with big highlights that include the recent AMD/ATI merger, the introduction of a new programming shading model via DirectX, NVIDIA's introduction of G80, real-time graphics hardware in the new generation of consoles, and Intel's intent to come back to discrete -- the speculation and anticipation for AMD's next generation of Radeon hardware has reached levels never seen before. 4 years in the making by a team of some 300 engineers, the chip takes the best bits of R5 and Xenos, along with new technology, to create their next architecture. How it performs, and how it slides in to the big graphics picture, means the base architecture and its derivative implementations will have an impact on the industry that will be felt a long time from launch day. If you're not excited about what we're about to explain and go over in the following pages, you haven't been paying attention to the state of the GPU union over the last year and a half, since we live in the most exciting graphics-related times since Voodoo Graphics blazed its real-time, mass-market consumer trail. The engineers at the new AMD Graphics Products Group have been beavering away for the last few years on what they call their 2nd generation unified shader architecture. Based in part on what you can find today in the Xenos GPU inside the Xbox 360 console, AMD's D3D10 compliant hardware has been a long time coming. Obviously delayed and with product family teething troubles, R600, RV610 and RV630 -- the first implementations of the new architecture -- break cover today for the first time, at least officially! We'll let you know the architecture basics first, before diving in for closer looks at some of the bigger things the architecture and the implementing GPUs do. As with our G80 analysis, we split things in to three, covering architecture in this piece, before looking at image quality and performance in subsequent articles, to divide things up into manageable chunks for us to create and you to consume. AMD have embargoed performance analysis of RV610 and RV630 until next month, but we're allowed to talk about those GPUs in terms of architecture and their board-level implementations, so we'll do that later today, after our look at R600, the father of the family. Tank once said, "Damn, it's a very exciting time". Too true, Tank, too true. He also said shortly afterwards, "We got a lot to do, let's get to it". We'll heed his sage advice. The Chip - R600 R600 is the father of the family, outgunning G80 as the biggest piece of mass market PC silicon ever created, in terms of its transistor count. It's not feature identical to the other variations, either, so we'll cover the differences as we get to them. ATI R600 details Foundry and process 80nm @ TSMC, 720M Die Size 420mm² 20mm x 21mm Chip Package Flipchip Basic Pipeline Config 16 / 16 / 32 Textures / Pixels / Z Memory Config 512-bit 8 x 64-bit API Compliance DX10.0 System Interconnect PCI Express x16 Display Pipeline Dual dual-link DVI, HDCP, HDMI TSMC are AMD's foundry partner for this round of graphics processors once again, with R600 built on their 80HS node at 80nm. 720M transistors comprise the huge 20x21mm die, which contains all of the logic, including display and signal I/O. R600 is an implementation of AMD's 2nd generation unified shading architecture, fully threaded for computation, data sampling and filtering, and supporting Shader Model 4.0 as set out by Direct3D 10. R600 also sports a hardware tesselation unit for programmable surface subdivision and certain high order surfaces outside of any major existing API, although programmable using them. R600 sports a 512-bit external memory bus, interfacing with an internal, bi-directional 1024-bit ring bus memory controller, with support for dozens of internal memory clients and GDDR3 or GDDR4 memories for the external store. Sticking with memory, the R600 architecture is cache-heavy internally, SRAM logic a significant portion of the die area. The external bus interface to the PC host is PCI Express, access to that coming via dedicated stop on the internal rings. R600 sports AMD's next generation video decoding core, called the Unified Video Decoder or UVD for short. The UVD is designed to handling full H.264 AVC decode processing offload at maximum bitrates for both Blu-Ray and HD-DVD video at maximum resolution. In terms of power management, the chip supports clock throttling, voltage adjust for p-states and entire unit shutdown depending on workload, combined by marketing under the umbrella of PowerPlay 7. We'll cover all of those things and more as the article(s) progress. The initial R600-based SKU is called Radeon HD 2900 XT, so we'll take a look at the reference board before we move on to the architecture discussion for the chip that powers it. We'll cover RV610 and RV630 separately later today, as mentioned. The Radeon HD 2900 XT Reference Board The only R600-based launch product is Radeon HD 2900 XT. To signal the intent of the products in terms of how they're able to process screen pixels, be that in motion video or real-time 3D graphics, AMD are dropping the X from the product family in place of HD. The familiar XT, XTX (maybe) and PRO monikers for inter-family designation will stay, however. Radeon HD 2900 XT sports two clock profiles, the 3D one seeing R600 clocked at 742MHz (the chip has a single master clock domain, across many slightly asynchronous subdomains), with 512MiB of GDDR3 memory clocked at 825MHz. The 2D profile sees things drop to 507MHz for the chip, with the memory run at 514MHz. Windowed applications are run in the 2D clock profile. ATI Radeon HD 2900 XT 512MiB Details Board Name Radeon HD 2900 XT Memory Quantity 512MiB Chip ATI R600 Core Frequency 742MHz Memory Frequency 825MHz Theoretical Performance @ 742/825 Pixel Fillrate 11872 Mpixels/sec Texture Fillrate 11872 Mtexels/sec Z sample rate 23744 Msamples/sec AA sample rate 47488 Msamples/sec Geometry rate 742 Mtris/sec Memory Bandwidth 105.6 GB/sec The table above will start to give some of the story away if you're willing to take the base clock and do some basic arithmetic. By virtue of the 512-bit external memory bus and Hynix HY5RS573225A FP-1 DRAM devices clocked at 825MHz (1.65GHz effective rate), HD 2900 XT has a peak theoretical bandwidth of 105.6GB/sec. One hundred and five point six, for those still blinking at the figure in numerical form. And while AMD aren't the first company to deliver a single board consumer solution with a peak bandwidth higher than 100GB/sec, given NVIDIA's soft launch of GeForce 8800 Ultra and its 103.6GB/sec peak via a 384- bit bus, they're definitely the first to try a 512-bit external bus on a consumer product. Retail editions of the board are barely larger than AMD's outgoing Radeon X1950 XTX, PCB size wise, although the cooler is a good bit beefier. While it looks like your average double slot cooler for a high end board, replete with blower fan and rear exit for the exchanged heat, the mass of the sink attached to the board and the components it's cooling is significant. In fact, the board is the heaviest 'reference' hardware yet created, with reference used to define the cooler that the IHV officially specifies. It's a somewhat sad fact, for AMD and the consumer, that TSMC's 80HS process has pretty horrible static power leakage properties. Given the process properties, the die area, clocks for GPU and mem, and the memories themselves, it's no honest surprise to see the fastest launch board based on R600 coming with such a cooler. It's also no real surprise to see it come with a pair of external power input connectors, and one of those is the new 8-pin variant as well. That version of the connector gives the board another +12V input source at just over 6A max current. The board doesn't require you fill that connector block up with power- giving pins, though, since it runs happily with a regular pair of 6- pin inputs. Look closely and you'll see the 8-pin block holds the 6- pin block, and there's only one orientation that can make it happen, so you don't connect the wrong pins and have things go horribly wrong. A pair of dual-link DVI connectors dominates the backplane space that the first slot occupies, with the grille for the heat output dominating the second. The DVI ports (although it's unclear if it's just one of them or both) support HDMI output for video and audio via a supplied active convertor. It appears that audio is either muxed into the video bitstream pushed out over the pins, or the spare pins on the dual-link connector (the version of HDMI that the hardware supports is analogous to single-link DVI for the video portion) are used for that, but it's an implementation detail at best. Physicals In terms of the board's physical properties when being used in anger, we've run our test sample just fine using a PSU that provides only 6- pin connectors. Calibrating against other boards on the same test platform, we estimate (because we're not 100% certain about other board's peak power either) peak load power from the board, at least using our load condition of 3DMark05's GT3 @ 1920x1200 w/4xAA/16xAF, of less than 200W at stock frequencies. We haven't tried overclocking enough to be sure of a peak power draw using the Overdrive feature. Using AMD's GPU overclocking tool which is able to read the GPU's thermal diode, load temperatures for the GPU approach 90°C with our sample. Under the load condition generating that heat, the fan speed is such that board volume is higher than any of the board boards on test, and that the pitch of the noise is annoyingly high with a whistling property to it, at least in our test system and to our ears. Added to that, the speed steppings of the fan make pitch changes very noticeable, and the board constantly alters fan speed in our test system, even just drawing the (admittedly Vista Aero Glass) desktop. So there's no static 2D fan speed that we can see, and the overall noise profile of the cooling solution is subjectively much worse than Radeon X1950 XTX, NVIDIA GeForce 8800 GTX or GeForce 8800 GTS. So while the board's form factor is pleasing in the face of NVIDIA's highest-end products, simply because it's not as long and will therefore fit into more cases than its competitor, the cooler is a disappointment in terms of noise. It certainly seems effective when dealing with heat, though, and we're sure that was AMD's primary concern when engineering it. It's possible that things can get better for the product here without any hardware changes, via software. The fan's got more variable speed control than the driver seems to make use of, and a more gradual stepping function to change fan speed based on GPU load and temperature can likely be introduced. Here's hoping! Time to check out the test system that we used for our architecture and performance investigations. AMD R600 Overview Click for a bigger version We use an image display overlay for image enlargement at Beyond3D (if your browser supports it), but it might be worth having the enlargement open in a new brower window or a new tab, so you can refer to it as the article goes by. The image represents a (fairly heavily in places) simplified overview of R600. Datapaths aren't complete by any means, but they serve to show the usual flow of data from the front of the chip to its back end when final pixels are output to the framebuffer and drawn on your screen. Hopefully all major processing blocks and on-chip memories are shown. R600 is a unified, fully-threaded, self load-balancing shading architecture, that complies with and exceeds the specification for DirectX Shader Model 4.0. The major design goals of the chip are high ALU throughput and maximum latency hiding, achieved via the shader core, the threading model and distributed memory accesses via the chip's memory controller. A quick glance at the architecture, comparing it to their previous generation flagship most of all, shows that the emphasis is on the shader core and maximising available memory bandwidth for 3D rendering (and non-3D applications too). The shader core features ALUs that are single precision and IEEE754 compliant in terms of rounding and precision for all math ops, with integer processing ability combined. Not all R600 SPUs are created equal, with a 5th more able ALU per SPU group that handles special function and some extra integer processing ops. R600, and the other GPUs in the same architecture family, also sports a programmable tesselation unit, very similar to the one found in the Xbox 360. While DirectX doesn't support it in any of its render stages, it's nonetheless programmable using that API with minimal extra code. The timeframe for that happening is unclear, though. That's the basics of the processor, so we'll start to look at the details, starting with the front end of the chip where the action starts. Shader Core Ah, the processing guts of the beast. With four such clusters, R600 sports a full 320 separate processing ALUs dedicated to shading, with the units arranged as follows. Each cluster contains sixteen shader units, each containing five sub scalar ALUs that perform the actual shading ops. Each ALU can run a separate op per clock, R600 exploiting instruction-level parallelism via VLIW. Having the compiler/assembler do a bunch of the heavy lifting in terms of instruction order and packing, arguably reduces overall efficiency compared to something like G80's scalar architecture that can run full speed with dependent scalar ops. Not all the ALUs are equal, either, with the 5th in the group able to do a bit more than the other four, and independently of those too. Architecture Summary Well well, graphics fans, it's finally here! Years in the making for AMD, via the hands of 300 or so engineers, hundreds of millions of dollars in expenditure, and unfathomable engineering experience from the contributing design teams at AMD, R600 finally officially breaks cover. We've been thinking about the architecture and GPU implementations for nearly a year now in a serious fashion, piecing together the first batches of information sieved from yon GPU information stream. As graphics enthusiasts, it's been a great experience to finally get our hands on it and put it through the mill of an arch analysis, after all those brain cycles spent thinking about it before samples were plugged in and drivers installed. So what do we think, after our initial fumblings with the shader core, texture filter hardware and ROPs? Well arguably the most interesting bits and pieces the GPU and boards that hold them provide, we've not been able to look at either for time reasons, resource reasons, or they simply fall outside this article's remit! That's not to say things like the UVD, HDMI implementation and the tesselator overshadow the rest of the chip and architecture, but they're significant possible selling points that'll have to await our judgement a little while longer. What remains is a pretty slick engineering effort from the guys and guys at AMD's Graphics Products Group, via its birth at the former ATI. What you have is evolution rather than revolution in the shader core, AMD taking the last steps to fully superscalar with independent 5-way ALU blocks and a register file with seemingly no real-world penalty for scalar access. That's backed up by sampler hardware with new abilities and formats supported to chew on, with good throughput for common multi-channel formats. Both the threaded sampler and shader blocks are fed and watered by an evolution of their ring-bus memory controller. We've sadly not been able to go into too much detail on the MC, but mad props to AMD for building a 1024-bit bi- directional bus internally, fed by a 16-piece DRAM setup on the 512- bit external bus. Who said the main IHVs would never go to 512? AMD have built that controller in the same area as the old one (whoa, although that's helped by the process change), too. Using stacked pads and an increase in wire density, affording them the use of slower memory (which is more efficient due to clock delays when running at higher speeds), R600 in HD 2900 XT form gets to sucking over 100GB/sec peak theoretical bandwidth from the memories. That's worth a tip of an engineer's hat any day of the week. Then we come to the ROP hardware, designed for high performance AA with high precision surface formats, at high resolution, with an increase in the basic MSAA ability to 8x. It's here that we see the lustre start to peel away slightly in terms of IQ and performance, with no fast hardware resolve for tiles that aren't fully compressed, and a first line of custom filters that can have a propensity to blur more than not. Edge detect is honestly sweet, but the CFAA package feels like something tacked on recently to paper over the cracks, rather than something forward-looking (we'll end up at the point of fully-programmable MSAA one day in all GPUs) to pair with speedy hardware resolve and the usual base filters. AMD didn't move the game on in terms of absolute image quality when texture filtering, either. They're no longer leaders in the field of IQ any more, overtaken by NVIDIA's GeForce 8-series hardware. Coming back to the front of the chip, the setup stage is where we find the tesselator. Not part of a formal DirectX spec until next time with DX11, it exists outside of the main 3D graphics API of our time, and we hope the ability to program it reliably comes sooner rather than later since it's a key part of the architecture and didn't cost AMD much area. We'll have a good look at the tesselator pretty soon, working with AMD to delve deep into what the unit's capable of. With a harder-to-compile-for shader core (although one with monstrous floating point peak figures), less per-clock sampler ability for almost all formats and channel widths, and a potential performance bottleneck with the current ROP setup, R600 has heavy competition in HD 2900 XT form. AMD pitch the SKU not at (or higher than) the GeForce 8800 GTX as many would have hoped, but at the $399 (and that's being generous at the time of writing) GeForce 8800 GTS 640MiB. And that wasn't on purpose, we reckon. If you asked ATI a year ago what they were aiming for with R600, the answer was a simple domination over NVIDIA at the high end, as always. While we take it slow with our analysis -- and it's one where we've yet to heavily visit real world game scenarios, DX10 and GPGPU performance, video acceleration performance and quality, and the cooler side facets like the HDMI solution -- the Beyond3D crystal ball doesn't predict the domination that ATI will have done a year or more ago. Early word from colleagues at HEXUS, The Tech Report and Hardware.fr in that respect is one of mixed early performance that's 8800 GTS-esque or thereabouts overall, but also sometimes less than Radeon X1950 XTX in places. Our own early figures there show promise for AMD's new graphics baby, but not everywhere. It's been a long time since that's been something anyone's been able to say about a leading ATI, now AMD, graphics part. We'll know a fuller story as we move on to looking at IQ and performance a bit closer, with satellite pieces to take in the UVD and HDMI solution and the tesselator to come as well. However after our look at the base architecture, we know that R600 has to work hard for its high-quality, high-resolution frames per second, but we also know AMD are going to work hard to make sure it gets there. We really look forward to the continued analysis of a sweet and sour graphics architecture in the face of stiff competition, and we'll have image quality for you in a day or two to keep things rolling. RV610 and RV630 details will follow later today. The VLIW design packs a possible 6 instructions per-clock, per-shader unit (5 shading plus 1 branch) into the complete instructions it issues to the shader units, and those possible instruction slots have to match the capabilities of the hardware underneath. Each of the first 4 sub ALUs is able to retire a finished single precision floating point MAD (or ADD or MUL) per clock, dot product (dp, and special cased by combining ALUs), and integer ADD. In terms of float precision, the ALUs are 1 ULP for MAD, and 1/2 ULP for MUL and ADD. The ALUs are split in terms of gates for float and int logic, too. There's no 32-bit mantissa in the ALU to support both, but only one datapath in and out of the sub-ALU, so no parallel processing there. Denorms are clamped to 0 for both D3D9 and D3D10, but the hardware supports inf. and NaN to IEE754 spec. The fifth fatter unit (let's egotistically call it the RysUnit, since it shares my proportions compared to normal people, and I can be 'special' too) can't do dp ops, but is capable of integer division, multiply and bit shifting, and it also takes care of transcendental 'special' functions (like sin, cos, log, pow, exp, rcp, etc), at a rate of one retired instruction per clock (for most specials at least). It's also responsible for float-integer conversion. Unlike the other units, this one is actually FP40 internally (32-bit mantissa, 8-bit exponent). This allows for single-cycle MUL/MAD operations on INT32 operands under D3D10, which G80 needs 4 cycles for. It's certainly an advantage of having a VLIW architecture and multiple kinds of units. If you didn't follow that, the following should help. Each cluster runs thread pairs with the same type in any given cycle, but each of those four clusters can run a different thread type if it needs to. The front-end of the chip handles the thread load balancing across the core as mentioned, and there's nothing stopping all running threads in a given cycle being all pixel, all vertex, or even.....you guessed it: all geometry, although that might not be the case currently. More on that later. For local memory access, the shader core can load/store from a huge register file that takes up more area on the die than the ALUs for the shader core that uses it. Accesses can happen in 'scalar' fashion, one 32-bit word at a time from the application writer's point of view, which along with the capability of co-issuing 5 completely random instructions (we tested using truly terrifying auto-generated shaders) makes ATI's claims of a superscalar architecture perfectly legit. Shading performance with more registers is also very good, indeed we've been able to measure that explicitly with shaders using variable numbers of registers, where there's no speed penalty for increasing them or using odd numbers. It's arguably one of the highlights of the design so far, and likely a significant contributor to R600's potential GPGPU performance as well. Access to the register file is also cached, read and write, by an 8KiB multi-port cache. The cache lets the hardware virtualise the register file, effectively presenting any entry in the cache as any entry in the larger register file. It's unclear which miss/evict scheme they use, or if there's prefetching, but they'll want to maximise hits for the running threads of course. It seems the hardware will also use it for streamout to memory, letting the shader core bypass the colour buffer and ROPs on the way out to board memory, and the chip will also use it for R2VB and overflow storage for GS amplification, making it quite the useful little piece of on-chip memory. While going 5-way scalar has allowed AMD more flexibility in instruction scheduling compared to their previous hardware, that flexibility arguably makes your compiler harder to write, not easier. So as a driver writer you have more packing opportunities -- and I like to think of it almost like a game of Tetris when it comes to a GPU, but only with the thin blocks and with those being variable lengths, and you can sometimes break them up! -- those opportunities need handling in code and your corner cases get harder to find. The end result here is a shader core with fairly monstrous peak floating point numbers, by virtue of the unit count in R600, its core clock and the register file of doom, but one where software will have a harder time driving it close to peak. That's not to say it's impossible, and indeed we've managed to write in-house shaders, short and long and with mixtures of channels, register counts and what have you, that run close to max theoretical thoughput. However it's a more difficult proposition for the driver tech team to take care of over the lifetime of the architecture, we argue, than their previous architecture. In terms of memory access from the sampler hardware, sampler units aren't tied to certain clusters as such, rather certain positions inside the cluster. If you visualise the 16 shader units in a cluster as being four quads of units, each of the four samplers in R600 is tied to one of those quads, and then across the whole shader core. |
#2
|
|||
|
|||
AMD R600 Architecture and GPU Analysis (long read)
[lots of stuff clipped]
Alas, it appears a rushed chipset product has resulted in a power-hungry beast that had to be underclocked before it could be released, and also prevented the release of a contender for fastest videocard. Perhaps when it's retooled on a smaller process. rms |
#3
|
|||
|
|||
AMD R600 Architecture and GPU Analysis (long read)
On May 15, 1:53 pm, "rms" wrote:
[lots of stuff clipped] Alas, it appears a rushed chipset product has resulted in a power-hungry beast that had to be underclocked before it could be released, and also prevented the release of a contender for fastest videocard. Perhaps when it's retooled on a smaller process. rms there's a 65nm version in the works, possibly called R650. but I don't expect the R6xx architecture to shine until its been refreshed / improved, in the form of an R680 or R700 -which will most likely share the same basic architechure but revamped. |
#4
|
|||
|
|||
AMD R600 Architecture and GPU Analysis (long read)
On Tue, 15 May 2007 12:53:56 -0600, "rms"
wrote: [lots of stuff clipped] Alas, it appears a rushed chipset product has resulted in a power-hungry beast that had to be underclocked before it could be released, and also prevented the release of a contender for fastest videocard. Perhaps when it's retooled on a smaller process. rms It's a crap card and doesn't hold a candle to the G80. nVidia is king of the hill for the next year or so at least. |
#5
|
|||
|
|||
AMD R600 Architecture and GPU Analysis (long read)
Bababooey wrote:
On Tue, 15 May 2007 12:53:56 -0600, "rms" wrote: [lots of stuff clipped] Alas, it appears a rushed chipset product has resulted in a power-hungry beast that had to be underclocked before it could be released, and also prevented the release of a contender for fastest videocard. Perhaps when it's retooled on a smaller process. rms It's a crap card and doesn't hold a candle to the G80. nVidia is king of the hill for the next year or so at least. yeah...its a great card for dx9 games |
#6
|
|||
|
|||
AMD R600 Architecture and GPU Analysis (long read)
Thanks for that RadeonR600 ...My brain hurts !
Why such an obsession with quite Fans ? I'm fighting my Geforce to turn its fan up. I want the thing to last as reliably as possible ...every Graphic card I have had that has died has done so because of some fault in the cooling. I'd go for noisy & cool everytime. All those enthusiastic words !! but no mention at all of reliability / long term life. When spending say £200+ BPS / $400+ USD a person should know the thing is good for at least a couple of years of heavy use. All the words don't really tell me anything in terms of what performance / cost is. After years with ATI cards & now recently to NVIDIA for the 8800GTX I'm in 2 minds ...wanting to hear my Geforce is the better ...& wanting to hear about a really good new DX10 card ..I would not recommend the 8800 to anyone because of the dreadful control panel & silly little driver bugs. The article should try to impress us with how wonderful the ATI drivers will be. (\__/) (='.'=) (")_(") mouse |
#7
|
|||
|
|||
AMD R600 Architecture and GPU Analysis (long read)
On Tue, 15 May 2007 12:53:56 -0600, "rms"
wrote: [lots of stuff clipped] Alas, it appears a rushed chipset product has resulted in a power-hungry beast that had to be underclocked before it could be released, and also prevented the release of a contender for fastest videocard. Perhaps when it's retooled on a smaller process. rms The only real question is how good a CPU is it ultimately going to be |
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Compaq Deskpro won't boot, won't read floppy or CD, 1 long beep-1 short-1 long-1 short, pause then 1 long-2 short beeps | Olde Fortran | Compaq Computers | 2 | May 31st 06 04:45 AM |
ATI R520 architecture | [email protected] | Nvidia Videocards | 1 | October 6th 05 02:16 PM |
A new computer architecture | Yingxia Wang | Storage & Hardrives | 1 | February 4th 05 08:13 PM |
CPU Architecture | Adrian | General | 3 | June 10th 04 07:22 AM |
how can I use a signal defined in one Architecture to another Architecture | Muhammad Khan | General Hardware | 3 | July 10th 03 05:11 AM |