If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
Thread Tools | Display Modes |
|
#1
|
|||
|
|||
NVIDIA Tesla: new GPU computing product range
http://www.beyond3d.com/content/articles/77
NVIDIA Tesla: GPU computing gets its own brand written by Tim for Professional Workstation A Brief History of CUDA When NVIDIA's G80 launched in November 2006, there was a brief mention of a new toolkit that would greatly simplify GPU computing development. Called CUDA (for Complete Unified Device Architecture), we knew at the time that it was a C derivative that would run on the GPU without using any 3D API as an intermediary. We also knew that the lead architect for CUDA was Ian Buck, a student of the legendary Pat Hanrahan at Stanford and one of the authors of the original BrookGPU paper. Considering its pedigree, we were very excited to see what he could do with G80. While we waited to get our hands on CUDA, we saw three major advantages to NVIDIA's approach. First, by bypassing 3D APIs, there's no concern about future drivers breaking an application as has plagued them in the past; consider Folding@Home's initial release on R580 and the continued absence of G80 support as an example. Second, it makes GPU computing more accessible by allowing developers to write their applications in a potentially more familiar manner, as opposed to shoehorning their application to fit within a 3D API's paradigm. Finally, it allows developers to access portions of the chip that they wouldn't be able to use directly in a 3D API. In February, NVIDIA released a beta of CUDA to the public. Our ideas about the advantages of the CUDA approach were confirmed, especially with the exposure of the parallel data cache to reduce DRAM accesses and accelerate algorithms that would have previously been limited by memory bandwidth or latency. Of course, it wasn't perfect; there was still only single precision (32-bit), which is not suitable for many applications, among other limitations. Still, the beta showed enormous promise, and it was obvious that GPU computing would rapidly become a major part of NVIDIA's business. Today, NVIDIA launches its third brand of GPU products, Tesla, for GPU computing. The Tesla Lineup At the moment, NVIDIA Tesla is primarily focused on the highest of the high-end, namely the oil, gas, and computational finance industries. That's important to keep in mind because the Tesla introduction has answered another question we had when we first looked at CUDA: would it be limited to professional cards, even though the consumer GeForce cards would be capable of using CUDA? The answer is a resounding no. CUDA will be available across all product lines, although eventually there will be some features specific to GPU computing that are only available through the Tesla brand. Instead, Tesla, like Quadro, will be focused as a total solution. Workstations and software will be qualified to work with Tesla, with the same types of support as Quadro. The basic unit of the current Tesla line, the Tesla C870, should be very familiar to anyone who's seen the GeForce 8800. It's essentially an 8800 GTX--a 575MHz core clock and 128 SPs at 1.35GHz--with 1.5GiB of GDDR3 RAM. Of course, it's not quite an 8800 GTX--there are no display outputs at all on the card, even though it has a new version of the NVIO chip. NVIDIA states that the C870 has a peak performance of 518 GFlops. Careful readers might already realize the implications of this number: the missing MUL is apparently now available in CUDA, increasing theoretical peak performance by 50% over what was previously stated for the 8800 GTX in CUDA. However, the conditions necessary for the MUL in the SFU to be used are unknown, and we don't know whether or not the SFU MUL will ever be available in 3D applications. Still, the difference between a 50x and an 100x speedup is a lot less important than the difference between 1x and 50x, so we aren't too concerned about the missing MUL. The C870 has an MSRP of $1299 and should be available in August. The second product in the current Tesla line is the NVIDIA D870, called the "Deskside Supercomputer." It's very similar to the Quadro Plex; it's two Tesla C870 cards in an external unit. Like the Quadro Plex, the D870 will connect to a host computer via an external PCIe 8x or 16x connection. Because it's simply two C870 cards in a more convenient form factor, the peak theoretical performance of the D870 is 1.036 TFlops. Keep in mind that CUDA doesn't use any multi-chip interface like SLI. Instead, one thread on the CPU controls one CUDA device. So, in the case of the D870, there are two CUDA devices, and two CPU threads will be used to control them. As a result, if the data set can be spread across the two devices, there's a linear increase in speed. There's not any overhead from SLI or anything other than PCIe bandwidth, so the D870 really will be about twice as fast as the C870. The D870 has an MSRP of $7500 and, like the C870, is scheduled for availability in August. Finally, there's the utter beast of the family: the Tesla GPU Server. With availability targeted for Q4, it costs $12,000, and it's twice as fast as the D870 with four G80 cards. The really terrifying thing here, however, is the form-factor. It fits four G80s into a 1U rack. Again, the peak performance is double that of the D870: 2.072 TFlops. The GPU Server will consume about 550W of power on average (with a peak of 800W) and, like the D870, will be connected to a host machine via an external PCI Express 8x or 16x interface. A host machine will be able to control a single GPU Server. Still, for markets that can leverage this kind of computing power, the GPU Server has to be unbelievably attractive, as it would fit into the existing server ecosystem without any special considerations. But, just what are those markets, and why are they able to use GPU computing to such effect? NVIDIA has been working closely with some developers since CUDA first became available, and we've seen some of the fruits of their labor. (page 2) Tesla isn't going to cause a major paradigm shift in every market overnight. It will, however, cause those shifts in individual industries as software is writen to take advantage of CUDA. We had the chance to speak to a number of groups who have been working with CUDA over the past year to see what they've been able to do with it, and the results are impressive. Headwave One of the major markets that NVIDIA is targeting with Tesla is the oil and gas industry, and it's easy to see why. Most oil fields that have not yet been discovered are offshore, and as a result, the need for data processing capabilities has grown at an astronomical rate. Seismic data is being acquired at much higher resolutions due to the expense of attempting to tap an undersea oil field--about $150 million--along with new types of data, such as electromagnetics, that weren't even considered a decade ago. These processed data sets are often in the range of 50 gigabytes in size, down from between half a terabyte and 2.5 terabytes before processing. Headwave is entering the seismic data processing market with a product that uses CUDA to great effect. Traditional algorithms run on a workstation can process a data set at 10-30 MB/s. For a terabyte data set, that's over three weeks of processing time at least. With massive data sets and an inherently parallelizable algorithm, Headwave is able to achieve speeds of 1-2 GB/s. That same terabyte data set can now be processed in less than a day on a single G80. Evolved Machines Most people have heard of neural networks in artificial intelligence, but many people don't realize that artificial neural networks as described in the literature have absolutely nothing to do with biological neurons. Artificial neural networks essentially abstract away all of the physical aspects of the neuron in exchange for a greatly simplified model that seems to have reached its limits in terms of usefulness. Obviously, those physical aspects of the neuron are essential for more general purpose computation. Dr. Paul Rhodes and his team at Evolved Machines are creating neural arrays, which are complete simulations of neural circuits. Because they take into account all of the physical characters of the neuron, they are infinitely more complicated than the neural networks of old. A single neuron contains 2000 differential equations, each of which is updated 100,000 times per second. Each update takes 20 Flops. So, for a single neuron, we're already at 4 GFlops. Dr. Rhodes estimates that a neural circuit capable of performing sensory perception would be composed of 1000 to 2500 neurons, which takes us to between 4 and 10 TFlops. Until very recently, this was the type of application that would take a Top 500 supercomputing cluster, but it's suddenly possible with a rack or two. Sensory perception is exactly what Evolved Machines is trying to achieve. They're building neural circuits that are capable of visual as well as olfactory recognition--yes, that's right, computers that smell. Already, they've found that their application is about 65 times faster on a single G80 than it is on a current x86 chip. Theoretical and Computational Biophysics Group at UIUC Evolved Machines weren't the only group demonstrating simulations of molecular biology. John Stone, a developer with the Theoretical and Computational Biophysics Group at the University of Illinois at Urbana- Champaign, was on hand to share his experiences with CUDA development. Stone has added CUDA support to the Nanoscale Molecular Dynamics package (NAMD). As with Headwave and Evolved Machine, parts of NAMD were sped up by orders of magnitude thanks to CUDA. Stone has documented much of his experiences with CUDA in a lecture given to the ECE 498AL class at UIUC (more on that class later), and it's definitely worth a read to anyone considering using CUDA. Stone focuses on the algorithm that places ions in a simulated virus, and he points out a number of potential bottlenecks that are applicable to any GPU computing project. The number of arithmetic operations must be high enough to effectively hide memory latency, data structures must be modified to prevent branching, different memory regions must be used effectively, register usage has to be kept as low as possible... basically, it's all the rules that we've come to expect when writing 3D code on a GPU. Of course, this isn't completely straightforward. Stone provides four implementations of the Coloumbic potential kernel. The naive version that doesn't make any attempt to hide memory latency achieves 90 GFlops. The final version, which follows all the guidelines above and maximizes use of the parallel data cache, reaches 235 GFlops. Of course, the resulting code is significantly more daunting than naive version, although once you've got a decent handle on CUDA, the code itself is really not that bad. However, it's very clear that the algorithm was carefully designed, implemented, profiled, and reimplemented several times, which is where the difficulty with CUDA arises. (page 3) The Need for Education What people are going to discover, though, is that CUDA is hard. Writing the code isn't hard--CUDA really is just C with a few added keywords--but designing algorithms to really utilize the architecture can be fantastically difficult. One concern that NVIDIA has is that students in computer science won't get enough training with parallel algorithms and massively parallel architectures to be able to make the best use of CUDA. This certainly isn't unjustified. A year ago, if we were to mention a massively parallel architecture, we'd be talking about a supercomputing cluster. Now, some of the same difficulties in designing software for a cluster apply to every G8x chip. To try to improve the situation, David Kirk, chief scientist at NVIDIA, taught a class at UIUC on data-parallel programming using CUDA, the previously mentioned ECE 498AL. All of the materials for the class, including lecture slides, audio recordings of all lectures, and assignments, are freely available. We've gone through all of the materials to get a better understanding of CUDA ourselves, and we highly recommend it to anyone interested in data-parallel programming or CUDA. NVIDIA, and Kirk in particular, are hoping that ECE 498AL can be used as a template for classes in data-parallel programming at other universities. Of course, many will claim that NVIDIA is pushing classes in data- parallel programming as a way to push CUDA, and there's certainly some element of truth to that. The problem with that view, though, is the lack of other widely available massively parallel machines. As Kirk told us, it's not possible to be entirely platform agnostic when teaching any low-level language like CUDA. Teaching C usually involves some explanation of what's happening on an x86, and teaching data- parallel programming using CUDA isn't too different. Most importantly, though, students need to be exposed to these types of architectures as early as possible. Massively parallel architectures won't be some fad that goes away in five years, and any exposure, even if it's centered around CUDA, is better than leaving students totally clueless once other vendors introduce similar products. The Future First, it's important to keep in mind Tesla's position in the marketplace. CUDA on consumer products isn't going away. Developing CUDA is perfectly possible on GeForce cards, and we expect that the importance to CUDA in mainstream applications as well as gaming will quickly grow with the number of G8x chips. Tesla is instead focused at the users of high-end GPU computing products that will be certified for use with specific Tesla products. However, that does not mean that there won't be Tesla-specific features. One of the problems preventing the adoption of GPU computing in some markets is the need for double precision. As we look to G92, which has been stated to be close to 1 TFlop for single precision processing as well as being capable of double precision processing, we can say that double precision on G92 will be limited to the Tesla line. While this will surely disappoint some, the need for double precision processing on a consumer GPU is questionable at best, and NVIDIA sees this as an excellent way to differentiate the two product lines. We expect that there will be other Tesla-specific features, such as ECC RAM, that simply don't make sense in the consumer market. As far as CUDA goes, 1.0 should launch next week, and we'll have in- depth coverage of just what that means for developers. Among the new features are asynchronous kernel calls (freeing up the CPU while the GPU runs), improved FFT and BLAS libraries, and a Matlab library to offload whatever processing it can to the GPU. Most importantly, though, 64-bit versions of CUDA will be available, correcting a major complaint with the earlier betas. In addition, the separation between CUDA drivers and normal device drivers will soon end, making CUDA realistically available to end users. Finally, the specification for PTX, the intermediate assembly language used by CUDA, will be opened, allowing other languages to gain the same access to the chip as CUDA has. However, it also means that backends for different architectures could be developed, potentially changing the GPU computing game considerably. Keep an eye out for that. All in all, 2007 looks to be the year when GPU computing starts to make inroads into the market. With Tesla, NVIDIA is making a serious bet on the future of the company (and certainly looking at waging war with the CPU guys). Considering the disruptive effects that GPU computing could have on some markets, it's a very exciting time for the industry, and with GPU computing capabilities stealthily introduced to more and more computers, we're definitely looking forward to seeing just what happens. Also, be sure to check out our interviews with David Kirk, NVIDIA Chief Scientist, and Andy Keane, General Manager of the GPU Computing group regarding Tesla, the future of GPU computing, and better incorporating parallel programming into education. Interviews: Dave Kirk http://www.beyond3d.com/content/interviews/40/ Andy Keane http://www.beyond3d.com/content/interviews/41/ |
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
LARGE RANGE OF PRODUCT CHOICES | wholesale-website | Overclocking AMD Processors | 0 | March 23rd 07 02:53 AM |
NVIDIA Strengthens Mainstream Product Line with Two GeForce® 7 Series GPUs | [email protected] | General | 0 | March 26th 06 08:33 AM |
NVIDIA, boots to "Out of Frequency Range" | Doug Mitchell | Dell Computers | 7 | November 30th 04 12:09 AM |
NVidia release mid-range 6600 and 6600GT | Leodiensian | Nvidia Videocards | 1 | August 13th 04 02:25 AM |
Nvidia MX4000 - Very ODD New Product | Darthy | Nvidia Videocards | 12 | February 11th 04 10:11 AM |