Nolan Goodnight Dale Beermann
September 26, 2002
| 1) |
The crossover point between geometry and rasterization
We have modified the supplied gfxBench program so that we are able to benchmark the Triangle and Fill Rates under various conditions. The system we used is a Dual Pentium III 550 Mhz with a GeForce 4 Ti4200. We believe that in many of our tests, the 2X AGP slot is a major limiting factor. With a faster interface to the graphics card, many of the tests we have done in this assignment would have benefited. The graph below shows our results for Triangle rates using the system's main memory. There are four tests, all of which are using memory allocated by the CPU. The legend indicates if the test was done using triangles that were colored (c), unlit (unl), lit (l), untextured (unt), and textured (t). The tests were done using right triangles indexed in a Vertex Array. Our results are not exactly what we would have expected. We achieved a maximum of 11.55 Million Triangles per Second, at a triangle size (area) of about 55 pixels. However, this occurred in an interesting place. As we expected, there is no speedup at all when increasing triangle size up to a point. This point should be the crossover point between geometry and rasterization limited rendering. From the graph, you can see an increase in Triangle Rate before a dropoff. The only explanation we have for this is that there was some sort of optimization happening for triangles around this size, or some other factor that we are unaware of. From these tests it is difficult to tell what could be causing this. For the test with colored, lit, untextured triangles, the crossover point was at a size of about 84 pixels. Our results for the tests in which we do lighting are using 4 positional light sources, each with a diffuse, ambient, and specular component. We needed to use this many to make sure that our hardware was not able to perform acceleration on all of them. We managed to find a limiting solution this way, but the data says a little more. The lighting applied was limiting the triangle rate until the triangle size reached the same point on the graph as for the unlit triangles, at which point we assume the fill rate has again become limiting. It is important to note that different data can be acquired for many different lighting situations, all of which will have different limiting factors depending on the graphics hardware. A GeForce is capable of using acceleration for more than one light, so we wanted to make sure that we found a limiting case. The crossover point for unlit, textured and lit, textured triangles are both at about 32 pixels. In this case we assume the texturing to be the limiting factor. Another interesting result is that lit, textured triangles render at a faster rate than unlit, textured triangles. We aren't certain as to why this would happen. It could be the result of textures being cached and accessed very quickly, but again, without more extensive testing it would be hard to know. One of the more interesting results of these graphs is that they converge. By this we mean that for lit, untextured triangles and unlit, untextured triangles, beyond a size of 84 pixels, you are kind of getting lighting for free. A similar case occurs for lit, textured and unlit, textured triangles. For this second case, it seems that texturing is a limiting factor whereas in the first case, it seems that fill rate is the limiting factor. The data below confirms, to an extent, that texturing is limiting in our second case here.
Although the data isn't entirely conclusive, we do see a small decrease in fill rate for textured triangles. Here lighting doesn't seem to have much of an impact. We used the same lighting model as above, hoping that it would "trick" the card and cause a significant slowdown, but we weren't able to achieve anything. From this we can conclude that the reason textured triangle rates don't hit the same bottleneck that untextured triangle rates do is because texturing is in fact the limiting factor, and not fill rate. |
|
2) Pixel Transfer performance
The purpose of this benchmark is to determine at what rate pixels (stored in processor memory) can be moved over the AGP bus and into the framebuffer. OpenGL supports a number of pixel formats (RGB, RGBA, etc) and data types. Our tests are designed to reveal the trade-off between performance and how pixel data is stored in memory. The graph below illustrates our measured speeds (in Mpix/sec) for a number of standard GL pixel formats. In most cases, ReadPixels is slower than DrawPixels, however this is not the case with GL_BYTE. Likewise, read pixels is faster when the data type is set to any GL_SHORT format. We are unsure of why this is. A speedsheet containing all of our data is available at the bottom of this page.
Graph (1) shows the pixel rate (and bandwidth in MB/sec) for various GL data types. In each case the format is fixed at GL_RBGA. Clearly, the graphics card is optimized to operate on single byte components. The second graph (2) is a comparison between fixed data type and varying pixel format. For this illustration we have chosen to fix the peak performance type (GL_UNSIGNED_BYTE) from the previous graph. Here the results show a maximum DrawPixels rate for RGBA and RGB formats (as well as the reverse: BGR and BGRA). In both of the data sets shown, we used a RGBA framebuffer with blending disabled. Data Alignment In OpenGL the UNPACK_ALIGNMENT and PACK_ALIGNMENT values effect the operation of any method that moves memory between the system and the graphics card. For this test we have chosen to fix the pixel format and data type and step through the four possible alignment options (pack and unpack). The graph below shows a plot of the four different cases. Note that changing the alignment seems to effect performance only in the high performance regions, where we are either reading or writing RGB (RGBA), UNSIGNED_BYTE color components.
We tested the performance of glDrawPixels with various blending modes enabled. In each test, the source blending factor was set to the same value as the destination factor. Disparate combinations of source and destination factors were also implemented, however the results did not vary notably. We have determined that the GeForce4 is highly efficient at performing hardware alpha blending, considering that not one of the blending modes decreased our DrawPixels performance. In fact, the results were virtually identical to as they were with blending disabled. Using RGB and GL_UNSIGNED_BYTE we were able to maintain around 34 Mpix/sec regardless of blending mode. Depth Components We found the difference between reading and writing depth components to be significant. Our results indicate that it is always faster to read depth components than it is to write them. The graph below illustrates this. From the graph it is clear that the only case where read and write speeds compare is when the data type is set to GL_FLOAT. With most of the data types, the glDrawPixels rate is appallingly slow.
Pixel Rate Vs. Image size To test the relationship between image size and transfer rate we again have chosen to use one of the highest performance data type and pixel format combination (RGB). For each glDrawpixels iteration, we vary the image size by adding a fixed number to the allocated width and height. This is repeated, starting with a 8X8 image and ending with a 1000X1000 image. The graph below shows results for both glReadPixels and glDrawPixels. Interestingly, we obtain a maximum read and write performance when the image size is around 250X250 pixels. It seems likely that the clear drop-off (at around 400X400) could be due to the bandwidth limitations of the AGP bus. If true, the data suggests that the AGP bus on our test system has a maximum download bandwidth of around 80MB/sec and a maximum upload rate of 50MB/sec. These values are far lower than even the AGP 2X rating. In addition, these low rates conflict with the texture upload measurements we show in the next section. Such discrepancies suggest the presence of another bottleneck.
Texture Upload Bandwidth To evaluate the rate at which we can move texture image data (from processor memory to texture memory) across the AGP bus, we simply call glTexImage2D() with varying format and data type parameters. Nearly all of the formats and types allowed for pixel transfers are also valid for creating texture objects. Therefore, the range of these tests closely matches that for moving pixels. The graph below shows the effect that data type has on texture image upload rates. One surprising result is the "apparent" speed at which UNSIGNED_INT and FLOAT components can be stored in the texture memory. As with DrawPixels and ReadPixels, texture upload bandwidth is at a maximum with 32bit per pixel data types (UNSIGNED_BYTE, UNSIGNED_BYTE_8_8_8_8, etc). However, there is reasonably high performance with FLOAT components.
We have chosen to display only a small portion of the benchmarks. A complete listing of our "pixel moving" results can be downloaded here
|
| 3) |
The effect of triangle shape on rasterization performance
For this part of the assignment, we attempted to accurately determine the Triangle Rate for varying ratios of triangle width and triangle height. One would assume that such things would not have an affect on how triangles are rasterized. To test this, we changed gfxbench so that it would be able to display triangles of the same area with varying height and width components. We used triangles with an area of 5000 pixels and varied height and width from 5 to 2000 pixels. Thus two tests were run, one that incremented the height of a triangle by 5 pixels each iteration and calculated the corresponding width, and vice-versa. Our results should accurately show that these ratios do in fact make a difference in terms of how the hardware can rasterize the triangles. For triangles that are much wider than they are high, our results show that these triangles can not be rasterized as quickly as triangles of the same size having the opposite dimensions. Until the ratio of width to height reaches about 660, short, fat, triangles are generally rasterized faster than tall, skinny ones. At this point though, the triangle rate stops increasing, and levels out. The triangle rate for tall, skinny triangles keeps increasing beyond this point. What this means is that the hardware should be able to rasterize either many pixels in the vertical direction more quickly, or few pixels in the horizontal direction more quickly. Depending on how the hardware scan-converts triangles, it is hard to know which exactly is the case. Below we provide our results for this test. It is easy to see visually that short, fat tiangles generally rasterize a little quicker than tall, skinny ones. |
| 3) |
Below is a graph representing fill rate versus triangle size. As would be expected, small geometry dominates rendering up to a point, at which point the fill rate becomes the limiting factor and no matter how fast you can make triangles, you have to wait for them to be filled. This happens at a triangle area of about 525 pixels, or 32 pixels on an edge for a right triangle. This is somwhere in between the values that we found for crossover points for 1) and the second part of 3).
Again, the interesting thing about this graph is that there is a cyclic pattern to the results. We incremented the size of triangles by 0.5 pixels each time to make sure we would end up with a consistent graph. Because we have used such a fine scale, we can be sure that we are covering most possible sizes of triangles. We must assume that there is something in the hardware that works better for certain sizes of triangles. It could be that for triangles covering many partial pixels, the hardware must do more calculation to figure out whether or not to display the pixel. Again, this iplementation can be different between graphics cards and it is hard to know what exactly could be causing this to happen. |
| 3) |
Using on-board video memory for increased geometry performance
The tests we did for this section are nearly identical to the tests for section one. The main difference is that instead of using regular memory allocated by the CPU (malloc'ed), we have used NVidia's Vertex Array Range (VAR) extension. The extension gave us a lot of problems, mainly because if the environment in which it is used is not set correctly, the extension will not get enabled and regular CPU memory will be used instead. It is important to note that NVidia provides an interface to allocate this memory, and depending on read/write frequency and priority parameters, it is possible to allocate either video or AGP memory. As indicated below, there are ranges of parameters that can be used to indicate which type you would prefer:
We provide one graph for the effects these parameters have on performance. We have done multiple tests within these limits with no significant change in performance. The graph below shows our results for colored, unlit, untextured triangles. The only noticeable result is that there are cases where we are again seeing cyclic patterns based on triangle size. It is possible that this is simply because of noise and inconsistency in our data, but the regularity and consistency with which they occur is too coincidental. Similar results were found in 1) as well, for the decreasing part of the graph, and the previous graph. We did our tests with increasing edge lengths of .5 pixels, so it is possible that there are some cases where the card has a little more trouble handling the triangles. |
| 3) |
More data for our results in this area is available in our output directory. These results however are a little frustrating, it is hard to know if we are even getting the type of memory that we want. We have found that there are many different factors that affect the ability of VAR to operate correctly. Many more hours could be spent trying to figure out these factors. We would suspect that one type of memory would be faster than the other, but have been unable to differentiate between the two.
For our tests, we used the exact same vertex array as we did for 1). The indices for the array were not placed in the on-card (we use on-card, video, and AGP interchangeably here, as we can see no difference) memory. If you try using on-card memory for this it will not work. The color array and both the texture coordinate array and the texture itself were all placed in on-card memory. Because we could place the texture in the on-card memory, our texture rates received a significant speedup compared to when the texture was in main memory. We find that in this case, texturing is not the bottleneck as it was for 1). Texturing is now much faster and lighting has become the bottleneck. The maximum speedup seen when using the vertex array range as opposed to CPU allocated memory was 1.995 for colored, unlit, untextured triangles, or almost exactly a two-fold increase. However, the speedup for textured triangles was 2.36. We found advertised Triangle rates for the Ti4600 at almost 50 MTri/s. Interestingly enough, a benchmark we found had a maximum of just over 12 MTri/s when using lighting [link]. We have achieved this same rate but cannot come close to the advertised triangle rate. Our highest triangle rate was 24.39 MTri/s. Below we provide two graphs of our results. The first is a graph independent of malloc'ed memory, and the second is a graph combining the results from 1) and these trials. For more results, please look in our directory. Our source code, in tarred, gzipped form can also be found in the directory. |