|
1
|
- Depth Complexity in Object-Parallel Graphics Architectures
- Pixel Merging for Object-parallel Rendering: A Distributed Snooping
Algorithm
- Parallel Volume Rendering Using Binary-Swap Compositing
- A Comparison of Parallel Compositing Techniques on Shared Memory
Multiprocessors
|
|
2
|
- Michael Cox, Pat Hanrahan - 1992
- An analysis of average depth complexity in a scene in order to find a
place to look for a better algorithm for compositing of pixels after
object-parallel rendering.
- Such an algorithm is presented in the next paper.
- Model is based on graphics primitive size distribution.
- Use RenderMan and a software implementation of SGI’s GL to generate
traces of the pixels generated during the rendering process (I won’t go
into this much).
|
|
3
|
- Vertical Parallelism
- also known as instruction-level parallelism and pipelining
- Horizontal Parallelism
- also known as data parallelism, multiprocessing, and parallel
processing.
- Two Types of Horizontal Parallelism:
- Image Parallelism
- also known as screen-space parallelism
- each processor is assigned a region of the screen and must render all
primitives that would fall in that region - partial local frame
buffers are created
- Object Parallelism
- also known as image composition
- primitives are assigned round-robin and a full frame buffer is
created
- The Problem: Pixel Merge
- pixels rendered by all processors must be merged in the end
|
|
4
|
- The object parallel architecture has three phases:
- object assignment to processors (round robin)
- done by the front-end network
- rendering (done locally on individual processors)
- performs local hidden surface removal with z-buffering
- produces a local frame buffer
|
|
5
|
- depth at (x,y)
- the number of primitives in a graphics scene to be rendered to this
specific (x,y) position
- depth complexity distribution or depth complexity
- in a graphics scene, this is the distribution of depth over the whole
image (all (x,y) coordinates).
- will use subsets of the scene and refer to the depth complexity of
these subsets as well
- ex. the 1/n primitives a processor may receive
|
|
6
|
- Scenes with larger depth complexity require more back-end pixel traffic.
- Scenes require more processing for z-buffering when there is a large
depth complexity because of increased comparisons.
- Removing depth complexity at the local frame buffer will reduce the
bandwidth required by the back-end network.
- when done by local processors, it can be done in parallel
- depending on the back-end network, there is a limited amount of
parallelism that can be achieved (other papers acknowledge these
algorithms)
|
|
7
|
|
|
8
|
- Assumptions:
- screen coverage by primitives is uniformly and randomly distributed
- primitives are uniformly and randomly distributed to processors
- A starting point (my attempt to explain the math):
- The Generating Function for probability distribution
- The probability that a primitive r with size k renders to (x,y) is
k/(An)
- this is just the size of the primitive divided by the frame buffer
area times the number of processors
- Hk(v) = (1 - (k/An)) + (k/An)v
- The coefficient of vo is the probability that r does not
render to (x,y)
- The coefficient of v1 is the probability that r does render
to (x,y)
|
|
9
|
- Generating functions can be multiplied to express convolution of the
probability density functions (yeah I copied that from the paper).
- So the coefficient of v2 in Hk(v)2 is
the probability the two primitives of size k render to (x,y)
- Moving along, Hk(v)Nk = [(1 - (k/An))
+ (k/An)v]Nk
- Nk is the number of primitives with size k
- So the probability that d are rendered to (x,y) is represented by the
coefficient of vd
- Now we can find the generating function of the probability density of
depth complexity by finding the product over all possible primitive
sizes.
|
|
10
|
- The probability pd that an arbitrary pixel in an arbitrary
processor’s local frame buffer has depth exactly d is the coefficient
of vd of G(v).
- Since they tell you to look in the appendix for this one, so will I.
|
|
11
|
- The point?
- With the graphics primitive size distribution for a given scene and
the number of processors, they can predict the depth complexity
probability density function.
- Then they can also predict the expected depth complexity and variance,
and, what they are really aiming at, the expected number of active
pixels per local frame buffer.
- I omit the rest of the math
- Some notes, p0 is the probability that no primitive renders
to some pixel
- Pr[arbitrary pixel (x,y) is active] is simply (1-p0)
|
|
12
|
- Answers to some questions:
- A sparse frame buffer is one for which a small portion of the frame
buffer has had primitives rendered to it.
- it is pointless to send all of the data across the wire when you only
need to send a small portion of it
- active vs. inactive pixels denote which ones have had primitives
rendered to them
|
|
13
|
- As the number of processors increase, the depth complexity goes down
- with 32 processors, for the Wash.ht scene, the depth complexity was
never greater than 1
- A problem: not all graphics primitives are uniformly distributed
- cube - for small numbers of processor, the model does not fit, but as
the number of processors increases, the primitives are distributed
better, getting rid of spatial coherence, and the model looks much
nicer
|
|
14
|
- Screen coverage steeply declines when primitives are distributed across
many processors.
- most of the frame buffers represent unused hardware.
- there is a potential here for reduced cost
- moving only those pixels rendered could help a lot
- Average active sample prediction was very good.
|
|
15
|
- The fit of the model for uniprocessors can be good, but is much better
for multiprocessors where primitives are distributed widely resulting in
less spatial coherence among primitives for a processor.
- Again, the model predicts the number of active samples for a scene
pretty well.
- Z-Buffering pixels locally has much less of an advantage when there are
few processors as opposed to when there are many.
- Basically, there has to be some way to take advantage of the sparse
frame buffers in not transmitting so much pixel data.
|
|
16
|
- Michael Cox, Pat Hanrahan - 1993
- Analyzes and compares a few different pixel merging algorithms and
presents a new distributed snooping algorithm.
- A good part of the paper is based on the previous one.
- Describe different types of parallelism are described, as well as the
pixel-merge problem
- Describe use of RenderMan and GL for tracing pixels
- Define the same bunch of terms
|
|
17
|
- Each node is assigned a subset of A (screen area) and is responsible for
any pixels in that subset.
- either as the node renders them or after it is done rendering all
pixels.
- could z-buffer pixels locally so it didn’t have to send so many off
- simulations show that few pixels will be deleted locally, so there
really isn’t much of a benefit
- After any node finishes rendering a pixel, those pixels are sent to the
node responsible for them.
- Each node then z-buffers the pixels it receives and the images from each
node can be tiled.
|
|
18
|
- Computing the network traffic is straight forward
- daA pixels must be sent, received and compared, this is the
total network traffic
- this is simply the average depth of the pixels multiplied by the screen
area
|
|
19
|
- Each processor has an associated full frame buffer it renders to.
- After rendering, the first node sends its pixels to the second node,
which performs z-buffering. The
second node then sends its pixels to the third, etc.
- Parallelism is taken advantage of, so processors aren’t waiting while
others are busy sending and z-buffering (once they’ve sent their pixels
they are inactive though).
- The final node will have the final image.
|
|
20
|
- The total number of pixels transferred will be nA, where n is the number
of processors in the system.
- bandwidth required on each link is only A
- In aggregate, O(nA) processing is required to read, compare, send and
receive pixels
|
|
21
|
- Each processor renders to a local frame buffer all the primitives it has
received.
- n processors share a global frame buffer to which pixels can be
broadcast.
- Processor 1 broadcasts all of its pixels to the global frame buffer
while everyone else “listens” to the broadcast.
- keeps track of active vs. inactive pixels
- For processors that have an active pixel for each (x,y) another
processor broadcasts, they do a z-buffer check and discard the pixel if
the need to.
- The global frame buffer z-buffers the incoming pixels and ends up with
the final image.
|
|
22
|
- Answers to some questions:
- Pixels that are broadcast include (screen-x, screen-y,
eye-coordinate-z, red, green, blue, *parent-primitive)
- Calculating expected network traffic:
- The first processor sends its pixels with probability 1, the next sends
each pixels with probability 1/2, etc.
- so Expected Traffic at (x,y) is
- from the last paper, they use the distribution of depth, pd, to
find the expected traffic from all locations
|
|
23
|
- They bound the equation for Estimated Traffic:
- Basically, the idea is that you take the lower bound (floor) and use it
for the probability that the pixel won’t be rendered (1-alpha). Similarly, you take the upper bound
(ceiling) and use it for the probability that the pixel will be
rendered.
- Alpha is the (average depth complexity - the floor of the average depth
complexity). I don’t know exactly
what that means.
- So, as the depth increases, the ET grows about at the same rate as log()
of the average depth because the Harmonics grow so slowly.
|
|
24
|
|
|
25
|
- Uniprocessor z-buffering
- daA pixels must be read and compared
- log(da)A pixels must be written
- total cost is O(daA + log(da)A)
|
|
26
|
- log(da)A is better than daA.
- There is still room for improvement.
They mention that hardware modification for snooping support
could be desirable.
- The Zinnia scene exceeded expected traffic (not shown) because the
largest primitive, the background, appeared first in the database, was
thus given to the first processor, and thus every pixel had to be
written to the global frame buffer and written over where necessary by
other pixels.
- The higher the depth complexity the better the reduction in traffic.
|
|
27
|
- Ma, Painter, Hansen, Krogh - 1994
- Present an algorithm where sub-volumes are raytraced independently to a
local frame buffer, and resulting images are composited in parallel.
- Based on the fact that there are many available high perfomance
workstations that could be used for rendering.
- no need for nodes to communicate while rendering data
- A Good Question: How does raytracing do lighting effects such as
shadows when other primitives are located on other processors?
- Also go over image- and data-space subdivision.
- Data-space subdivision is usually applied to distributed computing
while image-space subdivision generally involves a shared-memory
multiprocessing environment
|
|
28
|
- Don’t mention how raytracing works in a distributed environment where
nodes don’t have access to all primitives.
|
|
29
|
- Local rendering is done on each processor independently (again, that
question…)
- only rays that fall within the subvolume are cast
- use an identical view position
- Note that you must user consistent sampling locations so that “we can
construct the original volume.”
- you could see volume boundaries
- get two primitives on boundaries
|
|
30
|
- Need to composite images in back to front order
- Porter and Duff - over
- In the naïve process many processors become idle
- With Binary Swap every processor participates in every step of the
compositing process
|
|
31
|
- The final image is constructed by simply tiling the images from each
processor
- “Sparsity” can be exploited.
- don’t composite parts of the image that aren’t there
- processors keep a bounding box of their non-blank sub-image area
- is a bounding box the best way here?
- Could be very sparse but if two pixels far apart would create a big
bounding box
- my suggestion - would it be possible to keep an ordered list of
pixels and do a cross product on the two sets?
- Number of processors had to be a power of two
|
|
32
|
- Mathematical Massaging:
- after rendering each processor has approximately pn-2/3
pixels so the total number of pixels is pn1/3
- don’t say where they come up with the exponent
- some number of pixels will be deleted in compositing
- massaging …. pixels transmitted
is <= 2.43n1/3p
- the way they come up with it doesn’t exactly make sense to me (anyone
else?)
|
|
33
|
- Direct Send
- assign a subset of the image pixels to each processor
- pixels transmitted is n1/3p(1-1/n)
- may require n(n-1) messages to be transmitted
- every pixel a processor renders needs to be sent to another processor,
and this happens for each processor - one processpor sends p(n-1)
pixels, n processors, get pn(n-1)
- Binary Swap
- each processor sends log(n) messages, so the total number of messages
sent is nlog(n)
- can use nearest neighbor paths when they exist, when number of pixels
transmitted is largest early in the compositing
|
|
34
|
- Projection method
- rays propagate through a 3D grid decomposition of the volume data
- rays propagate back to front through the volumes from one processor
holding a sub-volume to the processor holding the neighboring
sub-volume
- transmits O(n1/3p) pixels like the other two methods
- Must move through n1/3 processor nodes so the message
latency grows by O(n1/3) as opposed to (n-1) for direct send
and log(n) for binary swap
|
|
35
|
- Used the CM-5, a 1024 node supercomputer
- each node had 32MB of RAM and 64-bit wide vector units (unused).
- Also experimented with a shared network of workstations.
- Didn’t implement other algorithms, have no basis for comparison of timed
results. Simulations do show that
with more processors rendering time goes down while the percentage of
time spent on distribution increases.
|
|
36
|
- They say that binary swap can be faster.
- It does keep the all the processors busy during compositing.
- Networked workstations did not have linear speedup because they were on
a shared network, data is somewhat inconsistent. A LAN doesn’t scale either, more nodes
doesn’t mean more bandwidth.
|
|
37
|
- Reinhard and Hansen - 2000
- Mention the Parallel Pipeline
- Review Binary Swap
- Review Direct Send
|
|
38
|
- The basic idea is that the images and z-buffers to be composited are
divided into P sub-images, one for each processor.
- The sub-images flow around a “ring” of processors,
- Each processor composites its sub-image k with the same sub-image that
belongs to processor pk There are P-1 steps
|
|
39
|
- There are P-1 steps
- each processor composites N/P pixels during each iteration
- time complexity is O(N)
|
|
40
|
|
|
41
|
|
|
42
|
|
|
43
|
|
|
44
|
|