|
fast affine texture mapping with constant slopes / tiled arbitrary size mapping in 5 inst. per pixel
whenever it comes to mapping textures with a constant gradient (e.g. linear triangle fillers, rotozoomers etc.)
there's an obvious trick how things can be speeded up drastically. as the term constant gradient implies
something remains, ehh, constant and hence doesn't need to be recalculated....speaking some more detailed it
means that an area gets filled by drawing horizontal rows, while along those rows you keep casting through a
texture that needs to be mapped by using linear interpolation of texture u and v coordinates.
if you look at a triangle that gets cut by lines representing a texturing gradient, there's one striking thing
about it, i.e. all those lines will pass along parallel. in particular this condenses in a constant series of
texture offsets, which hence can be buffered. the same situation appears in a constant perspective rotozoomer
fx., too. it becomes obvious realizing that each row you will have to do something like:
for x = 0 to xsize-1 {
*(screen++) = texture[u & $ff00|(v>>8)];
u += delta_u;
v += delta_v
}
as the coefficients delta_u and delta_v do not alter from one row to another (constant gradient!) the offset series
produced along the line will be the same for *every* row. so obviously, some cycles can be saved by just interpolating
the first row while buffering the obtained offsets, which then will be used for all the following rows. the same
works for triangles while there you need to interpolate the longest row in order to have a list for all the other
(shorter) lines, too, of course. this method is commonly referred to as "prestepping".
each line you will have to readjust the starting offset by adapting your texture pointer. however the technique
has two little disadvantages: first of all, you start losing some subpixel accuracy because the texture u,v
coordinates will be rounded into a texture offset at the beginning of each row to be mapped, whereas it this
wasn't the case in an ordinary mapper, though this can only be noticed on large triangles. secondly, as there
are two offset readjustments (one at the beginning of each row, one along the row) the index exceeds the texture
size from time to time, which is why you need to store your textures twice, one after another, so that you run into
the 2nd one in a situation like this.
now, again and as mentioned in the first tutorial about unrolling loops this sort
of buffering will only pay off as long as you code it in a cache friendly manner. additionally, i assume it
doesn't pay off on the 040 or 060, at all - but i guess, you don't need to care about issues like these on
accelerators, anyway :). just go ahead and use your old 5 instruction mapper if you write your code for those
cpus.
for cached machines (Mega STE, Falcon030, TT030) i'd recommend a 2 instruction type of mapper:
; (a0) -> buffered texture offsets
; (a1) -> texture
; (a2) -> start of current row instide the screen (<= 8bpp in this case)
; d0.l = 0 !
move.w (a0)+,d0 ; Fetch offset
move.b (a1,d0.l),(a2)+ ; Write out pixel
this can be optimized canceling out the expensive i8(an,dn.l) addressing mode. instead of prestepping absolute
offsets you can initialise your buffer with deltas, i.e. offsets realative to their predecessor. your innerloop
becomes very tight that way (i'd recommend you to use 32 bit deltas and adda.l on the TT030 though, because of
its full 32 bit data bus and in order to cancel out the adda.w longword extension costing 2 cycles):
; a0->Relative texture offsets
; a1->Texture
; a2->Screen (<= 8bpp in this case)
move.b (a1),(a2)+ ; Write out pixel
adda.w (a0)+,a1 ; Interpolate
in case you want to gain maximum performance in your triangle mapper designed for uncached 68000 machines
(ST, STE, Mega ST) this 1 instruction mapper, which maps a row from the right to the left, can't be beaten
(mind that you have to patch in the offsets into the code before you start mapping, once - this has to be done
"backwards", as well. how the overhead for unrolled loops could look like can be read here):
; a0->Texture
; a1->Screen (upper end of current polygon row)
rept xsize
move.b patch(a0),-(a1) ; Write out pixel
endr
entry
on the 030 a last little hint to push the speed is using a partly unrolled loop that maps blocks of at least
8 pixels (taking advantage of movem.w (an)+,d0-d7) plus an alignment loop which is executed before. you
can even expand the number of pixels being put by the unrolled loop by extending your movem.w line by some
address-registers still available but please remember the whole thing has to fit into the cache unless you
want to slow down things again.
this is becoming a little bit off-topic but i thought it should put it here. listen up, this will
go down in history ;). i'm going to reaveal a little trick i seem to have come up with first *proud* to realise
a tiled mapper for textures of arbitrary pixel dimensions, an arbitrary power of 2 at least, that requires 5 instructions per pixel
whereas "ordinary" 5 inst./pixel mappers are used to rely on a tile size of 256*256 pixels.
the idea is based upon letting the mapper interpolate your u/v coordinates within a single register in order
to be able to swiftly and inexpensively shift the accordant integer bitfields into proper positions.
assume a texture of 128*32 pixels, here do you go:
; a0->Texture
; a1->Screen
;
; Capitals: integer field
; Lower case: fractional field
;
; d0 = VVVVV:vvvvvvvvvvv UUUUUUU:uuuuuuuuu
; d1 = d/dx (VVVVV:vvvvvvvvvvv UUUUUUU:uuuuuuuuu)
; d4 = width-1
moveq.l #9,d2 ; Number of fractional bits for u
.loop move.l d0,d3 ; Load uv into scratch register
lsr.w d2,d3 ; VVVVV:vvvvvvvvvvv ---------UUUUUUU
rol.l #5,d3 ; :vvvvvvvvvvv----- ----UUUUUUUVVVVV
move.b (a0,d3.w),(a1)+ ; Voilá
add.l d1,d0 ; uv += duv/dx
dbra d4,.loop
as you can see coordinates must be stored left-justified so the mapper performs tiling on overflows in the upper bit. for u
overflows will be carried into the fractional part of v which results in an non-visual error though so this is no problem.
the routine has two little speed drawbacks, i.e. it uses a rol.l instruction which is a bit costy (6 cycles) on the 030 and the
and the other one is the .w indexing copying the texel over into the screen which tends to be slightly slower than .l
addressing on the 030, as well.
keeping in mind it provides tiled mapping for arbitrary textures the mapper's performance is still quite good and very useable even
on a 030 based system even from a democoder's point of view though.
there's also a tradeoff between texture size and accuracy but as known from usual mappers even cutting down to 8 fractional bits
is enough to produce decent results in practice.
feel free to use my method but pleas give me some credit if yor decide do so ;).
- 2002-2005 ray//.tscc. -
|
|