wolfenstein 3d

  unrolling loops
  c2p part I (st)
  c2p part II (st)
  avoiding c2p (st)
  interlacing (st)
  fat mapping
  3d pipeline
  portal rendering
  8bpp color mixing
  fixedpoint math
  blitter (mst/ste)
  sample replay (st)
  blitter gouraud (falc)
  blitter fading (falc)
  arbitrary mapping
  frustrum clipping etc.

  mc68000 math lib
  32 bytes sin-gen
  24 bit tga-viewer
  blitter example
  lz77 packer
  lz78 packer
  protracker replayer
fast affine texture mapping with constant slopes / tiled arbitrary size mapping in 5 inst. per pixel

whenever it comes to mapping textures with a constant gradient (e.g. linear triangle fillers, rotozoomers etc.) there's an obvious trick how things can be speeded up drastically. as the term constant gradient implies something remains, ehh, constant and hence doesn't need to be recalculated....speaking some more detailed it means that an area gets filled by drawing horizontal rows, while along those rows you keep casting through a texture that needs to be mapped by using linear interpolation of texture u and v coordinates.

if you look at a triangle that gets cut by lines representing a texturing gradient, there's one striking thing about it, i.e. all those lines will pass along parallel. in particular this condenses in a constant series of texture offsets, which hence can be buffered. the same situation appears in a constant perspective rotozoomer fx., too. it becomes obvious realizing that each row you will have to do something like:

for x = 0 to xsize-1 {
    *(screen++) = texture[u & $ff00|(v>>8)];
    u += delta_u;
    v += delta_v

as the coefficients delta_u and delta_v do not alter from one row to another (constant gradient!) the offset series produced along the line will be the same for *every* row. so obviously, some cycles can be saved by just interpolating the first row while buffering the obtained offsets, which then will be used for all the following rows. the same works for triangles while there you need to interpolate the longest row in order to have a list for all the other (shorter) lines, too, of course. this method is commonly referred to as "prestepping".

each line you will have to readjust the starting offset by adapting your texture pointer. however the technique has two little disadvantages: first of all, you start losing some subpixel accuracy because the texture u,v coordinates will be rounded into a texture offset at the beginning of each row to be mapped, whereas it this wasn't the case in an ordinary mapper, though this can only be noticed on large triangles. secondly, as there are two offset readjustments (one at the beginning of each row, one along the row) the index exceeds the texture size from time to time, which is why you need to store your textures twice, one after another, so that you run into the 2nd one in a situation like this.
now, again and as mentioned in the first tutorial about unrolling loops this sort of buffering will only pay off as long as you code it in a cache friendly manner. additionally, i assume it doesn't pay off on the 040 or 060, at all - but i guess, you don't need to care about issues like these on accelerators, anyway :). just go ahead and use your old 5 instruction mapper if you write your code for those cpus.

for cached machines (Mega STE, Falcon030, TT030) i'd recommend a 2 instruction type of mapper:

; (a0) -> buffered texture offsets
; (a1) -> texture
; (a2) -> start of current row instide the screen (<= 8bpp in this case)
; d0.l = 0 !

    move.w (a0)+,d0        ; Fetch offset
    move.b (a1,d0.l),(a2)+ ; Write out pixel

this can be optimized canceling out the expensive i8(an,dn.l) addressing mode. instead of prestepping absolute offsets you can initialise your buffer with deltas, i.e. offsets realative to their predecessor. your innerloop becomes very tight that way (i'd recommend you to use 32 bit deltas and adda.l on the TT030 though, because of its full 32 bit data bus and in order to cancel out the adda.w longword extension costing 2 cycles):

; a0->Relative texture offsets
; a1->Texture
; a2->Screen (<= 8bpp in this case)

    move.b (a1),(a2)+ ; Write out pixel
    adda.w (a0)+,a1   ; Interpolate

in case you want to gain maximum performance in your triangle mapper designed for uncached 68000 machines (ST, STE, Mega ST) this 1 instruction mapper, which maps a row from the right to the left, can't be beaten (mind that you have to patch in the offsets into the code before you start mapping, once - this has to be done "backwards", as well. how the overhead for unrolled loops could look like can be read here):

; a0->Texture
; a1->Screen (upper end of current polygon row)

     rept xsize
          move.b  patch(a0),-(a1) ; Write out pixel

on the 030 a last little hint to push the speed is using a partly unrolled loop that maps blocks of at least 8 pixels (taking advantage of movem.w (an)+,d0-d7) plus an alignment loop which is executed before. you can even expand the number of pixels being put by the unrolled loop by extending your movem.w line by some address-registers still available but please remember the whole thing has to fit into the cache unless you want to slow down things again.

this is becoming a little bit off-topic but i thought it should put it here. listen up, this will go down in history ;). i'm going to reaveal a little trick i seem to have come up with first *proud* to realise a tiled mapper for textures of arbitrary pixel dimensions, an arbitrary power of 2 at least, that requires 5 instructions per pixel whereas "ordinary" 5 inst./pixel mappers are used to rely on a tile size of 256*256 pixels.

the idea is based upon letting the mapper interpolate your u/v coordinates within a single register in order to be able to swiftly and inexpensively shift the accordant integer bitfields into proper positions.

assume a texture of 128*32 pixels, here do you go:

; a0->Texture
; a1->Screen
; Capitals: integer field
; Lower case: fractional field
; d0 =       VVVVV:vvvvvvvvvvv UUUUUUU:uuuuuuuuu
; d1 = d/dx (VVVVV:vvvvvvvvvvv UUUUUUU:uuuuuuuuu)
; d4 = width-1

          moveq.l #9,d2           ; Number of fractional bits for u

.loop     move.l  d0,d3           ; Load uv into scratch register
          lsr.w   d2,d3           ; VVVVV:vvvvvvvvvvv ---------UUUUUUU
          rol.l   #5,d3           ; :vvvvvvvvvvv----- ----UUUUUUUVVVVV
          move.b  (a0,d3.w),(a1)+ ; Voilá
          add.l   d1,d0           ; uv += duv/dx

          dbra    d4,.loop

as you can see coordinates must be stored left-justified so the mapper performs tiling on overflows in the upper bit. for u overflows will be carried into the fractional part of v which results in an non-visual error though so this is no problem.
the routine has two little speed drawbacks, i.e. it uses a rol.l instruction which is a bit costy (6 cycles) on the 030 and the and the other one is the .w indexing copying the texel over into the screen which tends to be slightly slower than .l addressing on the 030, as well.
keeping in mind it provides tiled mapping for arbitrary textures the mapper's performance is still quite good and very useable even on a 030 based system even from a democoder's point of view though.
there's also a tradeoff between texture size and accuracy but as known from usual mappers even cutting down to 8 fractional bits is enough to produce decent results in practice.

feel free to use my method but pleas give me some credit if yor decide do so ;).

- 2002-2005 ray//.tscc. -