ray's 16/32 bit atari page

downloads

  demos/intros
  wolfenstein 3d
  miscellaneous
  bundeswehr

docs

  unrolling loops
  c2p part I (st)
  c2p part II (st)
  avoiding c2p (st)
  interlacing (st)
  fat mapping
  3d pipeline
  portal rendering
  8bpp color mixing
  fixedpoint math
  blitter (mst/ste)
  sample replay (st)
  blitter gouraud (falc)
  blitter fading (falc)
  arbitrary mapping
  frustrum clipping etc.

sourcecode

  mc68000 math lib
  32 bytes sin-gen
  24 bit tga-viewer
  blitter example
  lz77 packer
  lz78 packer
  protracker replayer

chunk to planar conversion - part II

well, now that you learned how coding a chunk to planar conversion works it's time to focus on optimizing the whole thing, since the method presented in the first part of this tutorial doesn't produce very quick results.
let's try to find unnecessary steps that are performed during the operation. first thing to mention: on the 68000 shifting gets slower with an increasing number of shifts, each bit shifted, additionally costs 2*n cycles if n equals the number of shifts. looking at the code suggested in the first part you might notice that we can try to save the lsl.l #2,d0, at least, by simply preshifting your textures or whatever you want to draw to your buffer by those 2 bits. so the conversion-source becomes:

rept 160/8 moveq.l #0,d0 move.w (a0)+,d0 lsl.l #4,d0 or.w (a0)+,d0 move.l (a1,d0.l),d0 movep.l d0,0(a2) moveq.l #0,d0 move.w (a0)+,d0 lsl.l #4,d0 or.w (a0)+,d0 move.l (a1,d0.l),d0 movep.l d0,1(a2) move.l (a2)+,(a3)+ move.l (a2)+,(a3)+ endr ; convert the first 4 pixels ; notice: operate this as .l ; because it may exceed the ; wordsize by 2 bits ; the next 4 pixels ; linedoubling

next cue - linedoubling: unfortunately, we can't do this by hardware on any machine, except for the falcon, meaning we have to instruct the cpu with copying the data of every single row that has been converted.
copying great blocks of memory isn't very nice so we'll have to care about streamlining this task as much as possible. first suggestion: use the blitter if it's available. second suggsestion: don't copy at all, rather use a raster on every 2nd scanline to emulate hw-linedoubling by readjusting the videoaddress-counter on machines that don't support hardware-linedoubling but that allow write-accessing the videoaddress-counter (i.e. ste, mega ste and tt) - this is, however, unsuited for vga-monitors (tt-resolutions, at least) because of the high horizontal frequency or as soon as you need timer b for another purpose. last suggestion for faster line-doubling: use movem.l !
look at the source above, at the moment we use move.l (a2)+,(a3)+ twice to copy 4 words we've just converted. what you can do instead is copying larger blocks using movem.l with as many registers as possible just after the cpu is done with one row. movem.l tends to be interrupt-unfriendly as it requires many cycles before the next isr is able to interrupt, so please don't wonder about shakey rasters if you use movem-linedoubling. on the other hand using movem pays off with more than two registers because the cpu doesn't need to reread the move op-code for each move everytime:

ofs ofs ofs ofs set 0 rept 160/8 moveq.l #0,d0 move.w (a0)+,d0 lsl.l #4,d0 or.w (a0)+,d0 move.l (a1,d0.l),d0 movep.l d0,ofs(a2) moveq.l #0,d0 move.w (a0)+,d0 lsl.l #4,d0 or.w (a0)+,d0 move.l (a1,d0.l),d0 movep.l d0,1+ofs(a2) set ofs+8 endr set 0 rept 4 movem.l (a2)+,d0-d6/a4-a6 movem.l d0-d6/a4-a6,160-40+ofs(a2) set ofs+40 endr ; our ordinary c2p for one row ; here's the movem-trick ; copy 4*40 bytes

and don't forget to increment a2 by 160 each row. speaking about incrementing: to optimize this for uncached machines, unroll the c2p for the whole screen - this way you might save some cycles for the lea x(a2),a2 every row as well as the dbra.

next step: a completely diffrent approach. yes, it's possible to do the whole stuff a quicker way even using less memory. this works out by building the planar data for 4 pixels withing 2 fetches. then you'll use 2 16kb tables keeping the plane-data for 2 pixels each, with one of the tables being preshifted.
the offset-format will be $0a0b<<2 (preshifted colors again!) which means that only 256 combinations will be used (2 pixels -> 16*16), actually - but still the memoryconsumption is much lower than with the old method. it's hard to find the apropiate words so let's look at some code again (a1,a2 : c2p tables, a0 : buffer, a3 screen):

ofs ofs ofs ofs set 0 rept 160/16 movem.w (a0)+,d0-d7 move.l (a1,d0.l),d0 or.l (a2,d1.l),d0 move.l (a1,d2.l),d1 or.l (a2,d3.l),d1 move.l (a1,d4.l),d2 or.l (a2,d5.l),d2 move.l (a1,d6.l),d3 or.l (a2,d7.l),d3 movep.l d0,ofs(a3) movep.l d1,1+ofs(a3) movep.l d2,8+ofs(a3) movep.l d3,9+ofs(a3) set ofs+16 endr set 0 rept 4 movem.l (a3)+,d0-d6/a4-a6 movem.l d0-d6/a4-a6,160-40+ofs(a3) set ofs+40 endr ; quicker this time ; fetch 8*2 pixels (movem quickly !) ; the first 2 pixels ; the next 2 pixels ; and so on... ; you can use .l addressing ; as it's faster on 030s ; (movem.w extends to .l) ; put 8*2 doublepixels ; linedoubling

this again can be speeded up with a very system-unfriendly method, ie. overwriting the bios/xbios by placing your tables absolutely in low memory. this way you can get rid of the expensive i8(an,dn.l) addressing mode. imagine the second table would be aligned at $1010.l, the first one interleaved 64 bytes up, in between (saving another 16kbytes)...with preshifted pixeldata +$10 your c2p becomes:

ofs ofs ofs ofs set 0 rept 160/8 movem.w (a0)+,a1-a4 move.l 64(a1),d0 or.l (a2),d0 move.l 64(a3),d1 or.l (a4),d1 movep.l d0,ofs(a5) movep.l d1,1+ofs(a5) set ofs+8 endr set 0 rept 4 movem.l (a5)+,d0-d6/a1-a2 movem.l d0-d6/a1-a2,160-40+ofs(a5) set ofs+40 endr ; fetch 8 pixels ; Don't forget to add 16 to your preshifted ; textures in order to obtain the $1010 offset.

this is as fast as it gets for a byte c2p but depending on what's being displayed or let's say changed per frame you can speed this up by farther extends. imagine most of your buffer remains clear or at a certain background picture or color a moving 3d object is put above. this means that only small portions of your buffer will actually change. reconverting the unchanged parts would be a plain waste of clockcycles. to avoid this you use a method called "delta-clearing" that flips between two chunkybuffers every frame. the contents of the buffer to be converted currently is compared against the buffer that has been displayed last time (it's best to use cmpm.l (ax)+,(ay)+ for 4 double-pixels). if the values are equal this means that the 4 pixels haven't changed - hence you can skip the costy memory-accesses and the movep.l in those cases. but remember this only pays off if large areas of the buffer remain unchanged, if you update the whole screen everytime the cmpm.l and beq.s would cost too many cycles.

my last hint if you want to gain speed in your c2p conversion is to use a nibblebuffer where every 4bits represent one pixel. this way you can achieve faster fetches (4 pixels in one word!) as well as quicker bufferclearing because it's only sized half as big as in the byte-case. on the other hand your mapper or whatever you want your chunky-buffer get filled by naturally becomes a bit more complicated. with this nibble-c2p you'd use your old 256kb table again but without swapped data for the 2nd and 3rd pixel this time because you will get the pixels in the perfectly right order if you fetch them in a word.
if you let movem.w get the pixels from our buffer please mind that it does a longword extend, so your table needs to be remapped suiting the negative offsets that occur with the movem.w pixelfetch and after having remapped it don't forget to point your table register to the middle of this table (lea c2p_table+$20000,a1 instead of lea c2p_table,a1), but this has a great advantage: your moveq.l #0,dn falls away. the template would look like this:

ofs ofs ofs ofs set 0 rept 160/32 movem.w (a0)+,d0-d7 lsl.l #2,d0 lsl.l #2,d1 lsl.l #2,d2 lsl.l #2,d3 lsl.l #2,d4 lsl.l #2,d5 lsl.l #2,d6 lsl.l #2,d7 move.l (a1,d0.l),d0 move.l (a1,d1.l),d1 move.l (a1,d2.l),d2 move.l (a1,d3.l),d3 move.l (a1,d4.l),d4 move.l (a1,d5.l),d5 move.l (a1,d6.l),d6 move.l (a1,d7.l),d7 movep.l d0,ofs(a2) movep.l d1,ofs+1(a2) movep.l d2,ofs+8(a2) movep.l d3,ofs+9(a2) movep.l d4,ofs+16(a2) movep.l d5,ofs+17(a2) movep.l d6,ofs+24(a2) movep.l d7,ofs+25(a2) set ofs+32 endr set 0 rept 4 movem.l (a2)+,d0-d6/a4-a6 movem.l d0-d6/a4-a6,160-40+ofs(a2) set ofs+40 endr ; fetch 8*4 pixels ! ; longword alignment ; put 8*4 doublepixels ; linedoubling

well things can be optimized even more in many cases, by simply not using a c2p. i'm gonna cover this in the next tutorial.

- 2002 ray//.tscc. -