|
well, now that you learned how coding a chunk to planar conversion works it's time
to focus on optimizing the whole thing, since the method presented in the first part
of this tutorial doesn't produce very quick results.
let's try to find unnecessary steps that are performed during the operation. first thing
to mention: on the 68000 shifting gets slower with an increasing number of shifts, each
bit shifted, additionally costs 2*n cycles if n equals the number of shifts. looking at
the code suggested in the first part you might notice that we can try to save the lsl.l #2,d0,
at least, by simply preshifting your textures or whatever you want to draw to your buffer by
those 2 bits. so the conversion-source becomes:
|
rept 160/8
moveq.l #0,d0
move.w (a0)+,d0
lsl.l #4,d0
or.w (a0)+,d0
move.l (a1,d0.l),d0
movep.l d0,0(a2)
moveq.l #0,d0
move.w (a0)+,d0
lsl.l #4,d0
or.w (a0)+,d0
move.l (a1,d0.l),d0
movep.l d0,1(a2)
move.l (a2)+,(a3)+
move.l (a2)+,(a3)+
endr
|
; convert the first 4 pixels
; notice: operate this as .l
; because it may exceed the
; wordsize by 2 bits
; the next 4 pixels
; linedoubling
|
next cue - linedoubling: unfortunately, we can't do this by hardware on any machine,
except for the falcon, meaning we have to instruct the cpu with copying the data of
every single row that has been converted.
copying great blocks of memory isn't very nice so we'll have to care about streamlining
this task as much as possible. first suggestion: use the blitter if it's available. second
suggsestion: don't copy at all, rather use a raster on every 2nd scanline to emulate
hw-linedoubling by readjusting the videoaddress-counter on machines that don't support
hardware-linedoubling but that allow write-accessing the videoaddress-counter (i.e. ste,
mega ste and tt) - this is, however, unsuited for vga-monitors (tt-resolutions, at least)
because of the high horizontal frequency or as soon as you need timer b for another purpose.
last suggestion for faster line-doubling: use movem.l !
look at the source above, at the moment we use move.l (a2)+,(a3)+ twice to copy 4 words we've
just converted. what you can do instead is copying larger blocks using movem.l with as many
registers as possible just after the cpu is done with one row. movem.l tends to be interrupt-unfriendly
as it requires many cycles before the next isr is able to interrupt, so please don't wonder
about shakey rasters if you use movem-linedoubling. on the other hand using movem pays off
with more than two registers because the cpu doesn't need to reread the move op-code for each
move everytime:
ofs
ofs
ofs
ofs
|
set 0
rept 160/8
moveq.l #0,d0
move.w (a0)+,d0
lsl.l #4,d0
or.w (a0)+,d0
move.l (a1,d0.l),d0
movep.l d0,ofs(a2)
moveq.l #0,d0
move.w (a0)+,d0
lsl.l #4,d0
or.w (a0)+,d0
move.l (a1,d0.l),d0
movep.l d0,1+ofs(a2)
set ofs+8
endr
set 0
rept 4
movem.l (a2)+,d0-d6/a4-a6
movem.l d0-d6/a4-a6,160-40+ofs(a2)
set ofs+40
endr
|
; our ordinary c2p for one row
; here's the movem-trick
; copy 4*40 bytes
|
and don't forget to increment a2 by 160 each row. speaking about incrementing:
to optimize this for uncached machines, unroll the c2p for the whole screen -
this way you might save some cycles for the lea x(a2),a2 every row as well as the
dbra.
next step: a completely diffrent approach. yes, it's possible to do the whole
stuff a quicker way even using less memory. this works out by building the
planar data for 4 pixels withing 2 fetches. then you'll use 2 16kb tables keeping
the plane-data for 2 pixels each, with one of the tables being preshifted.
the offset-format will be $0a0b<<2 (preshifted colors again!) which means that only
256 combinations will be used (2 pixels -> 16*16), actually - but still the
memoryconsumption is much lower than with the old method. it's hard to find the apropiate
words so let's look at some code again (a1,a2 : c2p tables, a0 : buffer, a3 screen):
ofs
ofs
ofs
ofs
|
set 0
rept 160/16
movem.w (a0)+,d0-d7
move.l (a1,d0.l),d0
or.l (a2,d1.l),d0
move.l (a1,d2.l),d1
or.l (a2,d3.l),d1
move.l (a1,d4.l),d2
or.l (a2,d5.l),d2
move.l (a1,d6.l),d3
or.l (a2,d7.l),d3
movep.l d0,ofs(a3)
movep.l d1,1+ofs(a3)
movep.l d2,8+ofs(a3)
movep.l d3,9+ofs(a3)
set ofs+16
endr
set 0
rept 4
movem.l (a3)+,d0-d6/a4-a6
movem.l d0-d6/a4-a6,160-40+ofs(a3)
set ofs+40
endr
|
; quicker this time
; fetch 8*2 pixels (movem quickly !)
; the first 2 pixels
; the next 2 pixels
; and so on...
; you can use .l addressing
; as it's faster on 030s
; (movem.w extends to .l)
; put 8*2 doublepixels
; linedoubling
|
this again can be speeded up with a very system-unfriendly method, ie. overwriting
the bios/xbios by placing your tables absolutely in low memory. this way you can get
rid of the expensive i8(an,dn.l) addressing mode. imagine the second table would be
aligned at $1010.l, the first one interleaved 64 bytes up, in between (saving another
16kbytes)...with preshifted
pixeldata +$10 your c2p becomes:
ofs
ofs
ofs
ofs
|
set 0
rept 160/8
movem.w (a0)+,a1-a4
move.l 64(a1),d0
or.l (a2),d0
move.l 64(a3),d1
or.l (a4),d1
movep.l d0,ofs(a5)
movep.l d1,1+ofs(a5)
set ofs+8
endr
set 0
rept 4
movem.l (a5)+,d0-d6/a1-a2
movem.l d0-d6/a1-a2,160-40+ofs(a5)
set ofs+40
endr
|
; fetch 8 pixels
; Don't forget to add 16 to your preshifted
; textures in order to obtain the $1010 offset.
|
this is as fast as it gets for a byte c2p but
depending on what's being displayed or let's say changed per frame you can
speed this up by farther extends. imagine most of your buffer remains clear
or at a certain background picture or color a moving 3d object is put above.
this means that only small portions of your buffer will actually change.
reconverting the unchanged parts would be a plain waste of clockcycles. to avoid
this you use a method called "delta-clearing" that flips between two chunkybuffers
every frame. the contents of the buffer to be converted currently is compared
against the buffer that has been displayed last time (it's best to use cmpm.l (ax)+,(ay)+
for 4 double-pixels). if the values are equal this means that the 4 pixels
haven't changed - hence you can skip the costy memory-accesses and the movep.l in
those cases. but remember this only pays off if large areas of the buffer remain
unchanged, if you update the whole screen everytime the cmpm.l and beq.s would
cost too many cycles.
my last hint if you want to gain speed in your c2p conversion is to use a nibblebuffer
where every 4bits represent one pixel. this way you can achieve faster
fetches (4 pixels in one word!) as well as quicker bufferclearing because it's only
sized half as big as in the byte-case. on the other hand your mapper or whatever
you want your chunky-buffer get filled by naturally becomes a bit more complicated.
with this nibble-c2p you'd use your old 256kb table again but without swapped data
for the 2nd and 3rd pixel this time because you will get the pixels in the perfectly
right order if you fetch them in a word.
if you let movem.w get the pixels from our buffer please mind that it does a longword
extend, so your table needs to be remapped suiting the negative offsets that occur
with the movem.w pixelfetch and after having remapped it don't forget to point your
table register to the middle of this table (lea c2p_table+$20000,a1 instead of
lea c2p_table,a1), but this has a great advantage: your moveq.l #0,dn falls away.
the template would look like this:
ofs
ofs
ofs
ofs
|
set 0
rept 160/32
movem.w (a0)+,d0-d7
lsl.l #2,d0
lsl.l #2,d1
lsl.l #2,d2
lsl.l #2,d3
lsl.l #2,d4
lsl.l #2,d5
lsl.l #2,d6
lsl.l #2,d7
move.l (a1,d0.l),d0
move.l (a1,d1.l),d1
move.l (a1,d2.l),d2
move.l (a1,d3.l),d3
move.l (a1,d4.l),d4
move.l (a1,d5.l),d5
move.l (a1,d6.l),d6
move.l (a1,d7.l),d7
movep.l d0,ofs(a2)
movep.l d1,ofs+1(a2)
movep.l d2,ofs+8(a2)
movep.l d3,ofs+9(a2)
movep.l d4,ofs+16(a2)
movep.l d5,ofs+17(a2)
movep.l d6,ofs+24(a2)
movep.l d7,ofs+25(a2)
set ofs+32
endr
set 0
rept 4
movem.l (a2)+,d0-d6/a4-a6
movem.l d0-d6/a4-a6,160-40+ofs(a2)
set ofs+40
endr
|
; fetch 8*4 pixels !
; longword alignment
; put 8*4 doublepixels
; linedoubling
|
well things can be optimized even more in many cases, by simply not using a c2p.
i'm gonna cover this in the next tutorial.
- 2002 ray//.tscc. -
|
|