|
atari's 16 and 32bit machines are typically equipped with a motorola 68000
either (st,mega st,ste,mega ste) or an improved 68030 cpu (tt,falcon).
i'll try to cover the second kind later on but let's start off with talking
about the st series with its mc68000 cpu.
there's something special about this processor, which appeared as absolutely
ordinary at the time the st was released, however - it didn't have any cache.
cache can be considered as an array of memory the cpu can access much quicker
than the system's ram. there are multiple kinds of cache, a really quick onchip
cache, called "level1 cache", and an external but usually larger one used to
the name "level2 cache".
the falcon's and tt's 68030 cpu features one 256byte level1 cache for data and
instructions each, the only atari machine that comes with a 2nd level cache is
the mega ste, sized 16kb there.
well, time to focus on the purpose of cache and on the topic of this document,
ehh ? - yeah, right...i forgot to name the cache's purpose but i guess you already
supposed that it's meant to speed up things.
as cache can be accesed quicker than system memory it could be useful to copy
code-segments which are repeated (i.e. loops) into the cache instead of re-reading
them from ram as many times as the loop is repeated, for example.
this is exactly what instruction-caching is all about, but managing cache is done
automatically - all you can do is switch it on or off completely, so if you own
a falcon, a tt or a mega ste test some programs with cache switched on/off and you'll
most likely notice a major speed difference.
back to the topic, unrolling loops, wasn't it ? yes, so facing the fact that neither
the st, nor the ste, nor the mega st have any cache we'll try to optimize our loops
as much as we can, which is done by preventing the cpu from reading the dbcc-instruction
used as "dbra" mainly.
the 68000 dislikes this instruction because executing it means decrementing a
data-register and to jump contitionally and/or according to the sign of the data-reg
decremented before. so look at this block of code:
loop
|
moveq.l #64-1,d0
move.w (a0)+,d1
move.b (a1,d1.w),(a2)+
dbra d0,loop
|
; repeat this 64 times
; some weird example code
; nasty dbra
|
out first step will be eleminating the dbra-instruction, which is quite easy
in a static-loop case like above. simply use your assembler's repeat-directive
to generate a codeblock instead, as you know your loop will always execute 64
times (notice: i use hi soft's devpac):
|
rept 64
move.w (a0)+,d1
move.b (a1,d1.w),(a2)+
endr
|
; repeat this 64 times
; our sample-code again
; and we killed the dbra
|
this technique has one major disadvantage - it costs alot of memory in general. so using
unrolled loops will always be a tradeoff between cycle-killing and memory-saving. if your
program is supposed to keep large parts of unrolled loops it is save to let your program
generate the code-blocks into a static bss or an allocated memorybuffer (malloc, mxalloc on the
falcon/tt - if present, use fastram there !) to call them later on. generating the code
might be done like this:
generate_code
.block
.one_block
template
code_length
code
|
section text
lea.l code,a1
move.w #repeat-1,d0
lea.l template(pc),a0
move.w #code_length/2-1,d1
move.w (a0)+,(a1)+
dbra d1,.one_block
dbra d0,.block
move.w #$4e75,(a1)
rts
move.w (a0)+,d0
move.w (a1,d0.w),(a2)+
= *-template
section bss
ds.b repeat*code_length+2
|
; don't forget the "rts"
; sample-code
; space for code+"rts"
|
to call the generated code simply do:
now, streamlining your loops extends to further opimizations:
looking at the sample-code above you might notice that this is a characteristic
situation when it comes to demo-coding.
assuming a0 points to a table with word offsets, a1 to a byte-texture and a2 to a byte-
screenbuffer, this block of code could be interpreted as a common 2 instruction mapper,
which will be described in another doc, or part of a typical offset-effect like a static
tunnel or something.
if you consider this as a tunnel-plotter or a bumpmapper, you'd exactly know which
texture- or lightmap-offsets would need to be used for every pixel on the screen, making
it easy to use a faster addressing-mode than i8(an,rn) at least on the 68000 - ie. the
i16(an) addressing mode.
what you'll let the cpu do instead of an additional memory-access to fetch the
texture-offset, is directly patching those offsets into the instruction-code of
move.b offset(a1),(a2)+ which will be used then:
generate_code
.block
template
code_length
code
|
section text
lea.l code,a1
lea.l offsets,a2
move.w #repeat-1,d0
move.w template(pc),(a1)+
move.w (a2)+,(a1)+
dbra d0,.block
move.w #$4e75,(a1)
rts
move.w -1(a1),(a2)+
= *-template
section bss
ds.b repeat*code_length+2
|
; offset-table
; patch op-code
; patch the offset
; don't forget to close the
; door...
; faster addressing-mode
; space for code+"rts"
|
fine...but what about loops with a variable counter ?
like with a 2 istruction-mapper fx.:
loop
|
move.w counter,d0
subq.w #1,d0
move.w (a0)+,d1
move.b (a1,d1.w),(a2)+
dbra d0,loop
|
; repeat this n times
; dbra...
; that's what we want
; to get rid of
|
in this case you need to know the possible maximum of your counter value
in order to unroll the loop. let's assume we're using a 2 instruction mapper
with a maximum rowlength of 160 pixels.
to control the number of segments being executed you jump into your unrolled
loop "backwards" according to the value of your counter. if you want to avoid
costy multiplications while calculating your jumpoffset, try to create loop-blocks
with a codelength diviseable by 2,4,8,16... to use shifts instead, otherwise you
can replace your constant muls.w #-codelength,dn by a some shifs and adds like i
did or you just use a jumptable - let's visualize:
entry
|
move.w counter,d0
move.w d0,d1
lsl.w #3,d1
add.w d0,d0
sub.w d1,d0
lea.l entry(pc),a3
jmp (a3,d0.w)
rept 160
move.w (a0)+,d1
move.b (a1,d1.w),(a2)+
endr
|
; n times, again
; faster than muls.w #-6,d0
; n*8
; n*2
; n*2-n*8 = -6*n
; jump, don't loop :)
; every block is 6 bytes
; long
; voila - no dbra
; jump in from here
|
on uncached 68000s it's recommended to use a 1 instruction mapper
even avoiding the move.w (a0)+,d1 using a combination of a codeblock that gets
modified depending on the texturegradiet and a jump according to the current
row of the triangle in a linear texturemapper's case - but that's a topic for
another tutorial.
nontheless it's quite interesting to unroll your loops for optimizing newschool
demo-stuff for the 68000 as fx. a 160x100 fullscreen rotozoomer can be done in about
3 vbls on a plain st using a combination of the stuff described here and in the
'avoiding c2p' doc - 4 bitplanes, no overscan, of course.
to finish this tutorial i'd like to try to keep my promise of loosing a word on
cached machines, as well, finally:
on quick machines like the falcon or the tt unrolling your loops doesn't pay of
very much. completely unrolling them will most likely slow down things as the loop-
segment won't fit into the cache anymore, meaning the purpose of the cache will be
completely dismissed in the first place, as your loop will need to be read from memory
again.
in order to avoid this problem, size your loops as large as they can be but smaller
than 256 bytes in order to take advantage of the cache.
don't generate codetables just trying to save memoryfetches (eg. a tunnel-effect) as
long as you're coding for the 68030 - rather keep your whole program inside the fast-ram
(as long as it's available and as long as you allocate your screenbuffer in st-ram,
of course) and use movem.w to fetch your offsets + move.b/w (an,rn.l),(ax)+ to write the
pixels (notice: on the 030 an,rn.l is faster than an,rn.w) within a loop <= 256 bytes,
don't forget the 030 does data-caching, too.
concerning the mega ste i have to admit that i never owned nor coded this computer,
actually. hence i can only suppose that fitting your loops according to the 16kb l2
cache is the best thing you can do.
one last trick for cached machines is to partly unroll the loops to fit them into
the cache and save some dbra's at the same time, which works like this:
loop
entry
|
move.w counter,d0
move.w d0,d2
lsr.w #5,d2
andi.w #31,d0
move.w d0,d1
lsl.w #3,d1
add.w d0,d0
sub.w d1,d0
jmp (entry,pc,d0.w)
rept 32
move.w (a0)+,d1
move.b (a1,d1.w),(a2)+
endr
dbra d2,loop
|
; split number into:
; a magnitude (n/32)
; a fractional part (n mod 31)
; for the jump
; same as above
; no problem on the 030
; this still fits into
; the cache
; not a shame here cause
; we got our cache
|
- 2002 ray//.tscc. -
|
|