ray's 16/32 bit atari page

downloads

  demos/intros
  wolfenstein 3d
  miscellaneous
  bundeswehr

docs

  unrolling loops
  c2p part I (st)
  c2p part II (st)
  avoiding c2p (st)
  interlacing (st)
  fat mapping
  3d pipeline
  portal rendering
  8bpp color mixing
  fixedpoint math
  blitter (mst/ste)
  sample replay (st)
  blitter gouraud (falc)
  blitter fading (falc)
  arbitrary mapping
  frustrum clipping etc.

sourcecode

  mc68000 math lib
  32 bytes sin-gen
  24 bit tga-viewer
  blitter example
  lz77 packer
  lz78 packer
  protracker replayer

unrolling loops on the 68000/68030

atari's 16 and 32bit machines are typically equipped with a motorola 68000 either (st,mega st,ste,mega ste) or an improved 68030 cpu (tt,falcon). i'll try to cover the second kind later on but let's start off with talking about the st series with its mc68000 cpu.
there's something special about this processor, which appeared as absolutely ordinary at the time the st was released, however - it didn't have any cache.
cache can be considered as an array of memory the cpu can access much quicker than the system's ram. there are multiple kinds of cache, a really quick onchip cache, called "level1 cache", and an external but usually larger one used to the name "level2 cache".
the falcon's and tt's 68030 cpu features one 256byte level1 cache for data and instructions each, the only atari machine that comes with a 2nd level cache is the mega ste, sized 16kb there.
well, time to focus on the purpose of cache and on the topic of this document, ehh ? - yeah, right...i forgot to name the cache's purpose but i guess you already supposed that it's meant to speed up things.
as cache can be accesed quicker than system memory it could be useful to copy code-segments which are repeated (i.e. loops) into the cache instead of re-reading them from ram as many times as the loop is repeated, for example.
this is exactly what instruction-caching is all about, but managing cache is done automatically - all you can do is switch it on or off completely, so if you own a falcon, a tt or a mega ste test some programs with cache switched on/off and you'll most likely notice a major speed difference.
back to the topic, unrolling loops, wasn't it ? yes, so facing the fact that neither the st, nor the ste, nor the mega st have any cache we'll try to optimize our loops as much as we can, which is done by preventing the cpu from reading the dbcc-instruction used as "dbra" mainly.
the 68000 dislikes this instruction because executing it means decrementing a data-register and to jump contitionally and/or according to the sign of the data-reg decremented before. so look at this block of code:

loop moveq.l #64-1,d0 move.w (a0)+,d1 move.b (a1,d1.w),(a2)+ dbra d0,loop ; repeat this 64 times ; some weird example code ; nasty dbra

out first step will be eleminating the dbra-instruction, which is quite easy in a static-loop case like above. simply use your assembler's repeat-directive to generate a codeblock instead, as you know your loop will always execute 64 times (notice: i use hi soft's devpac):

rept 64 move.w (a0)+,d1 move.b (a1,d1.w),(a2)+ endr ; repeat this 64 times ; our sample-code again ; and we killed the dbra

this technique has one major disadvantage - it costs alot of memory in general. so using unrolled loops will always be a tradeoff between cycle-killing and memory-saving. if your program is supposed to keep large parts of unrolled loops it is save to let your program generate the code-blocks into a static bss or an allocated memorybuffer (malloc, mxalloc on the falcon/tt - if present, use fastram there !) to call them later on. generating the code might be done like this:

generate_code .block .one_block template code_length code section text lea.l code,a1 move.w #repeat-1,d0 lea.l template(pc),a0 move.w #code_length/2-1,d1 move.w (a0)+,(a1)+ dbra d1,.one_block dbra d0,.block move.w #$4e75,(a1) rts move.w (a0)+,d0 move.w (a1,d0.w),(a2)+ = *-template section bss ds.b repeat*code_length+2 ; don't forget the "rts" ; sample-code ; space for code+"rts"

to call the generated code simply do:

jsr code

now, streamlining your loops extends to further opimizations:
looking at the sample-code above you might notice that this is a characteristic situation when it comes to demo-coding. assuming a0 points to a table with word offsets, a1 to a byte-texture and a2 to a byte- screenbuffer, this block of code could be interpreted as a common 2 instruction mapper, which will be described in another doc, or part of a typical offset-effect like a static tunnel or something.
if you consider this as a tunnel-plotter or a bumpmapper, you'd exactly know which texture- or lightmap-offsets would need to be used for every pixel on the screen, making it easy to use a faster addressing-mode than i8(an,rn) at least on the 68000 - ie. the i16(an) addressing mode.
what you'll let the cpu do instead of an additional memory-access to fetch the texture-offset, is directly patching those offsets into the instruction-code of move.b offset(a1),(a2)+ which will be used then:

generate_code .block template code_length code section text lea.l code,a1 lea.l offsets,a2 move.w #repeat-1,d0 move.w template(pc),(a1)+ move.w (a2)+,(a1)+ dbra d0,.block move.w #$4e75,(a1) rts move.w -1(a1),(a2)+ = *-template section bss ds.b repeat*code_length+2 ; offset-table ; patch op-code ; patch the offset ; don't forget to close the ; door... ; faster addressing-mode ; space for code+"rts"

fine...but what about loops with a variable counter ?
like with a 2 istruction-mapper fx.:

loop move.w counter,d0 subq.w #1,d0 move.w (a0)+,d1 move.b (a1,d1.w),(a2)+ dbra d0,loop ; repeat this n times ; dbra... ; that's what we want ; to get rid of

in this case you need to know the possible maximum of your counter value in order to unroll the loop. let's assume we're using a 2 instruction mapper with a maximum rowlength of 160 pixels.
to control the number of segments being executed you jump into your unrolled loop "backwards" according to the value of your counter. if you want to avoid costy multiplications while calculating your jumpoffset, try to create loop-blocks with a codelength diviseable by 2,4,8,16... to use shifts instead, otherwise you can replace your constant muls.w #-codelength,dn by a some shifs and adds like i did or you just use a jumptable - let's visualize:

entry move.w counter,d0 move.w d0,d1 lsl.w #3,d1 add.w d0,d0 sub.w d1,d0 lea.l entry(pc),a3 jmp (a3,d0.w) rept 160 move.w (a0)+,d1 move.b (a1,d1.w),(a2)+ endr ; n times, again ; faster than muls.w #-6,d0 ; n*8 ; n*2 ; n*2-n*8 = -6*n ; jump, don't loop :) ; every block is 6 bytes ; long ; voila - no dbra ; jump in from here

on uncached 68000s it's recommended to use a 1 instruction mapper even avoiding the move.w (a0)+,d1 using a combination of a codeblock that gets modified depending on the texturegradiet and a jump according to the current row of the triangle in a linear texturemapper's case - but that's a topic for another tutorial.
nontheless it's quite interesting to unroll your loops for optimizing newschool demo-stuff for the 68000 as fx. a 160x100 fullscreen rotozoomer can be done in about 3 vbls on a plain st using a combination of the stuff described here and in the 'avoiding c2p' doc - 4 bitplanes, no overscan, of course.

to finish this tutorial i'd like to try to keep my promise of loosing a word on cached machines, as well, finally:
on quick machines like the falcon or the tt unrolling your loops doesn't pay of very much. completely unrolling them will most likely slow down things as the loop- segment won't fit into the cache anymore, meaning the purpose of the cache will be completely dismissed in the first place, as your loop will need to be read from memory again.
in order to avoid this problem, size your loops as large as they can be but smaller than 256 bytes in order to take advantage of the cache.
don't generate codetables just trying to save memoryfetches (eg. a tunnel-effect) as long as you're coding for the 68030 - rather keep your whole program inside the fast-ram (as long as it's available and as long as you allocate your screenbuffer in st-ram, of course) and use movem.w to fetch your offsets + move.b/w (an,rn.l),(ax)+ to write the pixels (notice: on the 030 an,rn.l is faster than an,rn.w) within a loop <= 256 bytes, don't forget the 030 does data-caching, too.
concerning the mega ste i have to admit that i never owned nor coded this computer, actually. hence i can only suppose that fitting your loops according to the 16kb l2 cache is the best thing you can do.

one last trick for cached machines is to partly unroll the loops to fit them into the cache and save some dbra's at the same time, which works like this:

loop entry move.w counter,d0 move.w d0,d2 lsr.w #5,d2 andi.w #31,d0 move.w d0,d1 lsl.w #3,d1 add.w d0,d0 sub.w d1,d0 jmp (entry,pc,d0.w) rept 32 move.w (a0)+,d1 move.b (a1,d1.w),(a2)+ endr dbra d2,loop ; split number into: ; a magnitude (n/32) ; a fractional part (n mod 31) ; for the jump ; same as above ; no problem on the 030 ; this still fits into ; the cache ; not a shame here cause ; we got our cache

- 2002 ray//.tscc. -