Useful ramblings from AA on z88dk

Hey all,

I’ve tried to tidy this up and will improve it, but for now this is what we have 🙂

Calling into ASM
The z88dk_fastcall convention is always the best. This is passing a single parameter in DEHL (8/16/32 subset). z88dk_callee is often better than sdcc_call(1) for pure asm functions but it depends on how many params you are passing. If it’s few params then it may be better than z88dk_callee. You can test it yourself. There is a cost in the parameter set up. z88dk_callee gathers one param at a time and stacks it. sdcc_call(1) gathers params in registers one at a time which could involve stack action if it’s too many params. Too many on a z80 is not that many because there aren’t many registers and the instruction set is not orthogonal. You have to try it to see in real code. ** Also if your function is written in asm and register values are preserved (the a,b,c,d,e,h,l lot) by your function, declaring in the prototype that certain registers are unchanged can improve the code around the function call a lot.(edited)

In the cliab all the esxdos functions work fine. Disk streaming in c is done in playvid, although that’s a much more involved example because it’s mapping the file on the sd card and using that map to stream video files that are fragmented. I don’t think there’s any other program that can stream from fragmented files.(edited)1

All the esxdos functions work fine. Disk streaming in c is done in playvid, although that’s a much more involved example because it’s mapping the file on the sd card and using that map to stream video files that are fragmented. I don’t think there’s any other program that can stream from fragmented files.(edited)

@aa3141592653The z88dk_fastcall convention is always the best. This is passing a single parameter in DEHL (8/16/32 subset). z88dk_callee is often better than sdcc_call(1) for pure asm functions but it depends on how many params you are passing. If it’s few params then it may be better than z88dk_callee. You can test it yourself. There is a cost in the parameter set up. z88dk_callee gathers one param at a time and stacks it. sdcc_call(1) gathers params in registers one at a time which could involve stack action if it’s too many params. Too many on a z80 is not that many because there aren’t many registers and the instruction set is not orthogonal. You have to try it to see in real code. ** Also if your function is written in asm and register values are preserved (the a,b,c,d,e,h,l lot) by your function, declaring in the prototype that certain registers are unchanged can improve the code around the function call a lot.(edited)

[5:07 PM]Also if your function is written in asm and register values are preserved (the a,b,c,d,e,h,l lot) by your function, declaring in the prototype that certain registers are unchanged can improve the code around the function call a lot. How do you declare that – is it another decorator for the method signature at the top?

For your actual implementation of new_esx_dos_read above, the parameter push order for sdcc is right to left and it wants to push single bytes if the parameter is char. You might as well make the entire function in asm too and not leave little straggler calls to c functions in c to cut out the c compiler completely – that’s why I moved the entire thing to asm. The entire c library is written in asm and it has two entry points – one from c and one from asm. So if you’re using zx_border() from c, there is a corresponding asm_zx_border function for calling from asm. The asm implementation of the library code is in z88dk/libsrc/_DEVELOPMENT for newlib. You need to poke around there a bit to locate the function you have in mind. For zx specific functions, that’s going to be in arch/zx and asm_zx_border is found here: https://github.com/z88dk/z88dk/tree/master/libsrc/_DEVELOPMENT/arch/zx/misc/z80 Source code and comments on what registers are set up on input and what registers are changed on output: https://github.com/z88dk/z88dk/blob/master/libsrc/_DEVELOPMENT/arch/zx/misc/z80/asm_zx_border.asm This function exists in order to track the border colour and sound bit state sent to port fe so that clicks don’t occur when changing border colour or starting audio.

Yes you can see how z88dk defines these things in its headers. For newlib, these are located in z88dk/include/_DEVELOPMENT and the sdcc subdirectory for sdcc. zx_border() is in arch/zxn.h: https://github.com/z88dk/z88dk/blob/master/include/_DEVELOPMENT/sdcc/arch/zxn.h#L1001 The preserves_regs qualifer is there and informs the c compiler that only A and L are changed. This really improves the code around that call. When you invoke zx_border() like this: zx_border(6); The defines there are making sure you are actually doing this: zx_border_fastcall(6); So the parameter will be passed in the L register without any stack use because this is a z88dk_fastcall function. However, if you use a function pointer: myfunc = zx_border; (myfunc)(6); Then the defines make sure you are using standard c linkage because this function pointer invoke requires the normal push of param onto stack and pop of param after call. This is why there is so much define magic in the z88dk headers. It’s to make sure the calls are most efficient and you can still use function pointers.

The standard esxdos calls for z88dk are declared in its header #include <arch/zxn/esxdos.h or #include <arch/zx/esxdos.h but the latter is for the standard zx and divmmc peripheral. It’s a subset of the Next’s because the Next’s is a much larger api. But it’s possible to port compiled programs between the Next and standard zx by sticking to the subset of calls supported by esxdos in the zx header.

The esx_f_open() function is prototyped in the include file:

https://github.com/z88dk/z88dk/blob/master/include/_DEVELOPMENT/sdcc/arch/zxn/esxdos.h#L425

and there are comments just above that define constants you can use for the mode parameter.

The asm implementation is in the libsrc tree:

https://github.com/z88dk/z88dk/blob/master/libsrc/_DEVELOPMENT/arch/zxn/esxdos/z80/asm_esx_f_open.asm#L11

You could call that directly in your asm code and it will be the same code used by the c call. The final error jump is going to make sure errno is set properly and you can see the return value will be -1 for both asm and c on error.

Yes you can use that pragma to move the stack. There is a “LD SP, NNNN” in the crt in cases where this number is >= 0. For many types of executables, you can’t move the stack immediately or after the user part of the program has started. An example is big dot commands (DOTN) where you can’t have the stack in main memory where basic has it while you are loading up extended memory pages. These kinds of problems also exist on other z80 platforms so the best place for setting the program’s initial SP is in the crt which can control when the stack is moved as it does whatever is necessary to initialize the target machine. Having sensible defaults for each output type also means programmers can just compile and the thing works. The CRT is very simple. At its root, it’s a call to _main and then figure out what to do when the program exits. Sometimes that’s an infinite loop, sometimes that’s a return to the system requiring some cleanup, and sometimes it’s a program restart which involves re-initializing the data and bss sections. These are also things automatically handled by the CRT with non-default behaviour specified by PRAGMAs. In general, the PRAGMAs control the behaviour of the small number of these decisions. The actual CRT in z88dk is structured like this even if some these sections are empty in different cases: initialization user initialization (the user can insert code here by assignment to a specific section) call _main (maybe with command line arguments if the target supports it) run exit stack (user functions registered with atexit() that must be run by the system before exit) run quick_exit stack (user functions registered with at_quick_exit() that must be run by the system before exit) exit (return to system, infinite loop, restart program) C introduced the idea of a quick exit separate from normal exit at some point to distinguish the minimum essential exit functions from system cleanup functions that don’t have to be run in an emergency (?)
[4:58 PM]. The library code also makes use of this structure. For example, if you use malloc() to allocate dynamic memory, the library inserts code to initialize the heap into the “user initialization” section above. This is done automatically by the linker and like magic, your program compiles with malloc() and friends working without your having to do anything. When I set up im2 interrupts, I prefer to do it from asm. So I normally insert a little code into the “user initialization” to set im2 mode and load the I register. https://gitlab.com/thesmog358/tbblue/-/blob/master/src/c/DotCommands/playvid/interrupts-common.asm#L6 That’s done here and there is some code inserted into the user exit section to restore the I register before exit. Then I manually enable interrupts in main() when the program is ready to start receiving them. The library also has its optional driver instantiation in the CRT. This is mainly initializing the fd table (stdin, stdout, stderr) and initializing drivers attached to those streams. It does that by building the fd table (an array of function pointers) one entry at a time with a reference to the driver’s entry point in the fd table. The linker takes care of putting the code and table in the correct section in memory. The driver instantiation is done with M4 which is a macro language that allows custom code to be generated at build time. The custom code sets the defaults of the driver and makes that available to mod by the user. So, eg, the default for stdout might be a driver that prints to the full 32×24 screen but the user could change those M4 params to make the printing go to a smaller window. The magnetic scrolls interpreter for the Next is a public example of the user manually setting up a custom fd table and then customizing the driver itself in asm as the drivers are written in an object oriented way. Linker magic makes the user code replace the “superclass” code.(edited)11
@aa3141592653Yes you can use that pragma to move the stack. There is a “LD SP, NNNN” in the crt in cases where this number is >= 0. For many types of executables, you can’t move the stack immediately or after the user part of the program has started. An example is big dot commands (DOTN) where you can’t have the stack in main memory where basic has it while you are loading up extended memory pages. These kinds of problems also exist on other z80 platforms so the best place for setting the program’s initial SP is in the crt which can control when the stack is moved as it does whatever is necessary to initialize the target machine. Having sensible defaults for each output type also means programmers can just compile and the thing works. The CRT is very simple. At its root, it’s a call to _main and then figure out what to do when the program exits. Sometimes that’s an infinite loop, sometimes that’s a return to the system requiring some cleanup, and sometimes it’s a program restart which involves re-initializing the data and bss sections. These are also things automatically handled by the CRT with non-default behaviour specified by PRAGMAs. In general, the PRAGMAs control the behaviour of the small number of these decisions. The actual CRT in z88dk is structured like this even if some these sections are empty in different cases: initialization user initialization (the user can insert code here by assignment to a specific section) call _main (maybe with command line arguments if the target supports it) run exit stack (user functions registered with atexit() that must be run by the system before exit) run quick_exit stack (user functions registered with at_quick_exit() that must be run by the system before exit) exit (return to system, infinite loop, restart program) C introduced the idea of a quick exit separate from normal exit at some point to distinguish the minimum essential exit functions from system cleanup functions that don’t have to be run in an emergency (?)

But at which point is the SP specified in pragma actually changed from some other (initial) SP? And what are the ranges to which SP is safe to change with the pragma? I understood it’s different for different crt variants of z88dk?

I ask specifically as I see the memory layout is both given by the runtimes but the users change it too in their own code as well (and maybe these custom pragma settings too?)(edited)

The CRT will change the SP at the right time and before main() is called. You are free to place it whereever you want, including into places that will cause your program to crash Just like asm – there is total freedom to fail. Always remember the z88dk memory model. There’s a main binary blob that assembles at CRT_ORG_CODE, normally defaulting to 0x8000 on spectrum builds. This is also changeable via pragma. This is your main CODE/DATA/BSS put together and growing upward from 0x8000. Stuff you have assigned to memory banks are in those memory banks and not here. The default location of the stack is at 0 so it is growing downward from the top of memory toward the end of CODE/DATA/BSS. LIke in any program, you need to make sure there is sufficient space there. But if you need to bank up there, it makes perfect sense to move the stack somewhere else, like underneath CODE/DATA/BSS. If that is at 0x8000, setting the stack to 0x8000 will make it grow downward underneath the main binary.
Always keep in mind when banking :- where’s my stack? where are the interrupts going? when the interrupts are running what is my memory map?
The user can change all of this. The CODE/DATA/BSS need not be contiguous in a single blob. You can specify a different ORG for DATA and a different one again for BSS. That may not make sense in the normal zx environment but it does make sense for ROMable systems like an if2 cartridge. The CODE must lie in the rom, presumably at address 0. If the ORG is 0, the z88dk crt will optionally provide the normal z80 page 0 code that consists of stubs for the RST locations and NMI location. The DATA and BSS must lie in RAM at, say, 0x8000 if you want to avoid contended memory. DATA/BSS is normally kept together. The crt init code must zero the BSS section and decompress the DATA section stored in the rom to ram before main() starts. These are a lot of details that the z88dk crt will automatically take care of, just to give an example of the initialization load the crt may take on. All optional of course but this is a default that allows you to just click compile and get a working ROMable system out the other end.

You can also change the memory map to separate some things out of the main binary. Maybe you want all malloc code or all adt code or all math code somewhere else. You can change the memory map to move that around as the library is written to assign modules to very specific section names. section “code_math” normally sits inside the large section “CODE”. But I wouldn’t recommend doing this. It makes more sense to keep it all a binary blob and if you want to have self-contained code in banks, do it the other way that was discussed already. That way requires no specialized knowledge from the user.

Some systems have done the latter where they have a small operating system that is composed of a standalone binary with custom and library code. This acts like a command shell that can execute programs from disk. When you build programs for such a system, you link against the entry points in the command shell and that makes the generated user programs running in the system much smaller.

It’s really a corner case because the test program is pounding the malloc. But maybe not so corner because the heap on a 64k system is not going to be big. It could probably be a flag thing but one of the things I am leaning toward is using the linker to make those choices rather than library flags. Like right now, newlib has all these build time flags for fast integer math, small integer math, qsort uses quicksort, qsort uses insertion sort and on and on. I think now that those things should be done by the linker so you can -l on the compile line to change the implementation. (edited)

Everything in the library gets unique names. Then there is a default set of defines to alias standard names to default unique names. If the user makes his own defines to make a different alias, this is preferentially used by the linker.

Anyway you learn these things incrementally. There were a lot of requirements to synthesize to come up with the current methods and it’s not perfect yet. Newlib’s stdio implementation is a 3rd iteration I think, one in which a wrong turn was taken as it prepared to incorporate task switching which has shown up in the standards docs. But that’s a mistake so it’s carrying some unnecessary code for that. classic is now using my 2nd iteration which doesn’t have the object oriented idea.

I’m definitely good on the object oriented and normal streams approach. Object oriented does act to reduce the code size because it also means different drivers share code. On systems with a certain min amount of memory, this is the right way to go. On small systems, you probably shouldn’t be using stdio and should have a very simple put() function written in low level asm and it probably shouldn’t be put() but rather system specific asm code. My thoughts are always to put the difficult into the lib and let the user do the simple because users will never take on the difficult but they’re very happy to do things simply on their own. You can see that in every retro community. Very little code is shared in the zx community but one example is the fzx printing that Einar wrote. He did one simple function – print a char. z88dk’s implementation adds things like colour / xor-or-and printing / printing into windows / measuring the width of strings / targeting multiple screen resolutions / adding edit functions so you can scanf with proportional fonts, etc. The larger zx community will never do this since it’s difficult.(edited)

I almost forgot why I was heading into a malloc… when the memory system is small, you do have to wonder where the memory for the heap is. On a big system with 32-bit or 64-bit addresses, no one cares because the memory available for a program is essentially infinite. The heap is usually occupying some number of pages and can grow automatically as needed very easily in that large address space. No one thinks about it. The default behaviour z88dk takes is to take all the memory between the end of the main binary (after CODE/DATA/BSS) and the bottom of the stack (initial SP minus default 512 bytes). Use of malloc then causes the heap initialization code to initialize that memory area. If that region is too small, the size calculation goes negative and the program should hang. It may even fail to compile – I don’t recall offhand. It can be too small if the initial SP is below CODE/DATA/BSS too. This default ensures standard c programs can just compile and it’s also normally what is wanted. However, there are options to specify the size of the heap and where it is. The location can include a static char array declared in C

For NEX, I believe the default is to not have a heap at all – I’ll look that up in a bit but I mention this because moving SP around could affect the default behaviour of the heap. Small systems can bank and have more stringent requirements for dynamic memory allocation so in z88dk there are many options: (1) The standard heap which is a region of memory that can allocate random sized blocks. It’s a contiguous range of memory addresses, an array of char basically. As memory is freed, you can end up with holes in the heap so further mallocs have to search the heap to find big enough holes to satisfy the request. Allocations are confined to the heap. z88dk actually supports multiple heaps. The underlying api is heap_alloc, heap_realloc, heap_free using named heaps. These heaps are independent, can vary in size and can be in multiple banks. You just have to make sure each heap is paged in before allocating memory from it. The standard functions malloc, realloc, free are calling these functions with an implied heap. (2) The obstack. Obstacks come from linux and they too are a contiguous memory block like a char array. Allocations are confined to the obstack. Again, you can have multiple obstacks of varying sizes in different memory banks. Obstacks are very quick and simple. A pointer is maintained that points at the next available address. You make an allocation, that pointer is returned and the pointer is advanced by the size of the allocation. You must deallocate in reverse order and cannot deallocate randomly. In fact, you can deallocate everything at once by simply resetting the pointer to the beginning of the obstack by, eg, freeing the first bit of memory allocated. A lot of functions are added to this simple idea that let you build objects incrementally on the obstack. It’s very convenient, small and fast for this type of allocation

The block memory allocator. The idea is you maintain a linked list of fixed size memory blocks gathered from pockets of availabe memory around the memory map. An array is maintained, once for each list. You might put 10 byte blocks in one list, 20 bytes in another, 100 in another, etc – it’s up to you because the allocator only deals with queue numbers. You can add memory from different banks and it probably makes sense to have different lists for different banks. You add via a function that accepts an address range and then breaks that range into blocks added to the list. Allocation & freeing is quick as it’s a simple singly linked list implementation. The advantage is speed and the ability to gather memory from small pockets all over the place.

copy a byte from one location to another:

*(unsigned char *)(destination address) = *(unsigned char *)(source address) ;

Comments

Leave a Reply Cancel reply