Category Archives: Programming

An article about C++

Async tasks in C++

The introduction of the concurrent memory model and the concurrency library in STL are major features brought by C++11. With prior versions, we had to resort to platform-specific extensions or library in order to write multithreaded applications, limiting the portability of the code.

One of the new facility introduced is std::async(), which allows to make use of concurrency at an higher level of abstraction. It relies on « tasks » instead of manually creating and managing threads. But it also comes with some serious caveats that every programmer should be aware of.

Introduction to tasks

Thread based parallelism

C++03 lacks any notion of thread management. Thus, in order to create multithreaded applications, programmers often relies on C APIs, such as POSIX or Win32, and have to be familiar with the concept of threads. But manipulating threads can quickly become a burden. You have to manually launch and join them, and be careful that you don’t exceed the capabilities of your hardware and OS, as threads are costly.

Such a traditional thread management is now provided by « modern »C++, around the class std::thread.

Task based parallelism

C++11 also introduced the notion of task based parallelism.
You define jobs (aka tasks) that can and may be run concurrently, and you rely on the system to spawn threads, depending on the available resources available. Unlike with manual thread management, creating tasks is supposed to be burden free for the programmer and always optimal. In order to fulfil these pledges, the underlying has to be quite smart. It usually relies on global thread pools. The tasks themselves may have to be migrated from one thread to another. Or, if a task is blocked waiting for a resource, it may be taken off its thread so another can replace it.

Tasks in C++


In order to work with tasks, C++11 introduced the function template std::async. Its purpose is to launch the provided task on a thread, which may be a new thread, a recycled thread or the current thread. The sophistication behind the scene can greatly vary from one implementation to the other.

std::async takes as parameters, a launch policy (more on that later) and a function object (along with its parameters) forming the « task »to be processed. std::async returns a std::future.


std::future hosts the result promised by a task. This result is accessed by the get() operation. If the task has already completed its job, it may already be available. Otherwise, get() launches the task in the current thread and blocks until the result becomes available.
If an exception is thrown by the task, it is propagated to the current context. Thus, error handling is greatly simplified!


Unfortunately, the way C++ standardized task based parallelism is far from perfect, and there are many limitations every programmer should be aware of before using tasks!

The first, and most important limitation is that, because of C++11 itself, it is extremely difficult for the STL implementers to provide a true task based parallelism!!!

A true task, should not be too closely tied to its underlying thread. It should be easy to reuse threads for multiple tasks, and to move them to another thread or unload them when being blocked.
But C++11 introduced new concepts such as thread_local. A thread_local variable is some kind of global variable that’s only visible from one thread and should be initialized and destroyed along with it. As you can imagine, difficulties arise if a task making use of thread_locals had to be moved to another thread…

Many other problematic scenarios could also appear due to locks and mutexes.

The second limitation is that the C++ standard does not define precisely enough what to expect when using std::async to create tasks. When you use std ::async, you can specify a launch policy.

std::launch::deferred will force the task to run in the same thread. std::launch::async tells the underlying that the task has to run concurrently. And if you don’t precise a launch policy, a default one is chosen:  std::launch::deferred |  std::launch::async. It means that the library will decide which policy is to be applied, according to its own criteria. That’s the one you would expect to use if you’d want to profit from task based parallelism as I presented it earlier. The problem is that the standard imposes nothing concerning those criteria! Thus, the behavior of a task created with the std::launch::async may differ greatly from one STL implementation to the other.

For instance:

  • GCC 4.8 :
    • std::launch::deferred | std::launch::async will always launch the task as deferred
    • std::launch::async will create one thread per task
  • Visual C++ 2017
    • std::launch::deferred | std::launch::async will make a decision according to the resources
    • std::launch::async is said to use a global thread pool

It is clear that this is problematic if you aim to write portable code! At the extreme, if you spawn too much tasks, it will behave accordingly to expectations if compiled with Visual, but crash if compiled with GCC4.8, due to resource exhaustion!

Note: I am aware that more recent version of GCC are closer in sophistication to Visual 2017. I just wanted to illustrate my point with two extremes.

For all those reasons, it is safer to consider that std::async do not provide a true task based parallelism!


Despite all the limitations discussed above, it is highly advised to make use of std::aync and futures whenever possible ! Indeed, threads can be sensitive and have a tendency to terminate() if not handled with care. Furthermore, having an easy interface providing futures is worth the limitations.

The only thing is to remember that std::async do not provide a true task based parallelism!
If you need to run your tasks concurrently, it is suggested to use std::async explicitly asking for a std::launch::async policy and consider that each task will run in its own thread.
Also, if you want to stay away from troubles, don’t use std::async tasks to  perform I/O operations, or manipulate mutexes. That last part is not from me, but from Bjarne Stroustrup himself!

If you follow these advices, you will benefit of all the advantages brought by std::async and std::futures over without any risk.

It is only if you need more control over your « tasks », that you will have to fall back to manipulate plain std::threads.

After all this is what C++ is all about: giving freedom of choice the programmers. It’s just that sometimes freedom can become tricky 😉

A Game of Life

My port of Conway’s Game of Life to the Apple II is now complete!

The whole code is available on GitHub.
Plus, there are four posts on this documenting my progression that could interest those who want to learn more about programming on the Apple II,

  1. Coding in C for an 8 bit 6502 CPU
  2. Coding in Assembly for an Apple II
  3. HIRES Graphics on the Apple II
  4. Making the Apple II sing

You’re welcome to try it in the emulator embedded below!

This Javascript emulator presents a few caveats though.

  • It’s very, very slow, and won’t run real time on all but the most powerful smartphones
  • In order to « play » it past the Title Screen, a keyboard is required anyway
  • The loading time can last more than 30 seconds :/
  • The display is way cleaner and crisper than it would be on an actual machine

Try it

You can also download this disk image and load it on your favorite emulator for a more faithful result.

A few screenshots if you are on a smartphone and cannot run the emulator:

Lessons learnt

Now that this small project is completed, I can say that I have a good grasp on how the Apple II actually works and how to make use of its main capabilities. Of course, no breakthrough here. But my « A Game of Life » makes use of:

  • Keyboard inputs
  • Access to the file system via the PRODOS‘ API
  • Color graphics in both HIRES and LORES
  • Sounds, and even music

Here I will sum up some lessons I learnt, how things could be improved in an eventual future project and what surprised me the most.

Coding for an Apple II in 2017

As I explained in my first article on the subject, it’s quite possible to code in plain C, thanks to the amazing work of the guys behind CC65. The binary produced is far from being optimized but usable enough to form a skeleton or hosts a state machine. CC65 also comes with an assembler if some routines require to be optimized.

On the plus side

  • CC65 has a good compliance with the ANSI standard
  • CC65 comes with a good support library with some standard functions such as printf and others inspired from conio.h
  • There is a good integration with the assembler and the calling convention is well documented,

On the minus side

  • Most of the zero page addresses are reserved for the « runtime library ». Meaning that, even in assembly, you will be restricted to using only 8 addresses!
  • A lot of symbols are loaded by default, reducing the usable memory.

Testing and debugging

Using CC65, it is rather easy to develop and to cross-compile for the Apple II.

Fortunately, using an adequate emulator, it is also rather easy to debug code. Being mainly on Windows, I chose AppleWin. It is faithful to the hardware and, more important in this occasion, provides a debugger. If its debugging capabilities seem rather limited compared to a modern debugger, they are more than enough. And way more advanced that any debugger that can be found on the actual machine.

I was quite satisfied by AppleWin’s debugger and emulation!

The Apple II hardware

I learnt much about the hardware of the machine. I can sum up my feelings with two words: clever and primitive.

Indeed, everything about the Apple II was very cleverly engineered. But as the grandfather of the personal computers, it’s also way more primitive that what I expected.

The MOS 6502

The 6502 is a great microprocessor! Its instruction set is small and very lean. Quite easy to learn. For its time, it was quit an engineering achievement: as powerful as its main competitors of the time (the Intel 8080 and the Motorola 6800) for a fifth of the price! It was powerful enough to remain competitive during the first half of the eighties against its nemesis, the Zilog Z80.

But there is a catch. It comes with nothing superfluous.

For instance, there is, of course a LSR instruction (Logical Shift Right). But you can only shift by one bit. If you want to shift more? Use a loop! Furthermore, there is no LSL instruction! You can straight that out with the  ROR instruction anyway. Same with subtraction: if you want to subtract by a number, add its negative!

And it is a true 8-bit CPU. Meaning that only 8-bit operations are permitted!

On other 8-bit CPU such as the Z80, you can manipulate 16 bit numbers and use in them in some arithmetic operations. Of course, it is suboptimal. As the ALU is not 16-bit wide, it will rely on some slow micro-coded operations to perform a 16-bit task.

But micro-codes cost transistors. Thus the 6502 does not provide any!

When programming the 6502, you are constantly reminded how limiting 8-bit arithmetic is. The result of your additions cannot exceed 256. Your loops are limited to 256 iterations. You cannot event jump to an address more than 256 byte away in memory!

The Graphics

If I had no trouble using the text mode and the low resolution graphical mode, using the “HIRES” high resolution mode was a lot more difficult that what I expected. Of course, as I knew, there is no dedicated hardware to assist you. But the most disturbing is the peculiar way the data is organized in memory. Saying nothing about the artifacts appearing if you’re not careful.

This color display was astonishing for the time. But it was also very difficult to program for. I now measure the troubles that programmers of the time had to overcome in order to offer us some of the best arcade games ever!

The Sound

Bare-bone is what comes to my mind. I knew that the speaker had to be driven by the software but I did not expect the lack of an hardware counter or clock that could cadence it.


Programming a Game of Life on the Apple II was worth every bit of it. Using modern tools, it was not that difficult and was mainly a matter of reading (and understanding!) the documentation.

Now I wonder what will be my next move.
I still have to complete the restoration of my Apple II by fixing its Disk drive.
I could capitalize on my new knowledge of the 6502 to program a small game for the PC Engine (also known as the TurboGrafx-16 by our American friends). An impressive arcade machine!
But I’m also tempted to program something for the Amiga.

Stay tuned…

Making the Apple II sing

Now that my port of Conway’s Game of Life on Apple II possess a proper HIRES Title Screen, some kind of music would be great!

In this post, I will present the sound capabilities of the Apple II then explain how I coded a (simple) music library.

This is the fourth post about coding the Game of Life on the Apple II. Here are the previous ones:

  1. Coding in C for an 8 bit 6502 CPU
  2. Coding in Assembly for an Apple II
  3. HIRES Graphics on Apple II

And as usual, the code is available on GitHub!

Where in the world is the dedicated sound hardware?

Do you know the musical abilities of the Apple II? Well… there is none!

OK I am exaggerating a tiny bit. The CPU has direct control of a speaker, called the Beeper!

Most of the 8 bit home computers from the early eighties provided some hardware dedicated to producing pleasant music. Atari 8bits featured 4 voices on which you could play sines and mix some noise. The Commodore C64 came along the legendary SID, which the chiptuners still compose for!

But the Apple II precedes those machines by several years. It was not designed primarily as a game machine (although games were among its very first batch of programs). And, as we know, Woz liked nothing more than using standard chips and cutting their number to the bare minimum. That is why the Beeper is directly driven by the CPU.

Even the IBM PC, which was often derided for its appalling sound capabilities, provided more. In the Apple II’s fashion, the PC’s speaker is directly controlled by the CPU. But the PC also came with a Programmable Interval Timer, which could be used to cadence the program driving the speaker. On the Apple machine, you don’t enjoy such a facility. In order to timely cadence the speaker, you have to count cycles. So, if you don’t rely on very cleaver programming tricks, you will hog 100% of the CPU.

And that’s precisely what I am doing in the library I will expose below! 😉

Producing notes

One caveat: I am not a musician. I apologize in advance if my explanations concerning musical notation, pitch and octaves are a little bit approximate.


The only interaction a program can have with the Beeper is through a Soft-Switch. Reading or writing to the address 0xC030 will invert the position of the speaker. And that’s it.

By periodically alternating the high and the low positions of the speaker, this switch make it easy to produce a square waveform. The frequency of this vibration will command the musical note produced. Thus, the single most important thing is to control the timing on which I will activate the speaker’s soft-switch.

Unfortunately, as I said, on the Apple II there’s is no other easy way to measure time other than counting cycles. So, the heart of my “note” routine is composed of two loops.

  1. The inner loop controls the pitch of the note to play. It is composed of a number of NOPs, forming the smallest quantum of time for my function. The number of iterations is adjusted so that iterating the loop takes the time of half a period for the desired note. When that amount of time is elapsed, the soft-switch is activated.
  2. The outer loop controls the duration of the note. I want the base duration to be the same for all the notes. Thus, of course, the number of iterations will be different for each one, as their period depends on their pitch.

In order to calculate the number of those iteration to perform, I need to know precisely the duration of each instruction. Inside the loops of course, but also the control instructions: comparisons, conditional branching, etc…

Fortunately, the 6502 behaves predictably. Counting the cycles is only a matter of reading the documentation.

Controlling the pitch

These loops written, the real trick is to calculate the required number of iteration, depending on the pitch of the desired note. The formula is quite simple. The half period duration is \frac{1}{2 \ast Frequency_{NOTE}}. The duration of the half-period loop is easy to calculate. It’s \frac{Number_{cycles}}{Frequency_{CPU}}.

Thus, to get the number of required iterations: Nb_{ITERATIONS} = \frac{Frequency_{CPU}}{2 \ast Frequency_{NOTE} * Number_{CYCLES}}.

In order to help me, I wrote a small spreadsheet to perform the calculus for all the playable notes. Given that the 6502 is an 8 bit CPU, the highest number of iteration is 255. Thus, the the lowest note reproducible by my library is D#3, with 251 iterations.

You can download this spreadsheet from GitHub.

Du to the various approximations and the duration of the inner loop, this method cannot be 100% accurate. Thus, I recorded the notes using Audacity and measured their precise duration, so I could plot their relative distance from the goal.

Relative error of the reproduced frequency

Not that bad 😉

At the highest frequencies, less iterations are required to last half a period. Thus, the imprecision raises. I decided that the highest note reproducible would be B4. After that, the imprecision in the reproduction would be so high that those notes would be impracticable to play.

Controlling the duration

Concerning the duration of the note, it’s far easier. A base duration will be one second. It will correspond to a WHOLE note. As the inner loop is in charge for lasting half a period of the note, the number of outer loop’s iterations required is 2 \ast Frequency_{NOTE}.

This base duration will be shifted in order to play HALF notes, QUARTER notes, and so on.


In order to differentiate the notes, silence is needed. So, I also wrote a “pause” function. Its principle is the same as the “note” function and its base duration is also one second. But of course, the speaker soft-switch is left untouched.

The result

With my library being able to produce musical notes, control their duration and insert an optional pause between notes, I can play a simple tune. I asked my friend Clint Slate to compose a short one, to be played during my Title Screen:


And the same tune played by emulated Apple IIe:


You can marvel to the characteristic beeps of the machine! And the tune is still quite recognizable! 🙂

More advanced techniques

While quite satisfied by the result, and also by the ease of use of my “Apple II music library“, I know it is possible to do a lot better!

First, my library use 100% of the CPU when playing something. It’s OK for my title screen, but it would be impossible to use it while achieving fluid animations. Good coders of the eighties could do it.

Well, maybe it is easier than playing music, as sound effects may not have to rely on strict timing requirements. But it impresses me nonetheless!

Another weakness of my library, is that it produces the crudest level of 1 bit music: square waves. While it was the quite common back in the time, some talented hackers could do a lot better and played music on more than one channel!

It’s even possible to play sampled sounds, using a technique called “Pulse-Width Modulation“.

Anyway, my music library now fulfills my needs!

HIRES Graphics on Apple II

The Apple II was innovative in many ways. One of its most remarkable feats was that it was the very first home computer supporting color graphics! Think about it! While all its competitor could display 40 columns of ASCII characters, the Apple II could displays graphics with up to 16 colors in LoRes mode and 6 colors in HiRes mode! This was huge!

Now that I know how to program my Apple II in C and ASM, I could not resist trying to display a high resolution image on my screen! So I added a Title Screen to my Game of Life 😉

This article is the third concerning my port of Conway’s Game of life on the Apple II. Here are the two previous ones:

  1. Coding in C for an 8 bit 6502 CPU
  2. Coding in Assembly for an Apple II

Graphics modes

Information can be display be the Apple II using different graphic modes.

  • TEXT mode allows to display 24 rows of 40 ASCII characters. With an expansion board, displaying 80 characters becomes possible.
  • LORES mode is 40 pixel wide, 48 pixel tall in 16 colors! While it seems quite low by today’s standards, it is  reasonably fast and straightforward to program for. This is the mode used to edit the “play-field”, then watch the cells evolve, in my port of the Game of Life.
  • HIRES is the mode used by most graphical programs and games of the time. It is 140 pixel wide, 192 pixel tall in 6 colors. BUT… relying on many oddities, it is not as easy to grasp nor program than the LORES mode. More precisions later.
  • MIXED mode allow to use the bottom of the screen of LORES and HIRES mode to display four lines of text.
  • Later Apple II models came with the so called DOUBLE LORES and the DOUBLE HIRES mode. The later was quite difficult to program for and heavily taxing memory. So it was mainly used for static screens.

From that point, I will be exclusively writing about HIRES.

HIRES Oddities

Color is achieved a very peculiar way on an Apple II. Furthermore, Woz made the choice to reduce the number of chips to the minimum. The price paid being that programming HIRES graphics on the Apple II is not as easy and fast that it could be on other machines.

Producing colors with black & white dots!

At its core, the Apple II is a black and white machine!
On the contrary to other 8 bit and 16 bit machines, you don’t select colors in a palette. Instead, you have to carefully place black and white dots. Indeed, the hardware was designed so that a rapid succession of such points introduces interferences into the color carrier of the signal sent to the NTSC TV.

Img1. Image seen on a black & white set


Img2. The same image on a color NTSC set

If you watch carefully the two images above you can notice:

  • That the text in the color version is not very crisp and is affected by some color fringes. Those are the “interferences” in action!
  • That the black & white version seems to be twice the resolution. And it is indeed! In order to achieve the 140 colored pixels of an HIRES line, you’ve got to draw 280 black and white dots!

Sending a succession of lit dots followed by unlit dots will draw color pixels, while two consecutive white dots will result in a white pixel and two consecutive black dots produce a black pixel. If you’re not already lost, wait, there is more!

The process I just describe only allows you to produce four colors: WHITE, BLACK, GREEN and PURPLE. So how can the system also produce BLUE and ORANGE?? By delaying the signal of half a dot!

Img3. Color generation on the Apple II

In memory


Two memory blocks can be read by the graphic circuitry of the machine in order to display a HIRES picture.

  • PAGE 1, from 0x2000 to 0x3FFF
  • PAGE 2, from 0x4000 to 0x5FFF


7 dot blocks

Each page allows the description of 192 lines of 140 pixels, or 280 dots. Those dots are grouped into lines of 40 blocks of 7 dots. Each of these dot is controlled by one bit: 0 produces an black dot, while 1 produces a white dot. Such a 7 dot block is described by one byte.
And the 8th bit? It controls the half-dot delay applied, thus the COLOR GROUP of the block.

  • 8th bit = 0 : GROUP 1, allowing to display GREEN, VIOLET, BLACK and WHITE
  • 8th bit = 1 : GROUP 2, allowing to display ORANGE, BLUE, BLACK and WHITE

This 8th bit is the higher bit of the byte.
The 7 other bits describe the dots in the following order: the lesser bit corresponds to the dot displayed on the left and the higher bit to the dot on the right.
Those bytes are grouped inside 40 block lines and addressed the following way: the byte of lowest address corresponds to the left-most block of one line, while the byte of highest address correspond to the block on the right.

Img4. Dots layout in a byte

If you followed carefully, you surely noticed that the layout of the bits inside one byte (7 dot block) is the reverse of the layout of the bytes inside a line! Legends tell that this insanity allowed Woz to save two chips…


If you want to produce color, you have to set the high bit accordingly to the colors you want to produce.
Then you have to lay the dots following these directives:

  • dots OFF ON gives GREEN (GROUP1) or ORANGE (GROUP2)
  • dots ON OFF gives VIOLET (GROUP1) or BLUE (GROUP2)
  • dots ON ON gives WHITE
  • dots OFF OFF gives BLACK

As 7 dots gives 3.5 colored pixels, you’ll often have to think about the layout of two consecutive blocks.

To sum those things up, here are the bytes required to obtain a line of 14 green pixels:

0x2A – 0x55 – 0x2A – 0x55

0 0 10 10 10 – 0 10 10 10 10 0 10 10 10 – 0 10 10 10 1

(the brown bits describe adjacent dots)


Now, two limitations become evident:

  1. You cannot display a color from GROUP 1 and a color from GROUP 2 in the same 7 dot block.
  2. Two different adjacent colors may lead to the appearance of unwanted artifacts! Indeed if you want to display GREEN (dots: OFF ON) followed by VIOLET (dots: ON OFF), there will be two adjacent ON dots, leading to the appearance of WHITE!

Those artifacts were called clashing back in the days. Quoting Wikipedia:

Img5. The horizontal borders between two colors generate “fringe effects” on the Apple II. In the lower left image, drawing a blue star on a green background causes the Apple II to add black, white and orange pixels at and near the horizontal boundaries between the green and blue.


One more thing…

The lines are not contiguous in memory! Meaning line number 2’s address does not follow line number 1’s… Instead, they are ordered according to a 64 line interleaving pattern.

Below lies a table from the Apple’s documentation of the time.
On the left are the address of so-called “boxes” of height lines. On the right, in the zoomed view, you have the offset to be applied for each line inside a box. For instance, line #10 is the second line of the second box. Therefore, its address is 0X2480.
On the top are the offset of each 7 dot blocks inside a line. Therefore, the block #3 of line #10 lies at the address 0x2483.

Img6. Starting address of the ordered lines


And a last, the PAGES are 8192 byte long, but 192*40 = 7680! 512 bytes are missing!

Those bytes are 8 byte long holes located at the end of every line which address ends with 0x50 or 0xD0. In the eighties, code wizards did not left these holes unused! For instance, this memory could be used to host the code of a self displaying image!


In order to display your image, you have to activate some soft-switches by reading or writing to the following addresses:

  • 0xC050 in order to switch to the graphical display
  • 0xC052 to be full-screen
  • 0xC057 to enter HIRES mode
  • 0xC054 to select PAGE 1 or 0xC055 to select PAGE 2

In order to display elaborate animations, you can work on a PAGE while the other one is displayed. Yes, the Apple II can perform double buffering!


Remember that I wanted to add a Title Screen to my Game of Life. But it is now apparent that producing a HIRES image is not a very straightforward task. Thus, I found it faster and easier to program a tool converting a modern RBG image (PNG, JPEG, etc) into a proper binary file that only need to be loaded in PAGE 1 or PAGE 2 in order to be displayed!

I called this program Rgb2Hires 😉

You can get it on Github.

Of course it’s not magical: you’ve got to respect the caveats exposed before.
In particular, your source image must be only composed of Black, White, Purple, Orange and Blue pixels.
And it has to be 140×192 pixels. Remember that HIRES pixels are anamorphic: they are not square, but twice wider than tall! For easy drawing, Photoshop or equivalent programs can be told to work on anamorphic pixels. Thus the image in the editor will appear with the same proportion that on the Apple II.

Img7. Game of Life’s Title Screen, PNG exported from Photoshop


Img8. Game of Life’s Title Screen, HIRES image in AppleWin

6502 Assembly language

Coding in Assembly for an Apple II

In my last post, I wrote how I tried to code a port of Conway’s Game of Life for my Apple II. This port was written in C and relied on a cross compiler suite (cc65), generating binaries for the Apple’s CPU : the MOS 6502. The conclusion was that the port was viable. But unfortunately, it was also way too slow. Indeed, the 6502 is a primitive 8 bit CPU that is not well suited to be targeted by a high level compiled language such as C.

It was time to speed things up and re-write the inner loops in assembly!

I will begin by presenting the results I obtained, before explaining the “workflow” I followed to produce binaries, upload then debug them.

The code can be cloned from my Github repository:

Results obtained

I concentrated my optimizing efforts on the main “updating” function. Indeed, it is responsible to determine which cell will die and where new ones will spawn, therefore is the most time-consuming part of the program.

  1. Rewriting the most inner function, “count_neighbours“, in assembly.
    • In my previous post, I noted that the code generated by the compiler was around 220 opcode long, which was huge.
      After my rewriting the function shortened to only 30 opcode long!
  2. Rewriting the C code of the outer loop in order to make it more 6502-friendly.
    • Using __fastcall__ calling convention, leading to the use of the accumulator and the X register to pass parameters instead of the “fake” stack.
    • Eliminating indirect addressing ( ie. [X][Y] ) which are specially expensive.
  3. Rewriting all the update function in assembly.

Incidentally, I also abandoned the ASCII display functions provided by the compiler suite, and wrote some assembly functions to replace them. I then abandoned the original text mode to draw the cells in color 😉

And here is the normalized execution time, for 30 iterations, of all these versions. Of course, the less, the better.

Normalized execution time of iterated versions

A 18 time speed-up! Writing assembly is not always easy, but it is definitely worth it!

The 6502 opcodes are few and “logical”, thus they are quite easy to learn and use. However, as I mentioned in my previous post, this CPU is primitive by today standards. Its resources are scarce and I have to admit that I was not used to 8 bit arithmetic: ceiling at 255 can be quite frustrating!
Especially when branching. The 6502 provides very efficient conditional branching instructions. The drawback is that you can only jump 128 bytes backwards and 127 bytes forward. As most instructions are 2 or 3 byte long, you have to code tight and carefully layout your instructions if you want to use them!

Remember that most of the code (all the logic, state machine, textual displays…) is still written in C. This remaining of code is less systematic and would be much more difficult to write in ASM. But it does not matter much, as the time spent in these parts is also much smaller.

To conclude, I would therefore say that nowadays it is not too difficult to write a program for Apple II computers. Thanks to cc65, you can write the whole program in C and, if you require more performance, rewrite the most time consuming function in ASM. And thanks to modern editors and emulators, the process of writing, testing and debugging is way more easy and pleasant than it was thirty years ago!

Quick workflow

When coding such a small project in C, I could live without a debugger. But such was not the case anymore when I started to rewrite some parts in assembly.

  1. I was not familiar with the 6502 architecture and opcodes, so it was unthinkable that I could produce bug free binaries at my first attempts
  2. I don’t know how to debug a low level program without a proper debugger

Thus I established my (quick and dirty) workflow.

The assembler

That’s the easy part, as the cc65 compiler suite provides a macro assembler simply called ca65. But I did not want to rewrite all my program, only the inner loops. Thus I had to learn the compiler’s calling convention, aka how to call a subroutine written in ASM from the C code. And return from it… Fortunately, this part is well documented.
The only caveat I could not overcome was to access the symbols declared in C. Thus, I introduced a function, init_asm, to pass their address to my “ASM world”.

Producing a disk image

The linker will produce a perfectly viable Apple II executable file, but  loading it on an emulator requires to embedded it on a disk image. For that purpose, I used AppleCommander. It is a utility (unfortunately written in java) which purpose is to manipulate Apple II’s disk images. And one bright spot is that it can be used from the command line, thus allowing to invoke it as a build step in the Makefile.

Command to add the executable to the disk image:

Command to remove the executable from the disk image:

The debugger

In order to test my program and debug it, I used an emulator: AppleWin. It is quite accurate, but most important, it also features a competent debugger! Of course, it is rougher to use than a modern one. But hey, if you’re coding in 6502 assembly that won’t stop you! 😉

At any moment, you can enter this debugger by pressing F7. It is then quite easy to place breakpoints, run until a specific address or inspect the memory. Unfortunately, after exiting the debugger the program often does not resume correctly. Thus I could not reliably place my breakpoints before launching my program :/ So I often had to place an infinite loop at the beginning of my code. When halted, I then entered the debugger to manually modify the program counter in order to exit this loop. Then I could ask the debugger to run until my area of interest.

AppleWin's integrated debugger

AppleWin’s integrated debugger