Building an Editor

In order to write assembly on Atom, I need to be able to write text on Atom. Technically I could write text externally and build a way to import it in, but I still want to avoid writing tools externally. The regex assembler has proven to be invaluable, but I would like to retire it and replace it with something better. At that point, I want to be back to all internal tools.

Additionally, as far as I can see the editor will be simpler than the assembler, so it is also a good ramp up into writing a larger program. Writing more assembly will give me a better idea if there are things I want to change before writing the assembler. It will be harder to change things there for a while, so it would be nice to get things correct initially.

This page describes the process of implementing the editor between #atom-io and #atom-editor in the git repository.

Designing the editor

To start with, I need to figure out what an editor needs to be able to do. The key difference between the editor and the simple I/O program I already made is the ability to edit text at any point; I need to be able to insert and delete content, not just append.

In order to come up with a good design for the interface of the editor, I do need to know a little bit about the technical limitations I am facing. Copying large chunks of data is going to be slow, and it is not going to be performant to just maintain a giant array of characters. Instead of one giant string, the text will be represented as a doubly linked list of strings. These could be free form, but I am going to have each one represent a single line. This simplifies the handling of lines, keeping the program smaller for now. Each of these represent a line of limited length. These lines can be arbitrarily modified, and it is easy to insert and delete lines from anywhere in the text. This means that it will be very natural to create commands that work on lines.

Editor Commands

The editor will always have a line selected, which determines how some of the commands work. I will also allow selecting the 0th line which will keep the total number of command small. It also is what is selected if there are no lines.

Insert: You need to be able to add lines to the string. A line will be inserted after the currently selected line and you will be prompted for its content. If you have the 0th line selected, the new line would be added to the beginning of the string. The new line will become the selected line, which means that repeated inserts insert one after another.

Print: You need to be able to print out the text you have written so far so that you can verify it. This command prints all of the text to the terminal, and includes the line numbers for you to select later. This command also indicates the selected line as it prints it.

Select a line: You need to be able to select a line, and will do so by its number given by print. The command itself will just be the number of the line, which updates the selected line.

Delete: You need to be able to delete lines. This command will delete the currently selected line. I am also going to not add a replace command for now because you can achieve it with delete and insert. To help with this, deleting a line causes the previous line to be selected. This means a newly inserted line is in the place of the deleted line.

These commands are sufficient for building a string in memory, but I definitely need to be able to save the strings to the storage peripheral. This is how I would actually move them between programs.

Write: Write the contents of the string to the storage peripheral.

Read: Read the contents of the storage peripheral into the line list.

This is the full set of commands that I will implement here. They should give a technically complete writing experience, though it will certainly have limitations.

Future additions

A few ideas for future additions that instantly jump out at me.

Local print: Because the print command always prints everything, long programs will be somewhat difficult to work with (and probably slow too). A natural addition that I would guess I will make later is a local print command that prints lines around the selected one.

Replace: As mentioned before, this behavior can be achieved. However, since this is the way you edit a line I will probably do it a lot, so having a dedicated command will likely be nice.

Search: Being able to search for a string will let you jump around the text much more quickly. For instance, you could search for a label to get to a procedure. I would want you to be able to find duplicates, and a simple way to do that would be to start from the line after the selected one. Selecting line 0 and searching will always find the first result, even if in the first line.

It would also be great to have some kind of copy and paste functionality, preferably over a range.

Implementation

Reading user input

Reading user input is exactly what the simple I/O program from the previous page did. However, that program directly wrote all data to storage. I want to keep program data in RAM because it is controlled entirely by the program and I don’t have to worry about anything important being there.

Data has to be written somewhere, so I need to reserve space in RAM that I can write the input into. I will write a procedure to get user input, and it will have two inputs: one is the address of a this buffer, and the second is the size of the buffer.

This READ_STR procedure handles reading data from the peripheral and showing it on screen so that you can see what you are entering as you enter it. This program is very similar to the one in the previous page, but using WRITE instead of P0WRITE.

For now I am null terminating all strings because it is convenient. I consider the length of the string as the number of non-null characters, which means that you actually need to have one more word reserved in memory.

New assembly syntax

I am now using ampersands to mean “get the address of a label.” This means different things to different instructions. Jumps end up with the immediate offset required to jump to the labeled instruction. The SETHI and SETLO instructions use the literal address, but SETHI takes the top 8 bits and SETLO takes the bottom 8 bits. This means that:

SETHI r0 &LABEL
SETLO r0 &LABEL

sets r0 to the address of the label LABEL.

I will also start to allow 4 digit hex numbers, which just represent that the data should be set to that literal value. Ultimately, assembly instructions are just a specific 4 digit hex number, and this allows you to write arbitrary data. You can also use an ampersand as this data, which means to write the address of the labeled instruction into memory, essentially as a pointer. This will be very useful throughout this program.

The input buffer

With these new tools, I can reserve space in memory that I can use as the actual input buffer. This would be something like this:

;INPUT_BUFFER
0000
0000

;...

SETHI r0 &INPUT_BUFFER
SETLO r0 &INPUT_BUFFER

This will set r0 to the address of those two zero words, which you could pass into the READ_STR procedure along with the number 0x0001 as you can only write one character here.

Because there is only one buffer, you could not use it in recursive functions. This is okay for this program, because the core loop that gets this user data will not recurse. All other strings will go into other space that I will allocate later.

This would be a good candidate for stack data, but right now I have no way to get an address there. I need to be able to grab the stack pointer, after which I could push zeros to reserve space. However, it is totally un-needed in this case so I have not invested time into that yet.

Printing strings

Another common operation that I need to implement is printing a string. The old I/O program always maintained the string in the output, but now I need to print a string at an address in memory. As a simple test I can print out the string I read from the user with READ_STR.

Printing the string is just a while loop that checks if the character is null. If it is not, I print it out and move on to the next character. This procedure is called PRINT_STR, and is actually very short:

;=PRINT_STR
; Print a null terminated string to the P1
; terminal
;
; Params:
; r0 - address of the string to write

SETHI r1 0x00
SETLO r1 0x01

;LOOP
READ r0 0x0 r2

CMP r2 r1
JL 0x03 ;&DONE

P1WRITE r2 r2
ADD r0 r1
JMP 0xFA ;&LOOP

;DONE
RET

Improving the user experience

Later it will be hard to tell the difference between input and output, so I wanted to add a prompt for string inputs. To do this, I decided to declare a string in memory, such that I can grab the pointer to pass to PRINT_STR.

;INPUT_PROMPT
003E ;'>'
0020 ;' '
0000 ;null

;...

SETHI r0 &INPUT_PROMPT
SETLO r0 &INPUT_PROMPT

CALL &PRINT_STR ; assumes that PRINT_STR is within immediate range

This is probably slightly overkill in this case because the prompt is just two characters, but representing strings is something I will need to figure out in general.

Register preservation in procedures

As this is the first big program I am writing, I need to officially establish more rules for the Application Binary Interface (ABI) regarding procedures. Procedures are going to be essential in managing complexity, and defining rules around register preservation will keep things simple enough to manage.

Right now there are no rules. A procedure could use any register, changing its value. All you know is a contract, defining how you need to setup registers before a call, and telling you about the state of some registers afterwards. If you as the caller need a value to be preserved, you must push it on the stack before the call and can pop it after.

The ABI will state that for all procedures, the contents of some register must be the same at execution start and execution end. This means that the caller does not have to push those values onto the stack. Of course, the work is shifted to the callee, but only if it uses the register. If it doesn’t, it can simply do nothing.

More specifically, the ABI states that register 8 through F must be preserved. It doesn’t actually stipulate anything about what registers should be used for arguments or returns, but if you need to be able to return a new value you would have to use use registers 0 through 7. Typically, I will use registers 0 through 7 in order for both parameters and returns because it helps me keep everything clear in my head, but I don’t actually see a reason to enforce that at this point.

Issuing commands

In order to select a command to run, I will use READ_STR with the input buffer. To keep things simple, I will just look at the first character. That character will be read into a register, say r3. Each command is responsible for checking that character and jumping to the next command if it doesn’t match, and jumping to the end (skipping all other commands) once it is complete:

;COMMAND
SETHI r4 0x00 ;'c'
SETLO r4 0x63
CMP r3 r4
JNE &NEXT_COMMAND

; Insert command itself goes here

JMP &END

;NEXT_COMMAND

Each of these checks can rely on r3 being the character because this chain of comparisons is will never changes its value. The command can do anything it wants with r3 because once it is over it must skip the other commands.

(i)nsert

Inserting data into the text is the first command I need to write, because without any lines none of the other commands actually do anything.

The line list

The first thing I need to do is actually design the doubly linked list data structure itself. As mentioned in the design, a doubly linked list will allow for lines to be inserted, deleted, and re-ordered.

A node is some data in memory M. Each node in the list will have:

The address of the next node at M
The address of the previous node at M+1
46 words of string data starting at M+2 and ending at M+47

The choice of 46 words of string data is fairly arbitrary. I wanted to balance line length and memory consumption, but it probably will need to be adjusted later. Also recall that 46 words of string data means 45 characters.

Allocation

This line list is going to dynamically grow, and that means that I need a way to mange where I put these nodes. This is essentially a simple allocator. Primarily I need to know where the memory I can use starts, and have a way to grab the next chunk of it.

I will start with representing where there is memory that I can allocate. To do this, I need to label the end of my program, and I also need to write that address to an address as well:

;HEAP_PTR
&HEAP
;HEAP
0000

That 0x0000 value is the first value in the heap. The address of that value is written into memory somewhere before it, which also has a label. This means that I will be able to retrieve it in the ALLOC procedure:

;=ALLOC
; Reserve space on the heap
;
; Params:
; r0 - number of words to be reserved
;
; Returns:
; r0 - address of allocated space

SETHI r1 &HEAP_PTR
SETLO r1 &HEAP_PTR

READ r1 0x0 r2

MOV r2 r3
ADD r3 r0
WRITE r1 0x0 r3

MOV r2 r0

RET

The HEAP_PTR is always the address of the start of the next free piece of memory. The ALLOC procedure returns that value, but before giving it to you adds the value you gave it to HEAP_PTR in memory. The result is that each call to ALLOC gets a unique range of words, starting from whatever address is returned in r0.

This allocator has no concept of “freeing” memory, which is part of what makes it so simple. Freeing would require keeping track of what parts of memory are used and free which would require more meta data.

Heads and tails

The doubly linked list needs a head and a tail, and these need to live somewhere in memory. All line nodes will be placed between these nodes.

In their initial state, the head and tail point at each other. Using the ampersand syntax I have already described, these nodes can be setup directly during assembly:

;LINE_LIST_HEAD
&LINE_LIST_TAIL
0000
;LINE_LIST_TAIL
0000
&LINE_LIST_HEAD

Inserting a line

Inserting a line is just a simple combination of allocation and updating pointers in the given node, its next node, and the newly allocated node.

;LINE_LIST_INSERT
; Insert a new entry into the line list after
; the given node
;
; The given node MUST HAVE a next node
;
; Params:
; r0 the address of the node to insert after
;
; Returns:
; r0 - the address of the newly inserted node

PUSH r0

SETHI r0 0x00
SETLO r0 0x30

SETHI r7 &ALLOC
SETLO r7 &ALLOC
CALL r7

POP r1
READ r1 0x0 r2

WRITE r0 0x0 r2
WRITE r0 0x1 r1
WRITE r1 0x0 r0
WRITE r2 0x1 r0

RET

Because a node only needs to have a next node, you can use this function with a pointer to LINE_LIST_HEAD to insert the first line.

The insert procedure returns the address of the inserted node. To get user input into that node, I can pass the address of the string data within the node to READ_STR with the correct length.

MOV r8 r0
SETHI r1 &LINE_LIST_INSERT
SETLO r1 &LINE_LIST_INSERT
CALL r1

MOV r0 r8

SETHI r1 0x00
SETLO r1 0x02
ADD r0 r1 ; data is offset by 2, so pass that

SETHI r1 0x00
SETLO r1 0x2D

SETHI r2 &READ_STR
SETLO r2 &READ_STR
CALL r2

The insert command also prints a newline after reading the string, otherwise the next command will be entered on the same line which is confusing.

SETHI r1 0x00
SETLO r1 0x0A
P1WRITE r1 r1

(p)rint

The core of the printing loop is fairly simple. I need to walk the line list from start to end with a cursor, doing something with each node. Because I am printing data, I need to be careful about not printing non-existent data from the head or tail. Doing this is pretty simple.

I never want to do anything with the head, so the first node I want to check is always the next node after the head. I can grab that by getting the &LINE_LIST_HEAD into a register and reading the next pointer out of the node.

SETHI r9 &LINE_LIST_HEAD
SETLO r9 &LINE_LIST_HEAD
READ r9 0x0 r9

If there are no nodes, the pointer that I just read will be the same as &LINE_LIST_TAIL. In order to be able to check that, I need to load that into a register as well.

SETHI rA &LINE_LIST_TAIL
SETLO rA &LINE_LIST_TAIL

I am using r9 and rA here because they will be preserved across calls to PRINTLN_STR that are used for each individual line. Before printing, I need to check the current node (in r9) is not the tail. If it is, printing has completed.

;PRINT_LOOP
CMP r9 rA
JEQ &MAIN_LOOP

If they are not equal, grab pointer to the start of the string, which is always at the node pointer plus two words. Pass that into PRINTLN_STR (a new variation of PRINT_STR that prints an newline afterwards).

MOV r9 r0
SETHI r1 0x00
SETLO r1 0x02
ADD r0 r1 ; data is offset by 2

SETHI r4 &PRINTLN_STR
SETLO r4 &PRINTLN_STR
CALL r4

To make this into a loop, I advance to the next node in the line list by reading the next pointer from the current node and then jump back to the condition.

READ r9 0x0 r9

JMP &PRINT_LOOP

Printing numbers

An important part of the print command is printing the line numbers with each line. That means I need to be able print a decimal number. The algorithm is pretty simple, I just need to be able to divide by multiples of 10.

Why not divide in hardware?

If you dig into the implementation, you will see that I did not implement a division instruction even though Digital has all the tools I would need. This is not because I don’t want a divide instruction, but because division needs to give me back to values: the quotient and the remainder (which Digital already gives). However, I don’t have a way to return to values from an instruction.

I have a longer term vision for returning two values. A special register could save the value, and a separate instruction could be used to retrieve that value and place it in a standard register. However, I don’t want to implement both of these things right now and I used a simple for loop. It isn’t efficient if the expected quotient is large, but if it is small (and in this case it will never exceed 9) it is good enough.

Because words are 16 bits, I know the maximum unsigned value is 65535 with 5 digits. That means I can start by dividing by 10000. The quotient will be a single digit from 0 to 9 that I can print. The remainder is the rest of the number. This does mean that I print out leading zeros, and I would have to do some extra work to remove them. However, I actually don’t want to exclude the zeros because I want each number to be the same width. Continue with 1000, 100, 10, and 1.

To print the numbers, I keep track of a line number as I iterate through the line list. Before printing each line, I print this number.

Indicating the current line

You need some way to know what line is currently selected because your next command may be dependent on that. At least to start, the print commands indicates that line with a *. Determining what line to do that on is really simple: as you walk the list, if the cursor is equal to the selected line, print the * character in place of the usual .

Selecting line (N)

To select a line, you just enter its number. This means that identifying the command will work (slightly) differently than the others because it has 10 trigger characters: each of the digits 0 through 9.

The other commands have made a single comparison and used a JNE command to skip to the next. Because the ASCII codes for the characters 0 through 9 are all contiguous, I can just check the two boundaries:

SETHI r4 0x00
SETLO r4 0x30 ;'0'
CMP r3 r4
JL &NEXT_COMMAND
SETLO r4 0x39 ;'9'
CMP r3 r4
JG &NEXT_COMMAND

Parsing numbers

Once determining that the command starts with a number, I need to actually parse it into a value in a register. This is basically the opposite of printing.

I need to walk through each digit in the string and “append” it to the end of the value. In this case, to “append” means to multiply the old value by 10, then add the digit’s value.

Multiplying in hardware

Like dividing, multiplication can be implemented in the hardware. It also can produce an extra value, because multiplying two N bit numbers can result in a number with 2N bits. The top N bits would need to be placed in the special register and loaded with the special instruction. Additionally, you need to know if the result uses those top N bits.

I decided to not implement division in hardware at this time, but I did decide to implement multiplication in hardware now. This is because I only need the normal value to parse valid numbers here, so I only need the core of the instruction. The implementation is technically incomplete, but well defined enough for my usage here.

The multiplication is unsigned, and the flags are weird. Only the carry flag has a real meaning, and I note that you cannot use any of the other flags. The carry flag tells you if any bit is set in the upper 16 bits (that you will not get back in the target register) is set. In the future, this would indicate that you need to handle a special case because there is more data.

The negative flag is meaningless. It just tells you if the highest bit in the lower 16 bits is high. But, the multiplication is unsigned so you shouldn’t ever need it anyways. In a smimilar way, the zero flag also only considered the lower 16 bits. This means it is well defined iff the carry flag is also zero. I just recommend not using it. The overflow flag is unchanged from subtraction, which means its value is also meaningless.

In all of these, I could define something useful. I don’t really need it here, and the real limiting factor is that I would need to rework the ALU fairly deeply, though I think that will be needed later anyways. Right now, the ALU determines the flags from the result (except for overflow), and I would need to modify it so that each operation defines its own flags and I switch between those. That would allow me to, for instance, examine all 32 bits of the multiplication output to check if the number is zero.

Walking the line list

After the number has been parsed, I need to update the selected line. The entries in the line list have no idea what number they are. This is an intentional decision because I don’t want to update every subsequent line number when a line is inserted or deleted. Because of that, I always have to walk the list.

I have already talked about walking the list in printing. This will be very similar, but with an important difference. Recall that you can select line 0. In the program, this means that the current line points at the head of the list, which is a node without any string content. Because walking the line list doesn’t actually interact with the string, I don’t have to worry about it here.

In the program itself, this means that I will not start at the node after the head, I will start with the head itself. In the procedure that selects the line, the line number can be considered the number of lines left to traverse. To select the correct line, I need to walk through the list decrementing that number until that number is 0. If given number was 0, the loop wont execute a single time, and so the head node will be selected. If it were 1, I walk forwards one time ending up on the first line.

The loop also needs to end if the cursor ever ends up on the list tail. This line is unselected able, and in this case the last line will be selected.

(d)elete

Deleting a line is a very simple operation in the doubly linked list I have created. Since I always have the address of a node, I just need to point the previous node at the next node, and the next node at the previous node. I do have to be careful to not attempt to do this process if the list head, as its previous value is not defined, and I want to keep it constant.

After deleting a node, its memory could be re-used. You would eventually exhaust the heap if you inserted and deleted repeatedly. This is an issue that will be addressed later after the problem gets worse.

With the completion of this command, you can now fully edit in memory!

Pre-computing jumps

As I have been adding these features, I have gone through the process of assembling the program many times. The regex assembler can convert literal instructions into hex, but what it cannot do is compute jump offsets or the addresses of procedures or data. That process is still manual.

One thing that I can do to reduce the work of assembling is to pre-compute jumps if they are not sensitive to changes in the overall program. This means that addresses cannot be pre-computed, but control flow can be in many cases. This means that I don’t have to redo this work every time I assemble the program.

Take the following example:

;LOOP
CMP r0 r1
JLE &DONE

ADD r0 r2

JMP &LOOP

;DONE
HALT

It can be pre-computed to the following:

;LOOP
CMP r0 r1
JLE 0x02 ;&DONE

ADD r0 r2

JMP 0xFC ;&LOOP

;DONE
HALT

Now I don’t have to figure out those values every time. I typically leave the label behind as a comment so that I can tell where it is supposed to go. This helps if you make a mistake, or forget to change it if you add or remove instructions between the jump and the destination.

(w)rite

In order to transfer the text you have written to another program like the assembler, you need to be able to write it to persistent storage. You will also be able to read the written file back into the editor, allowing you to work in multiple sessions or modify an existing program.

Writing itself is very simple. I have already implemented several commands that walk the line list. Because writing deals with the string content, I need to start with the node after the head just like print. However, instead of just calling PRINT_STR I also need to keep track of an address to write data to in storage. This address persists betweens lines.

The data always starts from 0x0000. Each line ends with a newline (0x000A). The whole document is ended with a null terminator (0x0000).

Cleaning up

As I get towards the end of writing the editor, it has become the largest program I have written throughout the project. There are some smaller mistakes that I have corrected over the course of writing it that make it easier to work with that I want to highlight.

Standardize names

Assembly is ultimately just a big list of instructions. The structure comes in the form of labels. Right now, labels are actually just comments. Comments are indicated by a ;. Labels are comments that start at the beginning of the line, and in which the second character is not a space. That is, ;LABEL is a label, but ; label is a comment. I have also been using upper case for labels, which helps distinguish them from textual comments.

Labels are used all over. They are part of control flow, but also part of procedures. These tend to be used differently, and I want to be able to identify procedures in particular. To that end, I have started labeling procedures with names starting with =, like ;=READ_STR. This makes it easy to search for.

Some of the earlier procedures like ALLOC were missing this equal. I went back and added it to keep everything consistent.

Make names searchable

Keeping things searchable is most about making them unique. The particular problem here was the write procedure. It was just called ;=WRITE, and the easy way to search for it was using WRITE. However, this conflicts with the WRITE instruction, so you get a lot of results. The same thing applies to READ later.

I updated the names of WRITE to WRITE_P0, which is searchable, and also more clear about where the data is going.

(r)ead

In order to read a file, I need to read through storage and copy out the lines that have been saved there. However, before I do that I need to clear out the line list by deleting everything.

Deleting everything is straightforward, I just walk the line list calling delete on every node. I do need to be careful about getting the address of the next node before the node is deleted. Once the node is removed form the list, its next and previous pointers shouldn’t be used.

Reusing line-list entries

Reading now results in mass deletion of lines. This make the memory exaustion issue I talked about before worse. I need to implement some way to re-use the lines. I could implement a FREE procedure, but it would require a rework of the allocation and layout of memory. This is because I need to keep meta data bout the structure of the heap in the heap. This will be worth it at some point, and I will go into more detail then. However, for now there is a much simpler option.

The simpler solution is to use a free list. This is a list of line list nodes, instead of being something the allocator needs to know about. I know that a line list node always has space for a next pointer, so the free list can be build directly using the nodes. The free list only needs to be singly linked, because I just need to be able to push and pop nodes.

I will need an additonal list head, LINE_LIST_FREE. This starts out as null. To inset a node into the list, set its next address to the value at LINE_LIST_FREE and write the nodes address to LINE_LIST_FREE. To take a node, read the value in LINE_LIST_FREE, and write the value of its next back into LINE_LIST_FREE.

When deleting a node, add it to the free list after removing it from the main line list. Then when adding a new node, first check if the value at LINE_LIST_FREE is not null. If it isn’t, use that instead of calling ALLOC.

After making this change, I encountered a strange bug. After re-using a node, the line indicator would change from * to an invalid character. It turned out that I was accidentally writing 0x0000 (a NOOP) directly into the print command, and it just happened to overrwite the SETLO r7 0x2A that setup the *. This resulted because I overwrote the value I read from LINE_LIST_FREE and wrote data back into the wrong address.

I also discovered a bug in the read command. The strings were not being null terinated. This wasn’t a problem when nodes were brand new, as they contained all zeros. However, if it already had data that isn’t true. If the string read from storage (e.g. HALT) was shorter than the old content (e.g. SETLO r0 0x00), you would only change the start of the string (e.g. HALTO r0 0x00).

Reflections

Improving the assembly experience

Because this is the largest program I have written, it gives the opportunity to reflect on the challenges of writing assembly and what the full assembler should be able to do. At the moment, this means improvements to the regex assembling process.

Assembly documentation

The regex assembler has been somewhat ad-hoc. This has resulted in a lack of clarity about individual instructions. When I forget an instruction, I have to work my way from the regex all the way down to the instructions. This isn’t so bad because I am still pretty close to the instructions, but the experience should be better.

In the regex assembler, this documentation looks like a comment that simply tells you what the instruction does in concrete terms. For example:

" WRITE r0 0xA r1 ; write r1 to memory at r0 + 0xA

The WRITE instruction is particularly difficult to remember because the order of the arguments in the assembly instruction doesn’t match the order in the machine instruction at all. The machine instruction itself uses the order that makes the most sense for the hardware implementation. The order is different because it makes more sense. When making the full assembler, documentation will be equally important.

Assembly argument ordering

In a similar vein, I should standardize the order of arguments to help with recollection. The MOV instruction is an example of an instruction that doesn’t follow the same pattern as similar instructions. For most register operations, the first argument is the register that will change value, like ADD r0 r1 where r1 is added to r0. Move, however, modifies the second argument. MOV r0 r1 will modify r1, updating it to r0.

Preparing assembly

While writing this program, I added a new tool called regex-prepare. This script collapses the program so that each line is an individual instruction. Whole line comments are removed, and labels are moved onto the line they label. Doing this makes it easy to figure out addresses and jump offsets because the line numbers map almost directly to instruction addresses (they are off by one).

This mirrors a process that I expect to do in the real assembler. In the same way, I will need to figure out the address of every instruction and label.

Indenting

The prepare script also allows me to indent lines, which can aid in readability. A loop or a procedure can be indented so that you can visually see where it ends. Side note, I am already starting to see that the character limit will become an issue pretty quickly. I will need to worry about conserving characters, and to that end I should use literal tab characters for indentation.

Common mistakes

The prepare and assemble scripts help a lot, but computing and filling out offsets and addresses are still leave room for mistakes. This is something that the assembler will help with.

Offset errors and typos

It is pretty easy to make mistakes when calculating offsets, and I will sometimes make typos both when converting and entering converted numbers. These error cause bugs that can be difficult to track down, with smaller errors sometimes being more difficult to find.

Similarly, it is easy to forget to update pre-computed offsets. This would be true in the assembler as well, but I should be able to do less pre-computation.

End of a doubly linked list

This is unrelated to the assembly, but I found myself making the same mistake around ignoring the end of the doubly linked list. I use a cursor, and would iterate until it was zero (null). However, this means that the loop executes for the tail. The solution is really simple, just iterate until the cursor is equal to the tail.

Including binaries

For most of this implementation, I only included the assembly code. Towards the end, I realized that I should be including the binary file as well. This is true for multiple reasons. For one, assembling is still manual, and I don’t want to repeat it if I go back to a previous commit. If you are following along, you also probably don’t want to have to assemble by hand either. This will get better with the assembler, but I should keep including the binary.

Another good reason to include the binary is for demonstrating bugs in assembly. Sometimes these bugs are a one off mistake, so even if I assemble again I won’t repeat the same thing. Some of these have interesting behaviors that I would like to talk about.

Next Steps

Now that I can write text, I have a good way to write assembly. I want to start working directly towards writing the assembler itself. The first step will be to figure out the broad design.