Building an Editor
In order to write assembly on Atom, I need to be able to write text on Atom. Technically I could write text externally and build a way to import it in, but I still want to avoid writing tools externally. The regex assembler has proven to be invaluable, but I would like to retire it and replace it with something better. At that point, I want to be back to all internal tools.
Additionally, as far as I can see the editor will be simpler than the assembler, so it is also a good ramp up into writing a larger program. Writing more assembly will give me a better idea if there are things I want to change before writing the assembler. It will be harder to change things there for a while, so it would be nice to get things correct initially.
This page describes the process of implementing the editor between #atom-io and #atom-editor in the git repository.
Designing the editor
To start with, I need to figure out what an editor needs to be able to do. The key difference between the editor and the simple I/O program I already made is the ability to edit text at any point; I need to be able to insert and delete content, not just append.
In order to come up with a good design for the interface of the editor, I do need to know a little bit about the technical limitations I am facing. Copying large chunks of data is going to be slow, and it is not going to be performant to just maintain a giant array of characters. Instead of one giant string, the text will be represented as a doubly linked list of strings. These could be free form, but I am going to have each one represent a single line. This simplifies the handling of lines, keeping the program smaller for now. Each of these represent a line of limited length. These lines can be arbitrarily modified, and it is easy to insert and delete lines from anywhere in the text. This means that it will be very natural to create commands that work on lines.
Editor Commands
The editor will always have a line selected, which determines how some of the commands work. I will also allow selecting the 0th line which will keep the total number of command small. It also is what is selected if there are no lines.
Insert: You need to be able to add lines to the string. A line will be inserted after the currently selected line and you will be prompted for its content. If you have the 0th line selected, the new line would be added to the beginning of the string. The new line will become the selected line, which means that repeated inserts insert one after another.
Print: You need to be able to print out the text you have written so far so that you can verify it. This command prints all of the text to the terminal, and includes the line numbers for you to select later. This command also indicates the selected line as it prints it.
Select a line: You need to be able to select a line, and will do so by its number given by print. The command itself will just be the number of the line, which updates the selected line.
Delete: You need to be able to delete lines. This command will delete the currently selected line. I am also going to not add a replace command for now because you can achieve it with delete and insert. To help with this, deleting a line causes the previous line to be selected. This means a newly inserted line is in the place of the deleted line.
These commands are sufficient for building a string in memory, but I definitely need to be able to save the strings to the storage peripheral. This is how I would actually move them between programs.
Write: Write the contents of the string to the storage peripheral.
Read: Read the contents of the storage peripheral into the line list.
This is the full set of commands that I will implement here. They should give a technically complete writing experience, though it will certainly have limitations.
Implementation
Reading user input
Reading user input is exactly what the simple I/O program from the previous page did. However, that program directly wrote all data to storage. I want to keep program data in RAM because it is controlled entirely by the program and I don’t have to worry about anything important being there.
Data has to be written somewhere, so I need to reserve space in RAM that I can write the input into. I will write a procedure to get user input, and it will have two inputs: one is the address of a this buffer, and the second is the size of the buffer.
This READ_STR procedure handles reading data from the peripheral and showing
it on screen so that you can see what you are entering as you enter it. This
program is very similar to the one in the previous page, but using WRITE
instead of P0WRITE.
For now I am null terminating all strings because it is convenient. I consider the length of the string as the number of non-null characters, which means that you actually need to have one more word reserved in memory.
New assembly syntax
I am now using ampersands to mean “get the address of a label.” This means
different things to different instructions. Jumps end up with the immediate
offset required to jump to the labeled instruction. The SETHI and SETLO
instructions use the literal address, but SETHI takes the top 8 bits and
SETLO takes the bottom 8 bits. This means that:
SETHI r0 &LABEL
SETLO r0 &LABEL sets r0 to the address of the label LABEL.
I will also start to allow 4 digit hex numbers, which just represent that the data should be set to that literal value. Ultimately, assembly instructions are just a specific 4 digit hex number, and this allows you to write arbitrary data. You can also use an ampersand as this data, which means to write the address of the labeled instruction into memory, essentially as a pointer. This will be very useful throughout this program.
The input buffer
With these new tools, I can reserve space in memory that I can use as the actual input buffer. This would be something like this:
;INPUT_BUFFER
0000
0000
;...
SETHI r0 &INPUT_BUFFER
SETLO r0 &INPUT_BUFFER This will set r0 to the address of those two zero words, which you could pass
into the READ_STR procedure along with the number 0x0001 as you can only
write one character here.
Because there is only one buffer, you could not use it in recursive functions. This is okay for this program, because the core loop that gets this user data will not recurse. All other strings will go into other space that I will allocate later.
This would be a good candidate for stack data, but right now I have no way to get an address there. I need to be able to grab the stack pointer, after which I could push zeros to reserve space. However, it is totally un-needed in this case so I have not invested time into that yet.
Printing strings
Another common operation that I need to implement is printing a string. The old
I/O program always maintained the string in the output, but now I need to print
a string at an address in memory. As a simple test I can print out the string I
read from the user with READ_STR.
Printing the string is just a while loop that checks if the character is null.
If it is not, I print it out and move on to the next character. This procedure
is called PRINT_STR, and is actually very short:
;=PRINT_STR
; Print a null terminated string to the P1
; terminal
;
; Params:
; r0 - address of the string to write
SETHI r1 0x00
SETLO r1 0x01
;LOOP
READ r0 0x0 r2
CMP r2 r1
JL 0x03 ;&DONE
P1WRITE r2 r2
ADD r0 r1
JMP 0xFA ;&LOOP
;DONE
RET Improving the user experience
Later it will be hard to tell the difference between input and output, so I
wanted to add a prompt for string inputs. To do this, I decided to declare a
string in memory, such that I can grab the pointer to pass to PRINT_STR.
;INPUT_PROMPT
003E ;'>'
0020 ;' '
0000 ;null
;...
SETHI r0 &INPUT_PROMPT
SETLO r0 &INPUT_PROMPT
CALL &PRINT_STR ; assumes that PRINT_STR is within immediate range This is probably slightly overkill in this case because the prompt is just two characters, but representing strings is something I will need to figure out in general.
Register preservation in procedures
As this is the first big program I am writing, I need to officially establish more rules for the Application Binary Interface (ABI) regarding procedures. Procedures are going to be essential in managing complexity, and defining rules around register preservation will keep things simple enough to manage.
Right now there are no rules. A procedure could use any register, changing its value. All you know is a contract, defining how you need to setup registers before a call, and telling you about the state of some registers afterwards. If you as the caller need a value to be preserved, you must push it on the stack before the call and can pop it after.
The ABI will state that for all procedures, the contents of some register must be the same at execution start and execution end. This means that the caller does not have to push those values onto the stack. Of course, the work is shifted to the callee, but only if it uses the register. If it doesn’t, it can simply do nothing.
More specifically, the ABI states that register 8 through F must be
preserved. It doesn’t actually stipulate anything about what registers should be
used for arguments or returns, but if you need to be able to return a new value
you would have to use use registers 0 through 7. Typically, I will use
registers 0 through 7 in order for both parameters and returns because it
helps me keep everything clear in my head, but I don’t actually see a reason to
enforce that at this point.
Issuing commands
In order to select a command to run, I will use READ_STR with the input
buffer. To keep things simple, I will just look at the first character. That
character will be read into a register, say r3. Each command is responsible
for checking that character and jumping to the next command if it doesn’t match,
and jumping to the end (skipping all other commands) once it is complete:
;COMMAND
SETHI r4 0x00 ;'c'
SETLO r4 0x63
CMP r3 r4
JNE &NEXT_COMMAND
; Insert command itself goes here
JMP &END
;NEXT_COMMAND Each of these checks can rely on r3 being the character because this chain of
comparisons is will never changes its value. The command can do anything it
wants with r3 because once it is over it must skip the other commands.
(i)nsert
Inserting data into the text is the first command I need to write, because without any lines none of the other commands actually do anything.
The line list
The first thing I need to do is actually design the doubly linked list data structure itself. As mentioned in the design, a doubly linked list will allow for lines to be inserted, deleted, and re-ordered.
A node is some data in memory M. Each node in the list will have:
- The address of the next node at
M - The address of the previous node at
M+1 - 46 words of string data starting at
M+2and ending atM+47
The choice of 46 words of string data is fairly arbitrary. I wanted to balance line length and memory consumption, but it probably will need to be adjusted later. Also recall that 46 words of string data means 45 characters.
Allocation
This line list is going to dynamically grow, and that means that I need a way to mange where I put these nodes. This is essentially a simple allocator. Primarily I need to know where the memory I can use starts, and have a way to grab the next chunk of it.
I will start with representing where there is memory that I can allocate. To do this, I need to label the end of my program, and I also need to write that address to an address as well:
;HEAP_PTR
&HEAP
;HEAP
0000 That 0x0000 value is the first value in the heap. The address of that value is
written into memory somewhere before it, which also has a label. This means that
I will be able to retrieve it in the ALLOC procedure:
;=ALLOC
; Reserve space on the heap
;
; Params:
; r0 - number of words to be reserved
;
; Returns:
; r0 - address of allocated space
SETHI r1 &HEAP_PTR
SETLO r1 &HEAP_PTR
READ r1 0x0 r2
MOV r2 r3
ADD r3 r0
WRITE r1 0x0 r3
MOV r2 r0
RET The HEAP_PTR is always the address of the start of the next free piece of
memory. The ALLOC procedure returns that value, but before giving it to you
adds the value you gave it to HEAP_PTR in memory. The result is that each call
to ALLOC gets a unique range of words, starting from whatever address is
returned in r0.
This allocator has no concept of “freeing” memory, which is part of what makes it so simple. Freeing would require keeping track of what parts of memory are used and free which would require more meta data.
Heads and tails
The doubly linked list needs a head and a tail, and these need to live somewhere in memory. All line nodes will be placed between these nodes.
In their initial state, the head and tail point at each other. Using the ampersand syntax I have already described, these nodes can be setup directly during assembly:
;LINE_LIST_HEAD
&LINE_LIST_TAIL
0000
;LINE_LIST_TAIL
0000
&LINE_LIST_HEAD Inserting a line
Inserting a line is just a simple combination of allocation and updating pointers in the given node, its next node, and the newly allocated node.
;LINE_LIST_INSERT
; Insert a new entry into the line list after
; the given node
;
; The given node MUST HAVE a next node
;
; Params:
; r0 the address of the node to insert after
;
; Returns:
; r0 - the address of the newly inserted node
PUSH r0
SETHI r0 0x00
SETLO r0 0x30
SETHI r7 &ALLOC
SETLO r7 &ALLOC
CALL r7
POP r1
READ r1 0x0 r2
WRITE r0 0x0 r2
WRITE r0 0x1 r1
WRITE r1 0x0 r0
WRITE r2 0x1 r0
RET Because a node only needs to have a next node, you can use this function with a
pointer to LINE_LIST_HEAD to insert the first line.
The insert procedure returns the address of the inserted node. To get user input
into that node, I can pass the address of the string data within the node to
READ_STR with the correct length.
MOV r8 r0
SETHI r1 &LINE_LIST_INSERT
SETLO r1 &LINE_LIST_INSERT
CALL r1
MOV r0 r8
SETHI r1 0x00
SETLO r1 0x02
ADD r0 r1 ; data is offset by 2, so pass that
SETHI r1 0x00
SETLO r1 0x2D
SETHI r2 &READ_STR
SETLO r2 &READ_STR
CALL r2 The insert command also prints a newline after reading the string, otherwise the next command will be entered on the same line which is confusing.
SETHI r1 0x00
SETLO r1 0x0A
P1WRITE r1 r1 (p)rint
The core of the printing loop is fairly simple. I need to walk the line list from start to end with a cursor, doing something with each node. Because I am printing data, I need to be careful about not printing non-existent data from the head or tail. Doing this is pretty simple.
I never want to do anything with the head, so the first node I want to check is
always the next node after the head. I can grab that by getting the
&LINE_LIST_HEAD into a register and reading the next pointer out of the node.
SETHI r9 &LINE_LIST_HEAD
SETLO r9 &LINE_LIST_HEAD
READ r9 0x0 r9 If there are no nodes, the pointer that I just read will be the same as
&LINE_LIST_TAIL. In order to be able to check that, I need to load that into a
register as well.
SETHI rA &LINE_LIST_TAIL
SETLO rA &LINE_LIST_TAIL I am using r9 and rA here because they will be preserved across calls to
PRINTLN_STR that are used for each individual line. Before printing, I need to
check the current node (in r9) is not the tail. If it is, printing has
completed.
;PRINT_LOOP
CMP r9 rA
JEQ &MAIN_LOOP If they are not equal, grab pointer to the start of the string, which is always
at the node pointer plus two words. Pass that into PRINTLN_STR (a new
variation of PRINT_STR that prints an newline afterwards).
MOV r9 r0
SETHI r1 0x00
SETLO r1 0x02
ADD r0 r1 ; data is offset by 2
SETHI r4 &PRINTLN_STR
SETLO r4 &PRINTLN_STR
CALL r4 To make this into a loop, I advance to the next node in the line list by reading the next pointer from the current node and then jump back to the condition.
READ r9 0x0 r9
JMP &PRINT_LOOP Printing numbers
An important part of the print command is printing the line numbers with each line. That means I need to be able print a decimal number. The algorithm is pretty simple, I just need to be able to divide by multiples of 10.
Because words are 16 bits, I know the maximum unsigned value is 65535 with 5 digits. That means I can start by dividing by 10000. The quotient will be a single digit from 0 to 9 that I can print. The remainder is the rest of the number. This does mean that I print out leading zeros, and I would have to do some extra work to remove them. However, I actually don’t want to exclude the zeros because I want each number to be the same width. Continue with 1000, 100, 10, and 1.
To print the numbers, I keep track of a line number as I iterate through the line list. Before printing each line, I print this number.
Indicating the current line
You need some way to know what line is currently selected because your next
command may be dependent on that. At least to start, the print commands
indicates that line with a *. Determining what line to do that on is really
simple: as you walk the list, if the cursor is equal to the selected line, print
the * character in place of the usual .
Selecting line (N)
To select a line, you just enter its number. This means that identifying the
command will work (slightly) differently than the others because it has 10
trigger characters: each of the digits 0 through 9.
The other commands have made a single comparison and used a JNE command to
skip to the next. Because the ASCII codes for the characters 0 through 9 are
all contiguous, I can just check the two boundaries:
SETHI r4 0x00
SETLO r4 0x30 ;'0'
CMP r3 r4
JL &NEXT_COMMAND
SETLO r4 0x39 ;'9'
CMP r3 r4
JG &NEXT_COMMAND Parsing numbers
Once determining that the command starts with a number, I need to actually parse it into a value in a register. This is basically the opposite of printing.
I need to walk through each digit in the string and “append” it to the end of
the value. In this case, to “append” means to multiply the old value by 10,
then add the digit’s value.
Walking the line list
After the number has been parsed, I need to update the selected line. The entries in the line list have no idea what number they are. This is an intentional decision because I don’t want to update every subsequent line number when a line is inserted or deleted. Because of that, I always have to walk the list.
I have already talked about walking the list in printing. This will be very
similar, but with an important difference. Recall that you can select line 0.
In the program, this means that the current line points at the head of the list,
which is a node without any string content. Because walking the line list
doesn’t actually interact with the string, I don’t have to worry about it here.
In the program itself, this means that I will not start at the node after the head, I will start with the head itself. In the procedure that selects the line, the line number can be considered the number of lines left to traverse. To select the correct line, I need to walk through the list decrementing that number until that number is 0. If given number was 0, the loop wont execute a single time, and so the head node will be selected. If it were 1, I walk forwards one time ending up on the first line.
The loop also needs to end if the cursor ever ends up on the list tail. This line is unselected able, and in this case the last line will be selected.
(d)elete
Deleting a line is a very simple operation in the doubly linked list I have created. Since I always have the address of a node, I just need to point the previous node at the next node, and the next node at the previous node. I do have to be careful to not attempt to do this process if the list head, as its previous value is not defined, and I want to keep it constant.
After deleting a node, its memory could be re-used. You would eventually exhaust the heap if you inserted and deleted repeatedly. This is an issue that will be addressed later after the problem gets worse.
With the completion of this command, you can now fully edit in memory!
Pre-computing jumps
As I have been adding these features, I have gone through the process of assembling the program many times. The regex assembler can convert literal instructions into hex, but what it cannot do is compute jump offsets or the addresses of procedures or data. That process is still manual.
One thing that I can do to reduce the work of assembling is to pre-compute jumps if they are not sensitive to changes in the overall program. This means that addresses cannot be pre-computed, but control flow can be in many cases. This means that I don’t have to redo this work every time I assemble the program.
Take the following example:
;LOOP
CMP r0 r1
JLE &DONE
ADD r0 r2
JMP &LOOP
;DONE
HALT It can be pre-computed to the following:
;LOOP
CMP r0 r1
JLE 0x02 ;&DONE
ADD r0 r2
JMP 0xFC ;&LOOP
;DONE
HALT Now I don’t have to figure out those values every time. I typically leave the label behind as a comment so that I can tell where it is supposed to go. This helps if you make a mistake, or forget to change it if you add or remove instructions between the jump and the destination.
(w)rite
In order to transfer the text you have written to another program like the assembler, you need to be able to write it to persistent storage. You will also be able to read the written file back into the editor, allowing you to work in multiple sessions or modify an existing program.
Writing itself is very simple. I have already implemented several commands that
walk the line list. Because writing deals with the string content, I need to
start with the node after the head just like print. However, instead of just
calling PRINT_STR I also need to keep track of an address to write data to in
storage. This address persists betweens lines.
The data always starts from 0x0000. Each line ends with a newline (0x000A).
The whole document is ended with a null terminator (0x0000).
Cleaning up
As I get towards the end of writing the editor, it has become the largest program I have written throughout the project. There are some smaller mistakes that I have corrected over the course of writing it that make it easier to work with that I want to highlight.
Standardize names
Assembly is ultimately just a big list of instructions. The structure comes in
the form of labels. Right now, labels are actually just comments. Comments are
indicated by a ;. Labels are comments that start at the beginning of the line,
and in which the second character is not a space. That is, ;LABEL is a
label, but ; label is a comment. I have also been using upper case for labels,
which helps distinguish them from textual comments.
Labels are used all over. They are part of control flow, but also part of
procedures. These tend to be used differently, and I want to be able to identify
procedures in particular. To that end, I have started labeling procedures with
names starting with =, like ;=READ_STR. This makes it easy to search for.
Some of the earlier procedures like ALLOC were missing this equal. I went back
and added it to keep everything consistent.
Make names searchable
Keeping things searchable is most about making them unique. The particular
problem here was the write procedure. It was just called ;=WRITE, and the easy
way to search for it was using WRITE. However, this conflicts with the WRITE
instruction, so you get a lot of results. The same thing applies to READ
later.
I updated the names of WRITE to WRITE_P0, which is searchable, and also more
clear about where the data is going.
(r)ead
In order to read a file, I need to read through storage and copy out the lines that have been saved there. However, before I do that I need to clear out the line list by deleting everything.
Deleting everything is straightforward, I just walk the line list calling delete on every node. I do need to be careful about getting the address of the next node before the node is deleted. Once the node is removed form the list, its next and previous pointers shouldn’t be used.
Reflections
Improving the assembly experience
Because this is the largest program I have written, it gives the opportunity to reflect on the challenges of writing assembly and what the full assembler should be able to do. At the moment, this means improvements to the regex assembling process.
Assembly documentation
The regex assembler has been somewhat ad-hoc. This has resulted in a lack of clarity about individual instructions. When I forget an instruction, I have to work my way from the regex all the way down to the instructions. This isn’t so bad because I am still pretty close to the instructions, but the experience should be better.
In the regex assembler, this documentation looks like a comment that simply tells you what the instruction does in concrete terms. For example:
" WRITE r0 0xA r1 ; write r1 to memory at r0 + 0xA The WRITE instruction is particularly difficult to remember because the order
of the arguments in the assembly instruction doesn’t match the order in the
machine instruction at all. The machine instruction itself uses the order that
makes the most sense for the hardware implementation. The order is different
because it makes more sense. When making the full assembler, documentation will
be equally important.
Assembly argument ordering
In a similar vein, I should standardize the order of arguments to help with
recollection. The MOV instruction is an example of an instruction that doesn’t
follow the same pattern as similar instructions. For most register operations,
the first argument is the register that will change value, like ADD r0 r1
where r1 is added to r0. Move, however, modifies the second argument. MOV r0 r1 will modify r1, updating it to r0.
Preparing assembly
While writing this program, I added a new tool called regex-prepare. This
script collapses the program so that each line is an individual instruction.
Whole line comments are removed, and labels are moved onto the line they label.
Doing this makes it easy to figure out addresses and jump offsets because the
line numbers map almost directly to instruction addresses (they are off by one).
This mirrors a process that I expect to do in the real assembler. In the same way, I will need to figure out the address of every instruction and label.
Indenting
The prepare script also allows me to indent lines, which can aid in readability. A loop or a procedure can be indented so that you can visually see where it ends. Side note, I am already starting to see that the character limit will become an issue pretty quickly. I will need to worry about conserving characters, and to that end I should use literal tab characters for indentation.
Common mistakes
The prepare and assemble scripts help a lot, but computing and filling out offsets and addresses are still leave room for mistakes. This is something that the assembler will help with.
Offset errors and typos
It is pretty easy to make mistakes when calculating offsets, and I will sometimes make typos both when converting and entering converted numbers. These error cause bugs that can be difficult to track down, with smaller errors sometimes being more difficult to find.
Similarly, it is easy to forget to update pre-computed offsets. This would be true in the assembler as well, but I should be able to do less pre-computation.
End of a doubly linked list
This is unrelated to the assembly, but I found myself making the same mistake around ignoring the end of the doubly linked list. I use a cursor, and would iterate until it was zero (null). However, this means that the loop executes for the tail. The solution is really simple, just iterate until the cursor is equal to the tail.
Including binaries
For most of this implementation, I only included the assembly code. Towards the end, I realized that I should be including the binary file as well. This is true for multiple reasons. For one, assembling is still manual, and I don’t want to repeat it if I go back to a previous commit. If you are following along, you also probably don’t want to have to assemble by hand either. This will get better with the assembler, but I should keep including the binary.
Another good reason to include the binary is for demonstrating bugs in assembly. Sometimes these bugs are a one off mistake, so even if I assemble again I won’t repeat the same thing. Some of these have interesting behaviors that I would like to talk about.
Next Steps
Now that I can write text, I have a good way to write assembly. I want to start working directly towards writing the assembler itself. The first step will be to figure out the broad design.