Sunday, February 17, 2013

Coping well with the lack of line numbers

One of my goals in the Qt version of ebe is to transparently support OS X assembly language as well as Linux assembly language.  There are several basic problems with using yasm under OS X.

  • OS X uses rip-relative addressing.
  • Global functions use an underscore prefix.
  • There is no debug information provided by yasm for use in gdb.
The first two problems are fairly easy to cope with.  First the rip-relative addressing only matters when you attempt to using indexing of an array or accessing a structure component in the data segment.  For those cases if you use load-effective-address to get the address of the array or struct into a register, this works on both Linux and OS X.

The second problem can be solved using macros.  I have prepared a set of macros which I automatically prefix each assembly file (using yasm's -P option).  The macros add "default rel" under OS X to establish rip-relative addressing and translate each of about 350 function names to have prefixes including main, scanf, printf, ...   So the source code can use main without worrying about the need for an underscore.  There is also a cname macro which can turn any name into a macro which will have an underscore prefix under OS X and not under Linux.  So the set of macros takes care of the first 2 issues.

The lack of debugging support is not as total as it could be.  You can still find globals and addresses using nm and within gdb, but there is no way to set a breakpoint by using a line number and when gdb stops after a next instruction command it won't tell you the next line of the function to execute.

My original solution to this was to inspect the listing file and determine relative addresses for each line and then query by either gdb or nm to determine an actual address such as &main.  This is a fair amount of code and more code means a higher probability of error.

My next solution to the lack of line numbers is to make my own.  I am now generating a debug asm file from the original with each original line preceded by a generated label with a line number in the label.  To properly handle local labels, my generated labels need to be local labels, so I need to create a global label to stick at the start of the file.

So a file would start with

ebe_debug;           ; generated global label
.filename_line_1:    ; generated local label with a line number
...                  ; whatever was originally on line 1
.filename_line_2:
...                  ; whatever was originally on line 2

So with a file with 100 lines there would be 201 lines in the debug asm file and the generated instructions are the same.  Now it is possible to extract all the address information using nm for the executable and issue breakpoint commands like:

break main.filename_line_8

By choosing a nice solution I managed to get a bonus: gdb reports the location when it finishes a command with something ending in "in main.filename_9", so I can readily determine the filename and the line number for the next line to execute and highlight it in ebe.

This solution was fairly good.  It was simple, but it interfered with using macros among other issues.  For repeat macros there would be multiple occurrences of the same label.  After using this solution for a while, I reverted back to the solution requiring analysis of the listing file to determine relative addresses which are later translated to program addresses using addresses of globals in the program.  This worked smoothly.

I think the total solution might require identifying labels well.  Unfortunately yasm allows labels with or without colons.  With a colon a label is obviously a label.  Without a colon an instuction looks about the same.  I rounded up a relatively complete collection of x86_64 instructions and yasm pseudo-ops to store in a QSet<QString> to identify when the first word on a command is an instruction or a label.  There are roughly 1600 names in the set.  It is a truly arcane instruction set, but fortunately you can learn a fairly small subset and do a fairly good job.

Left to deal with are handling data items with global and local labels.  I need to review this again to see if I really must identify the labels in the source.  So far I have identified the range of each global label which would determine which global label is appropriate for a variable identified by a local label.  I think this is necessary, but I hope to simplify this too.