
COPYRIGHT __________________________________________________________________________________________

SID v1.33
Copyright  7th software, 2014
All rights reserved.
                                               ARM, Thumb and Jazelle are trademarks of ARM Limited.

CONTENTS ___________________________________________________________________________________________

  * COPYRIGHT
  * CONTENTS
  * INTRODUCTION
  * REQUIREMENTS
  * COMMAND-LINE OVERVIEW
  * FUNCTIONAL OVERVIEW
  * APCS AND REGISTERS
  * LABEL NAMING SCHEME
  * OFFSETS FILES
  * ORDERING AND OVERRIDING
  * DEAD CODE
  * DANGER, DANGER!
  * FEEDBACK


INTRODUCTION _______________________________________________________________________________________

SID is a general-purpose disassembler for ARM binaries upto and including the ARM version 5
architecture. It does not currently disassemble Thumb or Jazelle binaries. SID can be used from the
RISC OS command line or via the multitasking FrontEnd user interface.

As the binary is processed, SID can generate various warnings about potentially problematic
constructs. For example, code which is not 32-bit safe, instructions with Undefined or Unpredictable
behaviour or branches to locations outside the binary.

SID makes an attempt at intelligently deciphering common structures, like C functions, branch
tables, module SWI tables, etc. To do all this, SID must employ various heuristics because, even
though assembling (or even compiling) source code into binary is a systematic process, conversion in
the opposite direction is not. See "Danger, Danger!" for some of the possible pitfalls.


REQUIREMENTS _______________________________________________________________________________________

SID requires RISC OS 3.5 or later. SID is 32-bit compatible. You can tweak the !Run file if SID
refuses to run on your system and give it a spin. Just don't be surprised if it doesn't work exactly
as you expect!

SID uses the Debugger module to disassemble instructions. It is recommended that you have Debugger
version 1.74 or later running to get the best output from SID. Earlier versions produced ambiguous
output from SWI Debugger_Disassemble (e.g. SWI OS_Unknown can be many SWIs, Undefined Instruction
can correspond to many bit patterns).

If you want to disassemble a binary which has been compressed using either 'Squeeze' or 'modsqz',
you will require 'xpand' or 'unmodsqz' respectively. These can be found in the official C/C++ Tools
release.


COMMAND-LINE OVERVIEW ______________________________________________________________________________

For help on the SID command-line syntax, see the !Help file. It is most likely that you will use the
following form:

  SID <input> <output> -string -comment -cover -adrl -label -indir -objasm -svc -smart

This form will attempt to disassemble the <input> binary into the <output> text file, which is
suitable for assembly with 'objasm'. The input file in this case will usually have a type of
Absolute (&FF8), Module (&FFA) or Utility (&FFC).

Processing a large (e.g. 50 KB) binary file can take some time, even on a StrongARM-based machine.
If this is the case, the -hour switch can be used to display an hourglass showing a percentage
indication of process completeness.


FUNCTIONAL OVERVIEW ________________________________________________________________________________

SID will make assumptions about the source file, based upon its filetype. For example, an Absolute
is assumed to start execution at address &8000. A module is assumed to be position-independent code
as is a utility, i.e. the base 'address' is zero.

You can override these assumptions or process files of other types by using the appropriate switches
on the CLI or via the Advanced options in the main window (click the Toggle Size icon to expose
these).

SID usually attempts to perform a flow analysis on the binary to determine what parts are code and
what parts are data. It attempts to represent the data in a meaningful manner (i.e. it tries to
detect strings and error blocks).

SID can insert labels, usually with fairly meaningful names, into the disassembled code, which makes
navigation easier. It also keeps a reference count for each label (given in a comment) so that you
can spot dead code and important subroutines.

One useful mode of operation is to generate output suitable for diffing (comparison) against the
output from some other binary. The most common example is to have two versions of some binary and
you want to compare them. Simply running a binary diff utility on the binaries can produce output
which is very hard to understand; one extra word near here and there can leave thousands of B, ADR
and LDR instructions different.

Because SID replaces the offsets in those instructions with a label, the output is far better suited
to comparison. The FrontEnd application provides a 'Output for diffing' button which tailors the
output for such. Note: in this mode, the output will often be unsuitable for assembly back into a
working binary.


APCS AND REGISTERS _________________________________________________________________________________

Conventionally, the Debugger module uses the register names R0-R14,PC. This can be changed on newer
versions of the Debugger by changing the Disassemble$Options system variable. For example, if you
want the registers output as R0-R12,SP,LR,PC, you might set it with:

  *Set Disassemble$Options -arm -SP -LR

To use the register naming schemes of the various APCS calling standards, you might set it with:

  *Set Disassemble$Options -apcs -sb -sl -fp -sp -lr

At present, SID can do this for you, but only in a primitive way; you can switch the Debugger
between ARM and APCS modes by using the -arm and -apcs switches. If you have a binary which was
built with the C compiler, the chances are you'll want to do the above to set your register names
for APCS. For finer control over the Debugger output, you should set the system variable yourself.

Note: any changes to the Disassemble$Options system variable will affect the disassembly for other
users of the Debugger module, i.e. other SID tasks already running.


LABEL NAMING SCHEME ________________________________________________________________________________

There are a number of label names possible:

  entry_point     ... an entry point (derived or specified in Offsets file)
  skip            ... code only referenced from earlier in the binary
  loop            ... code only referenced from later in the binary
  code            ... the start of some code
  subroutine      ... a piece of code which looks like a subroutine (i.e. BL refers to it)
  data            ... some numeric (and string) data
  string          ... a string literal
  indirect        ... a data word referencing another location in the binary

and for modules:

  Mod_Start          ... start code
  Mod_Init           ... initialisation code
  Mod_Die            ... finalisation code
  Mod_Title          ... title string
  Mod_HelpStr        ... help string
  Mod_HC_Table       ... help/command keyword table
  Mod_SWIHandler     ... SWI handler code
  Mod_SWITable       ... SWI decoding table
  Mod_SWIDecode      ... SWI decoding code
  Mod_Messages       ... messages filename string
  Mod_Flags          ... flags word
  Mod_Service        ... service call handler code
  fast_svc_table     ... fast service call table
  Fast_Service_Entry ... fast service call entry point
  Svc_Table_Pos      ... word containing offset of fast service call table

and for compiled C code:

  lib_chunk_list  ... block passed in R0 to SWI SharedCLibrary_LibInitBlahBlah
  kernel_init_blk ... probably won't appear in the output
  lib_vector      ... probably won't appear in the output
  lib_static      ... probably won't appear in the output

and if things have gone wrong:

  undefined0      }
  undefined1       }
  undefined2        } These all mean that something didn't go right in SIDs processing.
  undefined3        } I've never seen it happen so, if you do, report it...  ;)
  undefined4       }
  undefined5      }

Labels also get name based upon their context. For example, strings have a processed version of the
string they mark appened onto the end. E.g.

string_ThisIsAString
        DCB     "This Is A String.", 0

Other labels are named appropriately if possible. For example, like vector handlers, environment
handlers, C functions where the name is embedded in the binary and module command code.

Once code flow analysis has been performed and all of the labels are in place, SID will scan them
all to find any duplicates (i.e. there may be many called 'loop'). Each duplicate has an unique
number appended.


OFFSETS FILES ______________________________________________________________________________________

The -offsets switch implies the -label switch. It will add to the default entry-points and known
data offsets for that filetype (if any). An overview of the format of an Offsets file is:

  # This is a comment
  offset
  offset:label
  offset:label:type
  offset::type
  etc.

A more formal definition, in an EBNF-like syntax, might be:

  line       ::= [comment | offset_def | {}] newline
  comment    ::= '#' zero or more characters
  offset_def ::= basic_expr {':' {label_name}} {':' label_type}
  basic_expr ::= any valid BASIC expression (can include code_base%)
  label_name ::= any valid objasm label
  label_type ::= ['undef' | 'code' | 'isdata' | 'string' | 'nostring' | 'offset' | 'indir' | 'entry']

Each offset is separated by a newline and comments are introduced with the hash ('#') character.
Comments must be introduced at the start of a line, or they will most likely result in an error
being generated by the parser when SID loads the file.

Each offset may be given as a decimal, hexadecimal (using & prefix), binary (using % prefix)
integer, or as a valid BASIC expression. For example:

  # This is an example entry point file...
  0             # The first word of the binary
  %1111100      # The word at offset 124
  &C8           # The word at offset 200
  (16*4) - 4    # The word at offset 60
  code_base%!8  # The word at the offset given by the word at offset 8 bytes

Thus, the entry points and data offsets files can inform SID that the code or data is pointed to by
words in the binary itself (code_base% is the pointer to the base of the binary).

Each offset expression can be followed by a colon and a label name string. There is no need to
ensure that each label has an unique name because SID will append numbers to label names as
necessary to avoid duplication.

The label name string can optionally be followed by a colon and a type string. No spaces are
permitted around the label name string or type string. You do not have to specify a label name
string in order to specify the label type.

Valid type strings are (in ascending order of priority):

  undef    - the bytes are of unknown type (code or data) but require a label
  code     - the word(s) from this offset are code
  isdata   - the bytes after this offset are numeric data or strings
  string   - the bytes after this offset are a string (or string sequence)
  nostring - the following bytes are data which is known not to contain strings
  indir    - the word at this offset contains an address of another label
  offset   - the word(s) at this offset contain offsets to other labels
  entry    - this offset is a code entry point (highest priority label type)

Some examples are:

a) 0  :code_base

  Offset zero in the code has a label called code_base and is of undefined type. SID will attempt to
  deduce if the data at the start of the loaded binary is code, a RISC OS error block, a string or
  simple data.

  Note: spaces are valid in and around the expression. They are not permitted in other parts of the
  line.

b) &1AC:messages_filename:string

  Offset &1AC (428 decimal) is a string. It will have a label placed before it called
  'messages_filename'.

c) code_base%!4::isdata

  The bytes following offset contained in the first word of the code are simple numeric data or
  strings.

d) !(code_base%+!code_base%):indir_abort_code:entry

  The first word of the code contains an offset to a word which, in turn, contains an offset to some
  code, called 'indir_abort_code'.


ORDERING AND OVERRIDING ____________________________________________________________________________

When SID first loads the binary file (after it has been decompressed, if necessary), the offsets
file is processed (if any). Labels are added to the code in the order they appear in the file.

If more than one entry references the same offset (i.e. code_base%!4 and code_base%!8 both point to
&1C) then the first reference only is used - all other references are ignored.

After processing the offsets file, SID will then generate any default offsets for that file type
(i.e. for modules it processes the module header). Note: the -type parameter is useful for this.
Labels generated in this phase which coincide with those in the offsets file are ignored.

Labels are initially assigned a type. As documented earlier, this can be undefined, in which case
SID will attempt to auto-detect the type. If during flow analysis, multiple references to a labels
imply different types, types get overridden according to their priorities.

A label of a low priority type can be redefined to a higher priority type. Labels cannot be
redefined to a lower priority type.


DEAD CODE __________________________________________________________________________________________

Sometimes, SID will disassemble blocks of code as a sequence of DCDs. These are usually quite
obvious, i.e.

        DCD     &E1A00000
        DCD     &E04EC00F
        DCD     &E08FC00C
        DCD     &E99C000F
        DCD     &E24CC010
        DCD     &E59C2030
        DCD     &E3120C01
        DCD     &159CC034
        DCD     &008CC000
        DCD     &E08CC001
        DCD     &E3A00000
        DCD     &E3530000
        DCD     &D1A0F00E
        DCD     &E48C0004
        DCD     &E2533004

(an abundance of &E in the top-most nibble of words is a clue that this is code). There are three
possible reasons for this:

  1) It is dead-code, not referenced from anywhere in the binary

  2) It is code entered via some entry point about which SID is not aware

  3) It is simply code which SID has mis-identified as data. This does not often happen.

You will often see (1) in compiled code where some object was linked into the final binary and
contained functions which are not used. Usually, however, SID will go ahead and disassemble it
anyway (if it looks like code). In that case, there will be a comment on each line saying "Missed by
flow analysis".

(2) and (3) do not happen very often. When they do, you can correct the problem by adding an entry
point into an Offsets file for the start of the block of code. If there are many such blocks in the
disassembled output, it helps to correctly identify which block(s) is(are) the highest-level entry
point(s). Re-processing can often bring large parts of the output into life by following code paths
missed in the original scan.


DANGER, DANGER! ____________________________________________________________________________________

Of course, SID is not perfect! Even though every effort has been made to verify that SID
disassembles code as accurately and intelligently as reasonably possible, it can (and will) still
get things wrong. For example, the task of guessing (and it is often only a guess) if some binary
words are code, numeric data or a string is never going to be 100% accurate.

If SID found the word &20202020, that could simply be numeric data for a magic constant. It could be
a string containing four spaces. It could even be the instruction "EORCS R2,R0,R0,LSR #32", but that
is reasonably unlikely!

To make these guesses, SID employs a collection of heuristics. The algorithms behind those
heuristics are not discussed here, but the type of guesses SID makes are:

  * Given the binary filetype, guess information about execution address, position independence and
    entry points. This can be supplemented/overridden from the command line and with an Offsets
    file;

  * When SID identifies a block of the binary (e.g. referenced by an ADR), it has to guess if it is
    code or data;

  * Given a block of data, SID will attempt to spot any string literals within;

  * Given a block of data, SID will try to spot error blocks;

  * If a string has terminating nulls upto the next word boundary, place an ALIGN directive after
    the (last) DCB. This may be spurious.

  * Given a word of data (e.g. loaded into a register with a PC-relative LDR), does this word
    reference some other significant part of the binary?

For these reasons, SID may mis-identify some of the binary (i.e. a word of data becomes a string).
This is often to not much of a problem, because it will assemble back into the same binary. However,
if that word contained a reference (i.e. the offset into the binary) of some significant part of the
binary, the reference may become incorrect if code is inserted or removed between the source and
destination words/bytes (or simply before the source) before the output file is assembled.

This is the problem most likely to be experienced when generating a 'source' file for a binary to be
assembled into a new version of the binary, after making some changes in the new source file. If you
make changes which will insert or remove words in the binary, you should be very careful about this
problem.

Some things to look out for which aren't a problem but do result in an output file which assembles
to a different binary to the source binary are:

* If the source binary was compressed with squeeze or modsqz, the output binary built with objasm
  from the SID output probably won't be compressed!

* When disassembling with the -macros switch ("Use standard macros" in the front end), you may find
  stack pushes and pulls of a single register which were encoded originally as LDM and STM
  instructions have been optimised into an LDR or STR instruction. This is not harmful to the output
  binary - it will actually take fewer cycles to execute on modern ARMs.

* The C compiler appears to encode ADR instructions in a different way to objasm. This results in
  the same ADR destination, but the instruction encoding differs. This is harmless to the output
  binary because the ADR still points to the correct location.


FEEDBACK ___________________________________________________________________________________________

If you have any feedback, suggestions or fault reports, please mail your comments to:

  enquiries@7thsoftware.co.uk

Please put the word SID in the body and subject line of your e-mail so I can ignore it! Fault
reports should include:

  * The version of SID you were using (look in the Info window)
  * The version of RISC OS it ran on (*Fx 0)
  * The version of the debugger you have (*Help Debugger)
  * The nature of the failure (i.e. it disassembled word xxx into yyy, which is wrong!)
  * A copy of the binary which caused the problem (if possible - or just an extract)

Remember: SID fundamentally uses the Debugger module to convert a word into an instruction. If it
still looks wrong in Zap (or the like) in Code mode, then it's probably that. Send me the fault
report anyway and I'll get the Debugger module fixed.

As a last word, I have a list of things I might do to make SID better, but I'll keep them to myself
for now! If you have ideas, please send them to me. If you have Offsets files which you think others
may want to use, send them to me and I'll put them into the Demos directory of the next release.
