Sunday, June 16, 2013

Branch optimization in gcc


With -O2 option, gcc by default does some of branch optimizations. This one I recently discovered trying to understand __builtin_expect() gcc buitlin.

Consider following example:

#include <stdio.h>
#include <stdlib.h>

int main()
{
        char *c = malloc(10);
        if (c ==NULL) {
                printf("MALLOC FAILED\n");
                return 1;
        }
        free(c);
        printf("MALLOC SUCCEEDED\n");
}

If I compile as 'gcc -S test.c' and 'gcc -O2 -S test.c' and compare the assembly here is the output. Now what does that mean? In the optimized version, the malloc() is considered to be error free for most of cases (likely condition) by gcc and the part of code is moved near the text segment. The failure part is put in the sequel. How come gcc is doing such kind of optimization? Does this do based upon memory footprint of system? If so, how about cross compiler situation? I have no answer. Is this optimization based on exit point (since our code is small, gcc may be predicting things after malloc() are really less and hence nothing to worry on optimization since all code may be within cacheline size). My assumption was such kind of optimization is only left to user based on __builtin_expect() directive even if -O2 switch is used.

Look at assembly output difference

1) Optimized one

.section        .rodata.str1.1,"aMS",@progbits,1
.LC0:
        .string "MALLOC FAILED"
.LC1:
        .string "MALLOC SUCCEEDED"
        .section        .text.startup,"ax",@progbits
        .p2align 4,,15
        .globl  main
        .type   main, @function
main:
.LFB34:
        .cfi_startproc
        subq    $8, %rsp
        .cfi_def_cfa_offset 16
        movl    $10, %edi
        call    malloc
        testq   %rax, %rax
        je      .L4
        movq    %rax, %rdi
        call    free
        movl    $.LC1, %edi
        addq    $8, %rsp
        .cfi_remember_state
        .cfi_def_cfa_offset 8
        jmp     puts

.L4:
        .cfi_restore_state
        movl    $.LC0, %edi
        call    puts
        movl    $1, %eax
        popq    %rdx
        .cfi_def_cfa_offset 8
        ret
        .cfi_endproc
.LFE34:
        .size   main, .-main
        .ident  "GCC: (Ubuntu/Linaro 4.7.2-2ubuntu1) 4.7.2"
        .section        .note.GNU-stack,"",@progbits


2) Non-Optimized

.section        .rodata
.LC0:
        .string "MALLOC FAILED"
.LC1:
        .string "MALLOC SUCCEEDED"
        .text
        .globl  main
        .type   main, @function
main:
.LFB0:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register 6
        subq    $16, %rsp
        movl    $10, %edi
        call    malloc
        movq    %rax, -8(%rbp)
        cmpq    $0, -8(%rbp)
        jne     .L2
        movl    $.LC0, %edi
        call    puts
        movl    $1, %eax
        jmp     .L1

.L2:
        movq    -8(%rbp), %rax
        movq    %rax, %rdi
        call    free
        movl    $.LC1, %edi
        call    puts
.L1:
        leave
        .cfi_def_cfa 7, 8
        ret
        .cfi_endproc

.LFE0:
        .size   main, .-main
        .ident  "GCC: (Ubuntu/Linaro 4.7.2-2ubuntu1) 4.7.2"
        .section        .note.GNU-stack,"",@progbits



If you look at optimized assembly code

testq    %rax, %rax
je    .L4

which is callng the failure one. However right after 'je .L4', we see success code

movq    %rax, %rdi
call    free
movl    $.LC1, %edi
addq $8, %rsp       .cfi_remember_state                                                                                                                                            
.cfi_def_cfa_offset8                                                                                                                                          
jmp    puts


That means gcc has optimized the branch prediction. So we do not have to use __builtin_expect() for such cases ;-). Not sure if this is applicable for all sort branching or only branches with less code.

The non-optimized part takes normal course.

subq    $16, %rsp
movl    $10, %edi
call    malloc
movq    %rax, -8(%rbp)

cmpq    $0, -8(%rbp)
jne    .L2   // call usual malloc
movl    $.LC0, %edi //failure
call    puts
movl    $1, %eax
jmp    .L1


Let me know if you have alternate thoughts. The code is compiled under Linux mint x86_64 architecture. Not sure if same behavior could be seen across all archs.