With -O2 option, gcc by default does some of branch optimizations. This one I recently discovered trying to understand __builtin_expect() gcc buitlin.
Consider following example:
#include <stdio.h>
#include <stdlib.h>
int main()
{
char *c = malloc(10);
if (c ==NULL) {
printf("MALLOC FAILED\n");
return 1;
}
free(c);
printf("MALLOC SUCCEEDED\n");
}
If I compile as 'gcc -S test.c' and 'gcc -O2 -S test.c' and compare the assembly here is the output. Now what does that mean? In the optimized version, the malloc() is considered to be error free for most of cases (likely condition) by gcc and the part of code is moved near the text segment. The failure part is put in the sequel. How come gcc is doing such kind of optimization? Does this do based upon memory footprint of system? If so, how about cross compiler situation? I have no answer. Is this optimization based on exit point (since our code is small, gcc may be predicting things after malloc() are really less and hence nothing to worry on optimization since all code may be within cacheline size). My assumption was such kind of optimization is only left to user based on __builtin_expect() directive even if -O2 switch is used.
Look at assembly output difference
1) Optimized one
.section .rodata.str1.1,"aMS",@progbits,1
.LC0:
.string "MALLOC FAILED"
.LC1:
.string "MALLOC SUCCEEDED"
.section .text.startup,"ax",@progbits
.p2align 4,,15
.globl main
.type main, @function
main:
.LFB34:
.cfi_startproc
subq $8, %rsp
.cfi_def_cfa_offset 16
movl $10, %edi
call malloc
testq %rax, %rax
je .L4
movq %rax, %rdi
call free
movl $.LC1, %edi
addq $8, %rsp
.cfi_remember_state
.cfi_def_cfa_offset 8
jmp puts
.L4:
.cfi_restore_state
movl $.LC0, %edi
call puts
movl $1, %eax
popq %rdx
.cfi_def_cfa_offset 8
ret
.cfi_endproc
.LFE34:
.size main, .-main
.ident "GCC: (Ubuntu/Linaro 4.7.2-2ubuntu1) 4.7.2"
.section .note.GNU-stack,"",@progbits
2) Non-Optimized
.section .rodata
.LC0:
.string "MALLOC FAILED"
.LC1:
.string "MALLOC SUCCEEDED"
.text
.globl main
.type main, @function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
movl $10, %edi
call malloc
movq %rax, -8(%rbp)
cmpq $0, -8(%rbp)
jne .L2
movl $.LC0, %edi
call puts
movl $1, %eax
jmp .L1
.L2:
movq -8(%rbp), %rax
movq %rax, %rdi
call free
movl $.LC1, %edi
call puts
.L1:
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu/Linaro 4.7.2-2ubuntu1) 4.7.2"
.section .note.GNU-stack,"",@progbits
If you look at optimized assembly code
je .L4
which is callng the failure one. However right after 'je .L4', we see success code
movq %rax, %rdi
call free
movl $.LC1, %edi
addq $8, %rsp .cfi_remember_state
.cfi_def_cfa_offset8
jmp puts
That means gcc has optimized the branch prediction. So we do not have to use __builtin_expect() for such cases ;-). Not sure if this is applicable for all sort branching or only branches with less code.
The non-optimized part takes normal course.
subq $16, %rsp
movl $10, %edi
call malloc
movq %rax, -8(%rbp)
cmpq $0, -8(%rbp)
jne .L2 // call usual malloc
movl $.LC0, %edi //failure
call puts
movl $1, %eax
jmp .L1
Let me know if you have alternate thoughts. The code is compiled under Linux mint x86_64 architecture. Not sure if same behavior could be seen across all archs.