Wednesday, March 19, 2014

Speculative stores

Recently I read this wonderful article on C11 atomic variables and its possible usage in Linux Kernel. This is why admire community coding. People throng into fruitful discussions and eventually best comes out. The final result is lot of learning from all corners. In the article, Mr.Corbet mentioned about speculative stores which means the consequences of nasty compiler optimizations. I wrote about how intelligibly compilers handle certain cases here and here. So lets look at following example which is also mentioned in the article.

int y=2;

int do_some_work()
{
    y = 2;

    if (y)
        ....
    else
        ....
}


In above code, many programmers may expect compiler to rip off the 'else' branch. That's the dangerous part if the code belongs to kernel space. Why? Now 'y' is a data segment entity and can be manipulated by any CPU in SMP system. It can be even set to zero by some CPU. In that case, the optimization of compiler will result in untidy results. I cannot say how do compilers treat such code in kernel space since it requires bit of time to experiment.

How can we do it in user space? Very simple! Run more than one thread and lets see how assembly looks like. Note that this code is not thread safe

#include<pthread.h>
#include<stdio.h>
#include<assert.h>

int y=0;
void* thread_routine(void* arg);

void* thread_routine(void* arg)
{
        y=1;
        if(y)
                printf("Y is Y in thread = %d\n", pthread_self());
        else
                printf("Y is !Y in thread = %d\n", pthread_self());
}

void* thread_routine2(void* arg)
{
        y=0;
        if(y)
                printf("Y is !Y in thread = %d\n", pthread_self());
        else
                printf("Y is Y in thread = %d\n", pthread_self());
}

int main(int argc, char **argv)
{
        pthread_t tid[2];

        int thread_rc = 0;

        thread_rc = pthread_create(&tid[0], NULL, thread_routine, NULL);
        assert(!thread_rc);
        thread_rc = pthread_create(&tid[1], NULL, thread_routine2, NULL);
        assert(!thread_rc);
}


I am stripping of unnecessary sections and retaining only the thread stack assembly. The thread_routine2 function also looks similar

thread_routine:
.LFB2:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register 6
        subq    $16, %rsp
        movq    %rdi, -8(%rbp)
        movl    $1, y(%rip)
       movl    y(%rip), %eax <-- I know you are tricking me :D
       testl   %eax, %eax
       je      .L2

        call    pthread_self
        movq    %rax, %rsi
        movl    $.LC0, %edi
        movl    $0, %eax
        call    printf
        jmp     .L4
.L2:
        call    pthread_self
        movq    %rax, %rsi
        movl    $.LC1, %edi
        movl    $0, %eax
        call    printf

.L4:
        leave
        .cfi_def_cfa 7, 8
        ret
        .cfi_endproc


If you glance at assembly code, gcc is smart man :D. It knows that it should not optimize in such cases. If you observe the assembly, gcc emits code for both if and else part even though there is straight forward assignment before the branching. Also look at "movl y(%rip), %eax"! Instead of blindly copying value of '1' to EAX register, the actual value of 'y' is copied and tested :-).

Caveat: Multi threading may not emulate a SMP scenario in linux. Nowadays operating systems tend to hook threads to particular CPU rather than multiple CPUs. This is mainly to avoid penalty incurred due to cache line invalidations especially when global variable is involved and can be modified. Nevertheless, a thread can be pre-empted in middle of operation (say while if{} branch can be precisely evaluated) unless lock is held explicitly. Understanding SMP systems is quite intricate however opens up to wide variety of thoughts in programming world. Two cores are not two brains you know ;-). There are lot difficulties while handling such scenarios!

Finally short assignments ;-): 

1) Examine the assembly in case of -O2 switch
2) Remove threads and run bare minimal program while retaining data segment  variable and observe what gcc does!

Thursday, March 6, 2014

Handling of absurd code by gcc - The absurd sequel ;-)

I did mention about this in my previous blog here. Now slightly extending the condition to this

int test(unsigned int k)
{
        if (k <= 0)
                printf("BUG IN COMPILER: THIS IS ABSURD BLOCK\n");
}


We can expect few changes by compiler. Now this has two conditions, one for comparing for zero and other for comparing for Sign bit. What does compiler emit?

.LFB0:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register 6
        subq    $16, %rsp
        movl    %edi, -4(%rbp)
        cmpl    $0, -4(%rbp)
<-- Just compare with zero and kick programmer 
        jne     .L3
        movl    $.LC0, %edi
        call    puts



As expected it emits only condition for comparing with zero :D. gcc is very smart with this case too! Everything remains same except we have "call puts" now to print in case of condition is true.

Wednesday, March 5, 2014

Handling of absurd code by gcc

We programmers are bound to make silly mistakes and tend to write absurd code :-). If these things are spotted during code reviews or internal testing, then you are spared. If the buggy stuffs land in customer's runway, then we are in trouble :-). Today's blog is to closely examine gcc behavior's to such absurd code. This example is not exhaustive. I will try to come up with few more examples in future to learn myself and share observations. As of now, here is sample code.

#include<stdio.h>

int test(unsigned int k)
{
        if (k < 0)
                printf("BUG IN COMPILER: THIS IS ABSURD BLOCK\n");
}

int main()
{
        unsigned int k = 9;
        test(k);
        return 0;
}


As a programmer you know the absurdity of the code. So how does gcc behave. We can only say by looking into assembly output emitted by gcc. Here is the assembly dump of the program (with default optimization).

<snip>

test:
.LFB0:
        .cfi_startproc
        pushq   %rbp /* Save return pointer */
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        movq    %rsp, %rbp /* New base pointer */
        .cfi_def_cfa_register 6
       movl    %edi, -4(%rbp) /* Copy Argument */
       popq    %rbp /* Restore base pointer */
        .cfi_def_cfa 7, 8
        ret
        .cfi_endproc
.LFE0:
        .size   test, .-test
        .globl  main
        .type   main, @function
main:
.LFB1:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register 6
        subq    $16, %rsp
        movl    $9, -4(%rbp)
        movl    -4(%rbp), %eax
        movl    %eax, %edi
       call    test /* Here is call to function */
        movl    $0, %eax
        leave
        .cfi_def_cfa 7, 8
        ret


<snip>


As you can see, the entire 'if' block is discarded by gcc :-D. Compilers are smart nowadays ;-). They know how to get rid of weeds and make ELF fertile :-). Even though we inject irrelevant code, compilers (atleast gcc) get rid of them in final binary. Let me see what more weird stuff can be experimented with. Hope you enjoyed this small and simple post. Critiques and comments are always welcome!