Monday, September 15, 2014

A simple read-write lock implementation - Problem 2

This is the continued analysis of simple read-write lock implementation problem. This time it is starvation of writer locks.

Let me modify the thread routines and remove the sleep() after unlock, which means that readers or writers are assumed to have no pause in between. Here are modified thread routines.

void*
thread_routine1 (void *arg)
{
    for (;;) {
        my_rwlock_read_lock(&g_rw_lock);
        printf("Grabbed read-lock -- 1\n");
        sleep(10);
        my_rwlock_read_unlock(&g_rw_lock);
    }
}

void*
thread_routine2 (void *arg)
{
    for (;;) {
        my_rwlock_read_lock(&g_rw_lock);
        printf("Grabbed read-lock -- 2\n");
        sleep(10);
        my_rwlock_read_unlock(&g_rw_lock);
    }
}


void*
thread_routine3 (void *arg)
{
    for (;;) {
        my_rwlock_write_lock(&g_rw_lock);
        printf("Grabbed write-lock -- 1\n");
        sleep(5);
        my_rwlock_write_unlock(&g_rw_lock);
    }
}

void*
thread_routine4 (void *arg)
{
    for (;;) {
        my_rwlock_write_lock(&g_rw_lock);
        printf("Grabbed write-lock -- 2\n");
        sleep(5);
        my_rwlock_write_unlock(&g_rw_lock);
    }
}


What is the output?

Grabbed read-lock -- 1
Grabbed read-lock -- 2
Grabbed read-lock -- 2
Grabbed read-lock -- 1
Grabbed read-lock -- 2
Grabbed read-lock -- 1
Grabbed read-lock -- 1
Grabbed read-lock -- 2


:). As you see the writers are severely starved! As of now, I am not still have zero thoughts to fix this issue. If you find any suitable approach please let me know in the comment box.

Wednesday, September 3, 2014

Filesplitter source in github now!

Today, I am glad to announce the availability of FileSplitter source code in github. This was lying in sourceforge for quite sometime and now I wish to enhance my github profile and uploaded the same to github :). This was written way back in 2007 (IIRC) and is written in Java. The UI design matches with that of chainsaw file splitter (check here). The UI is written using Java Swing APIs. Unfortunately, this works only with windows i.e. final output is batch file and cannot be used in Linux :(. I have enhancement plans for the same and willing to improve it in future releases. There are lots of known bugs and I wish someone could pull up more bugs to make it robust. Also the code is little bit cluttered and want to cleanse the same. For everything, I need to find sufficient time to move ahead. The second good news is that the code is available under GNU GPL license and you are free to enhance it and play with it.

Jump to the github page now!: https://github.com/nkumar85/filesplitter_java

Features
--------------
Can handle files above 4GB and split sizes above 4GB (need confirmation)
Checksum validation tool available to check if the merged file is in tact.

Planned enhancements
-------------------------------
Shell script support for Linux
Tool to merge the split files manually
cleanse cluttered code

Bugs
---------
long strings are not taken care for creating batch files
Lot more bugs which I have not documented [There are no show stoppers though ;)]

Note
--------
The code is provided under GNU GPL License and does not contain any proprietary code/materials in it. Please do not sell the tool for money. Feel free to distribute amongst anyone for free of charge. Any modifications to code has to be informed to me (email details are provided in the About dialog of filesplitter)

Enjoy using the filesplitter :)

Tuesday, August 26, 2014

Set or Unset a particular bit in an integer - A short program

Many times we need a method to set and unset a bit without having condition construct. It is easy to execute with 'if' construct. Just for fun, is it possible without having a conditional construct? Here is program I wrote

#include <stdio.h>
#include <stdint.h>
#include <stdbool.h>

uint32_t
set_unset_bit (uint32_t src, uint32_t pos, bool set)
{
    return ((src & ~((uint32_t)1 << (pos-1))) | (!!set << (pos-1)));
}

int main (int argc, char **argv)
{
    uint32_t test1 = 0;
    uint32_t test2 = 0xFFFFFFFF;

    printf("%x\n", set_unset_bit(test1, 10, 1));
    printf("%x\n", set_unset_bit(test2, 11, 0));
}


This works for uint32_t types and as expected.  What was the approach? Basically you need to set the bit of interest to zero and OR with whatever option user requested for (set or unset i.e. 1 or 0). The first part "(src & ~((uint32_t)1 << (pos-1)))" forcibly sets bit of requested position 'pos' to zero while the next construct "| (!!set << (pos-1))" sets or unsets based on the requirement. Hope you enjoy the small code snippet.

Why do you think '!!set' required? Read my previous article here

Output:

nandakumar@heramba ~/setbit $ ./a.out
200
fffffbff

Final note: This works only for uint32_t however can be expanded to other types too! For uint64_t, make sure to cast '1' to uint64_t!

Sunday, August 24, 2014

A simple read-write lock implementation - Problem 1

Here I implemented a very basic read-write lock based on pthread libraries. You may be wondering as how this below code works?

void
my_rwlock_write_lock (my_rwlock_t *rw_lock)
{
    if (!rw_lock) {
        return;
    }

    pthread_mutex_lock(&rw_lock->rw_mutex);
    if (rw_lock->num_readers) {
        pthread_cond_wait(&rw_lock->rw_cond, &rw_lock->rw_mutex);
    }
}

Hell, where is the unlock? Does it not cause permanent deadlock?

If you look at the manpage of pthread_cond_wait(), the function atomically unlocks the mutex and waits on condition variable. So there is no need to worry!

Now coming to the problem. As you are aware, the previous implementation was running on continuous loop and hence readers and writers were grabbing their time-slices properly (even though not fair). Now, let me remove for-loops from all the thread routines and examine the output.

nandakumar@heramba ~/rwlock $ ./a.out
Grabbed read-lock -- 1
Grabbed read-lock -- 2
Grabbed write-lock -- 1


Voila, only one write lock gets woken up while the other is permanently waiting. What is the reason? There is no one to wake him up! pthread_cond_signal() wakes up only one thread and hence only one of the writer thread gets woken up. How to get around this problem? Simple, wake up all writer threads waiting on the lock. In that case, the writer threads will be scheduled as per scheduling policy but none of them will be blocked on condition.

The modified part of read_unlock is below.

void
my_rwlock_read_unlock (my_rwlock_t *rw_lock)
{
    if (!rw_lock) {
        return;
    }

    pthread_mutex_lock(&rw_lock->rw_mutex);
    if (rw_lock->num_readers) {
        rw_lock->num_readers--;
        if (!rw_lock->num_readers) {
            pthread_cond_broadcast(&rw_lock->rw_cond);
        }
    }
    pthread_mutex_unlock(&rw_lock->rw_mutex);
}


And here is output:

nandakumar@heramba ~/rwlock $ ./a.out
Grabbed read-lock -- 1
Grabbed read-lock -- 2
Grabbed write-lock -- 1
Grabbed write-lock -- 2


Works as expected! Another way is to signal the thread waiting on condition variable when write lock is unlocked. Here is the modifed code snippet and output. Note that conditional signal is required while unlocking final reader lock

void
my_rwlock_write_unlock (my_rwlock_t *rw_lock)
{
    if (!rw_lock) {
        return;
    }

    pthread_cond_broadcast(&rw_lock->rw_cond);
    pthread_mutex_unlock(&rw_lock->rw_mutex);
}


Output:

nandakumar@heramba ~/rwlock $ gcc my_rwlock.c -lpthread
nandakumar@heramba ~/rwlock $ ./a.out
Grabbed read-lock -- 2
Grabbed read-lock -- 1
Grabbed write-lock -- 2
Grabbed write-lock -- 1


I still feel #1 is better solution since it is reader's responsibility to wake up writing threads.

Finally here is comparison between the initial implementation and one with pthread_cond_broadcast in read_unlock()

Output with pthread_cond_signal:

Grabbed read-lock -- 1
Grabbed read-lock -- 2
Grabbed write-lock -- 1
Grabbed read-lock -- 1
Grabbed read-lock -- 2
Grabbed write-lock -- 2
Grabbed read-lock -- 1
Grabbed read-lock -- 2
Grabbed write-lock -- 1
Grabbed read-lock -- 1
Grabbed read-lock -- 2


Output with pthread_cond_broadcast:

Grabbed read-lock -- 1
Grabbed read-lock -- 2
Grabbed write-lock -- 1
Grabbed read-lock -- 2
Grabbed write-lock -- 2
Grabbed read-lock -- 1
Grabbed write-lock -- 2
Grabbed read-lock -- 2
Grabbed write-lock -- 1
Grabbed read-lock -- 1
Grabbed write-lock -- 2
Grabbed write-lock -- 1
Grabbed read-lock -- 2
Grabbed read-lock -- 1


As you see, the initial implementation does not guarantee fair chance. Also I mentioned about inherent problem above. For second, even though not in order, all of them get fair chance to access shared resource. Also second implementation confirms that no regression is introduced by change!

There may be far better solutions to circumvent the problem. The comment box is always in your service to help me :)

Hope you enjoyed the sequel. By the way, there are still more problems. Stay tuned for further analysis!

Thursday, August 21, 2014

A simple read-write lock implementation

Here is a simple read-write lock I implemented in quick succession as just to understand how pthred_rwlocks would have been implemented. I have not dug deep into the pthread code yet but believe would be having some similar mechanism. This implementation functions as per definition of read-write locks. Note that this is just very basic implementation I coded to satisfy my inquisitiveness. There are problems though with this implementation which I will be documenting further.

How does it work?

i)  The structure consists of mutex, codition variable and reader count.
ii) As soon as reader grabs the lock, the reader count is atomically incremented.
iii) Note that the lock is common for both reader and writer.
iv) If a thread wants to acquire write lock, then it first atomically checks if reader count is zero. If yes, it safely can acquire the lock
v) If there are readers pending, the write lock blocks on condition variable
vi) Once all readers are done with reading, the condition variable woken up signalling the waiting writer thread

Bugs:

If readers take same amount as writers, then writers are severely starved and may not even get a chance to hold the lock. I will simulate this in next write.


For brevity, error checks are offloaded to readers and users!

Here is the code then. Execute, test and enjoy! You are open to provide comments on improvements.

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <pthread.h>

typedef struct my_rwlock {
    pthread_cond_t   rw_cond;
    pthread_mutex_t  rw_mutex;
    int              num_readers;
} my_rwlock_t;

#define MY_RWLOCK_INITIALIZER {PTHREAD_COND_INITIALIZER, PTHREAD_MUTEX_INITIALIZER, 0}

my_rwlock_t g_rw_lock = MY_RWLOCK_INITIALIZER;

void
my_rwlock_init (my_rwlock_t *rw_lock)
{
    if (!rw_lock) {
        return;
    }

    memset(rw_lock, 0, sizeof(my_rwlock_t));
}

void
my_rwlock_read_lock (my_rwlock_t *rw_lock)
{
    if (!rw_lock) {
        return;
    }

    pthread_mutex_lock(&rw_lock->rw_mutex);
    rw_lock->num_readers++;
    pthread_mutex_unlock(&rw_lock->rw_mutex);
}

void
my_rwlock_read_unlock (my_rwlock_t *rw_lock)
{
    if (!rw_lock) {
        return;
    }

    pthread_mutex_lock(&rw_lock->rw_mutex);
    if (rw_lock->num_readers) {
        rw_lock->num_readers--;
        if (!rw_lock->num_readers) {
            pthread_cond_signal(&rw_lock->rw_cond);
        }
    }
    pthread_mutex_unlock(&rw_lock->rw_mutex);
}

void
my_rwlock_write_lock (my_rwlock_t *rw_lock)
{
    if (!rw_lock) {
        return;
    }

    pthread_mutex_lock(&rw_lock->rw_mutex);
    if (rw_lock->num_readers) {
        pthread_cond_wait(&rw_lock->rw_cond, &rw_lock->rw_mutex);
    }
}

void
my_rwlock_write_unlock (my_rwlock_t *rw_lock)
{
    if (!rw_lock) {
        return;
    }

    pthread_mutex_unlock(&rw_lock->rw_mutex);
}

void*
thread_routine1 (void *arg)
{
    for (;;) {
        my_rwlock_read_lock(&g_rw_lock);
        printf("Grabbed read-lock -- 1\n");
        sleep(5);
        my_rwlock_read_unlock(&g_rw_lock);
        sleep(5);
    }
}

void*
thread_routine2 (void *arg)
{
    for (;;) {
        my_rwlock_read_lock(&g_rw_lock);
        printf("Grabbed read-lock -- 2\n");
        sleep(8);
        my_rwlock_read_unlock(&g_rw_lock);
        sleep(5);
    }
}

void*
thread_routine3 (void *arg)
{
    for (;;) {
        my_rwlock_write_lock(&g_rw_lock);
        printf("Grabbed write-lock -- 1\n");
        sleep(5);
        my_rwlock_write_unlock(&g_rw_lock);
        sleep(5);
    }
}

void*
thread_routine4 (void *arg)
{
    for (;;) {
        my_rwlock_write_lock(&g_rw_lock);
        printf("Grabbed write-lock -- 2\n");
        sleep(8);
        my_rwlock_write_unlock(&g_rw_lock);
        sleep(5);
    }
}

int
main (int argc, char *argv[])
{
    pthread_t tid[4];

    pthread_create(&tid[0], NULL, thread_routine1, NULL);
    pthread_create(&tid[1], NULL, thread_routine2, NULL);
    pthread_create(&tid[2], NULL, thread_routine3, NULL);
    pthread_create(&tid[3], NULL, thread_routine4, NULL);

    for (;;) {
        sleep(10);
    }
}

Monday, August 18, 2014

The msgbuf structure in SysV message queue

Most of you who are familiar with SysV message queue must be aware of the formatting the message to following weird structure. The weirdest part is none other than the char mtext[1]. What is this unusual stuff? It is nothing but plays the role of expandable structure.

struct msgbuf {
    long mtype;       /* message type, must be > 0 */
    char mtext[1];    /* message data */
};


For example:

If you wish to send a message of 40 bytes, then one needs to create a structure with following size.

sizeof(long)+40;

Of-course, first element should be of type 'long' which represents message type.

Now, why is this weird mtext[1]. It could have been mtext[0]? Simple, SysV IPCs were introduced way long back. There was no C standard of having zero length arrays in structure. That is the sole reason behind mtext[1]. I used to scratch my head for a very long time and it has been eventually settled [hope even louses are out]. If any other reason, please leave comment in the box.

Saturday, May 3, 2014

Glibc malloc in linux

How does glibc implement malloc? I guess as many of you know, in traditional nix systems, it is implemented using brk() and sbrk() system calls. Basically these system calls adjust the heap pointer of process (or rather break pointer) to new location as per desired size. However in linux, glibc employs a different methodology in implementing malloc functions. For "x" number of bytes, the glibc allocates via brk()/sbrk() system calls while anything equal or above "x", mmap system call is used. Currently this threshold is 128k. Now what is mmap(). In the case of mmap(), the process address space is not adjusted rather the kernel allocates memory from its page cache and attaches the Virtual Memory Address to the process address space. Hence the heap segment is untouched and stack can grow freely upwards teasing heap :D.

Alright, can we have plausible example? Here it is :-). The first one is with 127k and latter is 128k. Follow the explanations after example code. Both of the code have sleep() function to be in wait state for some time. This enables me to see process maps. To see how malloc() behaves, one needs to run system call trace tool (strace) to understand the flow for different sizes.

1) 127k

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define BUF_SIZE (127 * 1024)

int main()
{
    char *temp = NULL;

    temp = (char*)(malloc(BUF_SIZE));
    strcpy(temp, "LINUX-TUX :-)");
    sleep(100);
}


compile with gcc and run the resulting executable with strace
>>strace ./a.out

I have stripped most of the portions especially the memory mappings of shared libraries. The actual snippet looks like

<<snip>>
brk(0)                                  = 0xb1d000
brk(0xb5d000)                           = 0xb5d000
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
<<snip>>


As you can see malloc() implementation uses brk() system calls to increase the data segment width. Initially it passes 0 to find current data segment location and later adds 127k to inform the OS to adjust data segment to new location.

How to verify? Most unix systems list memory mapings in /proc/<pid>/maps. Lets do the same for our case. As I mentioned earlier, the sleep serves the purpose of grabbing memory map or else the proc entry of the process gets flushed out.

In one more terminal, obtain process-id of a.out

>>pgrep a.out
14565

>>cat /proc/14565/maps
00400000-00401000 r-xp 00000000 08:0a 1708078                            /home/nandakumar/a.out
00600000-00601000 r--p 00000000 08:0a 1708078                            /home/nandakumar/a.out
00601000-00602000 rw-p 00001000 08:0a 1708078                            /home/nandakumar/a.out
00b1d000-00b5d000 rw-p 00000000 00:00 0                                  [heap]
7f5fe1ca0000-7f5fe1e5d000 r-xp 00000000 08:0a 658188                     /lib/x86_64-linux-gnu/libc-2.17.so
7f5fe1e5d000-7f5fe205d000 ---p 001bd000 08:0a 658188                     /lib/x86_64-linux-gnu/libc-2.17.so
7f5fe205d000-7f5fe2061000 r--p 001bd000 08:0a 658188                     /lib/x86_64-linux-gnu/libc-2.17.so
7f5fe2061000-7f5fe2063000 rw-p 001c1000 08:0a 658188                     /lib/x86_64-linux-gnu/libc-2.17.so
7f5fe2063000-7f5fe2068000 rw-p 00000000 00:00 0
7f5fe2068000-7f5fe208b000 r-xp 00000000 08:0a 658164                     /lib/x86_64-linux-gnu/ld-2.17.so
7f5fe2268000-7f5fe226b000 rw-p 00000000 00:00 0
7f5fe2288000-7f5fe228a000 rw-p 00000000 00:00 0
7f5fe228a000-7f5fe228b000 r--p 00022000 08:0a 658164                     /lib/x86_64-linux-gnu/ld-2.17.so
7f5fe228b000-7f5fe228d000 rw-p 00023000 08:0a 658164                     /lib/x86_64-linux-gnu/ld-2.17.so
7ffffa977000-7ffffa998000 rw-p 00000000 00:00 0                          [stack]
7ffffa99d000-7ffffa99f000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]


Here we go :D. The new break pointer obtained from system call matches with proc entry [and obviously should match ;-)]. The length if you have calculate is [00b1d000-00b5d000]= 256K

Wait a minute! How come so much? That's how malloc does. It allocates more than what is required and manages later requests internally. This avoids frequent system call overheads and also fragmentation to certain extent

If you are curious to know the logic behind the calculation, please dissect sysmalloc() function in glibc/malloc/malloc.c {Home work :-)}

2) 128k (mmap). Here is next example!

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define BUF_SIZE (128 * 1024)

int main()
{
    char *temp = NULL;

    temp = (char*)(malloc(BUF_SIZE));
    strcpy(temp, "LINUX-TUX :-)");
    sleep(100);
}

Snipping unnecessary parts:

<<snip>>
mmap(NULL, 135168, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbd51e67000
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
<<snip>>


Now lets look into maps

>>pgrep a.out
15060

>>cat /proc/15060/maps
00400000-00401000 r-xp 00000000 08:0a 1708078                            /home/nandakumar/a.out
00600000-00601000 r--p 00000000 08:0a 1708078                            /home/nandakumar/a.out
00601000-00602000 rw-p 00001000 08:0a 1708078                            /home/nandakumar/a.out
7fbd518c0000-7fbd51a7d000 r-xp 00000000 08:0a 658188                     /lib/x86_64-linux-gnu/libc-2.17.so
7fbd51a7d000-7fbd51c7d000 ---p 001bd000 08:0a 658188                     /lib/x86_64-linux-gnu/libc-2.17.so
7fbd51c7d000-7fbd51c81000 r--p 001bd000 08:0a 658188                     /lib/x86_64-linux-gnu/libc-2.17.so
7fbd51c81000-7fbd51c83000 rw-p 001c1000 08:0a 658188                     /lib/x86_64-linux-gnu/libc-2.17.so
7fbd51c83000-7fbd51c88000 rw-p 00000000 00:00 0
7fbd51c88000-7fbd51cab000 r-xp 00000000 08:0a 658164                     /lib/x86_64-linux-gnu/ld-2.17.so
7fbd51e67000-7fbd51e8b000 rw-p 00000000 00:00 0
7fbd51ea8000-7fbd51eaa000 rw-p 00000000 00:00 0
7fbd51eaa000-7fbd51eab000 r--p 00022000 08:0a 658164                     /lib/x86_64-linux-gnu/ld-2.17.so
7fbd51eab000-7fbd51ead000 rw-p 00023000 08:0a 658164                     /lib/x86_64-linux-gnu/ld-2.17.so
7fff48a97000-7fff48ab8000 rw-p 00000000 00:00 0                          [stack]
7fff48bfe000-7fff48c00000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]


If you glance at the process maps, the address returned by mmap() 0x7fbd51e67000 is attached to process address space with size of 7fbd51e67000-7fbd51e8b000 which is 144k. The kernel for mmap returns memory which is aligned by pagesize. If you look at mmap system call, the number supplied by glibc is 128k+4k. As said earlier, it may be for some house keeping stuffs. Later kernel allocates with multiples of pagesize. The extra memory returned by kernel is zeroed out and useless which means even if you write at that location, it will be still zero (citation required). Also note that there is no heap segment now!

The memory allocation is complex matter and one needs to dwell into glibc code if you wish to know more. The mmap() system call is quite complex implementation in linux kernel :-). If you wish to know why malloc follows this mechanism, please read Linux System Programming by Robert Love which provides excellent insight on this topic.

As mentioned, memory topics are quite confusing and complicated. If you have feedback or corrections, please feel free to leave a comment. It also helps me to improve.

References:


1) Linux system programming by Robert Love

2) glibc malloc code (just peeped into)

3) If you wish, you can check linux kernel mmap code [mm/mmap.c -- SYSCALL_DEFINE6(mmap_pgoff, ...)]. To give some heads-up, the code is quite complicated and requires to have good knowledge of Linux Kernel MM subsystem

Wednesday, March 19, 2014

Speculative stores

Recently I read this wonderful article on C11 atomic variables and its possible usage in Linux Kernel. This is why admire community coding. People throng into fruitful discussions and eventually best comes out. The final result is lot of learning from all corners. In the article, Mr.Corbet mentioned about speculative stores which means the consequences of nasty compiler optimizations. I wrote about how intelligibly compilers handle certain cases here and here. So lets look at following example which is also mentioned in the article.

int y=2;

int do_some_work()
{
    y = 2;

    if (y)
        ....
    else
        ....
}


In above code, many programmers may expect compiler to rip off the 'else' branch. That's the dangerous part if the code belongs to kernel space. Why? Now 'y' is a data segment entity and can be manipulated by any CPU in SMP system. It can be even set to zero by some CPU. In that case, the optimization of compiler will result in untidy results. I cannot say how do compilers treat such code in kernel space since it requires bit of time to experiment.

How can we do it in user space? Very simple! Run more than one thread and lets see how assembly looks like. Note that this code is not thread safe

#include<pthread.h>
#include<stdio.h>
#include<assert.h>

int y=0;
void* thread_routine(void* arg);

void* thread_routine(void* arg)
{
        y=1;
        if(y)
                printf("Y is Y in thread = %d\n", pthread_self());
        else
                printf("Y is !Y in thread = %d\n", pthread_self());
}

void* thread_routine2(void* arg)
{
        y=0;
        if(y)
                printf("Y is !Y in thread = %d\n", pthread_self());
        else
                printf("Y is Y in thread = %d\n", pthread_self());
}

int main(int argc, char **argv)
{
        pthread_t tid[2];

        int thread_rc = 0;

        thread_rc = pthread_create(&tid[0], NULL, thread_routine, NULL);
        assert(!thread_rc);
        thread_rc = pthread_create(&tid[1], NULL, thread_routine2, NULL);
        assert(!thread_rc);
}


I am stripping of unnecessary sections and retaining only the thread stack assembly. The thread_routine2 function also looks similar

thread_routine:
.LFB2:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register 6
        subq    $16, %rsp
        movq    %rdi, -8(%rbp)
        movl    $1, y(%rip)
       movl    y(%rip), %eax <-- I know you are tricking me :D
       testl   %eax, %eax
       je      .L2

        call    pthread_self
        movq    %rax, %rsi
        movl    $.LC0, %edi
        movl    $0, %eax
        call    printf
        jmp     .L4
.L2:
        call    pthread_self
        movq    %rax, %rsi
        movl    $.LC1, %edi
        movl    $0, %eax
        call    printf

.L4:
        leave
        .cfi_def_cfa 7, 8
        ret
        .cfi_endproc


If you glance at assembly code, gcc is smart man :D. It knows that it should not optimize in such cases. If you observe the assembly, gcc emits code for both if and else part even though there is straight forward assignment before the branching. Also look at "movl y(%rip), %eax"! Instead of blindly copying value of '1' to EAX register, the actual value of 'y' is copied and tested :-).

Caveat: Multi threading may not emulate a SMP scenario in linux. Nowadays operating systems tend to hook threads to particular CPU rather than multiple CPUs. This is mainly to avoid penalty incurred due to cache line invalidations especially when global variable is involved and can be modified. Nevertheless, a thread can be pre-empted in middle of operation (say while if{} branch can be precisely evaluated) unless lock is held explicitly. Understanding SMP systems is quite intricate however opens up to wide variety of thoughts in programming world. Two cores are not two brains you know ;-). There are lot difficulties while handling such scenarios!

Finally short assignments ;-): 

1) Examine the assembly in case of -O2 switch
2) Remove threads and run bare minimal program while retaining data segment  variable and observe what gcc does!

Thursday, March 6, 2014

Handling of absurd code by gcc - The absurd sequel ;-)

I did mention about this in my previous blog here. Now slightly extending the condition to this

int test(unsigned int k)
{
        if (k <= 0)
                printf("BUG IN COMPILER: THIS IS ABSURD BLOCK\n");
}


We can expect few changes by compiler. Now this has two conditions, one for comparing for zero and other for comparing for Sign bit. What does compiler emit?

.LFB0:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register 6
        subq    $16, %rsp
        movl    %edi, -4(%rbp)
        cmpl    $0, -4(%rbp)
<-- Just compare with zero and kick programmer 
        jne     .L3
        movl    $.LC0, %edi
        call    puts



As expected it emits only condition for comparing with zero :D. gcc is very smart with this case too! Everything remains same except we have "call puts" now to print in case of condition is true.

Wednesday, March 5, 2014

Handling of absurd code by gcc

We programmers are bound to make silly mistakes and tend to write absurd code :-). If these things are spotted during code reviews or internal testing, then you are spared. If the buggy stuffs land in customer's runway, then we are in trouble :-). Today's blog is to closely examine gcc behavior's to such absurd code. This example is not exhaustive. I will try to come up with few more examples in future to learn myself and share observations. As of now, here is sample code.

#include<stdio.h>

int test(unsigned int k)
{
        if (k < 0)
                printf("BUG IN COMPILER: THIS IS ABSURD BLOCK\n");
}

int main()
{
        unsigned int k = 9;
        test(k);
        return 0;
}


As a programmer you know the absurdity of the code. So how does gcc behave. We can only say by looking into assembly output emitted by gcc. Here is the assembly dump of the program (with default optimization).

<snip>

test:
.LFB0:
        .cfi_startproc
        pushq   %rbp /* Save return pointer */
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        movq    %rsp, %rbp /* New base pointer */
        .cfi_def_cfa_register 6
       movl    %edi, -4(%rbp) /* Copy Argument */
       popq    %rbp /* Restore base pointer */
        .cfi_def_cfa 7, 8
        ret
        .cfi_endproc
.LFE0:
        .size   test, .-test
        .globl  main
        .type   main, @function
main:
.LFB1:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register 6
        subq    $16, %rsp
        movl    $9, -4(%rbp)
        movl    -4(%rbp), %eax
        movl    %eax, %edi
       call    test /* Here is call to function */
        movl    $0, %eax
        leave
        .cfi_def_cfa 7, 8
        ret


<snip>


As you can see, the entire 'if' block is discarded by gcc :-D. Compilers are smart nowadays ;-). They know how to get rid of weeds and make ELF fertile :-). Even though we inject irrelevant code, compilers (atleast gcc) get rid of them in final binary. Let me see what more weird stuff can be experimented with. Hope you enjoyed this small and simple post. Critiques and comments are always welcome!

Saturday, February 22, 2014

LD_PRELOAD environment variable - A short insight

I believe most of the programmers (unix C programmers) are aware of the application of LD_PRELOAD env variable. Basically it can be used to override any functionality of libc with the in house implementation. For ex: 'getaddrinfo' may have different implementation for an organization and cannot replaced in libc code itself (due to GPL restrictions). The organization may not be wanting to expose their internal implementation to public domain to preserve intellectual stuffs. Can't they write their own implementation? That is crux of the matter. They may be using third party application and want to modify certain functionalities for their use. Instead of modifying glibc code, they replace with their own implementation. Aha! how is this irony :-). Forget it! The usual method is to implement same implementation with prototype matching libc and create a dynamic shared library. Later set the LD_PRELOAD environment variable to this dynamic shared object. This makes the program loader to preload the created .so before libc gets loaded. The loader marks all the symbols used by executable based on the order of libs loaded. In above case, 'getaddrinfo' will be linked to created library than libc. Even though I knew this information, never attempted to write a program. Finally I was tempted to write and here it is. This guy overrides malloc implementation

The library for malloc -- Buggy!!!

my_lib.c

#include <stdio.h>

char *virus_data_segment = (char*)(0x1234567890)

void* malloc(size_t size)
{
        printf("Get lost! I do not have even a penny to give ;-)\n");
        return virus_data_segment;
}


created shared library: gcc -shared -o libmy_lib.so -fPIC my_lib.c

Now the actual fun starts :D

set LD_PRELOAD

>>LD_PRELOAD=./libmy_lib.so

execute ls command

>> ls
Get lost! I do not have even a penny to give ;-)
Segmentation fault


LOL! The malloc has been overridden and 'ls' cannot run (may be 'ls' is using malloc). Why segmentation fault? Because malloc returned an address region not belonging to process address space. Here is output when we return NULL.

Get lost! I do not have even a penny to give ;-)
Get lost! I do not have even a penny to give ;-)
Get lost! I do not have even a penny to give ;-)
Get lost! I do not have even a penny to give ;-)
Get lost! I do not have even a penny to give ;-)
Get lost! I do not have even a penny to give ;-)
Get lost! I do not have even a penny to give ;-)
Get lost! I do not have even a penny to give ;-)
Get lost! I do not have even a penny to give ;-)
ls: memory exhausted


Strange is it not! Either it is retrying or multiple places trying to allocate.

Yes, LD_PRELOAD is well known :-) but scribbled here in case for anyone can provide more insight. I will try to gather some more information if possible and post them.