Typically with multi-core processors, the first thought is to use multiple threads in a shared memory programming paradigm to parallelize a software algorithm. This approach can work very well, especially for software designed from the ground up to be thread safe and thread efficient. Thread safety means that the threads and data structures are written in such a way that there are no race conditions between the threads for shared data. Thread efficient is a term I use to discuss the efficiency of the scheme used to avoid race conditions as well as the code’s ability to efficiently use the processor and its cache and memory bandwidth to keep the processors busy. Making legacy code thread safe and thread efficient is often a difficult task, especially for large pre-existing code bases and/or code bases that have been developed over a long period of time. In such code a top level understanding of the code call chains and architecture is often not complete. Simple locking can make the code thread safe but often leads to locks which have a long duration and make the code thread safe but not thread efficient. Re-coding the data structures and code can be prohibitively expensive and lengthy for the short term needs of the software’s user.
To scale serial programs through parallelization in the most thread efficient way typically involves a major re-architecture and re-implementation. This is especially true when scaling programs not only to multi-core systems but to many-core systems. Multi-core is usually defined as systems with 4, 8 maybe16 cpus. Many-core is defined as systems having at least 16 processors and usually 64 or 128 processors. It is nearly impossible to get significant speedups on many-core systems without coding with such systems in mind. I do not think the approaches and shortcuts I have used on multi-core systems will allow legacy software to scale on many-core systems. However it is possible on multi-core systems to get a decent speedup with smaller changes to the code using clever software engineering approaches.
In this posting I want to discuss one such approach which is the use of the use of the Copy on Write (COW) mechanism of fork() for Unix applications. In the initial implementations of fork() on older systems, fork() would copy the entire virtual space of the parent process and start it as a separate process. This worked well for the use model where the entire program went forward in both processes but did not work well in the most common case where the parent program would execute fork() and then exec() of another pogram. For example lets assume a parent program needs a directory listing. To get one it could fork() and then exec() the “ls” command to get it. However from a RAM and Virtual memory standpoint this was very wasteful. Imagine that the parent program was a large program using significant memory, for example running using 64Gig of RAM. When fork() was executed the memory usage would double since now their would be 2 copies of the 64Gig space. Then when exec() was executed, the memory listing would be retrieved. The peak memory usage of the program would double the 64Gig of memory used just to get the directory. Even though this peak might be short lived it would increase the virtual memory requirement of the system and could potential cause a lot of program thrashing during the execution of the smaller program which was started by exec(). This was inefficient and unnecessary.
Over time fork() was enhanced with a clever optimization to get around this problem. Instead of copying the entire memory image, the child copies just the page tables of the parent process and points to all of the same pages. The pages are marked with a special Copy on Write bit. When this COW bit is set, if the child or parent reads the page, the pointer are just dereferenced. However if either process were to write to a page of memory, a page fault occurs, the page is duplicated, and the writing process gets its own copy of the page. In this way the simple case of fork() and then exec of a simple program does not use any more memory resource except that needed to hold the page tables which is a small percentage of the total Virtual Memory used.
Initial memory use right after fork() with parent and child sharing memory for read access.
Memory after parent or child changes the value of a memory location in a page with the parent and child now having their own private copy of that page.
Let’s take an example of some thread unsafe code with a global static variable which needs to have a different value in each process. With copy on write fork() when the fork() is started the two processes share the same initial value and memory location of the global. When a process changes the global, a page fault occurs, a copy of the page containing the global is created, the page table is updated, and the processes can continue with independent copies of the page. This mechanism is guaranteed by the OS not to have a race condition and is very fast since it leverages the OS and hardware support for paging.
The downside to this approach is that if the parent or one of the child processes changes memory significantly than this approach will end up duplicating the memory. However if carefully utilized there are many applications which look at a lot of data and produce a little data which can leverage this technique and gain nearly all of the benefits of threading the code without the work in making the code thread safe. One has to be careful to build all or most of the data structures before the fork() occurs so that the memory can be shared.
When the children are done processing their work, they have useful output that eventually needs to get back to the parent. This can be done by sending the return data over a pipe, using a shared memory segment, or simply writing the data to a file for the parent to read.
Copy on write forks can be a simple yet effective way for software engineers to leverage multiple processors on the same machine without having to redesign the entire program to be multi-threaded. I have used this technique on multiple programs and have found it to work well. The overhead of the page table copy is small and for highly parallel problems the time taken by the page faults has been too small to measure. I have been careful to use this technique on mostly analysis applications where many GB of data is read and where a few MB of report data or results is produced.
No comments:
Post a Comment