Tom Spyrou's Blog: Review of 2 Performance related Chapters of “The Intel Guide for Developing Multithreaded applications”

Adding parallel processing to code is a desire of every software company that has a program which is significant in complexity and which needs to run faster. Processor clock rates are not increasing much and now multiple cores are being added to chips instead. The problem of speeding up software is moving from a hardware improvement problem to a software parallelization problem. My past blog post Why Parallel Processing? Why now? What about my legacy code? contains an introduction to why parallel programming will become part of every major computational program in the future.

One of the major issues confronting the software industry is that writing parallel programs requires a new set of skills on top of the skills needed to write well architected, fast and efficient serial programs. This issue also confronts the hardware industry. If there are not trained software developers who are capable of programming the new increasingly multi-core hardware that is being produced, then there will be no demand for that hardware and the computing industry will stagnate. Unfortunately there are few places a software developer can go to get trained in parallel computing and the training that is available is often around software packages that assist in writing parallel programs versus the concepts around parallel computing themselves.

I started writing a few blogs on the Intel Software Network after having met a few of the people involved at a couple of conferences. I was impressed with their understanding of the issue of the need to train software developers to be parallel software developers and the commitment and funding that Intel has placed on the effort to educate the developer community through its websites, the creation of tools, and working with universities to fund and design courses.

I was asked to review a couple of educational papers which are part of a bigger series which is forthcoming and titled “The Intel Guide for Developing Multithreaded Applications”. I was asked to review two of the sections of this guide which focus on 2 of the essential skills needed to write parallel programs. The first "Curing Thread Imbalance Using Intel® Parallel Amplifier" is on how to tune work assigned to threads so that the overall program scales well with the number of processors by ensuring that the work is divided evenly among the worker threads. The second "Detecting Memory Bandwidth Saturation in Threaded Applications" is around the problem of detecting when the bottleneck in a program is caused by the bandwidth from main memory to the processor caches being saturated or maxed out.

I will give a summary and my thoughts on the papers but first wanted to put them in context. Training, guides, cookbooks, and tools for parallel programmers are essential to progress in the computing industry both on the hardware and software sides of the problem.

The paper “Curing Thread Imbalance Using Intel Parallel Amplifier” is an introduction to a tool from Intel which can help find bottlenecks in parallel programs and show the developer where to make changes to improve the load balancing to optimize how the work is assigned to he available threads so that the overall program scales well with the number of processors. This paper starts with an example program that is parallel and which does not scale well, shows the use of the tool to locate where the run-time is being spent and where the hot-spots are and then shows an example of fixing the program to improve the over wall time of execution.

I have been writing programs since the 1970’s on Radio Shack TRS-80 in Z80 assembly code until today where I program in C++ on Linux machines. One thing that has remained constant in across the various programming languages and platforms is that the bottleneck in the program is rarely where you think it is. This means that at some point every programmer of computationally expensive code will need to profile the code when they have exhausted trying what they think will speed up the execution. In serial programs there are tools like gprof, Intel’s Vtune, Sun’s workshop and others which have allowed programmers to profile and improve code. The tools automatically detect where the time is going and then the programmer has a strong clue on where to look to begin designing improvements. I have found over my years of developing software that this skill of profiling and tuning, even in serial programs, is an important one.

With parallel programs, finding the bottleneck, can often be even less intuitive than with serial programs. On top of the issues that can occur there are additional issues. Two of the important issues are covered by the papers I am reviewing. One is load balancing and the need to use dynamic scheduling of which tasks are assigned to which threads at a given time. Since the run-time of a given computation is rarely known up front, it is essential to dynamically assign the work. The paper discusses this concept and walks the reader through modifying the example code from a static allocation to a dynamic allocation of work to threads and demonstrates the improvement.

What I like about the paper is that it walks the reader through the process of how to improve a parallel program. The specific tool is used but the paper and approach is not dependent on that tool. It gives appropriate background and provides the reader a view of a basic skill that is essential and transportable. It shows off the Intel tool but at the same time educates the reader on the first principles.

The paper then moves on to discuss the trade-offs between explicit thread creation like win32 threads and pthreads provide and implicit thread creation as exists in the TBB and OpenMP models. It does not try to sell the reader on using any particular approach but discusses them along with the trade-offs. I am happy to see the paper properly promoting the tools from Intel and educating the reader on the bigger issues and trade-offs. It also provides usefull links and pointers to issues that many may be unfamiliar with like the false sharing issue which is perhaps one of the least intuitive performance issues in threaded applications since most programmers mental view is that all memory is available at equal cost.

A second critical problem, which exists in both parallel and serial programs, but is exacerbated by multi-threaded programs on multi-core machines, is the issue of memory bandwidth saturation. The paper entitled “Detecting Memory Bandwidth Saturation in Threaded Applications” introduces the issue so that developers can be aware of it. The paper then discusses how to detect saturation using Intel’s Vtune and Performance Tuning Utility. It also discusses event based sampling, which is the only kind of profiling I have ever found to be effective on all but the most trivial programs. I still find it surprising that some programmers don’t know how to use it effectively and am glad to see it mentioned.

I am happy to see the issue of Memory Bandwidth Saturation discussed in the paper. However, from my way of thinking about the problem the paper jumped around a bit from a high level of abstraction to details and didn’t follow the way I think about the problem. First of all the problem of Memory Bandwidth Saturation exists in both parallel and serial programs. Profilers like Vtune or the free Oprofile can report “Clocks per Instruction” (CPI). When this CPI number is high it means many clocks cycles pass for each cpu instruction. This is a sign that memory access is a bottleneck. This can occur in serial programs with very poor data layout as well as in parallel programs with a lot of access to data. I think this is a critical concept to understand that many programmers have not been exposed to.

I also feel that there should have been some more discussion about NUMA and how the non-uniform memory access affects programs, how numactl works, how the first touch policy works and how memory is assigned to a processor in general. I think a lot of programmers expect the Operating System to handle much more than it is capable of and an understanding of these ideas set the context for what the OS can do and what the programmer will have to deal with.

The paper however is a good introduction to the Memory Bandwidth issue and introduces an important often unknown concept to programmers who have in many cases been able to ignore the memory hierarchy in computer systems. I recommend reading it and doing further research on the areas I outlined.

Tom Spyrou's Blog

Friday, August 6, 2010

Review of 2 Performance related Chapters of “The Intel Guide for Developing Multithreaded applications”

No comments:

Post a Comment