Tom Spyrou's Blog: 2010

Thursday, September 2, 2010

Yaesu VX8-G inexpensive data cable

http://cgi.ebay.com/ws/eBayISAPI.dll?ViewItem&item=230359281914&ssPageName=STRK:\MEWNX:IT

plus

http://cgi.ebay.com/ws/eBayISAPI.dll?ViewItem&item=380250785533&ssPageName=STRK:\MEWNX:IT

for less than $10.

These two cable connected together allow me to connect my usb port to the vx8-g.

On the DB9 to 2.5mm plug, it was not made for the vx8, so I had to use an exact-o knife to cut away some extra plastic molding so that the 2.5mm connector's tip would fit past the radio's case.

Now I am waiting for the software to be ready. I tried the vx8dr software but and it connects to the vx8g but them lets me know I have an unsupported radio. The serial connection does work and I can see the radio in clone mode send its radio number over the connection when I was detecting the port number.

Friday, August 6, 2010

Yaesu VX8-GR Handheld VHF Tranceiver with APRS

Just some background on me before I get into my recent purchase from the "Candy Store" which is what Amateur "Ham" Radio operators call the Ham Radio Outlet. I passed the General class FCC exam in Middle School in 1979 while a member of the Drexel Hill Junior High Amateur Radio Club. I operated only HF then for DX (Long distance communications). I have good memories of Sam Stern WB3KTP and David Tatum WB3KTQ who were the leaders and instructors from whom I learned to enjoy Amateur Radio and the Morse code. I moved to California in 1988 and recently started operating again in 2009, that was a long time. THings have changed a lot and they haven't. I am currently involved with ARES (Amateur Radio Emergency Services in Sunnyvale, CA. Currently I am operating only on 2m/70cm. I haven't done HF since middle school. I am interested in trying to work VHF satellites but haven't spent much time on it yet.

I have been operating again for 6 months now and have learned that for serious Emergency work that I needed to have a dual band 2m/70cm handheld tranceiver and that it was best to have one with dual receive which means that I can listen to 2 channels or frequencies at the same time. I have 2 other Yaesu radios and decided to stick with Yaesu and buy one new so I could enjoy the latest technology. I have also been interested in being able to beacon my location via APRS since that involves digital technology on the air which I have not tried out yet. In 1979 this didn't exist as far as I know. This radio has a built in GPS receiver and AX.25 based packet modem. The radio can beacon its current GPS location on one channel while the user talks on another. http://www.aprs.fi is a map showing the location that other hams beacon.

This radio works very well. The audio is good from what people I have had QSO's (conversations) with have told me and the setup was very easy. There are context sensitive menus and a lot of options but it was pretty well thought out. I was beaconing after about 30min and had my favorite repeaters programmed in after about 15 more.

This radio was the perfect trade off for me that its more expensive VX-8DR since it was just dual band, not quad band with band I don't need, and the GPS was built in instead of being an add on option.

EDA Workshop Debate Erupts Over Parallel Programming

The Electronic Design Processes (EDP) workshop may be a small, technical, IEEE-sponsored event, but that didn't stop a lively debate over the feasibility of parallel processing from erupting last week. The debate comes as EDA vendors, including Cadence, are working hard to port their software to multicore processor platforms. Presentations started with a somewhat skeptical presentation by Patrick Groeneveld of Magma Design Automation, and then moved on to a really, really skeptical presentation by Patrick Madden of the State University of New York (SUNY) Binghamton. These were followed by what I'd call a balanced, hands-on presentation from Tom Spyrou of Cadence. (He has been working to parallelize the Encounter Digital Implementation system, and I interviewed him last year about the challenges that entails).
If there was a silent presence in the room, it was Gene Amdahl, whose "Amdahl's Law" reminds us that the gains you get from parallelizing code are sharply limited by any sections that cannot be parallelized. As Patrick Groeneveld pointed out, if you can parallelize 50 percent of the code, you can only get a maximum 2X theoretical speedup. The reality is probably closer to 0.8X. Since all multi-tool EDA flows include some non-parallelizable code, no one should expect a 7X speedup on 8 processors.

Tom Spyrou (right) answers a question from Patrick Groeneveld (orange shirt) at the EDP Workshop April 8. Patrick Groeneveld - The "Last Resort"
"Parallelism is the last thing I look at if I have a problem in my flow," said Patrick, who serves as Magma's chief technologist. "I will try everything else first because I know parallelism has some problems." Sometimes, he noted, parallelizing code can actually cause a slowdown because of the "hidden costs and extra overhead" of making code parallel.
Patrick identified six problems he's experienced in trying to parallelize EDA software:

Contention arbitration. Two processors are running and both want to use a shared resource. They write to memory at the same time, and - crash! This is typically fixed by putting locks in the code, which leads to the next problem.
Easy to lock up performance. Locking every read "trashes your multi-threaded performance."
Finding intelligent partitions. You want to chop up the chip and send each to a different processor, which works for some problems, but "partitioning is evil" for synthesis and optimization.
Repeatability is the silent killer. If you want repeatability (getting the same result from the same operation), you will lose processing efficiency in parallelism.
Load distribution. It is very hard to estimate workloads in advance, but it's necessary to schedule CPU resources.
The sheer complication of things. Not many programmers are capable of writing parallel code. Bottom line: "parallelism works but gains are limited. Don't be disappointed."

Patrick Madden: Advocates are "Delusional"
Anyone who was a little depressed from Patrick Groeneveld's presentation was not helped by Patrick Madden, whose first words were that Groeneveld was "way too optimistic" about parallel programming. I was not surprised. In a panel (with the real Gene Amdahl) that I wrote about a couple of years ago, Madden said that multicore programming is "the end of the world as far as I'm concerned."
Madden, however, is working on a problem that should be important to parallel programmers - speeding up the "hard serial bottlenecks" that cannot be parallelized, and thus give rise to the limitations of Amdahl's Law. And where do we find such bottlenecks? In papers and books by parallel programming experts, he suggested. He told of one Design Automation Conference (DAC) paper that presented a 100-fold speedup with a parallelized algorithm. It later turned out that a time-worn serial algorithm actually did a faster job of solving the same problem, with no special hardware required.
"I would think that when I pick up a technical paper from a major university, things should be correct," Madden said. "Instead there are monumental flaws presenting a new technology that's orders of magnitude slower than the existing solution." With a slide including a photo of Bernie Madoff, he suggested that some of the published parallel processing research is fraudulent.
Audience members reacted, saying that Madden was "too skeptical," focusing on a small amount of bad research, and not giving new programming methods a chance to catch up. "If I do video, a lot of stuff is embarrassingly parallel and that's how it's done," one said. Madden agreed, but went on to suggest that some parallel programming advocates are "delusional" and are seeing what they wish to see after taking grant money to do parallel computing research.
Tom Spyrou: Yes, We Can Parallelize Legacy Code
It may have been a hard act to follow, but Tom Spyrou spoke next, in a calm presentation that showed how legacy EDA code can be parallelized within the context of Amdahl's Law. Spyrou focused on the "fine grained" problems that are most difficult to parallelize.
"For multicore, with 4 to 8 cores, there are shortcuts you can take to get legacy code to run in parallel," he said. "When it comes to manycore, 128 cores and all that, I think it will have to be written from the ground up. I think a lot of fine-grained problems aren't going to scale, just like [Madden] said."
Still, some clever things can be done today. One is to create a "parallel reducer server" for parasitic reduction. The idea is to take a net and reduce it to a math transfer function, using parallel calls to the server. There are no threads or thread safety issues, and the approach provides data locality, which will be important for manycore.
Another example is parallel noise analysis. It is hard to make legacy code thread-safe, but you can use a Copy on Write fork() construct as a short-term solution. It shares only "read" memory with its parent, but if memory is modified, it is automatically duplicated in the child.
While some easy-to-parallelize applications are more scalable, Syprou said that the overall Encounter netlist-to-GDSII flow is now running about 2X faster on 4 cores, thanks to the work that's been done to parallelize legacy code.
My Perspective
I think there's some truth in each of these presentations. Legacy serial code is really hard to parallelize, and not everything should be parallelized. University research may well focus too much on parallel programming and not enough on serial bottlenecks. But let's not throw up our hands and say it can't be done. Let's get to work and do what we can to speed up EDA software for multicore platforms, just as Spyrou has been doing since 2006 at Cadence.

Richard Goering

Review of 2 Performance related Chapters of “The Intel Guide for Developing Multithreaded applications”

Adding parallel processing to code is a desire of every software company that has a program which is significant in complexity and which needs to run faster. Processor clock rates are not increasing much and now multiple cores are being added to chips instead. The problem of speeding up software is moving from a hardware improvement problem to a software parallelization problem. My past blog post Why Parallel Processing? Why now? What about my legacy code? contains an introduction to why parallel programming will become part of every major computational program in the future.

One of the major issues confronting the software industry is that writing parallel programs requires a new set of skills on top of the skills needed to write well architected, fast and efficient serial programs. This issue also confronts the hardware industry. If there are not trained software developers who are capable of programming the new increasingly multi-core hardware that is being produced, then there will be no demand for that hardware and the computing industry will stagnate. Unfortunately there are few places a software developer can go to get trained in parallel computing and the training that is available is often around software packages that assist in writing parallel programs versus the concepts around parallel computing themselves.

I started writing a few blogs on the Intel Software Network after having met a few of the people involved at a couple of conferences. I was impressed with their understanding of the issue of the need to train software developers to be parallel software developers and the commitment and funding that Intel has placed on the effort to educate the developer community through its websites, the creation of tools, and working with universities to fund and design courses.

I was asked to review a couple of educational papers which are part of a bigger series which is forthcoming and titled “The Intel Guide for Developing Multithreaded Applications”. I was asked to review two of the sections of this guide which focus on 2 of the essential skills needed to write parallel programs. The first "Curing Thread Imbalance Using Intel® Parallel Amplifier" is on how to tune work assigned to threads so that the overall program scales well with the number of processors by ensuring that the work is divided evenly among the worker threads. The second "Detecting Memory Bandwidth Saturation in Threaded Applications" is around the problem of detecting when the bottleneck in a program is caused by the bandwidth from main memory to the processor caches being saturated or maxed out.

I will give a summary and my thoughts on the papers but first wanted to put them in context. Training, guides, cookbooks, and tools for parallel programmers are essential to progress in the computing industry both on the hardware and software sides of the problem.

The paper “Curing Thread Imbalance Using Intel Parallel Amplifier” is an introduction to a tool from Intel which can help find bottlenecks in parallel programs and show the developer where to make changes to improve the load balancing to optimize how the work is assigned to he available threads so that the overall program scales well with the number of processors. This paper starts with an example program that is parallel and which does not scale well, shows the use of the tool to locate where the run-time is being spent and where the hot-spots are and then shows an example of fixing the program to improve the over wall time of execution.

I have been writing programs since the 1970’s on Radio Shack TRS-80 in Z80 assembly code until today where I program in C++ on Linux machines. One thing that has remained constant in across the various programming languages and platforms is that the bottleneck in the program is rarely where you think it is. This means that at some point every programmer of computationally expensive code will need to profile the code when they have exhausted trying what they think will speed up the execution. In serial programs there are tools like gprof, Intel’s Vtune, Sun’s workshop and others which have allowed programmers to profile and improve code. The tools automatically detect where the time is going and then the programmer has a strong clue on where to look to begin designing improvements. I have found over my years of developing software that this skill of profiling and tuning, even in serial programs, is an important one.

With parallel programs, finding the bottleneck, can often be even less intuitive than with serial programs. On top of the issues that can occur there are additional issues. Two of the important issues are covered by the papers I am reviewing. One is load balancing and the need to use dynamic scheduling of which tasks are assigned to which threads at a given time. Since the run-time of a given computation is rarely known up front, it is essential to dynamically assign the work. The paper discusses this concept and walks the reader through modifying the example code from a static allocation to a dynamic allocation of work to threads and demonstrates the improvement.

What I like about the paper is that it walks the reader through the process of how to improve a parallel program. The specific tool is used but the paper and approach is not dependent on that tool. It gives appropriate background and provides the reader a view of a basic skill that is essential and transportable. It shows off the Intel tool but at the same time educates the reader on the first principles.

The paper then moves on to discuss the trade-offs between explicit thread creation like win32 threads and pthreads provide and implicit thread creation as exists in the TBB and OpenMP models. It does not try to sell the reader on using any particular approach but discusses them along with the trade-offs. I am happy to see the paper properly promoting the tools from Intel and educating the reader on the bigger issues and trade-offs. It also provides usefull links and pointers to issues that many may be unfamiliar with like the false sharing issue which is perhaps one of the least intuitive performance issues in threaded applications since most programmers mental view is that all memory is available at equal cost.

A second critical problem, which exists in both parallel and serial programs, but is exacerbated by multi-threaded programs on multi-core machines, is the issue of memory bandwidth saturation. The paper entitled “Detecting Memory Bandwidth Saturation in Threaded Applications” introduces the issue so that developers can be aware of it. The paper then discusses how to detect saturation using Intel’s Vtune and Performance Tuning Utility. It also discusses event based sampling, which is the only kind of profiling I have ever found to be effective on all but the most trivial programs. I still find it surprising that some programmers don’t know how to use it effectively and am glad to see it mentioned.

I am happy to see the issue of Memory Bandwidth Saturation discussed in the paper. However, from my way of thinking about the problem the paper jumped around a bit from a high level of abstraction to details and didn’t follow the way I think about the problem. First of all the problem of Memory Bandwidth Saturation exists in both parallel and serial programs. Profilers like Vtune or the free Oprofile can report “Clocks per Instruction” (CPI). When this CPI number is high it means many clocks cycles pass for each cpu instruction. This is a sign that memory access is a bottleneck. This can occur in serial programs with very poor data layout as well as in parallel programs with a lot of access to data. I think this is a critical concept to understand that many programmers have not been exposed to.

I also feel that there should have been some more discussion about NUMA and how the non-uniform memory access affects programs, how numactl works, how the first touch policy works and how memory is assigned to a processor in general. I think a lot of programmers expect the Operating System to handle much more than it is capable of and an understanding of these ideas set the context for what the OS can do and what the programmer will have to deal with.

The paper however is a good introduction to the Memory Bandwidth issue and introduces an important often unknown concept to programmers who have in many cases been able to ignore the memory hierarchy in computer systems. I recommend reading it and doing further research on the areas I outlined.

Parallelizing legacy Unix/Linux code using copy on write fork()

Adding parallel processing to legacy code is a desire of every software company that has an existing product which is significant in complexity and which needs to run faster. Processor clock rates are not increasing much and now multiple cores are being added to chips instead. The problem of speeding up software is moving from a hardware improvement problem to a software parallelization problem. This is a follow-on post to why parallel processing why now what about my legacy code, please read this posting for more background.
Typically with multi-core processors, the first thought is to use multiple threads in a shared memory programming paradigm to parallelize a software algorithm. This approach can work very well, especially for software designed from the ground up to be thread safe and thread efficient. Thread safety means that the threads and data structures are written in such a way that there are no race conditions between the threads for shared data. Thread efficient is a term I use to discuss the efficiency of the scheme used to avoid race conditions as well as the code’s ability to efficiently use the processor and its cache and memory bandwidth to keep the processors busy. Making legacy code thread safe and thread efficient is often a difficult task, especially for large pre-existing code bases and/or code bases that have been developed over a long period of time. In such code a top level understanding of the code call chains and architecture is often not complete. Simple locking can make the code thread safe but often leads to locks which have a long duration and make the code thread safe but not thread efficient. Re-coding the data structures and code can be prohibitively expensive and lengthy for the short term needs of the software’s user.
To scale serial programs through parallelization in the most thread efficient way typically involves a major re-architecture and re-implementation. This is especially true when scaling programs not only to multi-core systems but to many-core systems. Multi-core is usually defined as systems with 4, 8 maybe16 cpus. Many-core is defined as systems having at least 16 processors and usually 64 or 128 processors. It is nearly impossible to get significant speedups on many-core systems without coding with such systems in mind. I do not think the approaches and shortcuts I have used on multi-core systems will allow legacy software to scale on many-core systems. However it is possible on multi-core systems to get a decent speedup with smaller changes to the code using clever software engineering approaches.
In this posting I want to discuss one such approach which is the use of the use of the Copy on Write (COW) mechanism of fork() for Unix applications. In the initial implementations of fork() on older systems, fork() would copy the entire virtual space of the parent process and start it as a separate process. This worked well for the use model where the entire program went forward in both processes but did not work well in the most common case where the parent program would execute fork() and then exec() of another pogram. For example lets assume a parent program needs a directory listing. To get one it could fork() and then exec() the “ls” command to get it. However from a RAM and Virtual memory standpoint this was very wasteful. Imagine that the parent program was a large program using significant memory, for example running using 64Gig of RAM. When fork() was executed the memory usage would double since now their would be 2 copies of the 64Gig space. Then when exec() was executed, the memory listing would be retrieved. The peak memory usage of the program would double the 64Gig of memory used just to get the directory. Even though this peak might be short lived it would increase the virtual memory requirement of the system and could potential cause a lot of program thrashing during the execution of the smaller program which was started by exec(). This was inefficient and unnecessary.
Over time fork() was enhanced with a clever optimization to get around this problem. Instead of copying the entire memory image, the child copies just the page tables of the parent process and points to all of the same pages. The pages are marked with a special Copy on Write bit. When this COW bit is set, if the child or parent reads the page, the pointer are just dereferenced. However if either process were to write to a page of memory, a page fault occurs, the page is duplicated, and the writing process gets its own copy of the page. In this way the simple case of fork() and then exec of a simple program does not use any more memory resource except that needed to hold the page tables which is a small percentage of the total Virtual Memory used.

Initial memory use right after fork() with parent and child sharing memory for read access.

Memory after parent or child changes the value of a memory location in a page with the parent and child now having their own private copy of that page.
Let’s take an example of some thread unsafe code with a global static variable which needs to have a different value in each process. With copy on write fork() when the fork() is started the two processes share the same initial value and memory location of the global. When a process changes the global, a page fault occurs, a copy of the page containing the global is created, the page table is updated, and the processes can continue with independent copies of the page. This mechanism is guaranteed by the OS not to have a race condition and is very fast since it leverages the OS and hardware support for paging.
The downside to this approach is that if the parent or one of the child processes changes memory significantly than this approach will end up duplicating the memory. However if carefully utilized there are many applications which look at a lot of data and produce a little data which can leverage this technique and gain nearly all of the benefits of threading the code without the work in making the code thread safe. One has to be careful to build all or most of the data structures before the fork() occurs so that the memory can be shared.
When the children are done processing their work, they have useful output that eventually needs to get back to the parent. This can be done by sending the return data over a pipe, using a shared memory segment, or simply writing the data to a file for the parent to read.
Copy on write forks can be a simple yet effective way for software engineers to leverage multiple processors on the same machine without having to redesign the entire program to be multi-threaded. I have used this technique on multiple programs and have found it to work well. The overhead of the page table copy is small and for highly parallel problems the time taken by the page faults has been too small to measure. I have been careful to use this technique on mostly analysis applications where many GB of data is read and where a few MB of report data or results is produced.

Parallelizing Legacy code using Fine Grained Distributed Processing

Adding parallel processing to legacy code is a desire of every software company that has an existing product which is significant in complexity and which needs to run faster. Processor clock rates are not increasing much and now multiple cores are being added to chips instead. The problem of speeding up software is moving from a hardware improvement problem to a software parallelization problem. This is a follow-on post to Why Parallel Processing? Why now? What about my legacy code?, please read this posting for more background.
Typically with multi-core processors, the first thought is to use multiple threads in a shared memory programming paradigm to parallelize a software algorithm. This approach can work very well, especially for software designed from the ground up to be thread safe and thread efficient. Thread safety means that the threads and data structures are written in such a way that there are no race conditions between the threads for shared data. Thread efficient is a term I use to discuss the efficiency of the scheme used to avoid race conditions as well as the code’s ability to efficiently use the processor and its cache and memory bandwidth to keep the processors busy. Making legacy code thread safe and thread efficient is often a difficult task, especially for large pre-existing code bases and/or code bases that have been developed over a long period of time. In such code a top level understanding of the code call chains and architecture is often not complete. Simple locking can make the code thread safe but often leads to locks which have a long duration and make the code thread safe but not thread efficient. Re-coding the data structures and code can be prohibitively expensive and lengthy for the short term needs of the software’s user.
To scale serial programs through parallelization in the most thread efficient way typically involves a major re-architecture and re-implementation. This is especially true when scaling programs not only to multi-core systems but to many-core systems. Multi-core is usually defined as systems with 4, 8 maybe 16 cpus. Many-core is defined as systems having at least 16 processors and usually 64 or 128 processors. It is nearly impossible to get significant speedups on many-core systems without coding with such systems in mind. I do not think the approaches and shortcuts I have used on multi-core systems will allow legacy software to scale on many-core systems. However it is possible on multi-core systems to get a decent speedup with smaller changes to the code using clever software engineering approaches.
In this posting I want to discuss one such approach which is the use of Fine Grained Distributed Processing to speed up compute intensive pieces of serial programs.
Many programs which run for a long time have parts of the code which are a significant bottleneck or percentage of the runtime. These parts of the code can be found by profiling the code and looking to see where the time is spent. Once the code that needs to be sped up is found, the first step, before attempting any parallel coding, is to speed up the single CPU performance as much as possible. Optimizing the single CPU algorithm and data structure layout and access is key to being able to parallelize effectively. If the algorithm is poor then even when parallelized there will still be inefficient use of cpus. If the data locality is poor causing many cache misses then the memory to CPU data bandwidth will be the bottleneck in the single CPU version of the code and this will get even worse in the N-cpu version of the code since at the limit the parallel algorithm may require N times the amount of data bandwidth. In my experience this optimization is generally a local optimization and requires changing the local algorithm and the specific data structures upon which it depends. Sometimes it turns out to be a more global problem and when this is the case it is definately better to find out about it before trying to parallelize the program and fix it in the serial version.
Once the code and data access is optimized, then the next step is to find parallelism in the algorithm. Many times algorithms or steps of a program which are time consuming involve loops or other constructs which can be decomposed into parallel pieces. If the underlying code and structures are or can be made thread safe then shared memory threads can be a great way to go. Often this is not the case. When shared memory threads are not feasible one possible approach is to start separate processes, either on the local machine, if there are cpus available, or on remote machines that have good connectivity to the master process.
The remote processes can then be set up as servers which receive messages with work to do and which reply with a result. The client server paradigm fits well here. This paradigm is similiar to having threads running and whose work is determined by populating a queue. The remote processes are servers much like a web server which take requests and return results. The main program is the client and like an internet browser orchestrates the work and is the main interface to the user.

The protocol between the master and slave processes should be set up so that whether the slave ends up running on the local machine or a remote host that the socket programming will be the same. In this way the distributed programming can work on a single multi-core machine or across a farm of machines. On a farm the master program will not know what machine the slave will wake up on so a small handshaking protocol can be set up to initialize the connection between the master and the slave. I have summarized one simple protocol below. This protocol is kept very simple to avoid any chance of deadlock and to make debugging simple. The slaves are set up to log incoming messages so that if there is a bug in the slave it can be debugged independently of the master and other slaves.

The application which I am working on uses TCL as its command line language. To keep the protocol even simpler I found it helpful for the master and slave to be the same executable and for the master to pass TCL commands whose parameters contained the work to be done and whose return values were the results of the work. In this way the act of adding new TCL commands to perform the specialized work could be done in such a way as to re-use code from the serial master and allow a lower level access to one of the sub-calculations which was part of the serial master routine that is being parallelized.
The master program was loaded with the full set of data structures and state of the original serial program. The slave invocations of the same executable were done in a stateless mode. No data was loaded and the program was set up to listen to a socket for TCL commands to execute and to reply to with results. These stateless servers have the advantage that they use very little memory and free any memory used from work command to work command. They tend to have great data locality since the incoming message is converted to a small targeted set of data structures by the same processor which will perform the calculations. It is also helpful since a process tends to have affinity for a single processor and a processor has affinity for the data which it allocates.
I call this approach Fine Grained Distributed Processing since the TCL commands can be a very fine grain of work as long as the work is enough to offset the time spent to package up the message for the socket and return the result. Even when running on a single multi-core machine the programming paradigm is distributed.
The servers do not have to strictly be stateless, but the more state they have, the more complex the protocol and synchronization becomes between the processes.
Some observations that I have with this approach are that on a modern local machine with 4cpus I could send 30K 30KB size messages per second when the master sent messages and the slaves did nothing but echo the message back to the master. I also noticed in real use that if the task size became less that 20ms that scalability really would suffer. These benchmarks need to be taken into account when designing the messages and analyzing the suitability of this approach for a given problem. On our Cadence Farm which has machines with gigabit or better connectivity I saw the benchmark degrade by about 10% to 30% when the slaves were on remote machines. I could send the same number of shorter messages or reduce the number of messages. This number I am sure will vary a lot from farm to farm and also varied for me from run to run which I assume was affected by network traffic at the time. In production usage we have used the local machine for fine grained tasks and use the remote version for larger tasks of multiple minutes in duration. In this way the message passing overhead is kept minimal.
One example application I used this technique on was to parallelize a parasitic reducer of electronic circuits. Basically on a chip the wires connecting the chips would ideally be perfect conductors but there are parasitic resistances and capacitances that need to be analyzed before the speed of the circuit can be calculated. This analysis is called reduction since the output of the analysis is a reduced mathematical model known as a transfer function. This problem is easily parallelizable since the wires connecting the devices can be computed in any order. However the code in question depends on large data structures and a math library that are not thread safe. Rather than make the major changes needed for thread safety, I tried the fine grained distributed approach with sockets and tcl commands. I was able to get around a 3X speedup on 4cpus for this step. An example of a small circuit with around 20K nets connecting components is below.

The Fine Grained Distributed Processing approach can be an easy and effective way to get parallelism out of legacy code for certain problems where the messages are easy enough to compose and not too long compared to the work that needs to be done.
One thing to note is that TCL itself has a robust socket language which is easy to use and works well across platforms. I recommend it for prototyping and quick work. http://www.tcl.tk/about/netserver.html has a small example.
This is the first time I have tried to describe this approach in pure written form, usually I do it in person. If you have any questions post a comment and we can discuss.

Why Parallel Processing? Why now? What about my legacy code?

Many software companies have applications which are in use by their customers that have significant runtime and for which fast runtime is a necessity or a competitive advantage. There has always been the pressure to make such applications go faster. Historically, as processors have increased their speed, the needed speedups could often be achieved by tuning the single cpu performance of the program and by utilizing the latest and fastest hardware. In the Electronic Design Automation industry that I am a part of, it has always been the case that the newest machines had to be used to run the design tools which were being used to design the next generation of processors. The speed and memory capability of the newest machines had always been just enough to design the next generation chips. Other types of cpu intensive software have also ridden the hardware performance curve in this way.
We will no longer see significant increases in the clock speed of processors. The power consumed by the fastest possible processors generates too much heat to dissipate effectively in known technologies. Instead processor manufacturers are adding multiple processors cores to each chip. Why does this help? Power Consumed = Capacitance * Voltage^2 * Frequency. If a given calculation is perfectly moved from a processor running at N Gigahertz to 2 parallel processors running at N/2 Gigahertz where does the savings come from? It would seem that each processor runs in half the power but now there are 2 processors which would mean that the same power is used. The power savings comes from the fact that slower processors can run at a lower voltage. For example a processor running at half the frequency can run at around 8/10 the Voltage level. .8^2 is .64 which implies a 36% power savings. If you scale this up to 32 cpus then it will be possible to get a lot of compute power for much lower power consumption and therefore much lower required heat dissipation. Eventually it seems that even cell phones and other embedded devices will move to multi core processing for this reason. More compute capabilities or longer battery life for the same capabilities. Both are compelling values.
Part of the assumption that goes into the definition of how this power savings will be achieved is that the software implementation of the parallel program running on the 2 slower processors must be perfectly efficient. Well, nothing in the real world is perfectly efficient. Even if the coding is not perfectly efficient, as long as it is reasonably efficient, then there is a benefit. If the parallel coding is inefficient, then it might be that the parallel program will use more power on the slower processors than the serial program running on the fast single processor. However, since faster processors that won’t melt can no longer be made, we are kind of stuck with going parallel and need to do our best.
I say stuck because from a software development perspective a large new burden is being placed on software developers. That burden is to write programs that are as efficient as possible and which make use of N processors, hopefully where N is configurable by the user and can be increased as new processors chips with more cores become available. For most developers this is something really new and really complex. It also presents a huge discontinuity for software companies with large investments in legacy code.
I joined the company I work for in 2006 with the job of parallelizing a product with 6 Million lines of code developed over 10 years and which is made up of dozens of very complex intertwined algorithmic steps. I wanted to write this introductory blog because I haven’t seen the path to where we are and the need for parallel programming described from the big picture. I plan to write a few blogs going forward on the problem and some solutions for parallelizing legacy code. The solutions will work in some cases and will parallelize programs to some degreee but not perfectly. In the end software developers are engineers and need to make engineering trade offs. Please check back on this blog for some tricks of the trade in parallelizing legacy code. I have been able to get around a 2X speedup on 4cpus for this 6 Million line code base. Good but a lot more to do. I am hoping to help others by sharing and also look for new ideas from smart and innovative people that I hope will joing in the discussion.

First Blog Posting to Test

This is a test blog to see how it works.