The Wayback Machine - https://web.archive.org/web/20110529133159/http://www.futurechips.org:80/tips-for-power-coders/parallel-programming.html

Multi-cores are here, and they are here to stay. Industry trends show that each individual core is likely to become smaller and slower (see my post to understand the reason). Improving performance of a single program with multi-core requires that the program be split into threads that can run on multiple cores concurrently. In effect, this pushes the problem of finding parallelism in the code to the programmers. I have noticed that many hardware designers do not understand the MT challenges (since they have never written MT apps). This post is to show them the tip of this massive iceberg.
Update 5/26/2011: I have also written a case study for parallel programming which may interest you.

Why finding parallelism is hard?

Some jobs are easy to parallelize, e.g., if it takes one guy 8 hours to paint a room, then two guys working in parallel can paint it in four hours. Similarly, two software threads can convert a picture from color to grayscale 2x faster by working on different halves of the picture concurrently. Note: Programs that fall in this category are already being parallelized, e.g., scientific computing workloads, graphics, photoshop, and even open-source apps like ImageMagick.

There also other programs that are sequential in nature, e.g., two guys will not be able to cook 2x faster than one guy because the task isn’t fully parallelizable: there are inter-task dependencies and the cooks end up waiting for each other at times. Unfortunately, a lot of programs have artificial inter-task dependencies because the programmers wrote them with an ST mind set. For example, consider this code excerpt from the H.264 reference code (I have removed unnecessary details to highlight my point):

macroclock_output_data mb; //Global variable
for (...) // The OUTER loop
   decode_slice(...);
   if(mb->last_mb_in_slice)
       break;

void decode_slice(...)
    ...
    mb = ...

Notice how the variable mb is written every iteration and no iteration uses the mb written by the previous iterations. However, mb was declared as a global variable probably to avoid its repeated allocation and deallocation.  This is a reasonable ST optimization. However, from an MT standpoint, the loop iterations of the OUTER loop now have a dependency among each other and they cannot be run in parallel. To parallelize this code, the programmer has to first identify that the dependency is artificial. He/She then has to inspect 1000s of lines of code to ensure that this assumption isn’t mistaken. Lastly, he/she has to change the entire code to make mb a local per-iteration variable. All this is difficult to achieve (I parallelized H.264 for this paper).

So here is the status: Leaving the artificial dependencies in the code limits parallelism, mistakenly removing a real one breaks the program, and reaching the perfect balance requires prohibitive effort. Since its hard to identify all dependencies correctly the first time, they make errors and debugging begins.

Why is debugging difficult?

Debugging multi-threaded code is very hard because bugs show up randomly. Consider the following:

Say two threads T0 and T1 need to increment the variable X. The C/C++/JAVA code for this will be

X = X + 1

Their assembly code will look as follows: (instructions are labeled A-F).

  T0   T1
A Load X , R0 D Load A, R0
B Increment R0 E Increment R0
C Store R0, X F Store R0, A

 

The programmer wants X to be incremented by 2 after both threads are done. However, when the threads run concurrently, their instructions can interleave in any order. the final value of X depends on the interleaving  (assume X was 0 before the two threads tried to increment). For example,

ABCDEF: X = 2 (correct)

DEAFBC: X=1 (incorrect)

ADBCEF: X = 1 (incorrect)

Basically, there is a dependency that D shall not execute before C (or A should not happen before F). The programmer has missed this dependency. However, the code does work fine half the times making it impossible to track the bug or test a fix. Moreover, traditional debugging techniques like printf and gdb become useless because they perturb the system, thereby changing the code behavior and often times masking the bug.

Why is optimizing for performance so important and challenging?

The sole purpose of MT is performance. It is very common that the first working version of the parallel code is slower than the serial version. There are two common reasons:

Still too many dependencies (real or artificial): Programmers often iteratively remove dependencies and sometimes even re-write the whole program to reduce these dependencies.

Contention for a hardware resource: Threads can also get serialized if there is contention for some hardware resource, such as a shared cache. Programmers have to identify and reduce this contention. Note that identifying these bottlenecks is especially challenging because hardware performance counters are not reliable.

After several iterations, the code becomes good enough for the performance target.

The work does not end here..

Unlike ST code which would get faster every process generation, MT code has complex non-deterministic interactions that can make its performance swing long ways when hardware changes. For example, I had a branch-and-bound algorithm (a 16-Puzzle solver) which would slow down with more cores because the algorithm would end up on a different path when more threads were running. Even a simple kernel like histogram computation can behave very differently with different inputs or machine configuration (see this paper). Thus parallel programmers are also burdened with the task of making their code robust to changes.

Conclusion

My goal here was not to teach parallel programming but merely provide a flavor of what it takes to write a good parallel program. It is indeed a complex job and I assert that it is not possible to appreciate the challenges without actually writing a parallel program. For those who have not written one yet, here is my call for action:

Write a parallel program to compute the dot-product of two arrays and get it to scale perfectly, 4x speedup on 4-cores, etc. It is simple but you will learn more than you expect.

Search the keywords: pthreads or winthreads to learn the syntax on linux or windows respectively. Share your experiences!

31 Responses to “What makes parallel programming hard?”

  1. Great article on MT programming. I’m not a programmer, i’ve doing Logo, basic, qbasic, Pascal, C, Cobol at school, now I’m more like an OS enthusiast. I don’t really want to start a debate here, but I’m curious if you’ve check on what Apple have done for solving most of the challenge parallel programming got?

    In my limited understanding, with Apple addition of Block extension to C solve problem variables since the block use all variable of the function running in. And with libdispatch, you send block (job) in the serial or parallel to Grand central dispatch (GCD) who run all “wagon” coming from all apps and running all on his own threads. This have the advantage to simplify code, like it say above, loop and filter are very easy to transform into ^block and dispatch in serial or parallels to be execute on external threads. Another advantage this is very scalable, GCD will adjust how many concurrent threads giving the hardware and get advantage of shared cache CPU core vs multiple distinct CPU. GCD optimized jobs dispatched on multithreaded multicore multicpu design like the MacPro by dispatching all jobs of the same queue on multiples threads that can share the same cache on the same CPU. I’m not sure but I think you can even dispatch OpenCL code to multiple GPU this way.

    I’ve enjoy this articles:

    • Hey BigMac2, good to see you here:)

      Its a great comment and all your points are indisputable. GCD does ease the task of actually writing parallel code. As you point out, block constructs forces programs to think about dependences and “tricks” them into doing the right thing. The Serial and Parallel constructs makes it easier to enforce thread dependencies, and dynamic thread allocation relives the programmer of choosing the right number of threads.

      However, GCD still does not solve some of the challenges I discuss above:

      First and foremost, programmer still has to find parallel sub-tasks to insert in the queues. Programmer still have to identify all true dependencies to ensure that tasks are pushed in correct order. Furthermore they still need to think thread-synchronization to decide what goes in serial thread and what goes in parallel thread. All this implies that a big chunk of these challenges stays.

      I have not coded in GCD enough to comment on the debugging experiences, may be someone else can elude us.

      GCD as such does not help with robustness. However, indirectly it does help a lot. Robustness and scalability become a problem usually because programmers make wrong choices. By forcing programs to do the right thing and taking control of execution, GCD does improve scalability.

      I will learn more on GCD and update my answer asap. Its a great topic to discuss.

      • I greatly enjoy your explanation, and I acknowledge way to use of parallel programming for high computational program, when working with complex algorithm it can be a very difficult task to “split” in chunk and being execute separately.

        But I think parallel programming can (and will) be use for more common use like Apple is using GDC in all its software. Most program like Mail, iPhoto or iTunes are about managing and filtering content, execute repetitive tasks to multiple items. Mail filters is a great example of tasks that can be “transform” into block for parallels asynchronous execution. GDC solution is more than parallels execution, its a way of running “packetize” code thru a centralize dispatch. I think new gen of mobile device with multi-cpu need more than all parallels programming and some sort of OS controlled threads to allow multiple multithreads apps running at the same time.

        I like this description of GDC: Islands of Serialisation in a Sea of Concurrency

  2. I find your point on artificial dependencies very relevant. This is mostly due to shortcomings in current programming languages.

    Let me give an example. You want to increment both x and y. In C, Java or similar languages, you will have to write either { x++; y++; } or { y++; x++; }. That is, you have to introduce an artificial dependency, although there was none, just because the language syntax does not allow otherwise. Then parallelizing tools will work hard attempting to guess whether such dependencies are real or artificial.

    In a language with parallel composition such as Ateji PX for Java, the syntax makes it possible to write [ x++; || y++; ]. The parallel bar ||stands for parallel composition, it runs both statements in no particular order, or even in parallel when running on parallel hardware. This example shows that with an appropriate syntax, there is no need to introduce artificial dependencies any more.

    • Hi Patrick, Thanks for telling me about Ateji PX. I have look at the examples and they look pretty neat. I will try writing code in it one of these days and perhaps update my post.

      I do have a controversial question though:

      Why did you guys chose JAVA?
      -Any particular features of JAVA that made it a suitable choice?
      -Is JAVA better for MT in general?
      -If you were to do it again, Would you still pick JAVA?

    • I think you’re write that it’s mostly a language issue- the semantics for the majority of computer programming languages have a logical “top to bottom, left to right” temporal ordering. It is so deeply ingrained that it is difficult for most programmers to think of it any other way- Line 10 executes before line 11, and line 11 executes only after line 10. Very strong, very deeply ingrained mental model of temporal causality.

      There is only one family of languages that I know of where this isn’t true- Hardware Description Languages, such as Verilog or VHDL.

      Hardware Description Languages have a completely different logical and mental model for temporal causality because they have to- these are the languages that are used to design CPU’s, ASIC’s, etc. When you’re dealing with hardware like that, it makes no sense to think about things as if “This transistor executes before that transistor, and..” It’s non-sensical- every single transistor is executing at the exact same time. HDL languages reflect that fact- every single character of every single line is executing at the same time. There are of course ways to control causality, and at what point “something happens”, but it is unlike anything you’ll find in your typical computer programming language.

      For me, it was quite an eye-opener. It also made me realize that there’s nothing inherently difficult about doing things in parallel or concurrently- it’s actually pretty trivial… once you are forced to use a language where doing things in a sequential, step by step fashion is simply not possible.

      In Verilog, a <= b; b <= a; and b <= a; a <= b; mean exactly the same thing (I’m glossing over some important details, though). The short version of why this is so is that in Verilog, you have to explicitly declare when things happen. This typically means that you write something that says “At the start of the tick of the Clock, do this…”. So, at the start of the tick of the clock, the “inputs” are on the right side of the statement, and the assignment to the left side takes place “before the next start of the tick of the Clock, but have yet to take place during this tick of the Clock..” There’s even two different types of assignment, <= and =, where = means “the assignment takes place and is completed by the end of the statement containing the assignment.”

  3. [...] Aater Suleman writes about why parallel programming is so hard in a multicore world. Unlike ST code which would get faster every process generation, MT code has [...]

  4. [...] Aater Suleman, Intel This post is a follow up on the previous post titled why parallel programming is hard. To demonstrate parallel programming, this article presents a case study of parallelizing a kernel [...]

  5. OpenMP for c/c++ uses hints or pragmas to alert the compiler the code should be run multithreaded.

  6. Hi, Interesting read! I was wondering what you thoughts were on using OS processes vs threads. I’ve been finding that relying on the OS lets me focus more on my real software problems (see link). Of course communication between between processes is going to be harder than with threads, but it seems like this might be a fair trade off.

    Thoughts?

    • James,

      You are right that process level parallelism is good for several things but MT is required for cases where threads communicate. There are ways to get processes to talk to each other as well, e.g., dbus or IPC or pipes but I have written programs that way and the communication overhead is prohibitive. I would look at PostgreSQL as an example. By the way, for process level parallelism, make is your poster child and then there is GNU parallel as well.

      Having said that, I do think multi-threaded code will be the right way to do it if you want to speed up a single problem.

  7. What makes parallel programming hard is the legacy of tools we have to work with, the unholy trinity of CPU, OS and Programming Language.

    X86 architecture and instruction sets aren’t parallel friendly; it is the operating system that performs multi-threading and multi-tasking, and the operating system itself is just an application.

    The resulting Operating System ABIs to implement parallelism are hefty and expensive.

    Take the maximally extreme case of

    x++; y++;

    At the lowest levels, the CPU will probably pipeline that, but to get one code-stream to execute those operations in parallel the cost is gigantic.

    Under the hood, multi-core is just a hacky version of SMP. As a result, no real progress is being made in parallel programs outside of academic and extreme domain-specific scopes.

    I would hazard a guess that the root of the problem is the old hardware guy vs software guy issue; what portion of the pan-domain, pan-discipline, pan-language programming community is working in x86 machine code or assembler?

    I’d be surprised if it was more than a single digit percentile. If accurate, that means that 99% of all programmers are working at least one layer separated from actual machine instructions. When it comes to parallelism that’s a big deal, especially since parallelism itself is a pseudo-implementation.

    Of course, the OS devs don’t want applications tearing up threads/whatever that do their own thing because … well, botnet anyone?

    Languages like C/C++ etc carry on their legacy of single-threadedness because you have to find large jobs of work or else you actually decrease your apps performance due to the sheer volume of x86 instructions required to start a thread or dispatch work across cores.

    It’s more than just the ABI/APIs, very few languages have adopted constructs like “is thread safe” decorators. So the compiler’s have their work cut out to do any kind of auto-optimization and, in my experience, most of them work well in test cases or if a programmer exactingly follows very careful coding sequences, but in general production environments they just don’t work out as well as they could.

    My thoughts on what we can actually do about this:

    http://kfsone.wordpress.com/2011/03/19/computer-operating-and-programming/

    • Hey Oliver, thanks for reading.

      I agree with everything you said until:

      “Under the hood, multi-core is just a hacky version of SMP. As a result, no real progress is being made in parallel programs outside of academic and extreme domain-specific scopes.”

      Its much different from SMP because the communication overhead is much lower and trade-offs are very different. I measured that one a dual-chip (kind of SMP), core-to-core cache misses were going through memory and cost like 250 cycles. On a real CMP, like Nehalem, that cost is just tens of cycles. Thus, my disagreement that its that same as SMP.

      I also disagree that no real progress is being made. ImageMagick is a good one. Adobe has done a lot parallel programming. Open source efforts are firing up quickly. I guess its subjective by my opinion, the motivation is now high and hence work is catching fire.

      I agree with hardware-guy/software-guy issue. Fixing that issue is the theme of this blog. This post was for hardware guys to learn the troubles a software guys goes through so they don’t just stick it to the software (I am a hardware guy with software know how).

      Legacy tools, languages, and hardware architectures indeed make it harder. Read my post here where I highlighted this very issue:

      Are computer architects building the wrong computer hardware?

    • Btw, thanks for the pointer to your blog. I will read your blog and comment asap.

  8. A very nice article. This article is very relevant to me since I am taking a Concurrency/Parallel Programming course this quarter.

    While we parallelized few known algorithms as part of the course assignments, we discussed and used an awesome tool to identify & debug the hidden concurrency bugs. Its called “mchess” , part of developed by Microsoft Research, Redmond.
    This tool is open source and can be efficiently used to identify really-hard-to-find concurrency bugs.

    More info on Chess can be found and on

  9. A very nice article. This article is very relevant to me since I am taking a Concurrency/Parallel Programming course this quarter.

    While we parallelized few known algorithms as part of the course assignments, we discussed and used an awesome tool to identify & debug the hidden concurrency bugs. Its called “mchess” , part of Alpaca developed by Microsoft Research, Redmond.
    This tool is open source and can be efficiently used to identify really-hard-to-find concurrency bugs.

    More info on Chess can be found here and on CodePlex

  10. What are your thoughts on google Go? http://golang.org/

    It seems to solve some the issues you describe by enforcing variables declaration to be either inside or outside the parallelizable code. I’m not sure that it allows you to share ressources like your ABCDEF example, but maybe it’s just safer that way!

    • Go is kinda neat. I haven’t digged into it to make a strong statement but I feel that all the innovation in this area like Go, GCD, TBB are all great steps. They all ease this tiresome process.

      I continue to still use the low-level stuff in my posts because I like bottom up learning… IMO using high-level without understand whats under the hood just leads to surprises.

  11. [...] Aater Suleman writes about why parallel programming is difficult. … I was unaware … that a major challenge in multi-threaded programming lies [...]

  12. my vague & unthought out ideas…
    most algs fall into one of the two categories, non or paralizable.
    why dont we have standard implementations for each alg in most of the popular languages, available online to reference. why re-invent the wheel?
    then all we need to do is minimize the path / execution time length in the big picture

    • Colin,

      Thanks for reading and taking the time to share your thoughts. You are right about having more code on the internet for reference. Its an absolute must. My only concern with standard implementations is that most interesting software tailors algorithms to their needs which makes it hard to use standard code. I do however believe that reference examples can help a lot.

      On a side note, I do want to clarify that there are algorithms which are mid-way, e.g., a 16-puzzle problem is parallelizable but only partially.

  13. One of the main reasons parallel programming is hard is because most of it is done using the POSIX threads model. Hoare’s CSP came up with a much easier approach to reason about and get correct. In some ways it mirrors electrical circuits in hardware. A motherboard is an immensely complicated parallel system when viewed as millions of transistors yet we can now design them with decent results.

    The starting point for reading up on the computer languages influenced by CSP, including Google’s Go mentioned above, is Russ Cox’s Bell Labs and CSP Threads.

  14. Personally I am of the opinion that parallel computing is hard because of the very poor memory model of the C based languages: mutable shared memory.

    If you try programming in a near side-effect free language and use immutable objects the very real pain of parallelism is dramatically reduced.
    If you add such tools as “share-nothing”, “use messages for communication between processes”, and/or Software Transactional Memory the pain drops further.

    I cannot help but recommend looking at languages such as Clojure, Erlang or Scala with Akka if you want to reduce your pain with parallel programming. Whatever you do leave Java / C behind as its memory model is (in my opinion) broken.

Leave a Reply

(required)

(required)

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

© 2011 Future Chips Suffusion theme by Sayontan Sinha