The Wayback Machine - https://web.archive.org/web/20121216051010/http://www.futurechips.org:80/chip-design-for-all/a-multicore-save-energy.html
May 192011
 

I have been asked this question in-person and online before. I have seen “experts” argue about this at academic conferences and people debate this on forums. The answer is not a yes/no. It requires some analysis. My short answer is: mutli-core does not save energy unless you simplify each individual core to make it more energy-efficient. I explain this with an example and provide insights to back my assumptions.


I will use two metrics for this study: speed and energy.

Assume a FAST core that runs at 10Watts and completes a program in 10 seconds. Total energy consumed = 10 x 10 = 100J.

Now suppose we put two FAST cores together. Power will double to be 20 Watts, execution time will halve to be 5 seconds (I am giving multicore extra credit here), energy will still be 100 J. Note: There are no savings in energy by going multi-core alone. In fact, if multi-core did not halve my execution time (which it often fails to), multi-core would be less energy-efficient.

Now lets say we have a SLOW core. It consumes 5W of power, and finishes the program in 15 seconds. Energy = 75J.

Lets build a dual SLOW core. 10 Watts of power, 7.5 seconds execution time, 75J of energy. Now this is the best balance between energy and speed, assuming the workload can leverage two cores efficiently.

One can envision a SLOWER core which will make it even more energy-efficient. Thus, energy-efficiency is not a property of the multi-core, multi-core seems energy-efficient only because each individual core is efficient. Multi-core just allows us to get decent speed with low energy. Don’t forget that multi-core is no free lunch. Leveraging multi-core requires effort from the programmers. They have to find parallelism in the programs which can offset this efficiency benefit.

By the way, the fundamental reason the above maths works is because I assumed that the SLOW core burned half the power of a FAST core but finished the program in less than twice the time, i.e., it took only 15 seconds, not 20. This is what makes the SLOW core more efficient. If you are curious, the SLOW cores tend to be more efficient due the following three reasons:

1. Lower frequency. Since power is proportional to frequency cube square**, halving the frequency reduces speed by 2x but power by 8x, making it 4x energy efficient.

**Note: This is actually cube, with some reservations which is why I wrote square as a first order approximation.  Thanks to Harrkev for calling me out and giving me an opportunity to explain in depth. See our conversation below.

2. Less speculation. FAST, aggressive cores tend to be more speculative because they never waste time in waiting for data, they just predict the value and do the work in the hope that their prediction will be correct. This does improve performance but incorrect predictions lead to wasted work.

3. Lower flip-flop power. This one is a bit subtle. FAST cores tend to have deeper pipelines for high performance. The more the pipeline stages, the higher the number of flip-flops in the core, which leads to higher wasted energy in flops.

Summary: Multi-cores have the potential to be more energy-efficient. They can be a big win if architects can strike the right balance between the energy-efficiency benefit of multi-cores and the programmer effort required to write multi-threaded code. Unfortunately, this is not the case for todays multi-cores. Architects need to factor in the cost of programmer effort when designing these multi-core; after all, the mother of all metrics is the dollar.

  10 Responses to “Q & A: Do multicores save energy? Not really.”

  1. Good article Aater!

  2. RT @FutureChips Do multicores save energy? Not really. http://bit.ly/ir2j87

  3. Do multicores save energy? The title of this article is misleading. The text itself answers it with a “Yes”. Also, the programmer effort is not really more. It’s just the legacy of having educated people in sequential programming only…

    • Frank, I understand where you are coming from about the title being misleading. When I wrote the title, I deliberated about it too. After a lot of thinking, i still believe that the title is apt. Let me know what you think of my thought process below:

      The point I want to make is that multi-core by itself does not save energy. If energy-saving was the only metric you care about, you could get more energy savings via a single core. Multi-core is a path to getting higher performance for low energy if and only if the hardware of each is made more-energy efficient and the software is willing to lift the parallel programming burden. Thus, the energy comparison between multi-core and single-core is not to apples to apples. An energy-saving technique to me is one which can reduce energy at the same performance without requiring programmer effort, e.g., better power-management and clock gating. To present a concrete example, the core architecture of Intel’s Nehalem was lower power, higher performance, and did not increase programmer effort compared to its predecessor Core2. Saying that “multi-core saves energy” is more misleading IMO. I notice that you have multi-core experience yourself so I am very interested in hearing what you think.

      Now to your second point that parallel programming is not difficult. I do dare to disagree with this one. It is not just a problem of people being less educated in parallel programming, good parallel programming is indeed harder (at least with our current tools, tomorrow may be better). Perhaps better education and instilling the parallel mindset could help the programmers identify the parallelism easily or avoid common pitfalls, but that is only a small part of parallel programming.

      Parallel programming introduces tons of new trade-offs that did not exist in sequential programming. Thread waiting, bandwidth limitations, cache sharing, contention for data and resources, false sharing, etc. It all of a sudden requires programmers to think about hardware and software at a whole different level. If you just think HPC workloads, I would half-heartedly agree with you that they are equally easy to parallelize. However, as soon as we talk about something complex, like even a parallel implementation of the traveling salesman problem (TSP), the required effort goes up exponentially. You may want to disagree, but I also argue that parallel code is more sensitive to hardware changes like memory bandwidth, on-chip interconnects latency/bandwidth which implies that programmers must make more platform-specific changes compared to single-thread. This poses a major challenge if the programmer does not know the target platform and input set, which is becoming the case with ubiquitous multi-core computing.

    • Frank, I have written this post today which demonstrates my point about parallel programming. It may be an interesting read for you, and your comments will be appreciated.

      http://www.futurechips.org/tips-for-power-coders/writing-optimizing-parallel-programs-complete.html

  4. Wrong about frequency vs. power, BTW.

    The way is ACTUALLY works is that power is linear with frequency, and squared with voltage. 1/2 the speed, 1/2 the power. 1/2 the voltage, 1/4 the power.

    However, the voltage determines the maximum speed. As you reduce speed, you reduce power. Now, you do not need as much voltage, so you can lower that a little and save even more power.

    Note that all of this assumes a “perfect” CMOS chip where it is all dynamic power and no leakage. These days, leakage power (which is power consumed even without a clock) is getting larger and larger due to smaller geometries.

    • Hey, thanks for reading. Your logic is 100% correct. I would add that voltage does depend on frequency (which you point out too) making power proportional to the cube of frequency. Now I did not write cube because of your second point that not all chips are perfect and voltage isn’t simply proportional to frequency. Square is a middle ground but I did not want to get into those details because it would derail the article. I have now added a note about it referring to your insightful post.

      btw, very insightful comment. I would love to hear your comments about my future posts as well.

      Thanks.

  5. The CPU is not the only component of a battery-powered device. In a lot of systems, the backlit display and the WLAN or cellular radio take more power than the CPU.

    Single core runs at 10 watts plus the LCD+radio and completes a program in 10 seconds, using 100 J plus the energy that the LCD+radio consumes for this time period.

    Dual core runs at 20 watts plus the LCD+radio and completes a program in 5 seconds, using 100 J plus only half as much LCD+radio energy.

  6. [...] in 45 nm CMOSUsing On-Die Message-Passing and DVFS for     Performance and Power Scaling-Futurechips.org: Q & A – Doe multicores save energy? Not Really.-Rainer Leisen, Fachhochschule Bonn-Rhein: Seminararbeit zur Cache Kohärenz- Related Posts:auf dem [...]

 Leave a Reply

(required)

(required)

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>