Hardware / CPU/Chipset

32-Core Processors: Intel Reaches For (The) Sun

The Project "Keifer" Awakening
Author:Patrick Schmid
Date: July 10, 2006 10:15
Source: Tom's Hardware

The Project "Keifer" Awakening

 

Intel is of rolling out its Core 2 micro-architecture now. The Xeon 5100 server processor aka Woodcrest was released only weeks ago, Core 2 Duo for the desktop (Conroe) is expected on July 27th and the mobile version Merom will follow only weeks later. The next milestone is quad-core processors, which the firm will produce by fitting two Woodcrest dual cores inside a physical processor package (Clovertown). You may have realized that there is a product development pattern behind recent and upcoming Intel multi core processor releases. Amazingly enough, Intel has been studying Sun's UltraSPARC T1 (Niagara) to come up with a radical processor redesign for 2010 that could perform 16 times faster than Woodcrest. This is no marketing blurb, guys; this is technical intelligence from within the Borg collective.


Single/dual socket Core 2 micro-architecture versus Sun's Niagara in best/worst case scenarios. Intel believes it can beat the deadly dinosaur by 2010. The steep Intel slopes for 2008 and 2010 represent the upgrades to 45 and 32 nm as well as possible micro-architecture updates.

I have to say I can't remember performance gains anywhere near 16x in only four years. Comparing a 2002 Pentium 4 3.06 GHz with a Core 2 Extreme 2.93 GHz will give you a two to five fold increase - if most. 16x more performance by 32 cores in 2010 versus today's two cores, should it come true, equals linear scaling, which means that performance would double with the core count. Many of you will say this is utterly impossible, because even sustaining the clock speed levels at doubled core count might be difficult - and I agree, unless you start to think out of the box.

Santa Clara had some of its best brains compare the server processor roadmap with Sun's UltraSPARC T1 and expected future offsprings. The result is a project code-named Keifer. Although it was designed to come up with architecture to beat the pants off Sun in the server market by 2010, Keifer may easily be the technical basis for future server and mainstream processors as well.

Intel Runs On Complex Cycles

Before we talk about Keifer we will look at Intel's processor development cycle to get a better understanding of how the CPU chess board is laid out. Many of the upcoming products can be anticipated by knowing their projected time frame and keeping track on Intel's manufacturing technologies. Building a processor factory (these are called fabs) tears a multi-billion dollar hole in a semiconductor company's budget, which is why they need to make sure that the intended products will be profitable. Over time, Intel has come up with a highly efficient cycle that it seems to be applying now:

1. Deploy a new manufacturing process in an odd year (e.g. 65 nm in 2005) and update or shrink the current products (Pentium 4, Pentium D) to maximize yield rates for the upcoming product generation as early as possible. Launching mobile processors first is also beneficial, as these do not require highest clock speeds.
Example: The 65 nm process was introduced in 2005 and first powered the Pentium 4 6x1 and Pentium D 900 processors. The Pentium M has always adopted new manufacturing technologies first (90 nm Dothan, 65 nm Yonah).

2. Deploy a powerful new processor in the following, even year (e.g. Core 2 in 2006). This new CPU implements the latest micro-architecture and a balanced mix of an ideal core count, features and clock speed, so it reaches the goals for performance and performance per Watt based on current fabrication.
Example: Core 2 launches July 27.

3. Derive low-cost products from the current single-die to maximize yields and profit. A partly defective Core 2 dual core processor may still be sold as a cache reduced-version (Allendale) or a single-core model (Millville) - depending on where the defects are located.
Example:Core 2 Duo E6400, E6300, E4200 (2 MB L2 cache only), a Single core Core 2 will follow (possibly as Celeron).

4. This step only applies for the multi core generation: Create a next-generation product by merging two current single-die processors into one processor package. Clock speed might have to be reduced to stay within the given thermal envelope, but this offers a cost-effective way of doubling the core count.
Example: Pentium D Presler is based on two Pentium 4 Cedar Mill dies, Core 2 Kentsfield will be based on two Core 2 Duo Conroe dies.

5. At this point, the existing product(s) should be ready to be adjusted to the next-generation manufacturing process (45 nm in 2007). Clock speed can be adjusted according to progress in manufacturing.
Example: 45 nm will be introduced in 2007.

Depending on how well this machine is oiled, the cycle may be somewhat larger or shorter than two years. Having this principle in mind, it is logical that the first physical quad core processor, Harpertown, will arrive in 2008. For the time being, refinements in the 65 nm process should allow for adjusted Conroe and Kentsfield clock speeds at unchanged energy consumption within the given thermal envelope specifications. Harpertown will also be the basis for the eight core Gainstown that could follow after few months.

Scaling Bottlenecks

While multiple processor cores on a single die communicate with each other directly, the approach of building multi core processors by combining distinct processor dies creates the necessity to communicate via the processor interface, which, in case of desktop and server mainstream armada, is the Front Side Bus. This has been criticized as a huge bottleneck for multi core configurations ever since Intel released its first dual core Pentium D 800 aka Smithfield. As one core accesses data located in another processor's L1 or L2 cache it must use the Front Side Bus, which eats away at the available bus bandwidth.

For this very reason, the Core 2 generation implements a large, unified L2 cache, which means it is shared by two cores. However, as soon as you pack two dual core dies onto a physical processor to build quad cores, the FSB bottleneck issue is back again - and it is probably even worse, as there are more cores fighting over more data in larger L2 caches. Intel's countermeasure consists of a bus clock speed upgrade. The server platform already runs 333 MHz (FSB1333 quad-pumped); the desktop platform will probably receive the upgrade by the time the first quad core product hits the market.


The second bottleneck is the system's main memory. It is not a part of the processor, but resides in the chipset northbridge on the motherboard. Again the Front Side Bus is used to interconnect the processor(s) with the motherboard core logic, which has two or more cores fight over memory access. AMD integrated the memory controller into its processors as early as 2003, which minimizes the memory access path and improves performance due to faster operation at full CPU core clock speed. The real advantage of on-die memory controllers becomes obvious in multiprocessor environments, where each CPU can access its own memory at maximum bandwidth.

There is the issue of memory coherency, but e.g. the Opteron is smart enough to deal with it at up to four processors. We believe there are two reasons for Intel not integrating the memory controller. First of all, nobody embraces changes as long as they are not required for the business. And second, there is a chipset business that Intel may want to defend. Moving the memory controller into the processor would eliminate platform selling points: Compatibility, continuity and features that are exclusively available to Intel platforms (think of I/O AT).

What If The Memory Controller Were Integrated?

At some point the memory controller simply has to be relocated into the processor due to the reasons described above. Adding bigger cache memories certainly helps, but if you have four or much more processor cores working on your applications you need to make sure that they don't run out of data - who needs multi-lane freeways if there aren't sufficient entries and exits to access it?

In addition, 45 and 32 nm manufacturing processes will allow the RAM access logic to become a part of the processor die at very little additional cost. So, expect memory controllers to move into Intel processors in the future. I'm sure that some of you feel compelled to refer to AMD and its memory controller integration that happened as early as 2003. Well, I have to ask you to read on, as there is actually a concept behind this move; a concept that needs some more explanation.

Intel To Develop A Sun-Blocker?

The base line for Intel's internal analysis is Sun's Niagara (UltraSPARC T1) processor. It is a 90 nm, single-die, eight core 1.2 GHz server-type processor with four threads per core, four L2 caches (3 MB) that can all be accessed via a crossbar interface, four dual-channel DDR2-400 memory interfaces and a total of 279 million transistors at 379 mm² die size. All of this comes at a low 72 W peak power consumption, making such a product a serious threat.

We assume that a single Niagara processor is approximately twice as fast as a dual processor, dual core Woodcrest setup that Intel delivers today (eight 1.2 GHz cores vs. four 3 GHz cores). According to Intel's competition analysis, future 65 and 45 nm Niagara-type processors might double the thread count and L2 cache size with each generation, while upgrading to latest memory technologies. Intel wants to be prepared and believes that a well-structured, multi-core approach with a smart L3 cache design can block Sun.

Keifer Carries 32 Cores in 8 Nodes

 

Intel's most important weapon probably is its advanced manufacturing. It has been using the 65 nm process for almost a year, while AMD and Sun are in the transition process. If you now assume that Intel will reach 45 and 32 nm well ahead again, it could deploy a larger core count and more cache than the competitor.

As the chart on the first page of this article shows, Intel expects the jump from eight to 16 cores to provide a 50% performance improvement. Project Keifer, which would be a complete redesign and directly go to 32 cores, may provide a whopping 100% performance jump when compared to a 16-core processor in 2010.

The key for these wet dreams is a modular design approach that is based on eight processing nodes, each carrying a common 3 MB L2 cache (24 MB total) and four processor cores with 512 kB shared L2 cache. A ring interconnect, similar to what ATI deployed in its Radeon X1900 memory subsystem, will provide quick communication between the nodes.


Each Keifer node will carry four cores at 32 kB data and 32 kB instruction cache as well as a 512 kB L2 cache. The limitation to this node L2 cache capacity seems to provide better performance than fewer cores with larger caches.

Every Node Gets Its Memory Controllers

 

Let's talk about the integrated memory controller. If you think 32 cores per chip you have to find a way to implement the memory logic without creating new bottlenecks. A single high-bandwidth, multi-channel DDR controller as a shared resource would not perform well. The other extreme would be a dedicated memory controller per core, which technically doesn't work at such a core count. But a memory controller per node certainly does, which is exactly what Intel is thinking of.

Eight nodes would provide eight 12.8 GB/s FBD2-1066 interfaces, resulting in a total bandwidth of 102.4 GB/s in Intel's current projections. Four cores sharing a memory unit sounds like a reasonable compromise, and the ring interconnect would provide an adequate inter-node communication pathway.

This very modular approach is not only promising from the performance standpoint; it also makes a lot of sense from a business perspective. Processors with defective cores could be turned into models with a smaller node count, or a smaller core count per node. Silicon with defective L3 cache areas could be turned into models with less L3 cache, etc.

Whether all eight memory controllers will actually be used will be the customer's decision, but it is already very obvious that such a single-processor, 32-core server with only eight memory modules would be amazingly inexpensive and breathtakingly fast.

Final Words

We are not entirely sure how long the Keifer project has been evaluated inside Intel, but it must at least be half a year. At the same time, we've heard rumors that the project might already dead. The documents we received from undisclosed sources are dated March to May 2006. However, the information is very interesting, as it shows which direction Intel may be going in the upcoming years when it's time to replace Core. It also shows that the decision on future processors is usually made analytically, several years in advance, and tightly coupled to available manufacturing technologies.

Contact Us | Authors | Subject Index | RSS Feeds

Copyright ©2007 Setup32.com