Intel’s Second-Gen Core CPUs: The Sandy Bridge
Although the processing cores in Intel’s Sandy Bridge architecture are decidedly similar to Nehalem, the integration of on-die graphics and a ring bus improves performance for mainstream users. Intel’s Quick Sync is this design’s secret weapon, though.
Editor’s Note: Eager to show off what is has done with Intel’s Sandy Bridge architecture, system builderCyberPower PC is offering Tom’s Hardware's audience the opportunity to win a new system based on Intel’s Core i7-2600K processor. Read through our review, and then check out the last page for more information on the system, plus a link to enter our giveaway!
The high-end desktop processor market is a one-horse race, with Intel’s LGA 1366-based Core i7-900-series CPUs pretty much tromping along uncontested. If you have the money and are building a performance-oriented machine, it’s hard to beat an overclocked Core i7-950. Power users who really need the punch of a six-core chip can go Core i7-970—just be ready to pay out the ears for the privilege of owning one.
It’s the mainstream where we see more interesting battles being waged. Funny how healthy competition has a habit of forcing more aggressive prices, isn’t it? For example, the quad-core Core i5-760 is compelling at $200. But so is AMD’s six-core Phenom II X6 1075T. And while AMD’s Black Edition parts captured the hearts of overclocking enthusiasts long ago, Intel more recently shipped a couple of K-series SKUs that bucked the company’s habit of only unlocking the multipliers on thousand-dollar Extreme Edition parts.
And now we have a new architecture from Intel, called Sandy Bridge. The last time Intel launched a processor design, it started with high-end Core i7-900-series chips and let the technology trickle down to the mainstream and entry-level derivatives. This time is different. Sandy Bridge is going to have to swim its way upstream, surfacing on the flagship LGA 2011 interface in the second half of this year for the real workstation-oriented aficionados.
Intel's Sandy Bridge Architecture
Intel’s Here And Now
That’s a long way away, though. Between now and then, LGA 1366 is supposed to remain at the top of Intel’s stack, while LGA 1155-based processors centering on Sandy Bridge gobble up all of the volume as a result of what Intel claims is a ~30% performance improvement versus the Lynnfield- and Clarkdale-based processors.
Naturally, this means trouble for an AMD that continues to launch incrementally faster versions of its existing architecture—but nothing that’d give it the double-digit speed-up needed to fend off a new microarchitecture from its competition. The only way to strike back at this point is with lower prices, and that's probably not the route AMD wants to be taking. We expect Bulldozer, the company's own next-gen architecture, sometime in 2011; that launch can't come soon enough.
A large enough boost from Sandy Bridge would also make Intel’s Core i7-900-series vulnerable too, though. Right now, these are, at minimum, $300 parts (that’s just to get in the door with a -950) that drop into generally more expensive motherboards requiring pricier triple-channel memory kits. I’ve been saying all along that the X58 platform would remain, definitively, Intel’s crown jewel on the desktop. But after running the numbers I’ve run on Sandy Bridge, I have to wonder if X58’s days are numbered a little sooner than the company planned.
Sandy Bridge has a couple of other surprises up its sleeve—not all of them destined to go down as smoothly as a 1996 Dom Perignon on New Year’s Eve. For one, overclocking on an Intel platform is drastically different, and the LN2-drinking crowd probably won’t like it very much. There’s also a big emphasis on integrated graphics, which we’ve seen prematurely praised as a potential alternative to entry-level discrete graphics. That doesn't turn out to be the case, at least on the desktop.
On the other hand, Sandy Bridge comes armed with a block of fixed function logic that specifically addresses video encoding. AMD and Nvidia have no answer to this, are a year behind Intel with a competitive solution, and get completely outperformed today in video workloads. We also have a couple of unlocked SKUs that really give this architecture, manufactured at 32 nm, room to stretch its legs.
Putting Sandy Bridge To The Test
Leading up to the Sandy Bridge architecture’s launch, Intel sent over four SKUs from its upcoming lineup: Core i7-2600K, Core i5-2500K, Core i5-2400, and Core i3-2100. We put all four processors through a brand new benchmark suite for 2011, along with Bloomfield-, Lynnfield-, Clarkdale-, and Yorkfield-based chips from Intel, plus Thuban- and Deneb-based CPUs from AMD.
Inside Of Sandy Bridge: Cores And Cache
From 10 000 feet, the Sandy Bridge die you saw on the previous page looks like a complete departure from its predecessor. After all, the mainstream Clarkdale-based CPUs consisted of two physical chips—a dual-core CPU manufactured at 32 nm and a graphics core/integrated memory controller/PCI Express controller etched at 45 nm. Now we’re looking at a single 32 nm part with all of those capabilities crammed onto one piece of silicon. Drill down, though, and there are really a lot of similarities that turn out to be more evolutionary in nature.
For each piece of Sandy Bridge that you look at, keep one word in mind: integration. Intel wanted to get the most out of each of the architecture’s nearly 1 billion transistors (the official count is 995 million).
There are actually three different versions of the Sandy Bridge die shipping at launch. The quad-core configuration—the one composed of 995 million transistors—measures 216 mm². Then, there’s a dual-core die with 12 execution units making up its graphics engine. That one features 624 million transistors on a 149 mm² die. Finally, the slimmest variation sports two cores and a graphics engine composed of six EUs. Though it’s flush with 504 million transistors, you’d hardly know it given the 131 mm² die size.
|Die Size (square mm)||Transistors (million)|
|Sandy Bridge (4C)||216||995|
|Sandy Bridge (2C, HD Graphics 3000)||149||624|
|Sandy Bridge (2C, HD Graphics 2000)||131||504|
In comparison, the 45 nm Lynnfield design that served as the foundation for Intel’s Core i7-800- and Core i5-700-series chips measured a more portly 296 mm², despite the fact that it only consisted of 774 million transistors. Intel’s architects clearly owe much of what they were able to cram into Sandy Bridge to the engineers that brought the 32 nm node online for Westmere (tick), and then dialed in for today’s launch (tock).
In its current state, Sandy Bridge-based processors are available with four cores (with and without Hyper-Threading) and two cores (dual-core models all have Hyper-Threading enabled). As you’ll see in the benchmarks, these cores are, clock-for-clock, more powerful than what we saw from Nehalem.
Still present are the 32 KB L1 instruction and data caches (along with 256 KB L2 cache per core), though Sandy Bridge now incorporates what Intel calls a L0 instruction cache that holds up to 1500 decoded micro-ops. This feature has the dual effect of saving power and improving instruction throughput. If the fetch hardware finds the instruction it needs in cache, it can shut down the decoders until they’re needed again. Intel also rebuilt Sandy Bridge’s branch prediction unit, improving its accuracy.
I ran these two single-threaded tests as a synthetic comparison of performance, clock for clock. Both quad-core chips are set to the same frequency with Turbo Boost and EIST disabled. As you can see, just the architectural shift makes a significant impact on Sandy Bridge's performance versus the Nehalem-based Lynnfield design.
Sandy Bridge-based processors are the first to support Advanced Vector Extensions (AVX), a 256-bit instruction set extension to SSE (AMD will also support AVX in its upcoming Bulldozer processor architecture). The impetus behind AVX comes from the high-performance computing world, where floating-point-intensive applications demand more horsepower than ever. To that end AVX’s impact on Sandy Bridge will very likely be limited. Intel does, however, expect that audio processing and video editing applications should eventually be optimized to take advantage of AVX (along with the financial services analysis and engineering/manufacturing software that AVX is really designed to target). Unfortunately, there aren't any real-world apps optimized for AVX that we can test as a gauge of the capability's potential.
Naturally, a lot of implementation work went into enabling AVX, including a transition from a retirement register file to a physical register. This allows operands to be stored in the register file, rather than traveling with micro-ops through the out-of-order engine. Intel used the power and die size savings enabled by the physical register to also significantly increase buffer sizes, more efficiently feeding its beefier floating-point engine.
As a consequence of increased integration, Intel had to address the ways bits and pieces of its processor were accessing the last-level cache (in Sandy Bridge, it’s the L3).
Back in the days of Bloomfield, Lynnfield, and Clarkdale, a four-core (and even six-core, in Westmere) ceiling meant that each physical core could have its own connection to that shared cache. The Xeon 7500-series processors were designed to be more scalable, though, and currently-shipping models feature as many as eight cores per CPU. Built the same way, that’d be an exorbitant number of traces between each core and the last-level cache. So, Intel adopted a ring bus that, in those enterprise environments, allows the company to keep scaling core count without the logistics getting out of control.
Earlier this year, I had the chance to talk to Sailesh Kottapalli, a senior principle engineer at Intel, who explained that he’d seen sustained bandwidth close to 300 GB/s from the Xeon 7500-series’ LLC, enabled by the ring bus. Additionally, Intel confirmed at IDF that every one of its products currently in development employs the ring bus. Think we’re going to see a continued emphasis on adding cores and other platform components directly to the CPU die? I’d say that’s a fair assumption.
Of course, Intel wasn’t worried about higher core count on the mainstream desktop version of Sandy Bridge. Rather, it was the on-die graphics engine that compelled a similar shift to the ring bus architecture, which now connects the graphics, up to four processing cores, and the system agent (formerly referred to as uncore) with a stop at each domain. Latency is variable, since each component takes the shortest path on the bus; overall, though it’s always going to be lower than a Westmere-based processor.
At the end of the day, the ring bus’ most significant contribution is going to be the performance it facilitates in graphics workloads.