domingo, 23 de novembro de 2014

Apple A8X vs Tegra K1 vs Snapdragon 805 - Tablet SoC Comprarison (2014 Edition)

In the last few years, ultra-mobile System-on-Chip processors have made unprecedented strides in terms of performance and efficiency, advancing very quickly the standards for mobile performance. One form factor that particularly benefits from the exponential growth of SoC performance are tablets, since their large screens allow for the processors' abilities to be fully utilized. For the holiday season of 2014, we have the latest and greatest of mobile performance shipping inside high-end tablets. Apple has made a whole new SoC just for their iPad Air 2 tablet, which they call the A8X. Nvidia's Tegra K1 processor, which borrows Nvidia's venerable Kepler GPU architecture, has also appeared on a number of new high-end tablets. Finally, we also have the Qualcomm Snapdragon 805 processor found in the Amazon Kindle Fire HDX 8.9" (2014). Unfortunately, most other tablets either use the aging Snapdragon 801 processor, or in the case of Samsung's latest high-end tablets, use an even older Snapdragon 800 processor or the also old Exynos 5420 processor, which debuted with the Note 3 phablet in late 2013. In any case, at the pinnacle of tablet performance, we have the Apple A8X, the Tegra K1 and the Snapdragon 805 battling for the top spot.

 Apple A8X   Nvidia Tegra K1   Snapdragon 805
 Process Node   20nm  28nm HPM  28nm HPM
 CPU  Tri-core "Enhanced Cyclone" (64-bit) @ 1.5GHz  32-bit: Quad-core ARM Cortex A15 @ 2.3GHz
 64-bit: Dual-core Denver @ 2.5GHZ
 Quad-core Krait 450 @ 2.5GHz
 GPU  PoverVR GXA6850 @ 450MHz (230 GFLOPS)  192-core Kepler GPU @ 852MHz (327 GFLOPS)  Adreno 420 @ 600MHz (172.8 GFLOPS)
 Memory Interface  64-bit Dual-channel LPDDR3-1600 (25.6GB/s)  64-bit Dual-channel LPDDR3-1066 (17GB/s)  64-bit Dual-channel LPDDR3-1600 (25.6GB/s)


The CPU

It can certainly be said that all of this year's high-end mobile processors have excellent CPU performance. However, each manufacturer took a different path to reach those high performance demands, and that is what we'll be looking at in this section.

Starting with the A8X's CPU, what we have in hand is Apple's first CPU with more than two CPU cores. This time we have a Tri-core CPU, based on an updated revision of the Apple-designed Cyclone core, which utilizes the ARMv8 ISA and is therefore a 64-bit architecture. Clock speeds remain conservative with Apple's latest CPU, going no further than 1.5GHz. So with three cores at 1.5GHz, how does Apple get performance competitive with quad-core, 2GHz+ offerings from competitors? The answer lies within the Cyclone core.
The Cyclone CPU, now in its second generation, is a very wide core. As it is, it can issue up to 6 instructions per clock. Also, each Cyclone core contains 4 ALUs, as opposed to 2 ALUs/core in Apple's previous CPU architecture, Swift. Also, the reorder buffer has been increased to 192 instructions, in order to avoid memory stalls and to utilize more fully the 6 execution pipelines. In comparison, a Cortex-A15 core can co-issue up to 3 instructions per clock, half as much as Cyclone, and can hold up to 128 instructions in its reorder buffer, only two thirds of the amount that Cyclone's reorder buffer can hold.
By building a very wide CPU architecture, and keeping their CPUs to low core counts and clock speeds, Apple has, in one move, achieved excellent single-threaded performance, far beyond what a Cortex A15 or a Krait core can produce, while at least matching the quad-core competition in multi-threaded processing. I've always said that, due to the fact that CPU instructions tend to have a very threaded nature, CPUs should be way more efficient if they are built emphasizing single-threaded performance, and Apple continues to do the right thing with Cyclone.

The Snapdragon 805 is the last high-end SoC to utilize Qualcomm's own Krait CPU architecture, which was introduced WAY back with the Snapdragon S4. Needless to say, it's still a 32-bit core. The last revision of the Krait architecture is dubbed Krait 450. While Krait 450 carries many improvements compared to the original Krait core, the basic architecture is still the same. Like the Cortex-A15 it's based on, Krait is a 3-wide machine, capable of co-issuing up to 8 instructions at once. In comparison to Cyclone, it's a relatively small core, therefore, it won't be as fast in terms of single threaded performance. Krait 450's tweaked architecture allows it to run at a whopping 2.7GHz, or to be more exact, 2.65GHz. In the case of the Snapdragon 805, we have four of these Krait 450 cores. Qualcomm's signature architecture tweak, which involves putting each core on an individual voltage/frequency controller, allows each core to have a different frequency. That reduces the power consumption of the SoC, and should translate into better battery life. With four cores, and at such a high frequency, the Snapdragon 805's CPU gets very good multi-threaded performance, although the relatively narrow Krait core hurts single-threaded performance very much.

Finally, we have the Tegra K1 and its two different versions. The 32-bit version of the Tegra K1 employs a quad-core Cortex-A15 CPU clocked at up to 2.3GHz, and we've seen a CPU configuration like this in so many SoCs that by this point it's a very well known quantity. The interesting story here is the 64-bit Tegra K1, which uses a dual-core configuration of Nvidia's brand new custom CPU architecture, named Denver. If you don't care much to know about Denver's architecture, you'd better skip this section, because there is A LOT to say about Nvidia's custom CPU.

Denver: The Oddest CPU in SoC history

Denver is Nvidia's first attempt at making a proprietary CPU architecture, and for a first attempt it's actually very good. Some of Nvidia's expertise as a GPU maker has translated into its CPU architecture. For instance, exactly like with Nvidia's GPU architectures, Denver works with VLIW (Very Long Instruction Word) instructions. Basically, this means that the instructions are packed into a 32-bit long "word", and only then are sent into the execution pipelines.

Denver's most peculiar characteristic might be this one: it's an in-order machine, while basically every other high-end mobile CPU has Out-of-Order Execution (OoOE) capabilities. Denver's lack of a dedicated engine that reorders instructions in order to reduce memory stalls and therefore increase the IPC (Instructions Per Clock) should be a huge performance bottleneck. However, Nvidia employs a very interesting (and in my opinion unnecessarily complicated) way of dealing with its in-order architecture.

By not having a hardware OoOE engine built into the CPU, Nvidia has to rely on software tricks to reorder instructions and enhance ILP (Instruction Level Parallelism). Denver is actually not meant to decode ARM instructions most of the time. Rather, Nvidia chose to build a decoder that would run native instructions, optimized for maximum ILP. For this optimization to occur, Nvidia has implemented a Dynamic Code Optimizer (DCO). Basically, the DCO's job is to recognize ARM instructions that are being sent to the CPU frequently, translate it into native instructions and optimize the instruction by reordering parts of the instruction to reduce memory stalls and maximize ILP. For this to work, a small part of the device's internal storage must be reserved to store the optimized instructions.

One implication of this system is that the CPU must be able to decode both native instructions and normal ARM instructions. For this purpose there are two decoders in the CPU block. One huge 7-wide decoder for native instructions generated by the DCO, and a secondary 2-wide decoder for ARM instructions. The difference in size between the two decoders shows how Nvidia expects to have the native instructions being used most of the time. Of course, at the first time that a program is run, and there are no optimized native instructions ready for the native decoder to use, only the ARM decoder would be used until the DCO starts recognizing recurring ARM instructions from the program and optimizes those instructions, from which point onwards that specific instruction would always go through the native decoder. If a program ran the same instructions multiple times (for example, a benchmark program), eventually all of the program's instructions would have a corresponding native optimized instruction stored, and then only the native decoder would be utilized. That would correspond to Denver's peak performance scenario.

While Nvidia's architecture might be a very interesting move, I ask myself if it wouldn't just be easier to build a regular Out-of-Order machine. But still, if it performs well in real life, it doesn't really matter how odd Nvidia's approach was. 

Now, going on to the execution potion of the Denver machine, we see why Denver is the widest mobile CPU in existence. That title was previously held by Cyclone, with its 6 execution pipelines, however, Nvidia went a step ahead and produced a 7-wide machine, capable of co-issuing up to seven instructions at once. That alone should give the Denver core excellent single-threaded performance.

The 64-bit version of the Tegra K1 employs two Denver cores clocked at up to 2.5GHz. That makes it the SoC with the lowest core count among the ones being compared here. While single-threaded performance will most certainly be great, I'm not sure that the dual-core Denver CPU can outrun its triple-core and quad-core opponents.

In order to test that, let's start our synthetic benchmarks evalutation of the CPUs with Geekbench 3.0, which evaluates the CPU both in terms of single-threaded performance and multi-threaded performance.

CPU Benchmarks

In single-threaded applications, Nvidia's custom Denver CPU core takes the first place, followed closely by Apple's enhanced Cyclone core on the Apple A8X. Meanwhile, the older Cortex-A15 and Krait 400 CPU cores are far behind, with the 2.2GHz A15 core in the 32-bit Tegra K1 pulling slightly ahead of the 2.7GHz Krait 450 core in the Snapdragon 805. 


In multi-threaded applications, where all of the CPU's cores can be used, the A8X, with its Triple-core configuration blows past the competition. The dual-core Denver version of the Tegra K1 gets about the same performance as the quad-core Cortex-A15 Tegra K1 variant, with the quad-core Krait 450 coming in last place, but by a very, very small margin. 

Apple's addition of one extra core to the A8X's CPU, together with the fact that Cyclone is a very powerful core, make it easily the fastest CPU in the market for multi-threaded applications. While Nvidia's 64-bit Denver CPU core has some impressive performance, thanks to its wide core architecture, it's core count works against it in the multi-threaded benchmark. It is, in fact, the only dual-core CPU being compared here. Even if it's not as fast as the A8X's CPU, Nvidia's Denver CPU is a beast. Were it in a quad-core configuration, it would absolutely blow the competition out of the water.

The GPU

Moving away from CPU benchmarks, we shall now analyze graphics performance, which is probably even more important than CPU performance, given that it is practically a requirement for high-end tablets to act as a decent gaming machine. First we'll look at OpenGL ES 3.0 performance with GFXBench 3.0's Manhattan test, followed by the T-Rex test, which tests OpenGL ES 2.0 performance, followed by some of GFXBench 3.0's low level tests.

The Manhattan test puts the Apple A8X ahead of the competition, followed closely by both Tegra K1 variants, which have about the same performance, since they have the exact same GPU and clock speed. Unfortunately, the Adreno 420 in the Snpadragon 805 is no match for the A8X and the Tegra K1, something that points out the need for Qualcomm to up their GPU game.

The T-Rex test paints a similar picture, with the A8X slightly ahead of the Tegra K1, while both of the Tegra K1 variants get about the same score, and the Snapdragon 805 falls behind the other two processors by a pretty big margin.

The Fill rate test stresses mostly the processor's memory interface and the GPUs TMUs (Texture Mapping Units). Since both the Apple A8X and the Snapdrgon 805 have the same dual-channel 64-bit LPDDR3 memory interface clocked at 800MHz, the performance advantage the Snapdragon 805 has shown in comparison to the A8X can only be attributed to the possibility that the Adreno 420 GPU has better texturing performance than the PowerVX GXA6850 in the Apple A8X. Meanwhile, the two variants of the Tegra K1 feature the same memory interface, which also consists of a dual-channel 64-bit LPDDR3 interface, only with a lower 533MHz clock speed. Therefore, the Tegra K1 offers signifcantly less texturing performance compared to the A8X and the Snapdragon 805, but is a very worthy performer nevertheless.
The ALU test is more about testing the GPUs sheer compute power. Since Nvidia's Tegra K1 has 192 CUDA cores on its GPU, it naturally takes the top spot here, and by a pretty significant margin.

For some reason, all tests show the 32-bit Tegra K1 in the Nvidia Shield Tablet scoring a few more points than the 64-bit Tegra K1 in the Google Nexus 9. But given that the two processors have the exact same GPU, this difference in performance is probably due to software tweaks in the Shield Tablet's operating system, which would make sense, given that it is more than anything a tablet for gaming.

Thermal Efficiency and Power Consumption

In the ultra-mobile space, power consumption and thermals are the biggest limiting factors for peformance. As the three processors being compared here are all performance beasts, several measures had to be taken so that they wouldn't drain a battery too fast or heat up too much.

In order to keep power consumption and die size in check, Apple has decided to shrink the manufacturing process from 28nm to 20nm, a first in the ultra-mobile processor market. That alone gives it a huge advantage over the competition, since they can put more transistors in the same die area, and with the same power consumption. Since the A8X is, in general, the fastest SoC available, the smaller process node is important to keep the iPad Air 2's battery life good. 

Nvidia's Tegra K1 should also do well in terms of power consumption and thermal efficiency in situations where the GPU isn't pushed too hard. The 28nm HPM process it's built upon is nothing particularly good, but it's still not old for a 2014 processor. While the Kepler architecture is very power efficient, straining a 192-core GPU to its maximum is still going to produce a lot of heat in any case. The Nexus 9 tablet reportedly can get very warm on the back while the tablet is running an intensive game.

Finally, the Snapdragon 805 should be the less power hungry processor because it is also a smartphone processor. Given that a 5" phone can carry this processor without heating up too much or draining the battery too fast, a tablet should certainly be able to do the same. To put things in perspective, if we put the Tegra K1 or the Apple A8X inside a smartphone, both would be too power hungry and would produce too much heat to make for a decent phone. In any case, the Snapdragon 805 is, like the Tegra K1, built on a 28nm HPm process. Given that its not as much a performance moster as the other two processors mentioned here, it must be the least power hungry of all three.

Conclusion

Objectively speaking, the comparisons made here make it pretty much clear that once again Apple takes the crown for the best SoC for this generation of high-end tablet processors. Not that the competition is bad. On the contrary, Nvidia went, in just one generation, from being almost irrelevant in the SoC market (let's face it, the Tegra 4 was not an impressive processor) to being at the heels of the current king of this market (aka Apple). The Tegra K1 is an excellent SoC, and even if it can't quite match the Apple A8X, it's still quite close to it in most aspects.

Meanwhile, Qualcomm is seeing it's dominance in the tablet market start to fail. It's latest SoC, the Snapdragon 805, available even on some smartphones and phablets, is available in only one tablet, while most others carry the Snapdragon 801 or even the 800, and this is disappointing, given that a tablet can utilize the processing power more usefully than a smartphone or a phablet. Either way, the Snapdragon 805 is still a very good processor. It's just far from being the fastest. Perhaps Qualcomm should consider, like Nvidia and Apple, making a processor with extra oomph, but meant only to run inside tablets, because while the Snapdragon 805 is an excellent smartphone processor, it's not as competitive in the tablet market.