sábado, 16 de novembro de 2013

Apple A7 vs NVIDIA Tegra 4 vs Snapdragon 800: SoC Wars


Mobile SoC performance has become one of the most competitive aspects in the mobile sector. Since 2010, when the iPad made it clear how important processing power is for mobile devices, performance in mobile devices has had exponential growth, and SoC vendors began to compete more and more. In 2013, the main SoC manufacturers can be narrowed down to Qualcomm, Apple, NVIDIA, and to a lesser extent, Samsung. TI used to be a big player in the SoC market, but this year it practically disappeared from the SoC sector. Now that these companies have their latest silicon shipping in commercially available products, in time for the holiday season, it's time to put their best offerings to the test and see who has the best offering.

Apple A7 NVIDIA Tegra 4 Snapdragon 800
 Process Node   28nm HKMG   28nm HPL  28nm HPM
 Die Size  102mm2  ~80mm2 118.3mm2
 Instruction Set   ARMv8 (64-bit)   ARMv7 (32-bit)   ARMv7 (32-bit)
 CPU  Dual-core Cyclone @ 1.3/1.4GHz   Quad-core Cortex-A15 @ 1.9GHz + Low Power Cortex-A15 @ 825MHz  Quad-core Krait 400 @ 2.3GHz
 GPU  PowerVR G6430 @ 450MHz  72-core ULP GeForce @ 672MHz  Adreno 330 @ max 550MHz
 RAM  32-bit Dual-channel LPDDR3-1600 (12.8GB/s)  32-bit Dual-channel LPDDR3/DDR3L-1866 (14.9GB/s)  32-bit Dual-channel LPDDR3-1866 (14.9GB/s)



The CPU: Dual-core vs Quad-core

Apple's most impressive feat on the mobile performance sector so far is that, in an age of quad-cores with insane clock speeds, Apple has not once shipped a device with more than two CPU cores and with a relatively low clock speed, and has still managed to at least keep up with the latest competition. Let's see how Apple's latest CPU, the dual-core Cyclone with a max clock speed of 1.4GHz, stacks up against NVIDIA's latest offering, the Tegra 4's four Cortex-A15s @ 1.9GHz and the Snapdragon 800's four Krait 400 cores @ 2.3GHz

Architecturally speaking, Apple's CPU is far superior to the Cortex-A15 and the Krait 400. That's because the A7 CPU runs on a brand new 64-bit ARMv8 architecture. The luxury of 64-bit allows the Cyclone CPU to be able to address memory much faster, giving it a tangible performance gain in some cases over traditional 32-bit solutions. Not only that, but Apple has made the Cyclone core much wider than its predecessor, the Swift core. In fact, I think it's the widest mobile CPU so far. The wider architecture plus 64-bit give the Cyclone cores much better single-threaded performance over any of its competitors, and remember that in most use cases single-threaded performance is the most important. Kudos to Apple for competing against monstrous quad-cores with only a dual-core. 

The NVIDIA Tegra 4's CPU uses NVIDIA's Variable Symmetric Multi-Processing architecture, which was introduced with the Tegra 3. Like ARM's big.LITTLE architecture, the Tegra 4 consists of a main CPU cluster, composed of four high-performance Cortex-A15 cores running at a max 1.9GHz, and a shadow A15 core than can go up to 825MHz. When CPU demand is low, the Quad-core A15 cluster is power-gated, and all processing transfers to the shadow A15 core, and it remains like this as long as demand from the CPU is low enough. The advantage of this is, of course, reduced power consumption.

Qualcomm's Snapdragon 800 uses Qualcomm's own modification of the Cortex-A15 core, dubbed Krait 400. Since Qualcomm likes to keep its mouth shut about its CPU architectures, not much is known about the Krait 400. What we know is that the Krait 400 is mostly the Krait 300 core in a 28nm HPm process. However, the move from 28nm LP in the Krait 300 and 28nm HPm in the Krait 400 means that there's been some relayout in the Krait 400. Other differences from Krait 300 include lower memory latency. Apart from that, we only know that, like the Cortex-A15 upon which it's based on, the Krait 400 is a 3-wide machine with OoO (Out-of-Order) processing capabilities. The move to HPm means the Krait 400 can achieve higher clocks than its predecessor, which accounts for the insane 2.3GHz max clock speed. Put that four of those monster cores together and you potentially have the most powerful mobile CPU to date. Unfortunately, it still remains that it also lags behind the Apple A7 in single-threaded performance, which is also very important in mobile OSes. 

Now let's put in some quantitative information to see how these CPUs compare in their actual performance: 

What I said before about single-threaded performance shows here. Apple's Cyclone cores can deliver at least 50% more performance on a single core than any of its competitors. But due to the fact that the A7 has only two cores while all of its main competitors have four of them, in multi-threaded situations the A7 loses its advantage, but can still keep up with all of its competitors. It's very impressive how Apple always manages to match quad-core performance with only two cores. 

The GPU and Memory

Apple has always put more emphasis on the GPU rather than the CPU on its SoCs, and the A7 is no different. Apple continues to license GPUs from Imagination Technologies, like it has been doing since its first iPhone. This time around, Apple is using a PowerVR "Rogue" series GPU, which is based on ImgTech's latest technology and, of course, supports OpenGL ES 3.0. The exact model of the new PowerVR GPU in the A7 is the G6430 variant, which contains four GPU modules with 32 unified shader units on each module. That equates to a total of 128 shader units with at a clock speed of 450MHz. 

Ironically, the NVIDIA Tegra 4's GPU is the least fancy of the current high-end mobile GPUs. Designed by NVIDIA, the GPU in the Tegra 4 is based on the ancient NV40 architecture (the same used in the GeForce 6000 series), hence, its the only modern GPU that uses discrete pixel and vertex shaders. In this case, there are a total of 72 shader units, 48 of which are pixel shaders and the remaining 24 are vertex shaders. The GPU runs on a max clock speed of 672MHz. The biggest limitation of the Tegra 4's GeForce GPU is that it only supports OpenGL ES 2.0. Right now, this isn't really a problem, as game developers haven't yet migrated to OpenGL ES 3.0 for their games, but that practically destroys the future-proofing of the Tegra 4.

Finally, we have the Snapdragon 800 with its Adreno 330 GPU. Like I said before, Qualcomm likes to reveal as little information as possible about its SoCs, and the Adreno line of GPUs are probably the biggest mysteries I'm faced with now. All I can say is that it's a unified shader architecture compatible with the latest OpenGL ES 3.0 API. The Adreno 330, in its highest configuration, runs at 550MHz, but the vast majority of Snapdragon 800 devices have their GPUs clocked at 450MHz. By the way, the benchmark results I'll show later on reflect the Adreno 330's performance at 450MHz, since no devices have released yet with the 550MHz bin of the Adreno 330. 

Snapdragon 800 Apple A7 NVIDIA Tegra 4 NVIDIA Tegra 4i
 GPU Name   Adreno 330  PowerVR G6430   72-core GeForce  72-core GeForce
 Shader Cores
 ?
 4  4 Pixel; 6 Vertex  2 Pixel; 3 Vertex 
 ALUs/Core
 ?
 32  12 Pixel; 4 Vertex  24 Pixel; 4 Vertex
 Total ALUs
 ?
 128  72 (48 Pixel, 24 Vertex)  60 (48 Pixel; 12 Vertex)
 Max Clock Speed  550MHz  450MHz  672MHz  660MHz
 Peak GFLOPS
 ?
 115.2  96.8  79.2


Peak theoretical compute power puts the Tegra 4 behind the A7, but the Tegra 4 is still close enough to the A7 to call it competitive. However, be aware that, while the A7's unified shader architecture allows it to have its peak 115.2 GFLOPS performance available to it in any situation (the same applies to the Adreno 330), the story is quite different with the Tegra 4. The discrete pixel shader architecture means that the GPU's peak 96.8 GFLOPS can only be achieved when the mix of pixel and vertex shader requests matches the ratio between pixel and vertex shader hardware (2:1), so most of the time the GPU achieves less than 96.8 GFLOPS.

There may not be a huge gap in theoretical compute between the A7's and Tegra 4's GPU, but the architectural difference is astounding. You can hardly put a unified shader architecture that supports OpenGL ES 3.0 in the same league as a discrete pixel and vertex shader architecture that is limited to OpenGL ES 2.0. While these differences may not affect real-world performance, the omission of OpenGL ES 3.0 is bad for future-proofing. 

Interestingly, every current high-end SoC uses pretty much the same memory interface. The Tegra 4, Apple A7 and Snapdragon 800 have dual-channel DDR3L solution, except that the Tegra 4 and the Snapdragon 800 allow for a slightly higher clock speed (933MHz) versus the A7 (800MHz), giving the A7 12.8 GB/s peak theoretical memory bandwidth, versus 14.9 GB/s on the Tegra 4 and Snapdragon 800. While the A7 has technically less theoretical memory bandwidth than its competitors, it counteracts this with a very interesting solution. It turns out the A7 has 4 MB of SRAM on-die, acting as a L3 cache, which can be used to unload instructions off the main memory interface and hence increase the bandwidth. You may recall that a similar solution is used in the Xbox One's SoC to increase memory bandwidth. 

Considering the 4MB SRAM on the A7's die, it may turn out that the A7 can deliver significantly more memory bandwidth than the Tegra 4, but still, both have enough memory bandwidth to power ultra high-resolution (>1080p) tablets comfortably. 

The T-Rex HD test shows the Tegra 4 significantly behind the Apple A7 and also puts it as the slowest of the high-end mobile GPUs. The Apple A7, however, is only beaten by the Snapdragon 800, however only by a very small margin. 

The less intensive Egypt HD test also shows the Tegra 4 behind the A7 and other high-end mobile SoCs, but by a smaller margin. The A7 is the second slowest of these SoCs in this test, achieving slightly lower scores than the Mali-T628 in the Exynos 5420 and the Adreno 330 in the Snapdragon 800. Both tests show the Snapdragon 800 as the supreme mobile GPU.
ImgTech GPUs have always had industry leading fill rate capabilities, and it shows in the A7. The PowerVR G6430 GPU has a much higher fill rate than any of its competitors. On the ther end of the spectrum, we have the Tegra 4. Tegra GPUs have a tendency of being substandard in terms of fill rate, and it shows. The Tegra 4 manages a significantly lower fill rate score than every one of its competitors, especially the Apple A7. That's a problem, because the Tegra 4 is currently used to power some of the few tablets which boast 1600p displays, for example, the ASUS Transformer Pad TF701T. On devices with 1080p screens or less however, even the Tegra 4 probably won't run into any bottlenecking due to the limited fill rate. The Snapdragon 800 also doesn't do very well, as it's also outperformed by the Mali-T628 in the Exynos 5420.



Here, the Tegra 4 and the Apple A7 are in the lead, with the Apple A7 pulling ahead slightly.



Adding lighting per vertex for some reason causes the Apple A7 to lag behind all of its competitors, leaving the Tegra 4 on the lead.


When using per pixel lighting, the A7 once again falls behind everyone else, and this time the Tegra 4 also joins it with the second lowest score.

Even though in some cases the Apple A7 lags behind its competition severely, I highly doubt this is going to make performance suffer in any way, since most mobile games aren't very geometry bound. 

The Snapdragon 800, while not at the top spot in most of these tests, shows strong scores across the board, outperforming the whole competition by a significant margin in the fragment lit test. 

Power Consumption

All of the current high-end SoCs should have low enough power consumption, since they all use 28nm silicon. On the CPU side, the A7 enjoys a low core count as well as a low clock speed, so I don't expect the CPU to draw too much power. The Tegra 4, on the other side, has four power-hungry Cortex-A15 cores with a much higher clock speed, however, the shadow A15 core has potential to counteract the extra power consumed when the main A15 cluster is active. The S800 doesn't have any extra low power cores, and relies on the efficiency of the main Krait 400 cores to yield good battery life. But given Qualcomm's record of making CPUs with low idle power, this is definitely not a problem.

One optimization that Qualcomm makes to reduce power consumption is that it can have different clock speeds on each active core. The competitors' architectures only allow them to run every active core at the same clock speed, even if unnecessary. So, for example, if there are two cores active, one of them fully loaded and the other running a much lighter task, the Krait 400 will have the first core on its max clock speed, while the second core could have a much lower clock, while its competing CPUs will run both cores at the max clock speed, even if the second core doesn't really need it. This is one of the many optimizations that make the Krait 400 core very power efficient. 

I can't really tell whether it's the 72-core GeForce GPU, the PowerVR G6430 or the Adreno 330 that consumes less power, but given ImgTech's record of making the most power efficient mobile GPUs, it's not a stretch to assume that the G6430 is the GPU that draws less power. 



Conclusion

While the Tegra 4, the Apple A7 and the Snapdragon 800 have completely different architectures, I'd say that they're pretty close to each other, based on the performance they've showed on synthetic benchmarks. The differences between the CPUs are the most astounding. While Apple focused on keeping core count and clock speed low while driving up single-core performance, NVIDIA's (or rather, ARM's) and Qualcomm's solution offsets the relatively lower single-threaded performance by using more cores at a higher clock speed. While the former is probably better for overall system performance, as mobile OSes tend to rely much more on single-threaded performance, the latter is probably better for multi-tasking. In any case, it's evident that all current high-end SoCs are surprisingly close together when it comes to peak multi-threaded performance.

Comparing the Tegra 4, Apple A7 and the Snapdragon 800 as well as the rest of the high-end competition, it's clear that the only one that is truly distinguished is the A7. The Tegra 4 and the Exynos 5420, for instance, both have four Cortex-A15 cores with a similar clock speeds (1.9GHz vs 1.8GHz, respectively), and they also have a separate CPU cluster for handling light tasks with low power (the Tegra 4 has a single A15 core at its disposal, while the Exynos 5420 uses a quad-core Cortex-A7 cluster for the same purpose). The Snapdragon 800 uses a unique architecture, the Krait 400, in a quad-core configuration and even takes the clock speed beyond the norm with an insane 2.3GHz, but unlike two of its competitors, it doesn't need extra low power cores, but has other solutions to keep idle power consumption low.

In GFXBench's high-level GPU benchmarks, it seems that all four main high-end SoCs are more or less on the same level, with only the Snapdragon 800 slightly pulling head of the A7. In both high-level tests, however, we can see the Tegra 4 lagging behind all of its competition. How ironic.

GFXBench's Low-level tests show a huge difference between the current high-end mobile GPUs, however. In the fill rate department we see the Apple A7 blowing all of its competitors out of the water, and we also see the Tegra 4 on the bottom of the chart and the Snapdragon 800 slightly ahead of the Tegra 4, but still behind the Exynos 5420 and the Apple A7.

The verdict of this comparison is that, while pretty much all of the current flagship SoCs are pretty close in terms of CPU power, the Tegra 4 falters slightly when the GPU is put to the test. The Apple A7 does very well on the GPU side, but it's just slightly outperformed by the Adreno 330 GPU on the Snapdragon 800. But really, they're all so close it's hard to pick one as a definite winner. You could call the Snapdragon 800 the overall inner, but I say it's too close to call.

11 comentários:

  1. I really thought A7 was better than Tegra 4, but for some reason all of its specs seem to be wrong. I wasted money by buying the 5S, and most of my benchmarks for 3D scenarios are around 40 FPS while my friends' Tegra Note did no less than 65 FPS. For 2D, A7 does nice but well that's not impressive. Is there any way of explaining it? I want my 5S to do the same thing as a Tegra Note on max settings. Thought I'm probably going to buy the Tegra Note too for the pen pressure thing as I'm a digital painter.

    ResponderExcluir
    Respostas
    1. What are the benchmarks you used to achieve these results, specifically? As it is, the Tegra Note and Shield are much faster in benchmarks than other Tegra 4 devices, probably due to specific software optimizations in Nvidia's firmware for the Tegra Note and the Shield. That would explain why the Tegra Note is performing faster than the iPhone 5s.

      Excluir
  2. lol are you going to use benchmarks or real life applications? A7 is beating all in single threaded synthetic tests, and almost the equal in multithreaded. And that is only syntetic test, if you add iOS optimisation, Android phones are at the end far behind 5s.

    ResponderExcluir
    Respostas
    1. iOS optimisation? Don't you mean simplification? Apple's products are good,very good, but only being able to swipe left or right in a sort of appdrawer isn't OS optimisation. The 5s may be the smoothest, it's far behind android devices in being the most enjoyable.

      Excluir
  3. And yet nothing compares with the upcoming nvidia K1.

    ResponderExcluir
  4. Thanks for publishing this. im enjoy reading it

    ResponderExcluir
  5. I don't see energy consumption criteria in Your chart ?
    Let me put the question by this way - what is Watt per GFLOP ratio?
    I am sure You a7 will be light years ahead of dummy idiots that produce 32 bit junks

    ResponderExcluir
  6. To the poster above me, you do realize what a benchmark is, don't you? It's a sythetic test designed to determine the raw performance of something. Sure, a core i5 ulv processor uses much less power per gflop than a core i7 extreme, but the difference in anual running cost is a very insignificant amount (about what it would cost to take a family if four to the movie theatre.) You're point is so irrelevant to a performance test., especially considering the yearly cost to run the apple A7 vs. any of its android competitors is less than what it than the cost of buying a package of frozen chicken wings, and it still underperforms them in almost all serious power user (multi-threaded) applications. Even though I will concede the A7 has better performance per watt, its still less performance overall at nominal cost to run. "Lightyears ahead" no, a Haswell Intel Core i7 Extreme is lightyears ahead of an intel 486dx2...

    ResponderExcluir