Mostrando postagens com marcador tegra k1. Mostrar todas as postagens
Mostrando postagens com marcador tegra k1. Mostrar todas as postagens

domingo, 23 de novembro de 2014

Apple A8X vs Tegra K1 vs Snapdragon 805 - Tablet SoC Comprarison (2014 Edition)

In the last few years, ultra-mobile System-on-Chip processors have made unprecedented strides in terms of performance and efficiency, advancing very quickly the standards for mobile performance. One form factor that particularly benefits from the exponential growth of SoC performance are tablets, since their large screens allow for the processors' abilities to be fully utilized. For the holiday season of 2014, we have the latest and greatest of mobile performance shipping inside high-end tablets. Apple has made a whole new SoC just for their iPad Air 2 tablet, which they call the A8X. Nvidia's Tegra K1 processor, which borrows Nvidia's venerable Kepler GPU architecture, has also appeared on a number of new high-end tablets. Finally, we also have the Qualcomm Snapdragon 805 processor found in the Amazon Kindle Fire HDX 8.9" (2014). Unfortunately, most other tablets either use the aging Snapdragon 801 processor, or in the case of Samsung's latest high-end tablets, use an even older Snapdragon 800 processor or the also old Exynos 5420 processor, which debuted with the Note 3 phablet in late 2013. In any case, at the pinnacle of tablet performance, we have the Apple A8X, the Tegra K1 and the Snapdragon 805 battling for the top spot.

 Apple A8X   Nvidia Tegra K1   Snapdragon 805
 Process Node   20nm  28nm HPM  28nm HPM
 CPU  Tri-core "Enhanced Cyclone" (64-bit) @ 1.5GHz  32-bit: Quad-core ARM Cortex A15 @ 2.3GHz
 64-bit: Dual-core Denver @ 2.5GHZ
 Quad-core Krait 450 @ 2.5GHz
 GPU  PoverVR GXA6850 @ 450MHz (230 GFLOPS)  192-core Kepler GPU @ 852MHz (327 GFLOPS)  Adreno 420 @ 600MHz (172.8 GFLOPS)
 Memory Interface  64-bit Dual-channel LPDDR3-1600 (25.6GB/s)  64-bit Dual-channel LPDDR3-1066 (17GB/s)  64-bit Dual-channel LPDDR3-1600 (25.6GB/s)


The CPU

It can certainly be said that all of this year's high-end mobile processors have excellent CPU performance. However, each manufacturer took a different path to reach those high performance demands, and that is what we'll be looking at in this section.

Starting with the A8X's CPU, what we have in hand is Apple's first CPU with more than two CPU cores. This time we have a Tri-core CPU, based on an updated revision of the Apple-designed Cyclone core, which utilizes the ARMv8 ISA and is therefore a 64-bit architecture. Clock speeds remain conservative with Apple's latest CPU, going no further than 1.5GHz. So with three cores at 1.5GHz, how does Apple get performance competitive with quad-core, 2GHz+ offerings from competitors? The answer lies within the Cyclone core.
The Cyclone CPU, now in its second generation, is a very wide core. As it is, it can issue up to 6 instructions per clock. Also, each Cyclone core contains 4 ALUs, as opposed to 2 ALUs/core in Apple's previous CPU architecture, Swift. Also, the reorder buffer has been increased to 192 instructions, in order to avoid memory stalls and to utilize more fully the 6 execution pipelines. In comparison, a Cortex-A15 core can co-issue up to 3 instructions per clock, half as much as Cyclone, and can hold up to 128 instructions in its reorder buffer, only two thirds of the amount that Cyclone's reorder buffer can hold.
By building a very wide CPU architecture, and keeping their CPUs to low core counts and clock speeds, Apple has, in one move, achieved excellent single-threaded performance, far beyond what a Cortex A15 or a Krait core can produce, while at least matching the quad-core competition in multi-threaded processing. I've always said that, due to the fact that CPU instructions tend to have a very threaded nature, CPUs should be way more efficient if they are built emphasizing single-threaded performance, and Apple continues to do the right thing with Cyclone.

The Snapdragon 805 is the last high-end SoC to utilize Qualcomm's own Krait CPU architecture, which was introduced WAY back with the Snapdragon S4. Needless to say, it's still a 32-bit core. The last revision of the Krait architecture is dubbed Krait 450. While Krait 450 carries many improvements compared to the original Krait core, the basic architecture is still the same. Like the Cortex-A15 it's based on, Krait is a 3-wide machine, capable of co-issuing up to 8 instructions at once. In comparison to Cyclone, it's a relatively small core, therefore, it won't be as fast in terms of single threaded performance. Krait 450's tweaked architecture allows it to run at a whopping 2.7GHz, or to be more exact, 2.65GHz. In the case of the Snapdragon 805, we have four of these Krait 450 cores. Qualcomm's signature architecture tweak, which involves putting each core on an individual voltage/frequency controller, allows each core to have a different frequency. That reduces the power consumption of the SoC, and should translate into better battery life. With four cores, and at such a high frequency, the Snapdragon 805's CPU gets very good multi-threaded performance, although the relatively narrow Krait core hurts single-threaded performance very much.

Finally, we have the Tegra K1 and its two different versions. The 32-bit version of the Tegra K1 employs a quad-core Cortex-A15 CPU clocked at up to 2.3GHz, and we've seen a CPU configuration like this in so many SoCs that by this point it's a very well known quantity. The interesting story here is the 64-bit Tegra K1, which uses a dual-core configuration of Nvidia's brand new custom CPU architecture, named Denver. If you don't care much to know about Denver's architecture, you'd better skip this section, because there is A LOT to say about Nvidia's custom CPU.

Denver: The Oddest CPU in SoC history

Denver is Nvidia's first attempt at making a proprietary CPU architecture, and for a first attempt it's actually very good. Some of Nvidia's expertise as a GPU maker has translated into its CPU architecture. For instance, exactly like with Nvidia's GPU architectures, Denver works with VLIW (Very Long Instruction Word) instructions. Basically, this means that the instructions are packed into a 32-bit long "word", and only then are sent into the execution pipelines.

Denver's most peculiar characteristic might be this one: it's an in-order machine, while basically every other high-end mobile CPU has Out-of-Order Execution (OoOE) capabilities. Denver's lack of a dedicated engine that reorders instructions in order to reduce memory stalls and therefore increase the IPC (Instructions Per Clock) should be a huge performance bottleneck. However, Nvidia employs a very interesting (and in my opinion unnecessarily complicated) way of dealing with its in-order architecture.

By not having a hardware OoOE engine built into the CPU, Nvidia has to rely on software tricks to reorder instructions and enhance ILP (Instruction Level Parallelism). Denver is actually not meant to decode ARM instructions most of the time. Rather, Nvidia chose to build a decoder that would run native instructions, optimized for maximum ILP. For this optimization to occur, Nvidia has implemented a Dynamic Code Optimizer (DCO). Basically, the DCO's job is to recognize ARM instructions that are being sent to the CPU frequently, translate it into native instructions and optimize the instruction by reordering parts of the instruction to reduce memory stalls and maximize ILP. For this to work, a small part of the device's internal storage must be reserved to store the optimized instructions.

One implication of this system is that the CPU must be able to decode both native instructions and normal ARM instructions. For this purpose there are two decoders in the CPU block. One huge 7-wide decoder for native instructions generated by the DCO, and a secondary 2-wide decoder for ARM instructions. The difference in size between the two decoders shows how Nvidia expects to have the native instructions being used most of the time. Of course, at the first time that a program is run, and there are no optimized native instructions ready for the native decoder to use, only the ARM decoder would be used until the DCO starts recognizing recurring ARM instructions from the program and optimizes those instructions, from which point onwards that specific instruction would always go through the native decoder. If a program ran the same instructions multiple times (for example, a benchmark program), eventually all of the program's instructions would have a corresponding native optimized instruction stored, and then only the native decoder would be utilized. That would correspond to Denver's peak performance scenario.

While Nvidia's architecture might be a very interesting move, I ask myself if it wouldn't just be easier to build a regular Out-of-Order machine. But still, if it performs well in real life, it doesn't really matter how odd Nvidia's approach was. 

Now, going on to the execution potion of the Denver machine, we see why Denver is the widest mobile CPU in existence. That title was previously held by Cyclone, with its 6 execution pipelines, however, Nvidia went a step ahead and produced a 7-wide machine, capable of co-issuing up to seven instructions at once. That alone should give the Denver core excellent single-threaded performance.

The 64-bit version of the Tegra K1 employs two Denver cores clocked at up to 2.5GHz. That makes it the SoC with the lowest core count among the ones being compared here. While single-threaded performance will most certainly be great, I'm not sure that the dual-core Denver CPU can outrun its triple-core and quad-core opponents.

In order to test that, let's start our synthetic benchmarks evalutation of the CPUs with Geekbench 3.0, which evaluates the CPU both in terms of single-threaded performance and multi-threaded performance.

CPU Benchmarks

In single-threaded applications, Nvidia's custom Denver CPU core takes the first place, followed closely by Apple's enhanced Cyclone core on the Apple A8X. Meanwhile, the older Cortex-A15 and Krait 400 CPU cores are far behind, with the 2.2GHz A15 core in the 32-bit Tegra K1 pulling slightly ahead of the 2.7GHz Krait 450 core in the Snapdragon 805. 


In multi-threaded applications, where all of the CPU's cores can be used, the A8X, with its Triple-core configuration blows past the competition. The dual-core Denver version of the Tegra K1 gets about the same performance as the quad-core Cortex-A15 Tegra K1 variant, with the quad-core Krait 450 coming in last place, but by a very, very small margin. 

Apple's addition of one extra core to the A8X's CPU, together with the fact that Cyclone is a very powerful core, make it easily the fastest CPU in the market for multi-threaded applications. While Nvidia's 64-bit Denver CPU core has some impressive performance, thanks to its wide core architecture, it's core count works against it in the multi-threaded benchmark. It is, in fact, the only dual-core CPU being compared here. Even if it's not as fast as the A8X's CPU, Nvidia's Denver CPU is a beast. Were it in a quad-core configuration, it would absolutely blow the competition out of the water.

The GPU

Moving away from CPU benchmarks, we shall now analyze graphics performance, which is probably even more important than CPU performance, given that it is practically a requirement for high-end tablets to act as a decent gaming machine. First we'll look at OpenGL ES 3.0 performance with GFXBench 3.0's Manhattan test, followed by the T-Rex test, which tests OpenGL ES 2.0 performance, followed by some of GFXBench 3.0's low level tests.

The Manhattan test puts the Apple A8X ahead of the competition, followed closely by both Tegra K1 variants, which have about the same performance, since they have the exact same GPU and clock speed. Unfortunately, the Adreno 420 in the Snpadragon 805 is no match for the A8X and the Tegra K1, something that points out the need for Qualcomm to up their GPU game.

The T-Rex test paints a similar picture, with the A8X slightly ahead of the Tegra K1, while both of the Tegra K1 variants get about the same score, and the Snapdragon 805 falls behind the other two processors by a pretty big margin.

The Fill rate test stresses mostly the processor's memory interface and the GPUs TMUs (Texture Mapping Units). Since both the Apple A8X and the Snapdrgon 805 have the same dual-channel 64-bit LPDDR3 memory interface clocked at 800MHz, the performance advantage the Snapdragon 805 has shown in comparison to the A8X can only be attributed to the possibility that the Adreno 420 GPU has better texturing performance than the PowerVX GXA6850 in the Apple A8X. Meanwhile, the two variants of the Tegra K1 feature the same memory interface, which also consists of a dual-channel 64-bit LPDDR3 interface, only with a lower 533MHz clock speed. Therefore, the Tegra K1 offers signifcantly less texturing performance compared to the A8X and the Snapdragon 805, but is a very worthy performer nevertheless.
The ALU test is more about testing the GPUs sheer compute power. Since Nvidia's Tegra K1 has 192 CUDA cores on its GPU, it naturally takes the top spot here, and by a pretty significant margin.

For some reason, all tests show the 32-bit Tegra K1 in the Nvidia Shield Tablet scoring a few more points than the 64-bit Tegra K1 in the Google Nexus 9. But given that the two processors have the exact same GPU, this difference in performance is probably due to software tweaks in the Shield Tablet's operating system, which would make sense, given that it is more than anything a tablet for gaming.

Thermal Efficiency and Power Consumption

In the ultra-mobile space, power consumption and thermals are the biggest limiting factors for peformance. As the three processors being compared here are all performance beasts, several measures had to be taken so that they wouldn't drain a battery too fast or heat up too much.

In order to keep power consumption and die size in check, Apple has decided to shrink the manufacturing process from 28nm to 20nm, a first in the ultra-mobile processor market. That alone gives it a huge advantage over the competition, since they can put more transistors in the same die area, and with the same power consumption. Since the A8X is, in general, the fastest SoC available, the smaller process node is important to keep the iPad Air 2's battery life good. 

Nvidia's Tegra K1 should also do well in terms of power consumption and thermal efficiency in situations where the GPU isn't pushed too hard. The 28nm HPM process it's built upon is nothing particularly good, but it's still not old for a 2014 processor. While the Kepler architecture is very power efficient, straining a 192-core GPU to its maximum is still going to produce a lot of heat in any case. The Nexus 9 tablet reportedly can get very warm on the back while the tablet is running an intensive game.

Finally, the Snapdragon 805 should be the less power hungry processor because it is also a smartphone processor. Given that a 5" phone can carry this processor without heating up too much or draining the battery too fast, a tablet should certainly be able to do the same. To put things in perspective, if we put the Tegra K1 or the Apple A8X inside a smartphone, both would be too power hungry and would produce too much heat to make for a decent phone. In any case, the Snapdragon 805 is, like the Tegra K1, built on a 28nm HPm process. Given that its not as much a performance moster as the other two processors mentioned here, it must be the least power hungry of all three.

Conclusion

Objectively speaking, the comparisons made here make it pretty much clear that once again Apple takes the crown for the best SoC for this generation of high-end tablet processors. Not that the competition is bad. On the contrary, Nvidia went, in just one generation, from being almost irrelevant in the SoC market (let's face it, the Tegra 4 was not an impressive processor) to being at the heels of the current king of this market (aka Apple). The Tegra K1 is an excellent SoC, and even if it can't quite match the Apple A8X, it's still quite close to it in most aspects.

Meanwhile, Qualcomm is seeing it's dominance in the tablet market start to fail. It's latest SoC, the Snapdragon 805, available even on some smartphones and phablets, is available in only one tablet, while most others carry the Snapdragon 801 or even the 800, and this is disappointing, given that a tablet can utilize the processing power more usefully than a smartphone or a phablet. Either way, the Snapdragon 805 is still a very good processor. It's just far from being the fastest. Perhaps Qualcomm should consider, like Nvidia and Apple, making a processor with extra oomph, but meant only to run inside tablets, because while the Snapdragon 805 is an excellent smartphone processor, it's not as competitive in the tablet market. 

domingo, 18 de maio de 2014

Xiaomi Announces the Mi Pad, First Tegra K1 Tablet


Four years into its existence, Chinese smartphone manufacturer Xiaomi has achieved great success with its smartphone sales in the Chinese market, and is now even starting to expand to other countries. In fact, the company has just announced its first tablet, dubbed the Mi Pad. As is common with many Xiaomi products, the Mi Pad is designed very similarly to one of its main competitors' mobile products, in this case, it looks like an iPad mini with a back that resembles the plastic, colourful iPhone 5c. However, the Mi Pad does bring something new to the table: A remarkably low 1,499 yuan (US$240) starting price and what might be just about the best mobile processor designed to date: Nvidia's Tegra K1. 

To say that the Xiaomi Mi Pad is based on Apple's iPad mini Retina would be an utter understatement. It has the same screen size and resolution, with a 7.9" diagonal and a 4:3 aspect ratio and a 2048 x 1536 resolution, which adds up to 326 pixels per inch. "Copied" or not from Apple's tablet, it has to be said that this is a very good screen for such a low-priced tablet.

Since the screen size and aspect ratio are the same, the Mi Pad also has very similar dimensions to the iPad mini, measuring 202 x 135mm. It's thicker than the iPad mini though, measuring 8.5mm thick, but that's because it has a huge (for its size) 25Wh battery (vs 23.4Wh for the iPad mini) to power both the high-resolution screen and the beefy 192-core GPU in the Tegra K1 processor. The Mi Pad is also slightly heavier than its competitors, weighing 360g (vs the 338g iPad mini Retina), although it's not exactly heavy, and the large battery more than compensates for the added weight. 

The back of the tablet is made of glossy plastic, which will be available in a variety of different colors (yellow, pink, blue, green, white and black). Xiaomi itself says that it uses the same kind of plastic build as the iPhone 5c. 

The tablet has a decent (on paper at least) 8MP Sony camera with F/2.0 aperture on the rear, and the front sports an unusually high-resolution 5MP front-facing camera, so it should be excellent for video conferences and selfies. 


The main reason why the Mi Pad stands out from other tablets in the market is because it's the first device to ship with Nvidia's Tegra K1 processor. While Nvidia's past Tegra processors have never been the fastest, the Tegra K1 marks a huge change in Nvidia's strategy in the mobile market. The Tegra K1 actually has two variants: one with a quad-core 32-bit Cortex-A15 CPU with a max clock speed of 2.3GHz (we've never seen a Cortex-A15's clock speed go this high before) plus Nvidia's signature use of a fifth low-power A15 core for handling low-power workloads while consuming very little power, and the other variant will sport a dual-core variant of a custom CPU core designed by Nvidia, the 64-bit Denver, which will apparently run at up to 2.5GHz. The Denver K1 will only be released later this year, so it's definitely the Cortex-A15 variant powering the Mi Pad. This is no issue though, as the Cortex-A15 is still one of the fastest CPU cores available, especially at 2.3GHz.

But what really stands out in the Tegra K1 is Nvidia's abandoning of the old GPU architecture used in previous Tegra's ULP Geforce GPUs in favor of the most recent, most efficient, Kepler architecture. This immediately implies that the Tegra K1 supports a wide range of APIs, including OpenGL ES 3.0 and DirectX 11. The Kepler GPU in the Tegra K1 is a full SMX unit, which means 192 unified shader cores, more than on any other mobile GPU. Not only that, but Kepler's cutting-edge power-efficiency means that, even on a thermally and power-constrained form factor like a tablet the GPU can get to very high clock speeds; Nvidia says it can go up to 950MHz, which is just insane for a mobile SoC. We still can't say wether the Mi Pad will heat up or drain its battery too fast with usage due to the powerful processor inside, but given Kepler's efficiency, I believe that neither will be a problem.

Some benchmarks of the Mi Pad that surfaced with its announcement show truly impressive GPU performance. For instance, it scored 30fps on the GFXBench 3.0 Manhattan Offscreen test, which is unprecedented for a mobile device. For comparison, the second fastest device for this particular benchmark, the iPad Air, tied with the iPad mini Retina and the iPhone 5s, scores 13fps on the same test. In other words, virtually all mobile SoCs are dwarfed by the Tegra K1 when it comes to GPU performance.

As there's no way to ascertain whether the Mi Pad is running the Tegra K1's GPU at the full 950MHz, I decided to try to determine it indirectly, so I ran the GFXBench T-Rex Offscreen (Manhattan isn't available on Windows yet) GPU benchmark on my G750JW laptop, which has a GPU with the same Kepler architecture. My laptop's GPU contains four Kepler SMX with a max clock speed of 910MHz, in other words, four times the amount of cores in the Tegra K1 with a very close clock speed. And indeed, my laptop scored a bit less than four times what the Mi Pad scored in the same test (60fps by the way), the difference being accountable for the 40MHz lower clock speed of my laptop's GPU.

Hence, I can confirm that the Mi Pad has its 192-core GPU running at the full stunning 950MHz. This gives it 364.8 GFLOPS of peak processing power, much more than what the Xbox 360 and PS3 could produce. So if the Mi Pad is rather uninspiring in its design and specs, at least its performance is downright amazing. The GPU is definitely capable of powering demanding mobile games, even with the Mi Pad's high 2048x1536 resolution. And with Nvidia bringing console games to its Tegra devices, like Portal and Half Life 2, the Mi Pad has the potential to be just about the best gaming tablet on the market. 

Conclusion

It's a shame that Xiaomi doesn't sell its devices in many countries, because they've been producing some very good stuff lately. Even though the Mi Pad isn't exactly original with its design and spec sheet, that doesn't mean its a bad tablet. Quite the contrary. With a high-resolution display, an excellent duo of rear and front cameras, a large battery, and what is by far the fastest mobile processor available, the Mi Pad has all it needs to compete with top-end small tablets like the Nexus 7, the Samsung Galaxy Tab Pro 8.4 and the iPad mini Retina. Wrap that up with a 1,499 yuan/US$240 price tag, and you may just have not only the best portable tablet, but also the best tablet value on the market.

The tablet will be on sale in China some time around June, and will subsequently roll out to a few other emerging markets as part of Xiaomi's expansion plan (some of these countries are: Brasil, Malaysia, Mexico, Russia, Indonesia and Thailand). In the US, it'll probably be available at some point through online importers, but that means that the price tag will be a bit above $240. If you can get your hands on the Mi Pad, and if its resemblance to some Apple products doesn't bother you that much, I would definitely recommend it.

segunda-feira, 6 de janeiro de 2014

CES 2014: NVIDIA Introduces Tegra K1 SoC: 64-bit Denver CPU and 192-core Kepler GPU


The Tegra line has always seemed like a second-thought product for NVIDIA due to lack of the innovation we've come to expect from NVIDIA. Well, this may be because they were busy working on something extraordinary, and it's finally here. NVIDIA's latest addition to the Tegra line, the Tegra K1, was announced today at its CES 2014 event, and it's pretty impressive. Tegra K1 brings NVIDIA's custom CPU core named Denver as well as a GPU built on the Kepler architecture, which according to NVIDIA can even outperform the Xbox 360 and the PS3 and DX11 compatibility, all the while keeping a 5W TDP. The PowerVR GPUs Apple always uses in its SoCs were always the pinnacle of mobile GPU performance, but if NVIDIA's performance claims about Tegra K1 pan out, Apple's GPUs will be utterly blown out of the water. 

Tegra K1 is, just like the Tegra 4, built on a 28nm process, which is pretty much the standard for modern SoCs, save for Intel's latest Atoms, which have already moved to 22nm. Hopefully the efficient 28nm process will keep the TDP at 5W or below, despite that beefy Kepler GPU. 

NVIDIA's latest SoC will actually come out in two variants. One will have a Quad-core Cortex-A15 CPU with a 2.3GHz clock speed and the other, which will only be available later this year, will feature a dual-core configuration of NVIDIA's own Denver CPU core. 2.3GHz is actually the highest clock speed we've ever seen a Cortex-A15 run on, so performance should be superb. The dual-core Denver-toting variant has an unknown clock speed, but what's really important is that Denver is a) NVIDIA's first custom ARM CPU and b) one of the first CPUs that use the ARMv8 architecture and therefore support 64-bit processing. I'm very excited to see how Denver performs when it comes out, and the Quad-core Cortex-A15 @ 2.3GHz will be very impressive too. Also, NVIDIA says Tegra K1 will, like its predecessors, use the 4-PLUS-1 architecture, so there's going to be a single "shadow" CPU core for handling light tasks while using very little power. Whether it's going to be used with both the Quad-core A15s and the Dual-core Denvers, I don't know, but I suspect the dual-core Denver won't need the extra shadow core. 

Perhaps the most interesting GPU we've ever seen on mobile is the Tegra K1's GPU. Considering how every previous Tegra GPU was based on a very old architecture and seldom topped benchmark charts, a jump to Kepler in one generation is quite satisfying. NVIDIA's Kepler GPU architecture was introduced last year and brought high performance and much better power efficiency to notebook and desktop GPUs, and even supercomputers, but NVIDIA has now achieved the impressive feat of bringing this architecture to the ultra-mobile space. The Tegra K1's GPU uses one full Kepler SMX, which is 192 unified shader units (or as NVIDIA calls it, cores). That's much more shading units that any mobile GPU has ever packed (for instance, Apple's A7's GPU had 128 shader units). NVIDIA claims that this GPU can even outperform both the Xbox 360 and the PS3. According to our calculations, it can, at 950MHz at least. At this clock speed, this GPU would have 365 GFLOPS of power, which is much more than the Xbox 360's 240 GFLOPS GPU and the PS3's 230 GFLOPS. I don't know whether NVIDIA's 5W TDP claim account for the GPU at 950MHz, but if it does (and it might, given how power efficient Kepler is), I pity NVIDIA's SoC competitors. For the record, to match the Xbox 360's performance, the Tegra K1's GPU would have to be clocked at 625MHz, which is actually lower than the Tegra 4's GPU clock. At the rather standard GPU clock speed for many mobile GPUs, 500MHz, the Tegra K1 can achieve 192 GFLOPS of peak theoretical performance, which is more than all of its competitors have reached. Of course, there are theoretical calculations, and we'll have to wait for a device running the Tegra K1 to be released to test whether its performance (vs its power consumption) is as good as it sounds.

The Tegra K1 GPU also touts DirectX 11 compatibility, and will probably also support OpenGL ES 3.0. NVIDIA showed us a demo of a Tegra K1 running a game simulation with the DX11-based Unreal Engine 4, and it just looked fantastic. Far ahead of anything we've ever seen on a mobile device. This is probably the first time when a mobile GPU's capability can really be called console-quality (almost every mobile GPU vendor makes that claim every year). While the GPUs on Tegras 2, 3 and 4 were a bit disappointing, Tegra K1 is exactly the innovation I was always expecting from NVIDIA in the ultra-mobile space. 

NVIDIA has, for the first time, come up with a mobile SoC that really pushes the boundaries of mobile processing. Its Denver cores will probably rival, if not outperform, the Apple A7's performance, and its 192-core Kepler GPU is downright amazing. Wrap that up with a 5W TDP, and you have just about the most impressive SoC to date. Now all NVIDIA has to do is ensure it can get OEMs to release devices using the Tegra K1, and before the competition catches up. Tegra 4's time-to-market wasn't bad, but the adoption of its last SoC wasn't very widespread, and Qualcomm's SoCs simply trumped the Tegra 4 in terms of OEM adoption. Hopefully NVIDIA will try to change that with the Tegra K1, maybe by releasing a Tegra K1 with an integrated Icera modem to attempt to find its way into LTE-enabled smartphones. For the first time NVIDIA has industry leading performance (previously Qualcomm held that title), so now it only needs to attract OEMs to use this fine silicon on their devices.