Before I begin, the report comes from DigiTimes and they cite anonymous sources for this story. As always, a grain of salt is required when dealing with this level of alleged leak.
That out of the way, rumor has it that Apple's A11 SoC has been taped out on TSMC's 10nm process node. This is still a little way's away from production, however. From here, TSMC should be providing samples of the now finalized chip in Q1 2017, start production a few months later, and land in iOS devices somewhere in Q3/Q4. Knowing Apple, that will probably align with their usual release schedule — around September.
DigiTimes also reports that Apple will likely make their split-production idea a recurring habit. Currently, the A9 processor is fabricated at TSMC and Samsung on two different process nodes (16nm for TSMC and 14nm for Samsung). They claim that two-thirds of A11 chips will come from TSMC.
Can we get more information
Can we get more information on the Apple A10, and more infromation than just the process node nm(Size), Ditto on the A11 for more than just the process node, or more info than just the new Apps that any new Apple phone may be getting.
How about any of the usual CPU core specifications like number instruction decoders, I$ size, D$ size, L1, L2, and L3/L4 cache sizes(If any are used). The execution unit info: Numbers of FP units, SIMD units, Int units, Branch units(including reorder buffer size), memory fetch/store units, other units on the Apple A series CPU cores.
It’s pretty easy to find the info on the PowerVR graphics cores that Apple Licenses from IT! But since the Apple A7’s introduction NOT MUCH of any Real CPU core info on the A8/A8X, A9/A9X and even the upcoming A10 core from Apple! And more info than just the number of A series cores on any phone or tablet SKUs please!
Apple seems to keep most of
Apple seems to keep most of the internal information a secret. There isn’t much we, or Pcper, can do about that.
Then all of you computer
Then all of you computer websites go out there and find another Anand Lal Shimpi, at least he was smart enough to get the right benchmarking software and other software tools, and the assembler language and compiler writers’ optimization manuals for the ARMv8A ISA with which to write some test code to suss out the underlying hardware on Apple’s A7, and previous Apple SKU’s CPU cores! Hell, there are a few pay-walled publications, that all the websites, who leached off of Anand’s work in the past, can pool their resources and get a subscription to, and keep their readers informed about the latest custom designed ARMv8A running cores’ micro-architectures to a much better degree than they are doing now!
Apple is run by and for the Abject Retailers, and the retailer does not care for even as long of period in time as a New York nanosecond about technology and any technological discourse!
As an enthusiast with some
As an enthusiast with some computer architecture and design knowledge, I am interested in knowing those internals, but they are mostly worthless as far as deciding which to buy. They can give some idea of design efficiency. Intel gets better performance out of smaller caches; they have always seemed to have better cache design than AMD though. When you are buying a CPU these days, you are almost buying more of a memory chip than anything else with the large, multi-level caches and integrated memory/system controller. There also a lot of features that make a difference which are not really part of the CPU core, like load and store reordering and prefetching schemes. Cache design is far from simple; size is only one metric to describe a cache system, it doesn’t really tell you as much as it used to due to all of the new features in modern CPUs. The performance is highly application dependent due to these systems. There are a large number of design considerations that can introduce bottlenecks depending on the application. Due to the complexity, it is always best to just run the applications you are interested in and see how they perform rather than trying to estimate something based on cache metrics. There just isn’t that much of a reason to know those internals anymore.
I only buy the OEM products
I only buy the OEM products that have the proper data sheets for their components! I need all of the CPU core micro-architectural features listed, with no dumbing down, just provide a proper glossary of terms and such and I am off. This obsessive need for being secret simply for reasons of cachet and style by some with a retailers mindset(Apple), does not go over well with me!
OEM PCs/laptops used to come with some very nice data-sheets but that practice is gone downhill to the point that it is impossible to comparison shop for OEM PC/Laptop products.
It’s most definitely not as simple as what you are saying, as far as the need to know, less openness is a very bad thing always! I need to know what software and drivers are on my systems, who is responsible for updating the graphics drivers OEM, or ODM! and other facts! No proper cogent manuals means no sale, ditto for any lack of CPU/GPU/APU data-sheet information or links!
The Custom ARM CPU cores market needs its own specialized reporting with computer micro-architectural reporting that is on the Hot Chips Symposium level of presentation/white paper or better! I need to know all the facts in order to compare and make an educated decision!
You lost me at “worthless”!
What about this ARMv8.2-A ISA
What about this ARMv8.2-A ISA feature set, That Carlie D. over at S/A is babbling about?
from wikipedia
“ARMv8.2-A
Half-precision floating-point data processing
Memory model enhancements
Introduction of RAS (Reliability, Availability and Serviceability) features
Introduction of statistical profiling”
https://en.wikipedia.org/wiki/ARM_architecture#ARMv8.2-A
The semiaccurate article
The semiaccurate article looks like it is subscriber only. I see it tagged with 512-bit SIMD though. I don’t actually see that much point to adding that wide of vector units to the CPU. Almost any software that can take advantage of that wide of vector units could easily run on the GPU instead.
The access time on GPU isn’t
The access time on GPU isn’t the same as on CPU and using the GPU without HSA on shared memory imply you have to move data from the dedicated video memory to the system memory.
Keep up this discussion,
Keep up this discussion, there will be other non pay-walled sites talking about this as the information leaks out. Keep on looking and posting! There probably is a lot more Post (Apple A7 cyclone core) Apple A series CPU core information out there for the A8/A9 cores among the developer community, but they have to be careful with revealing any manuals, as manuals can be specially crafted to identify the leaker!
Apple hired Anand to cut off the flow of information that Anandtech was providing!
P.S. as per the You Tube
P.S. as per the You Tube discussions/interviews with Jim Keller hinting that there was much sharing of design ideas between both the Keller run Zen(x86) and Custom ARMv8a ISA running K12(Armv8A ISA) design teams, Does anyone have any information about possible SMT capabilities engineered into the K12 custom ARMv8A ISA running CPU core’s micro-architecture from AMD? SMT capabilities would mean a better IPC improvement for any custom ARMv8A ISA running micro-architecture out there!
I would say that it is better
I would say that it is better to adopt HSA rather than trying to shoe-horn vector units into the CPU. Designing the system to supply the level of streaming bandwidth necessary for such wide vector units to the CPU core is a waste of resources. Large vector units will take a lot of die area and power. The interconnect for the necessary bandwidth will also take a lot of power and die area. GPU front ends are already optimized for streaming; most data does not need to be cached. I tend to think that the CPU core should mostly implement scalar resources, with things like FMA instructions and 3 operand FPU instructions. Optimize the scalar instructions for low latency. Perhaps we will not get HSA with this generation of Apple chips though. Eventually, we may have tight enough integration that the CPU vector units and GPU units are the same internally.
I think you missed something
I think you missed something in the great scheme of a processor design.
HSA is a transient or cheap solution to merge the GPU and the CPU however you are wasting clock cycle asking the CPU to make de GPU work (via the northbridge) and the programmer’s time to learn how to access the GPU with a privative API (Direct 3D, OpenGL, etc).
One day or another, the processor should have built-in vector units to manipulate vectors as one unique object (e.g. 128-bit word) in the system memory (already 128-bit wide thanks to dual channel) and compilers should have a specific instruction set to make binaries using vectors.
NO HSA is not “a transient or
NO HSA is not “a transient or cheap solution” the entire mobile market is very supportive on the development of and the engineering/scientific precepts of HSA design, both in the hardware and the software/OS/driver ecosystem for HSA enabled systems. Even the VR gaming makers like the HSA types of GPU acceleration uses for the VR gaming market, where that CPU to GPU latency issue can be lessened simply by doing more non-graphics gaming compute on the GPU, as well as the graphics compute already done on the GPU! So more of the games’ code running on the GPU and getting around those CPU to GPU over PCIe induced latency issues that comes with having to encode/decode data/code bound to and from the GPU over the PCI latency inducing protocol!
HSA will allow for the mobile devices to take the GPU and get even more processing done on the mobile device’s SOC! Same for the PC/Laptop devices and VR/other games, as well as other software using the GPU to accelerate spread sheet software, and even spelling/grammar checking accelerated on the GPUs many parallel cores!
All that so called wasted time will be abstracted away in the newer HSA aware hardware where things will be abstracted away from the application software in a manner similar to the way virtual memory/paging is abstracted away by the CPU’s hardware from any of the application software, by the hardware in the CPU and the software in the ring 0 level of the OS kernel(page tables memory management)!
GPUs are getting more and more of the once traditional CPU types of functionality added to their ACE/Other units. With some GPUs able to handle their own virtual memory paging tables, and even partition themselves up into virtual GPU slices in the GPU’s hardware/driver/APIs to allow for many users/OS/application instances to make use of a single GPU.
Even Nvidia has hired away(1) AMD’s HSA Guru(Phil Rogers) to get in on the advantages of more compute done on the GPU! Nvidia did not hire Rogers for any tiddlywinks gaming knowledge that Rodgers may/may not have!
(1)
http://www.anandtech.com/show/9717/amd-corporate-fellow-phil-rogers-leaves-company-joins-nvidia
Do realize the complete
Do realize the complete nonsense in your talk?
The best workaround for the Hypertransport<=>Northbrige<=>PCI-E latency issue is a unified architecture which tie vector units to the CPU. Multiplying decoders and cache units is a waste of transistors!
Actually HSA is transient and should be replaced by a unified architecture in the same way of the FPU.
Any talk off interposer
Any talk off interposer module narrow interconnect higher clocked power hungry communication will not fly for power savings alone compared to an APU/SOC built on an interposer module with a large amount of CPU to GPU connection traces, and the CPU’s cache subsystems tied directly into the GPU’s cache subsystem. The Caches on both CPU cores complex and GPU cores complex could then be sufficiently merged that they become LIKE a single unified cache memory system, one that is able to transfer entire cache lines between the CPU’s physical cache and the GPUs physical cache over a wide enough number of etched traces in the silicon interposer. And all this data/other information transfer would happen in the background under control of the cache units cooperating in each respective processor(CPU, GPU), with none of that external PCIe coding/decoding necessary.
The interposer then becomes the effective monolithic die of sorts with the interposer’s silicon able to be etched with 10s of thousands of individual traces. That interposer technology is what is going to allow AMD, and any others, to create modular CPU, GPU, other processor complexes of smaller more manageable and higher wafer yield designs and join them together with the same numbers of traces via the interposer’s silicon as if they where made of a single larger monolithic die. The technology for doing this is already in use for GPU to HBM, but it will be expanded further into the APU/SOC on an interposer usage methods and AMD has the lead in using this interposer technology!
The Navi GPU will modular for a scalability based design to allow for smaller GPU die units to be wired up in wide parallel fashion on the interposer to allow for GPU/APU systems that are more efficiently and affordably scaled from small powered GPU/APU interposer systems to the high powered GPU/APU interposer systems. There will even be bridge interposers to tie multiple GPU/APU/SOC interposer systems together and create interposer module complexes for Exascale computing systems. The less encoding and decoding steps involved the better, and having a cooperative cache system wired directly up via an interposer to wire all the various processor core/complexes together will allow for maximum latency reduction and latency hiding with the directly wired up cache systems able to manage any transfers and any cache memory to HBM memory transfers for the entire interposer hosted complex(APU, GPU, SOC, other) of processor modules.
It’s much better and faster to transfer cache line to cache line between processors and avoid even some memory associated latency issues for the highest priority data transfers CPU to GPU, while letting the cache units to do that work while the processor runs uninterrupted, let the Cache units on each respective processor module/Die manage the processor to processor data/code transfer and the cache to unified memory management. With unified memory addressing(UMA) there should never be a wasteful need to move data memory to memory.
“Multiplying decoders and cache units is a waste of transistors!”
Each processor/complex is still going to have to have its own cache controller because of that module’s potential usage as a stand alone unit. So for total flexibility in an APU on an interposer design at least each processor/die complex of CPU, GPU, other stand alone/modular clusters will have to have their own cache controller so they can be used stand alone, or in modular cooperative form. And remember each ASIC is going to have its way of dealing with things internally to the on Die unit, with the cache unit of each respective processor type only able to transfer cache lines back and forth but maybe processes them differently. GPUs have much larger and more numerous caches.
Maybe once the ACE/equivalent units on other makers GPUs acquire most if not all of the same functionality of a CPU completely merged with a GPU/vector processor then there can be talk of a single cache controller, but not for modular units that may have to be used in a stand alone fashion, in addition to a modular fashion. There could be an active interposer design with the a single cache controller complex etched into the interposer’s silicon and able to do that part for all the GPU(vector units) and CPU cores complexes, and other processor die complexes hosted on the interposer but each unit/die would not have any stand alone capabilities that could be had by keeping its own cache system and have that cache system able to be directly wired up via a cache line to cache line interposer fabric and cooperate with the other cache controllers.
/