You put your finger on the sore spot. In itself, X86, ARM, and even other RISC CPUs don’t differ that much from each other. X86 as an instruction set also no longer really exists, but is increasingly a “CISC on RISC” interpretation in the CPU with more ‘handy’ things like branch prediction, smart caches, and duplication of certain cheap things (like asynchronous fetches) that make up for emotional losses in performance (mainly due to cache miss).
Performance/watt is a very funny comment in this context because it is not a linear line: this is also why undervolting is such a popular part with many people. You could take a high-end single-CCX Ryzen, and tune it to go to 5W max (as excluding things like IO/memory/etc…), and get as much or more performance as the most high-end ARM chip in generic compute business. However, a CPU is so much more than CPU nowadays. It’s a SOC.
And Apple is very smart about that. With very limited exceptions, a Ryzen is always equipped with energy-guzzling IO in the sense of PCIe and memory. Certainly memory is very difficult to time well enough if it goes outside the SOC itself: said the motherboard solder DIMMs, or even DIMM slots: low voltage DDR on the SOC has less distance, and therefore by definition: more efficiency.
The same goes for the GPU, although both AMD and Intel are making huge strides in that (AMD more than Intel…): shared memory with the CPU, shared cache. More important, however, is: what kind of tasks do you offload to the GPU, which ones do you let the CPU do. Desktop compositing has long been one of those things that a GPU can do better than a CPU, but of course there is much more.
And that brings me to the step where I think we can gain the most energy: tasks that are so specific and run so frequently that things like generic compute, or even a programmable microcontroller, aren’t the ideal solution (one of the things where GPU compute is strong in is not so much general purpose compute, but rather small, simple tasks that can be performed by thousands of processes simultaneously with a small compute footprint, think of tile rendering, or video decode, small sets that ‘flow’ instead of things like a database containing large sets that are not divisible). However, there is one more step above the GPU in optimization for limited tasks, and that is the ASIC. You see that certain neural net-like things are already happening there, but at the same time also that things that (strangely enough early were also about an ASIC) now go over the CPU to an ASIC. Like networking, audio, or even things like a webcam. Many ASICs can even do video encode/decode better than a GPU. That’s what Apple is now strong in (Microsoft, by the way… hololens was brilliant in the amount of ASICs for positioning). There is also a lot of energy gain and performance gain in this.
I think that further integration of chips through tiling/chiplets (more cache, memory, graphics), and the addition of specific-purpose computing will be a significant step. Don’t think we’re going back to the era of this kind of board: pricewatch: BFG Ageia PhysX — but that similar stuff will go to other dedicated modules to further unload the GPU and CPU and thus further increase the performance/watts to increase