This article fails to mention the single biggest differentiator between x86 and ARM: their memory models. Considering the sheer amount of everyday software that is going multithreaded, this is a huge issue, and the reason why ARM drastically outperforms x86 running software like modern web browsers.
On the x86 architecture, RAM is used by the CPU and the GPU has a huge penalty when accessing main RAM. It therefore has onboard graphics memory.
On ARM this is unified so GPU and CPU can both access the same memory, at the same penalty. This means a huge class of embarrassingly parallel problems can be solved quicker on this architecture.
That’s actually not what I was referring to, although the unified memory architecture is certainly more power efficient for mixed-intensive workloads. The cost of transferring to/from dedicated GPU memory is (unsurprisingly) quite large.
Here is a great article on the topic. Basically, x86 spends a comparatively enormous amount of energy ensuring that its strong memory guarantees are not violated, even in cases where such violations would not affect program behavior. As it turns out, the majority of modern multithreaded programs only occasionally rely on these guarantees, and including special (expensive) instructions to provide these guarantees when necessary is still beneficial for performance/efficiency in the long run.
This article fails to mention the single biggest differentiator between x86 and ARM: their memory models. Considering the sheer amount of everyday software that is going multithreaded, this is a huge issue, and the reason why ARM drastically outperforms x86 running software like modern web browsers.
Do you mind elaborating what is it about the difference on their memory models that makes a difference?
On the x86 architecture, RAM is used by the CPU and the GPU has a huge penalty when accessing main RAM. It therefore has onboard graphics memory.
On ARM this is unified so GPU and CPU can both access the same memory, at the same penalty. This means a huge class of embarrassingly parallel problems can be solved quicker on this architecture.
That’s actually not what I was referring to, although the unified memory architecture is certainly more power efficient for mixed-intensive workloads. The cost of transferring to/from dedicated GPU memory is (unsurprisingly) quite large.
Here is a great article on the topic. Basically, x86 spends a comparatively enormous amount of energy ensuring that its strong memory guarantees are not violated, even in cases where such violations would not affect program behavior. As it turns out, the majority of modern multithreaded programs only occasionally rely on these guarantees, and including special (expensive) instructions to provide these guarantees when necessary is still beneficial for performance/efficiency in the long run.
For additional context, the special sauce behind Apple’s Rosetta 2 is that the M family of SoCs actually implement an x86 memory model mode that is selectively enabled when executing dynamically translated multithreaded x86 programs.