Created attachment 23722 [details] Dumped llvm module producing the example (same code as in godbolt) Not-that-minimal-example: https://godbolt.org/z/x8zMq4 This code contains nested loops that sum up 8 accumulators, zmmi where i = 0, 1, 2, 5, 6, 7, 10, 11. These accumulators are ultimately reduced into a scalar: vaddpd zmm1, zmm6, zmm1 vaddpd zmm3, zmm7, zmm11 vaddpd zmm1, zmm3, zmm1 vaddpd zmm3, zmm5, zmm10 vaddpd zmm0, zmm0, zmm2 vaddpd zmm0, zmm0, zmm3 vaddpd zmm0, zmm0, zmm1 vextractf64x4 ymm1, zmm0, 1 vaddpd zmm0, zmm0, zmm1 vextractf128 xmm1, ymm0, 1 vaddpd xmm0, xmm0, xmm1 vpermilpd xmm1, xmm0, 1 # xmm1 = xmm0[1,0] vaddsd xmm0, xmm0, xmm1 These nested loops are organized like for i in I1 for j in J1 for k in K ; vfmadd to accumulate end end for j in J2 for k in K ; vfmadd to accumulate end end for j in J3 ... end for I in I2 ... In this example, the depth is 3. I didn't observe the problem when the depth was 2. The problem is that for each of these sets of inner most loops, a different bunch of registers are chosen as the accumulation registers. Given that there are many loops, this ends up requiring a huge number of registers and stack space, as well as a large number of move instructions to enforce their correspondence. Here is an example inner most loop: .LBB0_36: # %L2043 vbroadcastsd zmm0, qword ptr [rax + 8*rbp + 8] vbroadcastsd zmm1, qword ptr [r13 + 8*rbp + 8] vfmadd231pd zmm14, zmm0, zmmword ptr [rsp + 1536] # 64-byte Folded Reload vfmadd231pd zmm15, zmm1, zmmword ptr [rsp + 1472] # 64-byte Folded Reload vfmadd231pd zmm12, zmm0, zmmword ptr [rsp + 1856] # 64-byte Folded Reload vfmadd231pd zmm13, zmm1, zmmword ptr [rsp + 1792] # 64-byte Folded Reload vfmadd231pd zmm9, zmm0, zmmword ptr [rsp + 1408] # 64-byte Folded Reload vfmadd231pd zmm8, zmm1, zmmword ptr [rsp + 3008] # 64-byte Folded Reload vfmadd231pd zmm4, zmm0, zmmword ptr [rsp + 2944] # 64-byte Folded Reload vfmadd231pd zmm3, zmm1, zmmword ptr [rsp + 1728] # 64-byte Folded Reload vfmadd231pd zmm14, zmm0, zmmword ptr [rsp + 1664] # 64-byte Folded Reload vfmadd231pd zmm15, zmm1, zmmword ptr [rsp + 3136] # 64-byte Folded Reload vfmadd231pd zmm12, zmm0, zmmword ptr [rsp + 3072] # 64-byte Folded Reload vfmadd231pd zmm13, zmm1, zmmword ptr [rsp + 1600] # 64-byte Folded Reload vfmadd231pd zmm9, zmm0, zmmword ptr [rsp + 3264] # 64-byte Folded Reload vfmadd231pd zmm8, zmm1, zmmword ptr [rsp + 3200] # 64-byte Folded Reload vfmadd231pd zmm4 {k1}, zmm0, zmmword ptr [rsp + 3392] # 64-byte Folded Reload vfmadd231pd zmm3 {k1}, zmm1, zmmword ptr [rsp + 3328] # 64-byte Folded Reload inc rbp vmovapd zmm19, zmm14 vmovapd zmm18, zmm14 vmovapd zmm17, zmm14 vmovupd zmmword ptr [rsp + 704], zmm14 # 64-byte Spill vmovupd zmmword ptr [rsp + 576], zmm14 # 64-byte Spill vmovupd zmmword ptr [rsp + 448], zmm14 # 64-byte Spill vmovupd zmmword ptr [rsp + 320], zmm14 # 64-byte Spill vmovupd zmmword ptr [rsp + 192], zmm14 # 64-byte Spill vmovupd zmmword ptr [rsp + 128], zmm14 # 64-byte Spill vmovapd zmm0, zmm14 vmovapd zmm29, zmm15 vmovapd zmm27, zmm15 vmovapd zmm26, zmm15 vmovupd zmmword ptr [rsp + 640], zmm15 # 64-byte Spill vmovupd zmmword ptr [rsp + 512], zmm15 # 64-byte Spill vmovupd zmmword ptr [rsp + 384], zmm15 # 64-byte Spill vmovupd zmmword ptr [rsp + 256], zmm15 # 64-byte Spill vmovupd zmmword ptr [rsp + 64], zmm15 # 64-byte Spill vmovupd zmmword ptr [rsp], zmm15 # 64-byte Spill vmovapd zmm2, zmm15 vmovapd zmm30, zmm12 vmovapd zmm28, zmm12 vmovapd zmm31, zmm12 vmovupd zmmword ptr [rsp + 1344], zmm12 # 64-byte Spill vmovupd zmmword ptr [rsp + 1216], zmm12 # 64-byte Spill vmovupd zmmword ptr [rsp + 1088], zmm12 # 64-byte Spill vmovupd zmmword ptr [rsp + 960], zmm12 # 64-byte Spill vmovupd zmmword ptr [rsp + 896], zmm12 # 64-byte Spill vmovapd zmm5, zmm12 vmovapd zmm22, zmm13 vmovapd zmm24, zmm13 vmovapd zmm20, zmm13 vmovupd zmmword ptr [rsp + 1280], zmm13 # 64-byte Spill vmovupd zmmword ptr [rsp + 1152], zmm13 # 64-byte Spill vmovupd zmmword ptr [rsp + 1024], zmm13 # 64-byte Spill vmovupd zmmword ptr [rsp + 832], zmm13 # 64-byte Spill vmovupd zmmword ptr [rsp + 768], zmm13 # 64-byte Spill vmovapd zmm10, zmm13 vmovapd zmm23, zmm9 vmovapd zmm16, zmm9 vmovupd zmmword ptr [rsp + 2496], zmm9 # 64-byte Spill vmovupd zmmword ptr [rsp + 2368], zmm9 # 64-byte Spill vmovupd zmmword ptr [rsp + 2240], zmm9 # 64-byte Spill vmovupd zmmword ptr [rsp + 2112], zmm9 # 64-byte Spill vmovupd zmmword ptr [rsp + 2048], zmm9 # 64-byte Spill vmovapd zmm7, zmm9 vmovapd zmm21, zmm8 vmovapd zmm25, zmm8 vmovupd zmmword ptr [rsp + 2432], zmm8 # 64-byte Spill vmovupd zmmword ptr [rsp + 2304], zmm8 # 64-byte Spill vmovupd zmmword ptr [rsp + 2176], zmm8 # 64-byte Spill vmovupd zmmword ptr [rsp + 1984], zmm8 # 64-byte Spill vmovupd zmmword ptr [rsp + 1920], zmm8 # 64-byte Spill vmovapd zmm11, zmm8 vmovupd zmmword ptr [rsp + 2880], zmm4 # 64-byte Spill vmovupd zmmword ptr [rsp + 2816], zmm4 # 64-byte Spill vmovupd zmmword ptr [rsp + 2752], zmm4 # 64-byte Spill vmovupd zmmword ptr [rsp + 2688], zmm4 # 64-byte Spill vmovupd zmmword ptr [rsp + 2624], zmm4 # 64-byte Spill vmovupd zmmword ptr [rsp + 2560], zmm4 # 64-byte Spill vmovapd zmm6, zmm4 vmovapd zmm1, zmm3 cmp rbp, rbx jl .LBB0_36 Every single vmovupd and vmovapd above is unnecessary and should not exist. The `vfmadd231pd`s should not be loading from the stack. Instead of loading from the stack and assigning to `zmm`s It should be incrementing `zmm`s: 0, 1, 2, 5, 6, 7, 10, 11 Instead, it is loading from the stack, assigning to `zmm`s: 14, 3, 15, 12, 4, 9, 13, 8 and then `vmov(a/u)pd`ing to a huge number of aliasing stack spaces and registers, including of course `zmm`s they should have been all along: 0, 1, 2, 5, 6, 7, 10, 11 The inner loop should just be the `vbroadcastsd`, `vfmadd231pd`s, and `inc`, `cmp`, `jl`. This generated code is over 6 times slower than a version that creates (zero-initializes) new accumulation vectors for each inner loop, and then adds these to the final accumulation vectors. That version should be strictly slower, but it works around this performance-killing bug.