When using VSX (any ppc64le, or ppc64 targeting pow8), fptosi from f64 to i32 appears to round to an f32 intermediate, which loses some significant bits in the smaller mantissa. The following IR is produced by Rust: ; simd_cast::cast ; Function Attrs: uwtable define internal void @_ZN9simd_cast4cast17h5261798b19538724E(<4 x i32>* noalias nocapture sret dereferenceable(16), <4 x double>* noalias nocapture dereferenceable(32) %v) unnamed_addr #0 { start: %1 = load <4 x double>, <4 x double>* %v, align 32 %2 = fptosi <4 x double> %1 to <4 x i32> store <4 x i32> %2, <4 x i32>* %0, align 16 br label %bb1 bb1: ; preds = %start ret void } That results in this asm: .section .text._ZN9simd_cast4cast17h5261798b19538724E,"ax",@progbits .p2align 4 .type _ZN9simd_cast4cast17h5261798b19538724E,@function _ZN9simd_cast4cast17h5261798b19538724E: .Lfunc_begin14: .cfi_startproc li 5, 16 lxvd2x 0, 4, 5 xxswapd 0, 0 lxvd2x 1, 0, 4 xxswapd 1, 1 xxmrgld 2, 0, 1 xvcvdpsp 34, 2 xxmrghd 0, 0, 1 xvcvdpsp 35, 0 vmrgew 2, 3, 2 xvcvspsxws 34, 34 stvx 2, 0, 3 blr .long 0 .quad 0 .Lfunc_end14: .size _ZN9simd_cast4cast17h5261798b19538724E, .Lfunc_end14-.Lfunc_begin14 .cfi_endproc The xvcvdpsp rounds to f32, then xvcvspsxws converts to i32. That rounding step is explicit in PPCTargetLowering::combineElementTruncationToVectorTruncation, but this is a bad optimization since f32 has fewer significant bits in the mantissa. Using xvcvdpsxws instead for f64->i32 would probably be better.
Created attachment 20611 [details] Temporary patch to disable the optimization Here is a temporary patch to disable the optimization. In addition to disabling the f64->i32 vectorization, it also regresses some of the f32->i32 vectorizations, because in a few places the backend will extend f32 values f64 before generating FCTIWZ nodes. If we want to use this patch, I could use some help fixing up the testcase ( build-vector-tests.ll) this patch regresses, but I'm also open to alternative fixes. It would be good to get this fixed before the 7.0 release.
Thanks, that patch passes all of Rust's simd tests. Here's the new asm for the IR snippet: .section .text._ZN9simd_cast4cast17h5261798b19538724E,"ax",@progbits .p2align 4 .type _ZN9simd_cast4cast17h5261798b19538724E,@function _ZN9simd_cast4cast17h5261798b19538724E: .Lfunc_begin14: .cfi_startproc li 5, 16 lxvd2x 0, 4, 5 xxswapd 0, 0 lxvd2x 1, 0, 4 xxswapd 1, 1 fmr 2, 0 xscvdpsxws 2, 2 mfvsrwz 6, 2 mtvsrwz 2, 6 fmr 3, 2 fmr 2, 1 xscvdpsxws 2, 2 mfvsrwz 6, 2 mtvsrwz 2, 6 fmr 4, 2 xxmrghd 34, 3, 4 xxswapd 0, 0 fmr 2, 0 xscvdpsxws 2, 2 mfvsrwz 6, 2 mtvsrwz 2, 6 fmr 0, 2 xxswapd 1, 1 fmr 2, 1 xscvdpsxws 2, 2 mfvsrwz 6, 2 mtvsrwz 2, 6 fmr 1, 2 xxmrghd 35, 0, 1 vmrgow 2, 2, 3 stvx 2, 0, 3 blr .long 0 .quad 0 .Lfunc_end14: .size _ZN9simd_cast4cast17h5261798b19538724E, .Lfunc_end14-.Lfunc_begin14 .cfi_endproc So it is using xscvdpsxws now, but looks like just one element at a time.
Thanks Tom, we definitely need to remove the code that adds the rounding. I'll put up a patch that does this and corrects for the regression (scalarization of this code).
Fix in https://reviews.llvm.org/D50121 Please verify that it fixes the issue in rust without building the vector one element at a time.
The fix is approved on Phabricator. I'm just waiting to get confirmation that it fixes the Rust issues and if/when I receive such confirmation, I'll commit the fix.
With D50121, the original reported IR generates tidy vectorized code: li 5, 16 lxvd2x 0, 4, 5 xxswapd 0, 0 lxvd2x 1, 0, 4 xxswapd 1, 1 xxmrgld 2, 0, 1 xvcvdpsxws 34, 2 xxmrghd 0, 0, 1 xvcvdpsxws 35, 0 vmrgew 2, 3, 2 stvx 2, 0, 3 blr It produces the desired result, and Rust's full SIMD tests pass too. Thanks!
Merged to 7.0.0 in r338678