Home - Waterfall Grid T-Grid Console Builders Recent Builds Buildslaves Changesources - JSON API - About

Console View


Tags: Architectures Platforms default
Legend:   Passed Failed Warnings Failed Again Running Exception Offline No data

Architectures Platforms default
Rob N ★
chapoly: functional tests

Signed-off-by: Rob N ★ <robn@despairlabs.com>

Pull-request: #14249 part 5/5
Rob N ★
chapoly: hook up all the crypto

Signed-off-by: Rob N ★ <robn@despairlabs.com>

Pull-request: #14249 part 4/5
Rob N ★
chapoly: stub KCF provider (no-op)

Is this the smallest possible "encryption" provider? Maybe!

Signed-off-by: Rob N ★ <robn@despairlabs.com>

Pull-request: #14249 part 3/5
Rob N ★
chapoly: initial plumbing

Signed-off-by: Rob N ★ <robn@despairlabs.com>

Pull-request: #14249 part 2/5
Rob N ★
monocypher: import 3.1.3

This is removing everything except the very nice compare and clear
functions, the bits of Poly1305 and Chacha we need, and their supports,
and then a couple of very minor adjustments to make it build cleanly.

Signed-off-by: Rob N ★ <robn@despairlabs.com>

Pull-request: #14249 part 1/5
Richard Yao
Micro-optimize fletcher4 calculations

When processing abds, we execute 1 `kfpu_begin()`/`kfpu_end()` pair on
every page in the abd. This is wasteful and slows down checksum
performance versus what the benchmark claimed. We correct this by moving
those calls to the init and fini functions.

Also, we always check the buffer length against 0 before calling the
non-scalar checksum functions. This means that we do not need to execute
the loop condition for the first loop iteration. That allows us to
micro-optimize the checksum calculations by switching to do-while loops.

Note that we do not apply that micro-optimization to the scalar
implementation because there is no check in
`fletcher_4_incremental_native()`/`fletcher_4_incremental_byteswap()`
against 0 sized buffers being passed.

Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>

Pull-request: #14247 part 1/1
Richard Yao
Micro-optimize fletcher4 calculations

When processing abds, we execute 1 `kfpu_begin()`/`kfpu_end()` pair on
every page in the abd. This is wasteful and slows down checksum
performance versus what the benchmark claimed. We correct this by moving
those calls to the init and fini functions.

Also, we always check the buffer length against 0 before calling the
checksum functions. This means that we do not need to execute the loop
condition for the first loop iteration. That allows us to micro-optimize
the checksum calculations by switching to do-while loops.

Note that we do not apply that micro-optimization to the scalar
implementation because there is no check in
`fletcher_4_incremental_native()`/`fletcher_4_incremental_byteswap()`
against 0 sized buffers being passed.

Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>

Pull-request: #14247 part 1/1
Richard Yao
Micro-optimize fletcher4 assembly routines

When processing abds, we execute 1 `kfpu_begin()`/`kfpu_end()` pair on
every page in the abd. This is wasteful and slows down checksum
performance versus what the benchmark claimed. We correct this by moving
those calls to the init and fini functions.

Also, we always check the buffer length against 0 before calling the
checksum functions. This means that we do not need to execute the loop
condition for the first loop iteration. That allows us to micro-optimize
the checksum calculations by switching to do-while loops.

Note that we do not apply that micro-optimization to the scalar
implementation because there is no check in
`fletcher_4_incremental_native()`/`fletcher_4_incremental_byteswap()`
against 0 sized buffers being passed.

Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>

Pull-request: #14247 part 1/1
Richard Yao
Linux PPC: Fix build failures on kernels built without CONFIG_SPE

Closes #14233
Reported-by: Rich Ercolani <Rincebrain@gmail.com>
Tested-by: Georgy Yakovlev <gyakovlev@gentoo.org>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>

Pull-request: #14244 part 1/1
Richard Yao
PowerPC: Fix build failures on kernels built without CONFIG_SPE

Closes #14233
Reported-by: Rich Ercolani <Rincebrain@gmail.com>
Tested-by: Georgy Yakovlev <gyakovlev@gentoo.org>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>

Pull-request: #14244 part 1/1
Richard Yao
PowerPC: Fix build failures on kernels built without CONFIG_SPE

Closes #14233
Tested-by: Georgy Yakovlev <gyakovlev@gentoo.org>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>

Pull-request: #14244 part 1/1
  • Ubuntu 18.04 i386 (BUILD): cloning zfs -  stdio
George Wilson
nopwrites on dmu_sync-ed blocks can result in a panic

After a device has been removed, any nopwrites for blocks on that
indirect vdev should be ignored and a new block should be allocated. The
original code attempted to handle this but used the wrong block pointer
when checking for indirect vdevs and failed to check all DVAs.

This change corrects both of these issues and modifies the test case
to ensure that it properly tests nopwrites with device removal.

Signed-off-by: George Wilson <gwilson@delphix.com>

Pull-request: #14235 part 1/1
Richard Yao
Add generic fletcher4 vector implementation

This is written using GNU C's generic vector extensions to the C
standard and is based on the AVX512 implementation. However, unlike the
AVX512 implementation, this compiles on all supported architectures.
However, initially, it only generates vector instructions in the kernel
on x86/x86_64, aarch64 and power64.

Other architectures receive superscalar8. On x86 and x86_64, builds with
GCC versions older than GCC 4.9.0 will also receive superscalar8. On
x86, x86_64 and ppc64, Clang versions older than 5.0.0 will also receive
superscalar8. On ppc64le, Clang versions older than 3.8.0 will receive
superscalar8. Note that on older Clang versions, superscalar will be
mislabeled as vector code. The information here about Clang versions is
mostly informative. I do not expect anyone to try to build the code with
a version of LLVM/Clang that old and I am not sure if we even support
it.

In userspace, we are more likely to see vector instructions
automatically generated for other architectures, but this depends on
what the compiler is willing to use by default, since we make no attempt
to force vector instruction generation on other architectures.
Coincidentally, the compilers do not give us a generic way of doing that
across all architectures.

Supporting vector instructions on an architecture requires using
compiler pragma directives. At present, there are directives for both
GCC and Clang. ICC was tested and was found to need its own directives,
which were not added, so anyone compiling ZFS with ICC will not have
vector instructions emitted inside the kernel.

The code had to workaround the following GCC bug to get good assembly
generation from GCC:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107916

This was done by manually reimplementing the 512-bit generic vector
operations in terms of smaller 128-bit or 256-bit vector operations
depending on the machine's native SIMD width. For x86/x86_64, that means
256-bit vector operations while for POWER, that means 128-bit vector
operations. When Clang is the compiler, we do not need this workaround,
so the preprocessor will give it the full 512-bit generic vector
version.

For x86 and x86_64, AVX2 will be generated. GCC is actually unable to
generate good assembly on its own, so a single handwritten vpmovzxdq
instruction was added to fix that. With this improvement, the code runs
faster than the Intel written avx2 code in a KVM guest on Zen 3:

0 0 0x01 -1 0 770448447 7874698323
implementation  native        byteswap
scalar          9674935874    9342370616
superscalar      12389694710    12378469202
superscalar4    13417221678    12187713211
generic-avx2    47470237051    39416631033
sse2            23714943741    11366615899
ssse3            23692572459    20166280422
avx2            41473313254    37962243248
fastest          generic-avx2  generic-avx2

That is possible because the total theoretical bandwidth is
51.2GB/sec and Zen 3 is capable of performing four 256-bit vpaddq
operations per cycle according to Agner Fog. i.e. superscalar SIMD is
possible.

It is thought that the byteswap case fails to see much improvement
because of a hardware limitation that restricts shuffle instructions to
only two execution ports down from the four avaliable to additions. The
CPU pipeline starts the next iteration before the current iteration has
finished to to try occupy all 4 execution ports. In the native case,
this works since we only do additions. In the byteswap case, we must do
a shuffle for which we only have 2 execution ports. This causes us to
hit a pipeline hazard that the CPU handles by inserting bubbles, which
reduces execution port occupancy. Thus we do slightly better than we did
with the older avx2 byteswap case since we keep the execution ports
occupied slightly more, but not by much due to the pipeline hazard.

For POWER, VSX instructions will be generated for POWER8 and later.
Either GCC 12 or later or Clang is needed to generate good assembly
output.

For aarch64, NEON instructions will be generated.

Support for ARM was considered, but supporting aarch64 was deemed to be
enough, so I did not implement it.

Support for SPARC was also considered, but its SIMD instructions have a
maximum 64-bit width. We need greater widths to be able to do 64-bit
additions to be done simultaneously, so it was not possible to use its
SIMD instructions. The result is that the best possible on SPARC is the
superscalar code, which this resembles without vector instructions.

Support for MIPS was also considered, but MIPS Release 5 is required for
that, almost no hardware implements MIPS Release 5 and it was not
apparent how to get the compilers to generate SIMD instructions for it.
It would not surprise me if the compilers do not support MIPS Release
5's SIMD instructions at all.

Support for Loongson was also considered, but a lack of documentation
(depsite it being MIPS-based) prevented it from being done.

Support for S390 was also considered, but it required compiling for the
rather new Z13, and the compilers did not appear to emit vector
instructions for it. Like for MIPS, it is quite possible that the
compilers do not know how to emit them.

Support for RISC-V was also considered. While the RISC-V Vector Extension
was ratified in 2021, it does not appear to be supported by compilers at
this time.

Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>

Pull-request: #14234 part 1/1
Richard Yao
Add generic fletcher4 vector implementation

This is written using GNU C's generic vector extensions to the C
standard and is based on the AVX512 implementation. However, unlike the
AVX512 implementation, this compiles on all supported architectures.
However, initially, it only generates vector instructions in the kernel
on x86/x86_64, aarch64 and power64.

Other architectures receive superscalar8. On x86 and x86_64, builds with
GCC versions older than GCC 4.9.0 will also receive superscalar8. On
x86, x86_64 and ppc64, Clang versions older than 5.0.0 will also receive
superscalar8. On ppc64le, Clang versions older than 3.8.0 will receive
superscalar8. Note that on older Clang versions, superscalar will be
mislabeled as vector code. The information here about Clang versions is
mostly informative. I do not expect anyone to try to build the code with
a version of LLVM/Clang that old and I am not sure if we even support
it.

In userspace, we are more likely to see vector instructions
automatically generated for other architectures, but this depends on
what the compiler is willing to use by default, since we make no attempt
to force vector instruction generation on other architectures.
Coincidentally, the compilers do not give us a generic way of doing that
across all architectures.

Supporting vector instructions on an architecture requires using
compiler pragma directives. At present, there are directives for both
GCC and Clang. ICC was tested and was found to need its own directives,
which were not added, so anyone compiling ZFS with ICC will not have
vector instructions emitted inside the kernel.

The code had to workaround the following GCC bug to get good assembly
generation from GCC:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107916

This was done by manually reimplementing the 512-bit generic vector
operations in terms of smaller 128-bit or 256-bit vector operations
depending on the machine's native SIMD width. For x86/x86_64, that means
256-bit vector operations while for POWER, that means 128-bit vector
operations. When Clang is the compiler, we do not need this workaround,
so the preprocessor will give it the full 512-bit generic vector
version.

For x86 and x86_64, AVX2 will be generated. GCC is actually unable to
generate good assembly on its own, so a single handwritten vpmovzxdq
instruction was added to fix that. With this improvement, the code runs
faster than the Intel written avx2 code in a KVM guest on Zen 3:

0 0 0x01 -1 0 770448447 7874698323
implementation  native        byteswap
scalar          9674935874    9342370616
superscalar      12389694710    12378469202
superscalar4    13417221678    12187713211
generic-avx2    47470237051    39416631033
sse2            23714943741    11366615899
ssse3            23692572459    20166280422
avx2            41473313254    37962243248
fastest          generic-avx2  generic-avx2

That is possible because the total theoretical bandwidth is
51.2GB/sec and Zen 3 is capable of performing four 256-bit vpaddq
operations per cycle according to Agner Fog. i.e. superscalar SIMD is
possible.

It is thought that the byteswap case fails to see much improvement
because of a hardware limitation that restricts shuffle instructions to
only two execution ports down from the four avaliable to additions. The
CPU pipeline starts the next iteration before the current iteration has
finished to to try occupy all 4 execution ports. In the native case,
this works since we only do additions. In the byteswap case, we must do
a shuffle for which we only have 2 execution ports. This causes us to
hit a pipeline hazard that the CPU handles by inserting bubbles, which
reduces execution port occupancy. Thus we do slightly better than we did
with the older avx2 byteswap case since we keep the execution ports
occupied slightly more, but not by much due to the pipeline hazard.

For POWER, VSX instructions will be generated for POWER8 and later.
Either GCC 12 or later or Clang is needed to generate good assembly
output.

For aarch64, NEON instructions will be generated.

Support for ARM was considered, but supporting aarch64 was deemed to be
enough, so I did not implement it.

Support for SPARC was also considered, but its SIMD instructions have a
maximum 64-bit width. We need greater widths to be able to do 64-bit
additions to be done simultaneously, so it was not possible to use its
SIMD instructions. The result is that the best possible on SPARC is the
superscalar code, which this resembles without vector instructions.

Support for MIPS was also considered, but MIPS Release 5 is required for
that, almost no hardware implements MIPS Release 5 and it was not
apparent how to get the compilers to generate SIMD instructions for it.
It would not surprise me if the compilers do not support MIPS Release
5's SIMD instructions at all.

Support for Loongson was also considered, but a lack of documentation
(depsite it being MIPS-based) prevented it from being done.

Support for S390 was also considered, but it required compiling for the
rather new Z13, and the compilers did not appear to emit vector
instructions for it. Like for MIPS, it is quite possible that the
compilers do not know how to emit them.

Support for RISC-V was also considered. While the RISC-V Vector Extension
was ratified in 2021, it does not appear to be supported by compilers at
this time.

Lastly, to make things easier for the Windows port, we preemptively
disable this implementation by checking for _MSC_VER, since it is known
that MSVC is not able to compile GNU C's vector extensions.

Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>

Pull-request: #14234 part 1/1
Richard Yao
Add generic fletcher4 vector implementation

This is written using GNU C's generic vector extensions to the C
standard and is based on the AVX512 implementation. However, unlike the
AVX512 implementation, this compiles on all supported architectures.
However, initially, it only generates vector instructions in the kernel
on x86/x86_64, aarch64 and power64.

Other architectures receive superscalar8. On x86 and x86_64, builds with
GCC versions older than GCC 4.9.0 will also receive superscalar8. On
x86, x86_64 and ppc64, Clang versions older than 5.0.0 will also receive
superscalar8. On ppc64le, Clang versions older than 3.8.0 will receive
superscalar8. Note that on older Clang versions, superscalar will be
mislabeled as vector code. The information here about Clang versions is
mostly informative. I do not expect anyone to try to build the code with
a version of LLVM/Clang that old and I am not sure if we even support
it.

In userspace, we are more likely to see vector instructions
automatically generated for other architectures, but this depends on
what the compiler is willing to use by default, since we make no attempt
to force vector instruction generation on other architectures.
Coincidentally, the compilers do not give us a generic way of doing that
across all architectures.

Supporting vector instructions on an architecture requires using
compiler pragma directives. At present, there are directives for both
GCC and Clang. ICC was tested and was found to need its own directives,
which were not added, so anyone compiling ZFS with ICC will not have
vector instructions emitted inside the kernel.

The code had to workaround the following GCC bug to get good assembly
generation from GCC:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107916

This was done by manually reimplementing the 512-bit generic vector
operations in terms of smaller 128-bit or 256-bit vector operations
depending on the machine's native SIMD width. For x86/x86_64, that means
256-bit vector operations while for POWER, that means 128-bit vector
operations. When Clang is the compiler, we do not need this workaround,
so the preprocessor will give it the full 512-bit generic vector
version.

For x86 and x86_64, AVX2 will be generated. GCC is actually unable to
generate good assembly on its own, so a single handwritten vpmovzxdq
instruction was added to fix that. With this improvement, the code runs
faster than the Intel written avx2 code in a KVM guest on Zen 3:

0 0 0x01 -1 0 770448447 7874698323
implementation  native        byteswap
scalar          9674935874    9342370616
superscalar      12389694710    12378469202
superscalar4    13417221678    12187713211
generic-avx2    47470237051    39416631033
sse2            23714943741    11366615899
ssse3            23692572459    20166280422
avx2            41473313254    37962243248
fastest          generic-avx2  generic-avx2

That is possible because the total theoretical bandwidth is
51.2GB/sec and Zen 3 is capable of performing four 256-bit vpaddq
operations per cycle according to Agner Fog. i.e. superscalar SIMD is
possible.

It is thought that the byteswap case fails to see much improvement
because of a hardware limitation that restricts shuffle instructions to
only two execution ports down from the four avaliable to additions. The
CPU pipeline starts the next iteration before the current iteration has
finished to to try occupy all 4 execution ports. In the native case,
this works since we only do additions. In the byteswap case, we must do
a shuffle for which we only have 2 execution ports. This causes us to
hit a pipeline hazard that the CPU handles by inserting bubbles, which
reduces execution port occupancy. Thus we do slightly better than we did
with the older avx2 byteswap case since we keep the execution ports
occupied slightly more, but not by much due to the pipeline hazard.

For POWER, VSX instructions will be generated for POWER8 and later.
Either GCC 12 or later or Clang is needed to generate good assembly
output.

For aarch64, NEON instructions will be generated.

Support for ARM was considered, but supporting aarch64 was deemed to be
enough, so I did not implement it.

Support for SPARC was also considered, but its SIMD instructions have a
maximum 64-bit width. We need greater widths to be able to do 64-bit
additions to be done simultaneously, so it was not possible to use its
SIMD instructions. The result is that the best possible on SPARC is the
superscalar code, which this resembles without vector instructions.

Support for MIPS was also considered, but MIPS Release 5 is required for
that, almost no hardware implements MIPS Release 5 and it was not
apparent how to get the compilers to generate SIMD instructions for it.
It would not surprise me if the compilers do not support MIPS Release
5's SIMD instructions at all.

Support for Loongson was also considered, but a lack of documentation
(depsite it being MIPS-based) prevented it from being done.

Support for S390 was also considered, but it required compiling for the
rather new Z13, and the compilers did not appear to emit vector
instructions for it. Like for MIPS, it is quite possible that the
compilers do not know how to emit them.

Support for RISC-V was also considered. While the RISC-V Vector Extension
was ratified in 2021, it does not appear to be supported by compilers at
this time.

Lastly, to make things easier for the Windows port, we preemptively
disable this implementation by checking for _MSC_VER, since it is known
that MSVC is not able to compile GNU C's vector extensions.

Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>

Pull-request: #14234 part 1/1
Richard Yao
Add generic fletcher4 vector implementation

This is written using GNU C's generic vector extensions to the C
standard and is based on the AVX512 implementation. However, unlike the
AVX512 implementation, this compiles on all supported architectures.
However, initially, it only generates vector instructions in the kernel
on x86/x86_64, aarch64 and power64.

Other architectures receive superscalar8. On x86 and x86_64, builds with
GCC versions older than GCC 4.9.0 will also receive superscalar8. On
x86, x86_64 and ppc64, Clang versions older than 5.0.0 will also receive
superscalar8. On ppc64le, Clang versions older than 3.8.0 will receive
superscalar8. Note that on older Clang versions, superscalar will be
mislabeled as vector code. The information here about Clang versions is
mostly informative. I do not expect anyone to try to build the code with
a version of LLVM/Clang that old and I am not sure if we even support
it.

In userspace, we are more likely to see vector instructions
automatically generated for other architectures, but this depends on
what the compiler is willing to use by default, since we make no attempt
to force vector instruction generation on other architectures.
Coincidentally, the compilers do not give us a generic way of doing that
across all architectures.

Supporting vector instructions on an architecture requires using
compiler pragma directives. At present, there are directives for both
GCC and Clang. ICC was tested and was found to need its own directives,
which were not added, so anyone compiling ZFS with ICC will not have
vector instructions emitted inside the kernel.

The code had to workaround the following GCC bug to get good assembly
generation from GCC:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107916

This was done by manually reimplementing the 512-bit generic vector
operations in terms of smaller 128-bit or 256-bit vector operations
depending on the machine's native SIMD width. For x86/x86_64, that means
256-bit vector operations while for POWER, that means 128-bit vector
operations. When Clang is the compiler, we do not need this workaround,
so the preprocessor will give it the full 512-bit generic vector
version.

For x86 and x86_64, AVX2 will be generated. GCC is actually unable to
generate good assembly on its own, so a single handwritten vpmovzxdq
instruction was added to fix that. With this improvement, the code runs
faster than the Intel written avx2 code in a KVM guest on Zen 3:

0 0 0x01 -1 0 770448447 7874698323
implementation  native        byteswap
scalar          9674935874    9342370616
superscalar      12389694710    12378469202
superscalar4    13417221678    12187713211
generic-avx2    47470237051    39416631033
sse2            23714943741    11366615899
ssse3            23692572459    20166280422
avx2            41473313254    37962243248
fastest          generic-avx2  generic-avx2

That is possible because the total theoretical bandwidth is
51.2GB/sec and Zen 3 is capable of performing four 256-bit vpaddq
operations per cycle according to Agner Fog. i.e. superscalar SIMD is
possible.

It is thought that the byteswap case fails to see much improvement
because of a hardware limitation that restricts shuffle instructions to
only two execution ports down from the four avaliable to additions. The
CPU pipeline starts the next iteration before the current iteration has
finished to to try occupy all 4 execution ports. In the native case,
this works since we only do additions. In the byteswap case, we must do
a shuffle for which we only have 2 execution ports. This causes us to
hit a pipeline hazard that the CPU handles by inserting bubbles, which
reduces execution port occupancy. Thus we do slightly better than we did
with the older avx2 byteswap case since we keep the execution ports
occupied slightly more, but not by much due to the pipeline hazard.

For POWER, VSX instructions will be generated for POWER8 and later.
Either GCC 12 or later or Clang is needed to generate good assembly
output.

For aarch64, NEON instructions will be generated.

Support for ARM was considered, but supporting aarch64 was deemed to be
enough, so I did not implement it.

Support for SPARC was also considered, but its SIMD instructions have a
maximum 64-bit width. We need greater widths to be able to do 64-bit
additions to be done simultaneously, so it was not possible to use its
SIMD instructions. The result is that the best possible on SPARC is the
superscalar code, which this resembles without vector instructions.

Support for MIPS was also considered, but MIPS Release 5 is required for
that, almost no hardware implements MIPS Release 5 and it was not
apparent how to get the compilers to generate SIMD instructions for it.
It would not surprise me if the compilers do not support MIPS Release
5's SIMD instructions at all.

Support for Loongson was also considered, but a lack of documentation
(depsite it being MIPS-based) prevented it from being done.

Support for S390 was also considered, but it required compiling for the
rather new Z13, and the compilers did not appear to emit vector
instructions for it. Like for MIPS, it is quite possible that the
compilers do not know how to emit them.

Support for RISC-V was also considered. While the RISC-V Vector Extension
was ratified in 2021, it does not appear to be supported by compilers at
this time.

Lastly, to make things easier for the Windows port, we preemptively
disable this implementation by checking for _MSC_VER, since it is known
that MSVC is not able to compile GNU C's vector extensions.

Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>

Pull-request: #14234 part 1/1
Richard Yao
Add generic fletcher4 vector implementation

This is written using GNU C's generic vector extensions to the C
standard and is based on the AVX512 implementation. However, unlike the
AVX512 implementation, this compiles on all supported architectures.
However, initially, it only generates vector instructions in the kernel
on x86/x86_64, aarch64 and power64.

Other architectures receive superscalar8. On x86 and x86_64, builds with
GCC versions older than GCC 4.9.0 will also receive superscalar8. On
x86, x86_64 and ppc64, Clang versions older than 5.0.0 will also receive
superscalar8. On ppc64le, Clang versions older than 3.8.0 will receive
superscalar8. Note that on older Clang versions, superscalar will be
mislabeled as vector code. The information here about Clang versions is
mostly informative. I do not expect anyone to try to build the code with
a version of LLVM/Clang that old and I am not sure if we even support
it.

In userspace, we are more likely to see vector instructions
automatically generated for other architectures, but this depends on
what the compiler is willing to use by default, since we make no attempt
to force vector instruction generation on other architectures.
Coincidentally, the compilers do not give us a generic way of doing that
across all architectures.

Supporting vector instructions on an architecture requires using
compiler pragma directives. At present, there are directives for both
GCC and Clang. ICC was tested and was found to need its own directives,
which were not added, so anyone compiling ZFS with ICC will not have
vector instructions emitted inside the kernel.

The code had to workaround the following GCC bug to get good assembly
generation from GCC:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107916

This was done by manually reimplementing the 512-bit generic vector
operations in terms of smaller 128-bit or 256-bit vector operations
depending on the machine's native SIMD width. For x86/x86_64, that means
256-bit vector operations while for POWER, that means 128-bit vector
operations. When Clang is the compiler, we do not need this workaround,
so the preprocessor will give it the full 512-bit generic vector
version.

For x86 and x86_64, AVX2 will be generated. GCC is actually unable to
generate good assembly on its own, so a single handwritten vpmovzxdq
instruction was added to fix that. With this improvement, the code runs
faster than the Intel written avx2 code in a KVM guest on Zen 3:

0 0 0x01 -1 0 770448447 7874698323
implementation  native        byteswap
scalar          9674935874    9342370616
superscalar      12389694710    12378469202
superscalar4    13417221678    12187713211
generic-avx2    47470237051    39416631033
sse2            23714943741    11366615899
ssse3            23692572459    20166280422
avx2            41473313254    37962243248
fastest          generic-avx2  generic-avx2

That is possible because the total theoretical bandwidth is
51.2GB/sec and Zen 3 is capable of performing four 256-bit vpaddq
operations per cycle according to Agner Fog. i.e. superscalar SIMD is
possible.

It is thought that the byteswap case fails to see much improvement
because of a hardware limitation that restricts shuffle instructions to
only two execution ports down from the four avaliable to additions. The
CPU pipeline starts the next iteration before the current iteration has
finished to to try occupy all 4 execution ports. In the native case,
this works since we only do additions. In the byteswap case, we must do
a shuffle for which we only have 2 execution ports. This causes us to
hit a pipeline hazard that the CPU handles by inserting bubbles, which
reduces execution port occupancy. Thus we do slightly better than we did
with the older avx2 byteswap case since we keep the execution ports
occupied slightly more, but not by much due to the pipeline hazard.

For POWER, VSX instructions will be generated for POWER8 and later.
Either GCC 12 or later or Clang is needed to generate good assembly
output.

For aarch64, NEON instructions will be generated.

Support for ARM was considered, but supporting aarch64 was deemed to be
enough, so I did not implement it.

Support for SPARC was also considered, but its SIMD instructions have a
maximum 64-bit width. We need greater widths to be able to do 64-bit
additions to be done simultaneously, so it was not possible to use its
SIMD instructions. The result is that the best possible on SPARC is the
superscalar code, which this resembles without vector instructions.

Support for MIPS was also considered, but MIPS Release 5 is required for
that, almost no hardware implements MIPS Release 5 and it was not
apparent how to get the compilers to generate SIMD instructions for it.
It would not surprise me if the compilers do not support MIPS Release
5's SIMD instructions at all.

Support for Loongson was also considered, but a lack of documentation
(depsite it being MIPS-based) prevented it from being done.

Support for S390 was also considered, but it required compiling for the
rather new Z13, and the compilers did not appear to emit vector
instructions for it. Like for MIPS, it is quite possible that the
compilers do not know how to emit them.

Support for RISC-V was also considered. While the RISC-V Vector Extension
was ratified in 2021, it does not appear to be supported by compilers at
this time.

Lastly, to make things easier for the Windows port, we preemptively
disable this implementation by checking for _MSC_VER, since it is known
that MSVC is not able to compile GNU C's vector extensions.

Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>

Pull-request: #14234 part 1/1
Richard Yao
Add generic fletcher4 vector implementation

This is written using GNU C's generic vector extensions to the C
standard and is based on the AVX512 implementation. However, unlike the
AVX512 implementation, this compiles on all supported architectures.
However, initially, it only generates vector instructions in the kernel
on x86/x86_64, aarch64 and power64.

Other architectures receive superscalar8. On x86 and x86_64, builds with
GCC versions older than GCC 4.9.0 will also receive superscalar8. On
x86, x86_64 and ppc64, Clang versions older than 5.0.0 will also receive
superscalar8. On ppc64le, Clang versions older than 3.8.0 will receive
superscalar8. Note that on older Clang versions, superscalar will be
mislabeled as vector code. The information here about Clang versions is
mostly informative. I do not expect anyone to try to build the code with
a version of LLVM/Clang that old and I am not sure if we even support
it.

In userspace, we are more likely to see vector instructions
automatically generated for other architectures, but this depends on
what the compiler is willing to use by default, since we make no attempt
to force vector instruction generation on other architectures.
Coincidentally, the compilers do not give us a generic way of doing that
across all architectures.

Supporting vector instructions on an architecture requires using
compiler pragma directives. At present, there are directives for both
GCC and Clang. ICC was tested and was found to need its own directives,
which were not added, so anyone compiling ZFS with ICC will not have
vector instructions emitted inside the kernel.

The code had to workaround the following GCC bug to get good assembly
generation from GCC:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107916

This was done by manually reimplementing the 512-bit generic vector
operations in terms of smaller 128-bit or 256-bit vector operations
depending on the machine's native SIMD width. For x86/x86_64, that means
256-bit vector operations while for POWER, that means 128-bit vector
operations. When Clang is the compiler, we do not need this workaround,
so the preprocessor will give it the full 512-bit generic vector
version.

For x86 and x86_64, AVX2 will be generated. GCC is actually unable to
generate good assembly on its own, so a single handwritten vpmovzxdq
instruction was added to fix that. With this improvement, the code runs
faster than the Intel written avx2 code in a KVM guest on Zen 3:

0 0 0x01 -1 0 770448447 7874698323
implementation  native        byteswap
scalar          9674935874    9342370616
superscalar      12389694710    12378469202
superscalar4    13417221678    12187713211
generic-avx2    47470237051    39416631033
sse2            23714943741    11366615899
ssse3            23692572459    20166280422
avx2            41473313254    37962243248
fastest          generic-avx2  generic-avx2

That is possible because the total theoretical bandwidth is
51.2GB/sec and Zen 3 is capable of performing four 256-bit vpaddq
operations per cycle according to Agner Fog. i.e. superscalar SIMD is
possible.

It is thought that the byteswap case fails to see much improvement
because of a hardware limitation that restricts shuffle instructions to
only two execution ports down from the four avaliable to additions. The
CPU pipeline starts the next iteration before the current iteration has
finished to to try occupy all 4 execution ports. In the native case,
this works since we only do additions. In the byteswap case, we must do
a shuffle for which we only have 2 execution ports. This causes us to
hit a pipeline hazard that the CPU handles by inserting bubbles, which
reduces execution port occupancy. Thus we do slightly better than we did
with the older avx2 byteswap case since we keep the execution ports
occupied slightly more, but not by much due to the pipeline hazard.

For POWER, VSX instructions will be generated for POWER8 and later.
Either GCC 12 or later or Clang is needed to generate good assembly
output.

For aarch64, NEON instructions will be generated.

Support for ARM was considered, but supporting aarch64 was deemed to be
enough, so I did not implement it.

Support for SPARC was also considered, but its SIMD instructions have a
maximum 64-bit width. We need greater widths to be able to do 64-bit
additions to be done simultaneously, so it was not possible to use its
SIMD instructions. The result is that the best possible on SPARC is the
superscalar code, which this resembles without vector instructions.

Support for MIPS was also considered, but MIPS Release 5 is required for
that, almost no hardware implements MIPS Release 5 and it was not
apparent how to get the compilers to generate SIMD instructions for it.
It would not surprise me if the compilers do not support MIPS Release
5's SIMD instructions at all.

Support for Loongson was also considered, but a lack of documentation
(depsite it being MIPS-based) prevented it from being done.

Support for S390 was also considered, but it required compiling for the
rather new Z13, and the compilers did not appear to emit vector
instructions for it. Like for MIPS, it is quite possible that the
compilers do not know how to emit them.

Support for RISC-V was also considered. While the RISC-V Vector Extension
was ratified in 2021, it does not appear to be supported by compilers at
this time.

Lastly, to make things easier for the Windows port, we preemptively
disable this implementation by checking for _MSC_VER, since it is known
that MSVC is not able to compile GNU C's vector extensions.

Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>

Pull-request: #14234 part 1/1
Richard Yao
Add generic fletcher4 vector implementation

This is written using GNU C's generic vector extensions to the C
standard and is based on the AVX512 implementation. However, unlike the
AVX512 implementation, this compiles on all supported architectures.
However, initially, it only generates vector instructions in the kernel
on x86, x86_64 and power64 when either AVX2 is avaliable or VSX is
avaliable.

Other architectures receive superscalar8. On x86 and x86_64, builds with
GCC versions older than GCC 4.9.0 will also receive superscalar8. On
x86, x86_64 and ppc64, Clang versions older than 5.0.0 will also receive
superscalar8. On ppc64le, Clang versions older than 3.8.0 will receive
superscalar8. Note that on older Clang versions, superscalar will be
mislabeled as vector code. The information here about Clang versions is
mostly informative. I do not expect anyone to try to build the code with
a version of LLVM/Clang that old and I am not sure if we even support
it.

In userspace, we are more likely to see vector instructions
automatically generated for other architectures, but this depends on
what the compiler is willing to use by default, since we make no attempt
to force vector instruction generation on other architectures.
Coincidentally, the compilers do not give us a generic way of doing that
across all architectures.

Supporting vector instructions on an architecture requires using
compiler pragma directives. At present, there are directives for both
GCC and Clang. ICC was tested and was found to need its own directives,
which were not added, so anyone compiling ZFS with ICC will not have
vector instructions emitted inside the kernel.

The code had to workaround the following GCC bug to get good assembly
generation from GCC:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107916

This was done by manually reimplementing the 512-bit generic vector
operations in terms of smaller 128-bit or 256-bit vector operations
depending on the machine's native SIMD width. For x86/x86_64, that means
256-bit vector operations while for POWER, that means 128-bit vector
operations. When Clang is the compiler, we do not need this workaround,
so the preprocessor will give it the full 512-bit generic vector
version.

For x86 and x86_64, AVX2 will be generated. GCC is actually unable to
generate good assembly on its own, so a single handwritten vpmovzxdq
instruction was added to fix that. With this improvement, the code runs
faster than the Intel written avx2 code in a KVM guest on Zen 3:

0 0 0x01 -1 0 770448447 7874698323
implementation  native        byteswap
scalar          9674935874    9342370616
superscalar      12389694710    12378469202
superscalar4    13417221678    12187713211
generic-avx2    47470237051    39416631033
sse2            23714943741    11366615899
ssse3            23692572459    20166280422
avx2            41473313254    37962243248
fastest          generic-avx2  generic-avx2

That is possible because the total theoretical bandwidth is
51.2GB/sec and Zen 3 is capable of performing four 256-bit vpaddq
operations per cycle according to Agner Fog. i.e. superscalar SIMD is
possible.

It is thought that the byteswap case fails to see much improvement
because of a hardware limitation that restricts shuffle instructions to
only two execution ports down from the four avaliable to additions. The
CPU pipeline starts the next iteration before the current iteration has
finished to to try occupy all 4 execution ports. In the native case,
this works since we only do additions. In the byteswap case, we must do
a shuffle for which we only have 2 execution ports. This causes us to
hit a pipeline hazard that the CPU handles by inserting bubbles, which
reduces execution port occupancy. Thus we do slightly better than we did
with the older avx2 byteswap case since we keep the execution ports
occupied slightly more, but not by much due to the pipeline hazard.

For POWER, VSX instructions will be generated for POWER8 and later.

Support for AARCH64/ARM was considered, but the build system passes
-mgeneral-regs-only and upon seeing this, both GCC and Clang refuse to
emit NEON instructions. This is incompatible with the idea of
selectively enabling them on key functions, so support for ARM was not
implemented.

Support for SPARC was also considered, but its SIMD instructions have a
maximum 64-bit width. We need greater widths to be able to do 64-bit
additions to be done simultaneously, so it was not possible to use its
SIMD instructions. The result is that the best possible on SPARC is the
superscalar code, which this resembles without vector instructions.

Support for MIPS was also considered, but MIPS Release 5 is required for
that, almost no hardware implements MIPS Release 5 and it was not
apparent how to get the compilers to generate SIMD instructions for it.
It would not surprise me if the compilers do not support MIPS Release
5's SIMD instructions at all.

Support for Loongson was also considered, but a lack of documentation
(depsite it being MIPS-based) prevented it from being done.

Support for S390 was also considered, but it required compiling for the
rather new Z13, and the compilers did not appear to emit vector
instructions for it. Like for MIPS, it is quite possible that the
compilers do not know how to emit them.

Support for RISC-V was also considered. While the RISC-V Vector Extension
was ratified in 2021, it does not appear to be supported by compilers at
this time.

Lastly, to make things easier for the Windows port, we preemptively
disable this implementation by checking for _MSC_VER, since it is known
that MSVC is not able to compile GNU C's vector extensions.

Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>

Pull-request: #14234 part 1/1
Richard Yao
Add generic fletcher4 vector implementation

This is written using GNU C's generic vector extensions to the C
standard and is based on the AVX512 implementation. However, unlike the
AVX512 implementation, this compiles on all supported architectures.
However, initially, it only generates vector instructions in the kernel
on x86, x86_64 and power64 when either AVX2 is avaliable or VSX is
avaliable.

Other architectures receive superscalar8. On x86 and x86_64, builds with
GCC versions older than GCC 4.9.0 will also receive superscalar8. On
x86, x86_64 and ppc64, Clang versions older than 5.0.0 will also receive
superscalar8. On ppc64le, Clang versions older than 3.8.0 will receive
superscalar8. Note that on older Clang versions, superscalar will be
mislabeled as vector code. The information here about Clang versions is
mostly informative. I do not expect anyone to try to build the code with
a version of LLVM/Clang that old and I am not sure if we even support
it.

In userspace, we are more likely to see vector instructions
automatically generated for other architectures, but this depends on
what the compiler is willing to use by default, since we make no attempt
to force vector instruction generation on other architectures.
Coincidentally, the compilers do not give us a generic way of doing that
across all architectures.

Supporting vector instructions on an architecture requires using
compiler pragma directives. At present, there are directives for both
GCC and Clang. ICC was tested and was found to need its own directives,
which were not added, so anyone compiling ZFS with ICC will not have
vector instructions emitted inside the kernel.

The code had to workaround the following GCC bug to get good assembly
generation from GCC:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107916

This was done by manually reimplementing the 512-bit generic vector
operations in terms of smaller 128-bit or 256-bit vector operations
depending on the machine's native SIMD width. For x86/x86_64, that means
256-bit vector operations while for POWER, that means 128-bit vector
operations. When Clang is the compiler, we do not need this workaround,
so the preprocessor will give it the full 512-bit generic vector
version.

For x86 and x86_64, AVX2 will be generated. GCC is actually unable to
generate good assembly on its own, so a single handwritten vpmovzxdq
instruction was added to fix that. With this improvement, the code runs
faster than the Intel written avx2 code in a KVM guest on Zen 3:

0 0 0x01 -1 0 770448447 7874698323
implementation  native        byteswap
scalar          9674935874    9342370616
superscalar      12389694710    12378469202
superscalar4    13417221678    12187713211
generic-avx2    47470237051    39416631033
sse2            23714943741    11366615899
ssse3            23692572459    20166280422
avx2            41473313254    37962243248
fastest          generic-avx2  generic-avx2

That is possible because the total theoretical bandwidth is
51.2GB/sec and Zen 3 is capable of performing four 256-bit vpaddq
operations per cycle according to Agner Fog. i.e. superscalar SIMD is
possible.

It is thought that the byteswap case fails to see much improvement
because of a hardware limitation that restricts shuffle instructions to
only two execution ports down from the four avaliable to additions. The
CPU pipeline starts the next iteration before the current iteration has
finished to to try occupy all 4 execution ports. In the native case,
this works since we only do additions. In the byteswap case, we must do
a shuffle for which we only have 2 execution ports. This causes us to
hit a pipeline hazard that the CPU handles by inserting bubbles, which
reduces execution port occupancy. Thus we do slightly better than we did
with the older avx2 byteswap case since we keep the execution ports
occupied slightly more, but not by much due to the pipeline hazard.

For POWER, VSX instructions will be generated for POWER8 and later.

Support for AARCH64/ARM was considered, but the build system passes
-mgeneral-regs-only and upon seeing this, both GCC and Clang refuse to
emit NEON instructions. This is incompatible with the idea of
selectively enabling them on key functions, so support for ARM was not
implemented.

Support for SPARC was also considered, but its SIMD instructions have a
maximum 64-bit width. We need greater widths to be able to do 64-bit
additions to be done simultaneously, so it was not possible to use its
SIMD instructions. The result is that the best possible on SPARC is the
superscalar code, which this resembles without vector instructions.

Support for MIPS was also considered, but MIPS Release 5 is required for
that, almost no hardware implements MIPS Release 5 and it was not
apparent how to get the compilers to generate SIMD instructions for it.
It would not surprise me if the compilers do not support MIPS Release
5's SIMD instructions at all.

Support for Loongson was also considered, but a lack of documentation
(depsite it being MIPS-based) prevented it from being done.

Support for S390 was also considered, but it required compiling for the
rather new Z13, and the compilers did not appear to emit vector
instructions for it. Like for MIPS, it is quite possible that the
compilers do not know how to emit them.

Support for RISC-V was also considered. While the RISC-V Vector Extension
was ratified in 2021, it does not appear to be supported by compilers at
this time.

Lastly, to make things easier for the Windows port, we preemptively
disable this implementation by checking for _MSC_VER, since it is known
that MSVC is not able to compile GNU C's vector extensions.

Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>

Pull-request: #14234 part 1/1
  • Debian 8 ppc64 (BUILD): cloning zfs -  stdio
Richard Yao
Add generic fletcher4 vector implementation

This is written using GNU C's generic vector extensions to the C
standard and is based on the AVX512 implementation. However, unlike the
AVX512 implementation, this compiles on all supported architectures.
However, initially, it only generates vector instructions in the kernel
on x86, x86_64 and power64 when either AVX2 is avaliable or VSX is
avaliable.

Other architectures receive superscalar8. On x86 and x86_64, builds with
GCC versions older than GCC 4.9.0 will also receive superscalar8. On
x86, x86_64 and ppc64, Clang versions older than 5.0.0 will also receive
superscalar8. On ppc64le, Clang versions older than 3.8.0 will receive
superscalar8. Note that on older Clang versions, superscalar will be
mislabeled as vector code. The information here about Clang versions is
mostly informative. I do not expect anyone to try to build the code with
a version of LLVM/Clang that old and I am not sure if we even support
it.

In userspace, we are more likely to see vector instructions
automatically generated for other architectures, but this depends on
what the compiler is willing to use by default, since we make no attempt
to force vector instruction generation on other architectures.
Coincidentally, the compilers do not give us a generic way of doing that
across all architectures.

Supporting vector instructions on an architecture requires using
compiler pragma directives. At present, there are directives for both
GCC and Clang. ICC was tested and was found to need its own directives,
which were not added, so anyone compiling ZFS with ICC will not have
vector instructions emitted inside the kernel.

The code had to workaround the following GCC bug to get good assembly
generation from GCC:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107916

This was done by manually reimplementing the 512-bit generic vector
operations in terms of smaller 128-bit or 256-bit vector operations
depending on the machine's native SIMD width. For x86/x86_64, that means
256-bit vector operations while for POWER, that means 128-bit vector
operations. When Clang is the compiler, we do not need this workaround,
so the preprocessor will give it the full 512-bit generic vector
version.

For x86 and x86_64, AVX2 will be generated. GCC is actually unable to
generate good assembly on its own, so a single handwritten vpmovzxdq
instruction was added to fix that. With this improvement, the code runs
faster than the Intel written avx2 code in a KVM guest on Zen 3:

0 0 0x01 -1 0 770448447 7874698323
implementation  native        byteswap
scalar          9674935874    9342370616
superscalar      12389694710    12378469202
superscalar4    13417221678    12187713211
generic-avx2    47470237051    39416631033
sse2            23714943741    11366615899
ssse3            23692572459    20166280422
avx2            41473313254    37962243248
fastest          generic-avx2  generic-avx2

That is possible because the total theoretical bandwidth is
51.2GB/sec and Zen 3 is capable of performing four 256-bit vpaddq
operations per cycle according to Agner Fog. i.e. superscalar SIMD is
possible.

It is thought that the byteswap case fails to see much improvement
because of a hardware limitation that restricts shuffle instructions to
only two execution ports down from the four avaliable to additions. The
CPU pipeline starts the next iteration before the current iteration has
finished to to try occupy all 4 execution ports. In the native case,
this works since we only do additions. In the byteswap case, we must do
a shuffle for which we only have 2 execution ports. This causes us to
hit a pipeline hazard that the CPU handles by inserting bubbles, which
reduces execution port occupancy. Thus we do slightly better than we did
with the older avx2 byteswap case since we keep the execution ports
occupied slightly more, but not by much due to the pipeline hazard.

For POWER, VSX instructions will be generated for POWER8 and later.

Support for AARCH64/ARM was considered, but the build system passes
-mgeneral-regs-only and upon seeing this, both GCC and Clang refuse to
emit NEON instructions. This is incompatible with the idea of
selectively enabling them on key functions, so support for ARM was not
implemented.

Support for SPARC was also considered, but its SIMD instructions have a
maximum 64-bit width. We need greater widths to be able to do 64-bit
additions to be done simultaneously, so it was not possible to use its
SIMD instructions. The result is that the best possible on SPARC is the
superscalar code, which this resembles without vector instructions.

Support for MIPS was also considered, but MIPS Release 5 is required for
that, almost no hardware implements MIPS Release 5 and it was not
apparent how to get the compilers to generate SIMD instructions for it.
It would not surprise me if the compilers do not support MIPS Release
5's SIMD instructions at all.

Support for Loongson was also considered, but a lack of documentation
(depsite it being MIPS-based) prevented it from being done.

Support for S390 was also considered, but it required compiling for the
rather new Z13, and the compilers did not appear to emit vector
instructions for it. Like for MIPS, it is quite possible that the
compilers do not know how to emit them.

Support for RISC-V was also considered. While the RISC-V Vector Extension
was ratified in 2021, it does not appear to be supported by compilers at
this time.

Lastly, to make things easier for the Windows port, we preemptively
disable this implementation by checking for _MSC_VER, since it is known
that MSVC is not able to compile GNU C's vector extensions.

Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>

Pull-request: #14234 part 1/1
Richard Yao
Add generic fletcher4 vector implementation

This is written using GNU C's generic vector extensions to the C
standard and is based on the AVX512 implementation. However, unlike the
AVX512 implementation, this compiles on all supported architectures.
However, initially, it only generates vector instructions in the kernel
on x86, x86_64 and power64 when either AVX2 is avaliable or VSX is
avaliable.

Other architectures receive superscalar8. On x86 and x86_64, builds with
GCC versions older than GCC 4.9.0 will also receive superscalar8. On
x86, x86_64 and ppc64, Clang versions older than 5.0.0 will also receive
superscalar8. On ppc64le, Clang versions older than 3.8.0 will receive
superscalar8. Note that on older Clang versions, superscalar will be
mislabeled as vector code. The information here about Clang versions is
mostly informative. I do not expect anyone to try to build the code with
a version of LLVM/Clang that old and I am not sure if we even support
it.

In userspace, we are more likely to see vector instructions
automatically generated for other architectures, but this depends on
what the compiler is willing to use by default, since we make no attempt
to force vector instruction generation on other architectures.
Coincidentally, the compilers do not give us a generic way of doing that
across all architectures.

Supporting vector instructions on an architecture requires using
compiler pragma directives. At present, there are directives for both
GCC and Clang. ICC was tested and was found to need its own directives,
which were not added, so anyone compiling ZFS with ICC will not have
vector instructions emitted inside the kernel.

The code had to workaround the following GCC bug to get good assembly
generation from GCC:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107916

This was done by manually reimplementing the 512-bit generic vector
operations in terms of smaller 128-bit or 256-bit vector operations
depending on the machine's native SIMD width. For x86/x86_64, that means
256-bit vector operations while for POWER, that means 128-bit vector
operations. When Clang is the compiler, we do not need this workaround,
so the preprocessor will give it the full 512-bit generic vector
version.

For x86 and x86_64, AVX2 will be generated. GCC is actually unable to
generate good assembly on its own, so a single handwritten vpmovzxdq
instruction was added to fix that. With this improvement, the code runs
faster than the Intel written avx2 code in a KVM guest on Zen 3:

0 0 0x01 -1 0 770448447 7874698323
implementation  native        byteswap
scalar          9674935874    9342370616
superscalar      12389694710    12378469202
superscalar4    13417221678    12187713211
generic-avx2    47470237051    39416631033
sse2            23714943741    11366615899
ssse3            23692572459    20166280422
avx2            41473313254    37962243248
fastest          generic-avx2  generic-avx2

That is possible because the total theoretical bandwidth is
51.2GB/sec and Zen 3 is capable of performing four 256-bit vpaddq
operations per cycle according to Agner Fog. i.e. superscalar SIMD is
possible.

It is thought that the byteswap case fails to see much improvement
because of a hardware limitation that restricts shuffle instructions to
only two execution ports down from the four avaliable to additions. The
CPU pipeline starts the next iteration before the current iteration has
finished to to try occupy all 4 execution ports. In the native case,
this works since we only do additions. In the byteswap case, we must do
a shuffle for which we only have 2 execution ports. This causes us to
hit a pipeline hazard that the CPU handles by inserting bubbles, which
reduces execution port occupancy. Thus we do slightly better than we did
with the older avx2 byteswap case since we keep the execution ports
occupied slightly more, but not by much due to the pipeline hazard.

For POWER, VSX instructions will be generated for POWER8 and later.

Support for AARCH64/ARM was considered, but the build system passes
-mgeneral-regs-only and upon seeing this, both GCC and Clang refuse to
emit NEON instructions. This is incompatible with the idea of
selectively enabling them on key functions, so support for ARM was not
implemented.

Support for SPARC was also considered, but its SIMD instructions have a
maximum 64-bit width. We need greater widths to be able to do 64-bit
additions to be done simultaneously, so it was not possible to use its
SIMD instructions. The result is that the best possible on SPARC is the
superscalar code, which this resembles without vector instructions.

Support for MIPS was also considered, but MIPS Release 5 is required for
that, almost no hardware implements MIPS Release 5 and it was not
apparent how to get the compilers to generate SIMD instructions for it.
It would not surprise me if the compilers do not support MIPS Release
5's SIMD instructions at all.

Support for Loongson was also considered, but a lack of documentation
(depsite it being MIPS-based) prevented it from being done.

Support for S390 was also considered, but it required compiling for the
rather new Z13, and the compilers did not appear to emit vector
instructions for it. Like for MIPS, it is quite possible that the
compilers do not know how to emit them.

Support for RISC-V was also considered. While the RISC-V Vector Extension
was ratified in 2021, it does not appear to be supported by compilers at
this time.

Lastly, to make things easier for the Windows port, we preemptively
disable this implementation by checking for _MSC_VER, since it is known
that MSVC is not able to compile GNU C's vector extensions.

Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>

Pull-request: #14234 part 1/1
Richard Yao
Add generic fletcher4 vector implementation

This is written using GNU C's generic vector extensions to the C
standard and is based on the AVX512 implementation. However, unlike the
AVX512 implementation, this compiles on all supported architectures.
However, initially, it only generates vector instructions in the kernel
on x86, x86_64 and power64 when either AVX2 is avaliable or VSX is
avaliable.

Other architectures receive superscalar8. On x86 and x86_64, builds with
GCC versions older than GCC 4.9.0 will also receive superscalar8. On
x86, x86_64 and ppc64, Clang versions older than 5.0.0 will also receive
superscalar8. On ppc64le, Clang versions older than 3.8.0 will receive
superscalar8. Note that on older Clang versions, superscalar will be
mislabeled as vector code. The information here about Clang versions is
mostly informative. I do not expect anyone to try to build the code with
a version of LLVM/Clang that old and I am not sure if we even support
it.

In userspace, we are more likely to see vector instructions
automatically generated for other architectures, but this depends on
what the compiler is willing to use by default, since we make no attempt
to force vector instruction generation on other architectures.
Coincidentally, the compilers do not give us a generic way of doing that
across all architectures.

Supporting vector instructions on an architecture requires using
compiler pragma directives. At present, there are directives for both
GCC and Clang. ICC was tested and was found to need its own directives,
which were not added, so anyone compiling ZFS with ICC will not have
vector instructions emitted inside the kernel.

The code had to workaround the following GCC bug to get good assembly
generation from GCC:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107916

This was done by manually reimplementing the 512-bit generic vector
operations in terms of smaller 128-bit or 256-bit vector operations
depending on the machine's native SIMD width. For x86/x86_64, that means
256-bit vector operations while for POWER, that means 128-bit vector
operations. When Clang is the compiler, we do not need this workaround,
so the preprocessor will give it the full 512-bit generic vector
version.

For x86 and x86_64, AVX2 will be generated. GCC is actually unable to
generate good assembly on its own, so a single handwritten vpmovzxdq
instruction was added to fix that. With this improvement, the code runs
faster than the Intel written avx2 code in a KVM guest on Zen 3:

0 0 0x01 -1 0 770448447 7874698323
implementation  native        byteswap
scalar          9674935874    9342370616
superscalar      12389694710    12378469202
superscalar4    13417221678    12187713211
generic-avx2    47470237051    39416631033
sse2            23714943741    11366615899
ssse3            23692572459    20166280422
avx2            41473313254    37962243248
fastest          generic-avx2  generic-avx2

That is possible because the total theoretical bandwidth is
51.2GB/sec and Zen 3 is capable of performing four 256-bit vpaddq
operations per cycle according to Agner Fog. i.e. superscalar SIMD is
possible.

It is thought that the byteswap case fails to see much improvement
because of a hardware limitation that restricts shuffle instructions to
only two execution ports down from the four avaliable to additions. The
CPU pipeline starts the next iteration before the current iteration has
finished to to try occupy all 4 execution ports. In the native case,
this works since we only do additions. In the byteswap case, we must do
a shuffle for which we only have 2 execution ports. This causes us to
hit a pipeline hazard that the CPU handles by inserting bubbles, which
reduces execution port occupancy. Thus we do slightly better than we did
with the older avx2 byteswap case since we keep the execution ports
occupied slightly more, but not by much due to the pipeline hazard.

For POWER, VSX instructions will be generated for POWER8 and later.

Support for AARCH64/ARM was considered, but the build system passes
-mgeneral-regs-only and upon seeing this, both GCC and Clang refuse to
emit NEON instructions. This is incompatible with the idea of
selectively enabling them on key functions, so support for ARM was not
implemented.

Support for SPARC was also considered, but its SIMD instructions have a
maximum 64-bit width. We need greater widths to be able to do 64-bit
additions to be done simultaneously, so it was not possible to use its
SIMD instructions. The result is that the best possible on SPARC is the
superscalar code, which this resembles without vector instructions.

Support for MIPS was also considered, but MIPS Release 5 is required for
that, almost no hardware implements MIPS Release 5 and it was not
apparent how to get the compilers to generate SIMD instructions for it.
It would not surprise me if the compilers do not support MIPS Release
5's SIMD instructions at all.

Support for Loongson was also considered, but a lack of documentation
(depsite it being MIPS-based) prevented it from being done.

Support for S390 was also considered, but it required compiling for the
rather new Z13, and the compilers did not appear to emit vector
instructions for it. Like for MIPS, it is quite possible that the
compilers do not know how to emit them.

Support for RISC-V was also considered. While the RISC-V Vector Extension
was ratified in 2021, it does not appear to be supported by compilers at
this time.

Lastly, to make things easier for the Windows port, we preemptively
disable this implementation by checking for _MSC_VER, since it is known
that MSVC is not able to compile GNU C's vector extensions.

Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>

Pull-request: #14234 part 1/1
Richard Yao
Add generic fletcher4 vector implementation

This is written using GNU C's generic vector extensions to the C
standard and is based on the AVX512 implementation. However, unlike the
AVX512 implementation, this compiles on all supported architectures.
However, initially, it only generates vector instructions in the kernel
on x86, x86_64 and power64 when either AVX2 is avaliable or VSX is
avaliable. Other architectures receive superscalar8.

In userspace, we are more likely to see vector instructions
automatically generated for other architectures, but this depends on
what the compiler is willing to use by default, since we make no attempt
to force vector instruction generation on other architectures.
Coincidentally, the compilers do not give us a generic way of doing that
across all architectures.

Supporting vector instructions on an architecture requires using
compiler pragma directives. At present, there are directives for both
GCC and Clang. ICC was tested and was found to need its own directives,
which were not added, so anyone compiling ZFS with ICC will not have
vector instructions emitted inside the kernel.

The code had to workaround the following GCC bug to get good assembly
generation from GCC:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107916

This was done by manually reimplementing the 512-bit generic vector
operations in terms of smaller 128-bit or 256-bit vector operations
depending on the machine's native SIMD width. For x86/x86_64, that means
256-bit vector operations while for POWER, that means 128-bit vector
operations. When Clang is the compiler, we do not need this workaround,
so the preprocessor will give it the full 512-bit generic vector
version.

For x86 and x86_64, AVX2 will be generated. GCC is actually unable to
generate good assembly on its own, so a single handwritten vpmovzxdq
instruction was added to fix that. With this improvement, the code runs
faster than the Intel written avx2 code in a KVM guest on Zen 3:

0 0 0x01 -1 0 770448447 7874698323
implementation  native        byteswap
scalar          9674935874    9342370616
superscalar      12389694710    12378469202
superscalar4    13417221678    12187713211
generic-avx2    47470237051    39416631033
sse2            23714943741    11366615899
ssse3            23692572459    20166280422
avx2            41473313254    37962243248
fastest          generic-avx2  generic-avx2

That is possible because the total theoretical bandwidth is
51.2GB/sec and Zen 3 is capable of performing four 256-bit vpaddq
operations per cycle according to Agner Fog. i.e. superscalar SIMD is
possible.

It is thought that the byteswap case fails to see much improvement
because of a hardware limitation that restricts shuffle instructions to
only two execution ports down from the four avaliable to additions. The
CPU pipeline starts the next iteration before the current iteration has
finished to to try occupy all 4 execution ports. In the native case,
this works since we only do additions. In the byteswap case, we must do
a shuffle for which we only have 2 execution ports. This causes us to
hit a pipeline hazard that the CPU handles by inserting bubbles, which
reduces execution port occupancy. Thus we do slightly better than we did
with the older avx2 byteswap case since we keep the execution ports
occupied slightly more, but not by much due to the pipeline hazard.

For POWER, VSX instructions will be generated for POWER8 and later.

Support for AARCH64/ARM was considered, but the build system passes
-mgeneral-regs-only and upon seeing this, both GCC and Clang refuse to
emit NEON instructions. This is incompatible with the idea of
selectively enabling them on key functions, so support for ARM was not
implemented.

Support for SPARC was also considered, but its SIMD instructions have a
maximum 64-bit width. We need greater widths to be able to do 64-bit
additions to be done simultaneously, so it was not possible to use its
SIMD instructions. The result is that the best possible on SPARC is the
superscalar code, which this resembles without vector instructions.

Support for MIPS was also considered, but MIPS Release 5 is required for
that, almost no hardware implements MIPS Release 5 and it was not
apparent how to get the compilers to generate SIMD instructions for it.
It would not surprise me if the compilers do not support MIPS Release
5's SIMD instructions at all.

Support for Loongson was also considered, but a lack of documentation
(depsite it being MIPS-based) prevented it from being done.

Support for S390 was also considered, but it required compiling for the
rather new Z13, and the compilers did not appear to emit vector
instructions for it. Like for MIPS, it is quite possible that the
compilers do not know how to emit them.

Support for RISC-V was also considered. While the RISC-V Vector Extension
was ratified in 2021, it does not appear to be supported by compilers at
this time.

Lastly, to make things easier for the Windows port, we preemptively
disable this implementation by checking for _MSC_VER, since it is known
that MSVC is not able to compile GNU C's vector extensions.

Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>

Pull-request: #14234 part 1/1
Richard Yao
Add generic fletcher4 vector implementation

This is written using GNU C's generic vector extensions to the C
standard and is based on the AVX512 implementation. However, unlike the
AVX512 implementation, this compiles on all supported architectures.
However, initially, it only generates vector instructions in the kernel
on x86, x86_64 and power64 when either AVX2 is avaliable or VSX is
avaliable. Other architectures receive superscalar8.

In userspace, we are more likely to see vector instructions
automatically generated for other architectures, but this depends on
what the compiler is willing to use by default, since we make no attempt
to force vector instruction generation on other architectures.
Coincidentally, the compilers do not give us a generic way of doing that
across all architectures.

Supporting vector instructions on an architecture requires using
compiler pragma directives. At present, there are directives for both
GCC and Clang. ICC was tested and was found to need its own directives,
which were not added, so anyone compiling ZFS with ICC will not have
vector instructions emitted inside the kernel.

The code had to workaround the following GCC bug to get good assembly
generation from GCC:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107916

This was done by manually reimplementing the 512-bit generic vector
operations in terms of smaller 128-bit or 256-bit vector operations
depending on the machine's native SIMD width. For x86/x86_64, that means
256-bit vector operations while for POWER, that means 128-bit vector
operations. When Clang is the compiler, we do not need this workaround,
so the preprocessor will give it the full 512-bit generic vector
version.

For x86 and x86_64, AVX2 will be generated. GCC is actually unable to
generate good assembly on its own, so a single handwritten vpmovzxdq
instruction was added to fix that. With this improvement, the code runs
faster than the Intel written avx2 code in a KVM guest on Zen 3:

0 0 0x01 -1 0 770448447 7874698323
implementation  native        byteswap
scalar          9674935874    9342370616
superscalar      12389694710    12378469202
superscalar4    13417221678    12187713211
generic-avx2    47470237051    39416631033
sse2            23714943741    11366615899
ssse3            23692572459    20166280422
avx2            41473313254    37962243248
fastest          generic-avx2  generic-avx2

That is possible because the total theoretical bandwidth is
51.2GB/sec and Zen 3 is capable of performing four 256-bit vpaddq
operations per cycle according to Agner Fog. i.e. superscalar SIMD is
possible.

It is thought that the byteswap case fails to see much improvement
because of a hardware limitation that restricts shuffle instructions to
only two execution ports down from the four avaliable to additions. The
CPU pipeline starts the next iteration before the current iteration has
finished to to try occupy all 4 execution ports. In the native case,
this works since we only do additions. In the byteswap case, we must do
a shuffle for which we only have 2 execution ports. This causes us to
hit a pipeline hazard that the CPU handles by inserting bubbles, which
reduces execution port occupancy. Thus we do slightly better than we did
with the older avx2 byteswap case since we keep the execution ports
occupied slightly more, but not by much due to the pipeline hazard.

For POWER, VSX instructions will be generated for POWER8 and later.

Support for AARCH64/ARM was considered, but the build system passes
-mgeneral-regs-only and upon seeing this, both GCC and Clang refuse to
emit NEON instructions. This is incompatible with the idea of
selectively enabling them on key functions, so support for ARM was not
implemented.

Support for SPARC was also considered, but its SIMD instructions have a
maximum 64-bit width. We need greater widths to be able to do 64-bit
additions to be done simultaneously, so it was not possible to use its
SIMD instructions. The result is that the best possible on SPARC is the
superscalar code, which this resembles without vector instructions.

Support for MIPS was also considered, but MIPS Release 5 is required for
that, almost no hardware implements MIPS Release 5 and it was not
apparent how to get the compilers to generate SIMD instructions for it.
It would not surprise me if the compilers do not support MIPS Release
5's SIMD instructions at all.

Support for Loongson was also considered, but a lack of documentation
(depsite it being MIPS-based) prevented it from being done.

Support for S390 was also considered, but it required compiling for the
rather new Z13, and the compilers did not appear to emit vector
instructions for it. Like for MIPS, it is quite possible that the
compilers do not know how to emit them.

Support for RISC-V was also considered. While the RISC-V Vector Extension
was ratified in 2021, it does not appear to be supported by compilers at
this time.

Lastly, to make things easier for the Windows port, we preemptively
disable this implementation by checking for _MSC_VER, since it is known
that MSVC is not able to compile GNU C's vector extensions.

Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>

Pull-request: #14234 part 1/1
Richard Yao
Add generic fletcher4 vector implementation

This is written using GNU C's generic vector extensions to the C
standard and is based on the AVX512 implementation. However, unlike the
AVX512 implementation, this compiles on all supported architectures.
However, initially, it only generates vector instructions in the kernel
on x86, x86_64 and power64 when either AVX2 is avaliable or VSX is
avaliable. Other architectures receive superscalar8.

In userspace, we are more likely to see vector instructions
automatically generated for other architectures, but this depends on
what the compiler is willing to use by default, since we make no attempt
to force vector instruction generation on other architectures.
Coincidentally, the compilers do not give us a generic way of doing that
across all architectures.

Supporting vector instructions on an architecture requires using
compiler pragma directives. At present, there are directives for both
GCC and Clang. ICC was tested and was found to need its own directives,
which were not added, so anyone compiling ZFS with ICC will not have
vector instructions emitted inside the kernel.

The code had to workaround the following GCC bug to get good assembly
generation from GCC:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107916

This was done by manually reimplementing the 512-bit generic vector
operations in terms of smaller 128-bit or 256-bit vector operations
depending on the machine's native SIMD width. For x86/x86_64, that means
256-bit vector operations while for POWER, that means 128-bit vector
operations. When Clang is the compiler, we do not need this workaround,
so the preprocessor will give it the full 512-bit generic vector
version.

For x86 and x86_64, AVX2 will be generated. GCC is actually unable to
generate good assembly on its own, so a single handwritten vpmovzxdq
instruction was added to fix that. With this improvement, the code runs
faster than the Intel written avx2 code in a KVM guest on Zen 3:

0 0 0x01 -1 0 770448447 7874698323
implementation  native        byteswap
scalar          9674935874    9342370616
superscalar      12389694710    12378469202
superscalar4    13417221678    12187713211
generic-avx2    47470237051    39416631033
sse2            23714943741    11366615899
ssse3            23692572459    20166280422
avx2            41473313254    37962243248
fastest          generic-avx2  generic-avx2

That is possible because the total theoretical bandwidth is
51.2GB/sec and Zen 3 is capable of performing four 256-bit vpaddq
operations per cycle according to Agner Fog. i.e. superscalar SIMD is
possible.

It is thought that the byteswap case fails to see much improvement
because of a hardware limitation that restricts shuffle instructions to
only two execution ports down from the four avaliable to additions. The
CPU pipeline starts the next iteration before the current iteration has
finished to to try occupy all 4 execution ports. In the native case,
this works since we only do additions. In the byteswap case, we must do
a shuffle for which we only have 2 execution ports. This causes us to
hit a pipeline hazard that the CPU handles by inserting bubbles, which
reduces execution port occupancy. Thus we do slightly better than we did
with the older avx2 byteswap case since we keep the execution ports
occupied slightly more, but not by much due to the pipeline hazard.

For POWER, VSX instructions will be generated for POWER8 and later.

Support for AARCH64/ARM was considered, but the build system passes
-mgeneral-regs-only and upon seeing this, both GCC and Clang refuse to
emit NEON instructions. This is incompatible with the idea of
selectively enabling them on key functions, so support for ARM was not
implemented.

Support for SPARC was also considered, but its SIMD instructions have a
maximum 64-bit width. We need greater widths to be able to do 64-bit
additions to be done simultaneously, so it was not possible to use its
SIMD instructions. The result is that the best possible on SPARC is the
superscalar code, which this resembles without vector instructions.

Support for MIPS was also considered, but MIPS Release 5 is required for
that, almost no hardware implements MIPS Release 5 and it was not
apparent how to get the compilers to generate SIMD instructions for it.
It would not surprise me if the compilers do not support MIPS Release
5's SIMD instructions at all.

Support for Loongson was also considered, but a lack of documentation
(depsite it being MIPS-based) prevented it from being done.

Support for S390 was also considered, but it required compiling for the
rather new Z13, and the compilers did not appear to emit vector
instructions for it. Like for MIPS, it is quite possible that the
compilers do not know how to emit them.

Support for RISC-V was also considered. While the RISC-V Vector Extension
was ratified in 2021, it does not appear to be supported by compilers at
this time.

Lastly, to make things easier for the Windows port, we preemptively
disable this implementation by checking for _MSC_VER, since it is known
that MSVC is not able to compile GNU C's vector extensions.

Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>

Pull-request: #14234 part 1/1
Richard Yao
Add generic fletcher4 vector implementation

This is written using GNU C's generic vector extensions to the C
standard and is based on the AVX512 implementation. However, unlike the
AVX512 implementation, this compiles on all supported architectures.
However, initially, it only generates vector instructions in the kernel
on x86, x86_64 and power64 when either AVX2 is avaliable or VSX is
avaliable. Other architectures receive superscalar8.

In userspace, we are more likely to see vector instructions
automatically generated for other architectures, but this depends on
what the compiler is willing to use by default, since we make no attempt
to force vector instruction generation on other architectures.
Coincidentally, the compilers do not give us a generic way of doing that
across all architectures.

Supporting vector instructions on an architecture requires using
compiler pragma directives. At present, there are directives for both
GCC and Clang. ICC was tested and was found to need its own directives,
which were not added, so anyone compiling ZFS with ICC will not have
vector instructions emitted inside the kernel.

The code had to workaround the following GCC bug to get good assembly
generation from GCC:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107916

This was done by manually reimplementing the 512-bit generic vector
operations in terms of smaller 128-bit or 256-bit vector operations
depending on the machine's native SIMD width. For x86/x86_64, that means
256-bit vector operations while for POWER, that means 128-bit vector
operations. When Clang is the compiler, we do not need this workaround,
so the preprocessor will give it the full 512-bit generic vector
version.

For x86 and x86_64, AVX2 will be generated. GCC is actually unable to
generate good assembly on its own, so a single handwritten vpmovzxdq
instruction was added to fix that. With this improvement, the code runs
faster than the Intel written avx2 code in a KVM guest on Zen 3:

0 0 0x01 -1 0 770448447 7874698323
implementation  native        byteswap
scalar          9674935874    9342370616
superscalar      12389694710    12378469202
superscalar4    13417221678    12187713211
generic-avx2    47470237051    39416631033
sse2            23714943741    11366615899
ssse3            23692572459    20166280422
avx2            41473313254    37962243248
fastest          generic-avx2  generic-avx2

That is possible because the total theoretical bandwidth is
51.2GB/sec and Zen 3 is capable of performing four 256-bit vpaddq
operations per cycle according to Agner Fog. i.e. superscalar SIMD is
possible.

It is thought that the byteswap case fails to see much improvement
because of a hardware limitation that restricts shuffle instructions to
only two execution ports down from the four avaliable to additions. The
CPU pipeline starts the next iteration before the current iteration has
finished to to try occupy all 4 execution ports. In the native case,
this works since we only do additions. In the byteswap case, we must do
a shuffle for which we only have 2 execution ports. This causes us to
hit a pipeline hazard that the CPU handles by inserting bubbles, which
reduces execution port occupancy. Thus we do slightly better than we did
with the older avx2 byteswap case since we keep the execution ports
occupied slightly more, but not by much due to the pipeline hazard.

For POWER, VSX instructions will be generated for POWER8 and later.

Support for AARCH64/ARM was considered, but the build system passes
-mgeneral-regs-only and upon seeing this, both GCC and Clang refuse to
emit NEON instructions. This is incompatible with the idea of
selectively enabling them on key functions, so support for ARM was not
implemented.

Support for SPARC was also considered, but its SIMD instructions have a
maximum 64-bit width. We need greater widths to be able to do 64-bit
additions to be done simultaneously, so it was not possible to use its
SIMD instructions. The result is that the best possible on SPARC is the
superscalar code, which this resembles without vector instructions.

Support for MIPS was also considered, but MIPS Release 5 is required for
that, almost no hardware implements MIPS Release 5 and it was not
apparent how to get the compilers to generate SIMD instructions for it.
It would not surprise me if the compilers do not support MIPS Release
5's SIMD instructions at all.

Support for Loongson was also considered, but a lack of documentation
(depsite it being MIPS-based) prevented it from being done.

Support for S390 was also considered, but it required compiling for the
rather new Z13, and the compilers did not appear to emit vector
instructions for it. Like for MIPS, it is quite possible that the
compilers do not know how to emit them.

Support for RISC-V was also considered. While the RISC-V Vector Extension
was ratified in 2021, it does not appear to be supported by compilers at
this time.

Lastly, to make things easier for the Windows port, we preemptively
disable this implementation by checking for _MSC_VER, since it is known
that MSVC is not able to compile GNU C's vector extensions.

Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>

Pull-request: #14234 part 1/1
  • Debian 8 ppc64 (BUILD): cloning zfs -  stdio
  • Ubuntu 18.04 i386 (BUILD): cloning zfs -  stdio
Jorgen Lundman
Unify Assembler files between Linux and Windows

Add new macro ASMABI used by Windows to change
calling API to "sysv_abi".

Signed-off-by: Jorgen Lundman <lundman@lundman.net>

Pull-request: #14228 part 1/1
Jorgen Lundman
Unify Assembler files between Linux and Windows

Add new macro ASMABI used by Windows to change
calling API to "sysv_abi".

Signed-off-by: Jorgen Lundman <lundman@lundman.net>

Pull-request: #14228 part 1/1
Jorgen Lundman
Unify Assembler files between Linux and Windows

Add new macro ASMABI used by Windows to change
calling API to "sysv_abi".

Signed-off-by: Jorgen Lundman <lundman@lundman.net>

Pull-request: #14228 part 1/1
Rob Wing
ZTS: test reported checksum errors for ZED

Test checksum error reporting to ZED via the call paths
vdev_raidz_io_done_unrecoverable() and zio_checksum_verify().

Sponsored-by: Seagate Technology LLC
Submitted-by: Klara, Inc.
Signed-off-by: Rob Wing <rob.wing@klarasystems.com>

Pull-request: #14190 part 2/2
Rob Wing
Bump checksum error counter before reporting to ZED

The checksum error counter is incremented after reporting to ZED. This
leads ZED to receiving a checksum error report with 0 checksum errors.

To avoid this, bump the checksum error counter before reporting to ZED.

Sponsored-by: Seagate Technology LLC
Submitted-by: Klara, Inc.
Signed-off-by: Rob Wing <rob.wing@klarasystems.com>

Pull-request: #14190 part 1/2
Tony Hutter
Tag zfs-2.1.7

META file and changelog updated.

Signed-off-by: Tony Hutter <hutter2@llnl.gov>

Pull-request: #14162 part 83/83
Tony Hutter
Tag zfs-2.1.7

META file and changelog updated.

Signed-off-by: Tony Hutter <hutter2@llnl.gov>

Pull-request: #14162 part 83/83
Serapheim Dimitropoulos
Bypass metaslab throttle for removal allocations

Context:
We recently had a scenario where a customer with 2x10TB disks at 95+%
fragmentation and capacity, wanted to migrate their disks to a 2x20TB
setup. So they added the 2 new disks and submitted the removal of the
first 10TB disk.  The removal took a lot more than expected (order of
more than a week to 2 weeks vs a couple of days) and once it was done it
generated a huge indirect mappign table in RAM (~16GB vs expected ~1GB).

Root-Cause:
The removal code calls `metaslab_alloc_dva()` to allocate a new block
for each evacuating block in the removing device and it tries to batch
them into 16MB segments. If it can't find such a segment it tries for
8MBs, 4MBs, all the way down to 512 bytes.

In our scenario what would happen is that `metaslab_alloc_dva()` from
the removal thread pick the new devices initially but wouldn't allocate
from them because of throttling in their metaslab allocation queue's
depth (see `metaslab_group_allocatable()`) as these devices are new and
favored for most types of allocations because of their free space. So
then the removal thread would look at the old fragmented disk for
allocations and wouldn't find any contiguous space and finally retry
with a smaller allocation size until it would to the low KB range. This
caused a lot of small mappings to be generated blowing up the size of
the indirect table. It also wasted a lot of CPU while the removal was
active making everything slow.

This patch:
Make all allocations coming from the device removal thread bypass the
throttle checks. These allocations are not even counted in the metaslab
allocation queues anyway so why check them?

Side-Fix:
Allocations with METASLAB_DONT_THROTTLE in their flags would not be
accounted at the throttle queues but they'd still abide by the
throttling rules which seems wrong. This patch fixes this by checking
for that flag in `metaslab_group_allocatable()`. I did a quick check to
see where else this flag is used and it doesn't seem like this change
would cause issues.

Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>

Pull-request: #14159 part 1/1
Tino Reichardt
Update BLAKE3 for using the new impl handling

This commit changes the BLAKE3 implementation handling and
also the calls to it from the ztest command.

Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>

Pull-request: #13741 part 7/7
Tino Reichardt
Update BLAKE3 for using the new impl handling

This commit changes the BLAKE3 implementation handling and
also the calls to it from the ztest command.

Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>

Pull-request: #13741 part 7/7
Tino Reichardt
Update BLAKE3 for using the new impl handling

This commit changes the BLAKE3 implementation handling and
also the calls to it from the ztest command.

Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>

Pull-request: #13741 part 7/7