Wuffs’ PNG image decoder


Summary: Wuffs’ PNG image decoder is memory-apt but could per chance per chance well also additionally clock between
1.22x and 2.75x faster than libpng, the widely feeble inaugurate source C
implementation. It’s also faster than the libspng, lodepng and stb_image
C libraries as neatly because the most neatly-liked Plod and Rust PNG libraries. Excessive
performance is carried out by SIMD-acceleration, 8-byte broad input and copies when
bit-twiddling and zlib-decompressing the total image all-at-as soon as (into one
huge intermediate buffer) as a substitute of 1 row at a time (into smaller,
re-usable buffers). All-at-as soon as requires more intermediate memory but lets in
considerably more of the image to be decoded in the zlib-decompressor’s
quickest code paths.

Introduction

Transportable Community Graphics, is a ubiquitous, lossless image file structure, based mostly
on the zlib compression structure. It changed into invented in the
1990s
when 16-bit computer systems and 64 KiB
memory limits were restful an inviting
ache
. More fresh image
formats (like WebP) and newer compression formats (like Zstandard) can plan
smaller recordsdata at connected decode speeds, but there’s restful deal of inertia
in the zillions of present PNG photos. By one
metric
, PNG remains to be
the most typically feeble image structure on the salvage. Mozilla telemetry
IMAGE_DECODE_SPEED_XXX sample counts from 2021-04-03 (Firefox Desktop nightly
89) puts PNG 2d, after JPEG:

  • JPEG: 63.15M
  • PNG: 49.03M
  • WEBP: 19.23M
  • GIF: 3.79M

libpng is a widely feeble inaugurate source implementation of the PNG image structure,
constructing on zlib (the library), a widely feeble inaugurate source implementation of
zlib (the structure).

Wuffs is a 21st century programming language
with a primitive library written in that language. On a mid-range x86_64
laptop, albeit on an admittedly shrimp sample discipline, Wuffs can decode PNG photos
between 1.50x and 2.75x faster than libpng (which we outline because the 1.00x
baseline flee):

libpng_decode_19k_8bpp                            58.0MB/s ± 0%  1.00x
libpng_decode_40k_24bpp                           73.1MB/s ± 0%  1.00x
libpng_decode_77k_8bpp                             177MB/s ± 0%  1.00x
libpng_decode_552k_32bpp_ignore_checksum           146MB/s ± 0%  (†)
libpng_decode_552k_32bpp_verify_checksum           146MB/s ± 0%  1.00x
libpng_decode_4002k_24bpp                          104MB/s ± 0%  1.00x

libpng                                                  1.00x to 1.00x

----

wuffs_decode_19k_8bpp/clang9                       131MB/s ± 0%  2.26x
wuffs_decode_40k_24bpp/clang9                      153MB/s ± 0%  2.09x
wuffs_decode_77k_8bpp/clang9                       472MB/s ± 0%  2.67x
wuffs_decode_552k_32bpp_ignore_checksum/clang9     370MB/s ± 0%  2.53x
wuffs_decode_552k_32bpp_verify_checksum/clang9     357MB/s ± 0%  2.45x
wuffs_decode_4002k_24bpp/clang9                    156MB/s ± 0%  1.50x

wuffs_decode_19k_8bpp/gcc10                        136MB/s ± 1%  2.34x
wuffs_decode_40k_24bpp/gcc10                       162MB/s ± 0%  2.22x
wuffs_decode_77k_8bpp/gcc10                        486MB/s ± 0%  2.75x
wuffs_decode_552k_32bpp_ignore_checksum/gcc10      388MB/s ± 0%  2.66x
wuffs_decode_552k_32bpp_verify_checksum/gcc10      373MB/s ± 0%  2.55x
wuffs_decode_4002k_24bpp/gcc10                     164MB/s ± 0%  1.58x

wuffs                                                   1.50x to 2.75x

(†): libpng’s “simplified API” doesn’t present a trend to push apart the checksum. We
copy the verify_checksum numbers for a 1.00x baseline.

As an instance, the 77k_8bpp source image is 160 pixels broad, 120 pixels excessive
and its color model is 8 bits (1 byte; a palette index) per pixel. Decoding
that to 32bpp BGRA produces 160 × 120 × 4=76800 bytes, abbreviated as 77k.
The test photos:

  • 19k_8bpp
    dst: Indexed, src: Indexed
  • 40k_24bpp
    dst: BGRA, src: RGB
  • 77k_8bpp
    dst: BGRA, src: Indexed
  • 552k_32bpp
    dst: BGRA, src: RGBA
  • 4002k_24bpp
    dst: BGRA, src: RGB

Producing 4002k bytes at 104MB/s or 164MB/s scheme that it takes about
38ms or 24ms for libpng or Wuffs to decode that 1165 × 859 image.

Other PNG implementations (libspng,
lodepng,
stb_image, Plod’s
image/png
and Rust’s
png
) are measured in the Appendix (Benchmark
Numbers)
.

Wuffs Code

The expose line examples further below consult with a wuffs listing. Earn it by
cloning this repository:

$ git clone https://github.com/google/wuffs.git

Among the expose line output, here and below, were unnoticed or in another case
edited for brevity.

PNG File Layout

The PNG image structure builds on:

  1. Two checksum algorithms, CRC-32 and Adler-32. Both plan 32-bit hashes but
    they are diversified algorithms.
  2. The DEFLATE compression structure.
  3. 2-dimensional filtering. For a row of pixels, it’s typically better (smaller
    output) to compress the residuals (the adaptation between pixel values and a
    weighted sum of their neighbors above and left) than the uncooked values.

Every of these steps could per chance also be optimized.

Checksums

CRC-32

Rapidly CRC Computation for Generic Polynomials The exercise of PCLMULQDQ
Instruction

by Gopal, Ozturk, Guilford, Wolrich, Feghali, Dixon and Karakoyunlu is a 2009
white paper on implementing CRC-32 the utilization of x86_64 SIMD instructions. The exact
code looks like
this
.
The ARM SIMD code is even
much less complicated
,
as there are dedicated CRC-32 connected intrinsics.

As for performance, Wuffs’
instance/crc32
program is roughly similar to Debian’s /bin/crc32, diversified than being 7.3x
faster (0.056s vs 0.410s) on this 178 MiB
file
.

$ ls -lh linux-5.11.3.tar.gz | awk '{print $5 " " $9}'
178M linux-5.11.3.tar.gz
$ g++ -O3 wuffs/instance/crc32/crc32.cc -o wcrc32
$ time ./wcrc32   /dev/stdin 

SMHasher

SMHasher is a test and benchmark suite
for a diversity of hash feature implementations. It can present recordsdata for claims
like “our unusual Foo hash feature is faster than the widely feeble Bar, Baz and Qux
hash choices”. On the alternative hand, when comparing Foo to CRC-32, understand that a
SIMD-accelerated CRC-32 implementation could per chance also be 47x
faster

than SMHasher’s simple CRC-32 implementation.

Adler-32

There isn’t a white paper about it, however the Adler-32 checksum could per chance also be
SIMD-accelerated. Here’s the ARM
code

and the x86_64
code
.

Wuffs’ Adler-32 implementation is spherical 6.4x faster (11.3GB/s vs 1.76GB/s)
than the one from zlib-the-library (called the ‘mimic library’ here), as
summarized by the
benchstat program:

$ cd wuffs
$ # ¿ is exact an extraordinary persona that is easy to hunt recordsdata from for. By
$ # convention, in Wuffs' source, it marks construct-connected recordsdata.
$ grep ¿ test/c/std/adler32.c
// ¿ wuffs mimic cflags: -DWUFFS_MIMIC -lz
$ gcc -O3 test/c/std/adler32.c -DWUFFS_MIMIC -lz
$ # Escape the benchmarks.
$ ./a.out -bench | benchstat /dev/stdin
establish                      flee
wuffs_adler32_10k/gcc10   11.3GB/s ± 0%
wuffs_adler32_100k/gcc10  11.6GB/s ± 0%
mimic_adler32_10k/gcc10   1.76GB/s ± 0%
mimic_adler32_100k/gcc10  1.72GB/s ± 0%

Ignoring Checksums

Taken to an horrible, the quickest checksum implementation is exact now not doing the
checksum calculations in any respect (and skipping over the 4-byte anticipated checksum
values in the PNG file).

The ignore_checksum versus verify_checksum benchmark numbers on the tip of
this put up counsel a 1.04x performance distinction. For Wuffs, here’s a
one-line
swap
.
Even ought to you don’t exercise Wuffs’ decoder, turning off PNG checksum verification
could per chance per chance restful flee up your decodes, per chance by bigger than 1.04x in case your PNG
decoder doesn’t exercise a SIMD-accelerated checksum implementation.

If doing so, understand that turning off checksum verification is a swap-off:
being much less in a spot to detect recordsdata corruption and to deviate from a strict studying
of the connected file structure specifications.

DEFLATE Compression

The majority of DEFLATE compressed recordsdata consists of a sequence of codes, either
literal codes or copy codes. There are 256 likely literal codes, one for
every likely decompressed byte. Every copy code consists of a length (what number of
bytes to copy, between 3 and 258 inclusive) and a distance (how a long way earlier in
the ‘history’ or beforehand-decompressed output to copy from, between 1 and
32768 inclusive).

As an instance, “banana” could per chance per chance per chance be compressed as this sequence:

  • Literal ‘b’.
  • Literal ‘a’.
  • Literal ‘n’.
  • Reproduction 3 bytes starting from 2 bytes in the past: “ana”. Tremendous, the final ‘a’ of the copy
    input is also the principle ‘a’ of the copy output and wasn’t known unless the
    copy started.

Codes are Huffman encoded, which scheme that they bear a variable (but integral)
alternative of bits (between 1 and 48 inclusive) and end now not essentially begin or end
on byte boundaries.

Literal codes emit a single byte. Reproduction codes emit up to 258 bytes. The most
alternative of output bytes from any one code is therefore 258. We’ll re-consult with this
number later.

Wuffs model 0.2 had a equivalent implementation to zlib-the-library’s, and
conducted
similarly
,
on the very least on x86_64. Wuffs model 0.3 provides two predominant optimizations for
neatly-liked CPUs (with 64-bit unaligned loads and shops): 8-byte-chunk input and
8-byte-chunk output.

8-Byte-Chunk Enter

As illustrious above, DEFLATE codes bear between 1 and 48 bits.
zlib-the-library’s “decode 1 DEFLATE code” implementation reads input bits at
a few places in the loop. There are 7 circumstances of withhold +=(unsigned
lengthy)(*in++) in
inffast.c,
loading the input bits 1 byte (8 bits) at a time.

We can as a substitute situation a single 64-bit load as soon as per loop. Some of those loaded
bits will be dropped on the ground, if there already are unprocessed bits in the
bit buffer, but that’s OK. Ingesting those bits will shift in zeroes, bit-vivid
OR with zeroes is a no-op and bit-vivid OR with input bits is idempotent. Fabian
“ryg” Giesen’s 2018 weblog put up discusses studying bits in contrivance more
detail
.

For Wuffs, studying 64 bits as soon as per interior loop accelerated its DEFLATE
micro-benchmarks by up to
1.30x
.

8-Byte-Chunk Output

Retract into story a DEFLATE code sequence for compressing TO BE OR NOT TO BE. THAT IS
ETC
. The 2d TO BE could per chance per chance per chance be represented by a copy code of length 5 and
distance 13. A simple implementation of a 5 byte copy is a loop. If your CPU
lets in unaligned loads and shops, a five instruction sequence (4-byte load;
4-byte store; 1-byte load; 1-byte store; out_ptr +=5) could per chance per chance well also or are now not
faster, but restful exact (given a sufficiently huge distance). Even better
(in that it’s fewer instructions) is to copy too great (8-byte load; 8-byte
store; out_ptr +=5).

: TO_BE_OR_NOT_??????????????????????
: ^            ^
: out_ptr-13   out_ptr
:
:
:              [1234567) copy 8 bytes
:              v       v
: TO_BE_OR_NOT_TO_BE_OR??????????????
:              ^    ^
:                   out_ptr +=5
:
:
:                   [) write 1 byte
:                   vv
: TO_BE_OR_NOT_TO_BE.OR??????????????
:                   ^^
:                    out_ptr +=1

The output of subsequent codes (e.g. a literal '.' byte) will overwrite and
fix the excess. Or, if there are no subsequent codes, have the decompression
API post-condition be that any bytes in the output buffer may be modified,
even past the “number of decompressed bytes” returned.

Note that zlib-the-library’s API doesn’t allow this optimization
unconditionally. Its inflateBack function uses a single buffer for both
history and output
,
so that 8-byte overwrites could incorrectly modify the history (what the
library calls the sliding window) and hence corrupt future output.

For Wuffs, rounding up the copy length to a multiple of 8 sped up its DEFLATE
micro-benchmarks by up to
1.48x
.

gzip

The gzip file format is, roughly speaking, DEFLATE compression combined with
a CRC-32 checksum. Like example/crc32, Wuffs’
example/zcat program
is roughly equivalent to Debian’s /bin/zcat, other than being 3.1x faster
(2.680s vs 8.389s) on the same 178 MiB file and also running in a self-imposed
SECCOMP_MODE_STRICT
sandbox
.

$ gcc -O3 wuffs/example/zcat/zcat.c -o wzcat
$ time ./wzcat    /dev/null
real    0m2.680s
$ time /bin/zcat  /dev/null
real    0m8.389s

As a consistency check, the checksum of the both programs’ output should be the
same (and that 0x750d1011 checksum value should be in the final bytes of the
.gz file). Note that we are now checksumming the decompressed contents. The
earlier example/crc32 output checksummed the compressed file.

$ ./wzcat   

Running Off a Cliff

Racing from point A to point B on a flat track is simple: run as fast as you
can. Now suppose that point B is on the edge of a cliff so that overstepping is
fatal (if not from the fall, then from the sharks). Racing now involves an
initial section (let’s color it blue) where you run as fast as you can and a
final section (let’s color it red) where you go slower but with more control.

cliff illustration #0

Decompressing DEFLATE involves writing to a destination buffer and writing past
the buffer bounds (the classic ‘buffer overflow’ security flaw) is analogous to
running off a cliff. To avoid this, zlib-the-library has two decompression
implementations: a fast ‘blue’ one (when 258 or more bytes away from the
buffer end
,
amongst some other conditions) and a slow ‘red’ one
(otherwise).

Separately, libpng allocates two buffers (for the current and previous row of
pixels) and calls into zlib-the-library H times, where H is the image’s
height in pixels. Each time, the destination buffer is exactly the size of one
row (the width in pixels times the bytes per pixel, plus a filter configuration
byte, roughly speaking) without any slack, which means that zlib-the-library
spends the last 258 or more bytes of each row in the slow ‘red’ zone. For
example, this can be about one quarter of the pixels of a 300 × 200 RGB (3
bytes per pixel) image, and a higher proportion in terms of CPU time.

cliff illustration #1

Wuffs’ zlib-the-format decompressor also uses this blue/red dual
implementation technique, but Wuffs’ PNG decoder decompresses into a single
buffer all-at-once instead of one-row-at-a-time. Almost all (e.g. more than 99%
of the pixels of that 300 × 200 RGB image) of the zlib-the-format
decompression is now in the ‘blue’ zone. This is faster than the ‘red’ zone by
itself but it also avoids any instruction cache or branch prediction slow-downs
when alternating between blue code and red code.

Memory Cost

All-at-once obviously requires O(width × height) intermediate memory (what
Wuffs calls a “work buffer”) instead of O(width) memory, but if you’re
decoding the whole image into RAM anyway, that already requires O(width ×
height)
memory.

Also, Wuffs’ image decoding API does give the caller some choice on memory use.
Wuffs doesn’t say, “I need M bytes of memory to decode this image”, it’s “I
need between M0 and M1 (inclusive). The more you give me, the faster I’ll
be”.

Wuffs’ PNG decoder currently sets M0 equal to M1 (there’s no choice;
all-at-once is mandatory) but a future version could give a one-row-at-a-time
option by offering a lower M0. The extra O(width × height) memory cost
could be avoided (at a performance cost) for those callers that care.

PNG Filtering

Both Wuffs-the-library and libpng (but not all of other PNG decoders measured
here) have SIMD implementations of PNG’s 2-dimensional filters. For example,
here’s Wuffs’ x86
filters
.

libpng can actually be a little faster at this step, since it can ensure that
any self-allocated pixel-row buffers are aligned to the SIMD-friendliest
boundaries. Alignment can impact SIMD instruction selection and performance.
ARM and x86_64 are generally more and less fussy about this respectively.

Wuffs-the-library makes fewer promises about buffer alignment, partly because
Wuffs-the-language doesn’t have the capability to allocate
memory
,
but mainly because zlib-decompressing all-at-once requires giving up being
able to e.g. 4-byte-align the start of each row. This is especially true, even
if RGBA pixels at 8 bits per channel are 4 bytes per pixel, because the PNG
file format prepends one byte (for filter configuration) to each row. The
zlib-decompression layer sees an odd number of bytes per row.

Nonetheless, profiling suggests that more time is spent in zlib-decompression
than in PNG filtering, so that the benefits of all-at-once zlib-decompression
outweigh the costs of unaligned PNG filtering. Wuffs’ Raspberry Pi 4 (32-bit
armv7l) compared-to-libpng benchmark ratios aren’t quite as impressive as its
x86_64 ratios (see Hardware below), but Wuffs still comes out
ahead.

Tangentially, that one filter-configuration byte per row, interleaved between
the rows of filtered pixel data
, also makes it impossible to zlib-decompress
all-at-once directly into the destination pixel buffer. Instead, we have to
decompress to an intermediate work buffer (which has a memory cost) and then
memcpy (and filter) 99% of that to the destination buffer. In hindsight, a
different file format design wouldn’t need a separate work buffer, but it’s far
too late to change PNG now.

Upstream Patches

The optimization techniques described above were applied to new code:
Wuffs-the-library written in Wuffs-the-language. They could also apply to
existing code too, but there are reasons to prefer new code.

Patching libpng

libpng is written in C, whose lack of memory safety is well documented.
Furthermore, its error-handling API is built around setjmp and longjmp.
Non-local gotos make static or formal analysis more complicated.

Despite the file format being largely unchanged since 1999 (version 1.2 was
formalized in 2003; APNG is an
unofficial extension), the libpng C implementation has collected 74 CVE
records from 2002 through to
2021
, 9 of those since
2018.

Its source code has a one-line comment that literally says “TODO: WARNING:
TRUNCATION ERROR: DANGER WILL
ROBINSON”

but doesn’t say anything else. The comment was added in
2013

and is still there in 2021, but the code itself is older.

libpng is also just complicated. As a very rough metric, running wc -l *.c
arm/*.c intel/*.c
in libpng’s repository counts 35182 lines of code (excluding
*.h header files). Running wc -l std/png/*.wuffs in Wuffs’ repository
counts 2110 lines. The former library admittedly implements an encoder, not
just a decoder, but even after halving the first number, it’s still an 8x
ratio.

Patching zlib

I tried patching zlib-the-library a few years
ago
but it’s trickier than I first
thought, because of the inflateBack API issue mentioned above.

In any case, other people have already done this. Both
zlib-ng/zlib-ng and
cloudflare/zlib are zlib-the-library
forks with performance patches. Those patches (as well as those in Chromium’s
copy of
zlib-the-library
)
include optimization ideas similar to those presented here, as well as other
techniques for the encoder side.

Building zlib-ng from source is straightforward:

$ git clone https://github.com/zlib-ng/zlib-ng.git
$ mkdir zlib-ng/build
$ cd zlib-ng/build
$ cmake -DCMAKE_BUILD_TYPE=Release -DZLIB_COMPAT=On ..
$ make

With the test/c/std/png.c program (see Reproduction below),
running LD_LIBRARY_PATH=/the/path/to/zlib-ng/build ./a.out -bench shows that
libpng with zlib-ng (the second set of numbers below) is a little faster
but not a lot faster than with vanilla zlib (the first set of numbers below).

libpng_decode_19k_8bpp                            58.0MB/s ± 0%  1.00x
libpng_decode_40k_24bpp                           73.1MB/s ± 0%  1.00x
libpng_decode_77k_8bpp                             177MB/s ± 0%  1.00x
libpng_decode_552k_32bpp_ignore_checksum           146MB/s ± 0%  (†)
libpng_decode_552k_32bpp_verify_checksum           146MB/s ± 0%  1.00x
libpng_decode_4002k_24bpp                          104MB/s ± 0%  1.00x

libpng                                                  1.00x to 1.00x

----

zlibng_decode_19k_8bpp/gcc10                      63.8MB/s ± 0%  1.10x
zlibng_decode_40k_24bpp/gcc10                     74.1MB/s ± 0%  1.01x
zlibng_decode_77k_8bpp/gcc10                       189MB/s ± 0%  1.07x
zlibng_decode_552k_32bpp_ignore_checksum/gcc10                 skipped
zlibng_decode_552k_32bpp_verify_checksum/gcc10     177MB/s ± 0%  1.21x
zlibng_decode_4002k_24bpp/gcc10                    113MB/s ± 0%  1.09x

zlibng                                                  1.01x to 1.21x

cloudflare/zlib was forked from zlib-the-library version 1.2.8. Pointing
LD_LIBRARY_PATH to its libz.so.1 makes ./a.out fail with version
'ZLIB_1.2.9' not found (required by /lib/x86_64-linux-gnu/libpng16.so.16)
.

Patching Go or Rust

Both Go and Rust are successful, modern and memory-safe programming languages
with significant adoption. However, for existing C/C++ projects, it is easier
to incorporate Wuffs-the-library, which is transpiled to C (and its C form is
checked into the repository). It’d be like using any other third-party C/C++
library, it’s just not hand-written C/C++. In comparison, integrating Go or
Rust code into a C/C++ project involves, at a minimum, setting up additional
compilers and other build tools.

Still, there may very well be some worthwhile follow-up performance work for Go
or Rust’s PNG implementations, based on techniques discussed in this post. For
example, neither Go or Rust’s Adler-32 implementations are SIMD-accelerated. It
may also be worth trying the 8-Byte-Chunk Input and 8-Byte-Chunk Output
techniques. Go’s DEFLATE implementation reads only one byte at a
time
.
Rust’s miniz_oxide reads four bytes at a
time

and four is bigger than one, but eight is even bigger still. As far as I can
tell, neither Go or Rust’s PNG decoder zlib-decompress all-at-once.

Memory Safety

Also, unlike Go or Rust, Wuffs’ memory
safety

is enforced at compile time, not by inserting runtime checks that e.g. the i
in a[i] is within bounds or that (x + y) doesn’t overflow a u32. Plod and
Rust compilers can elide all these assessments, especially when iterating with a
uniform access sample, but e.g. decoding DEFLATE codes spend a variable
alternative of bytes per iteration.

Runtime security assessments can have an effect on performance. I love Zig’s “Performance and
Security: Resolve
Two”

motto but, now not like Zig, Wuffs doesn’t bear separate “Unencumber Rapidly” and “Unencumber
Reliable” construct modes.
There’s exact one Wuffs “Unencumber” configuration (pass -O3 to your C compiler)
and it’s both speedy and apt on the same time.

Even so, when handling untrusted (third occasion) PNG photos, sandboxing and a
multi-course of structure can present further defence intensive. Wuffs’
instance/convert-to-nia program converts from image formats like PNG to an
without considerations-parsed Naïve Declare
Layout

and, on Linux, runs in a self-imposed SECCOMP_MODE_STRICT
sandbox
.

Conclusion

Wuffs model
0.3.0-beta.1
has
exact been sever and it contains the quickest, safest PNG decoder on this planet. Inquire of
the Wuffs instance
programs
for how
to withhold it. Its PNG decoder doesn’t make stronger color areas or gamma correction
yet (notice Wuffs situation 39 ought to you
care), but just a few of that you just would be succesful to per chance well restful get it priceless at this early stage.


Appendix (Benchmark Numbers)

libpng scheme the /usr/lib/x86_64-linux-gnu/libpng16.so place that comes on
my Debian Bullseye machine.

libpng_decode_19k_8bpp                            58.0MB/s ± 0%  1.00x
libpng_decode_40k_24bpp                           73.1MB/s ± 0%  1.00x
libpng_decode_77k_8bpp                             177MB/s ± 0%  1.00x
libpng_decode_552k_32bpp_ignore_checksum           146MB/s ± 0%  (†)
libpng_decode_552k_32bpp_verify_checksum           146MB/s ± 0%  1.00x
libpng_decode_4002k_24bpp                          104MB/s ± 0%  1.00x

libpng                                                  1.00x to 1.00x

----

wuffs_decode_19k_8bpp/clang9                       131MB/s ± 0%  2.26x
wuffs_decode_40k_24bpp/clang9                      153MB/s ± 0%  2.09x
wuffs_decode_77k_8bpp/clang9                       472MB/s ± 0%  2.67x
wuffs_decode_552k_32bpp_ignore_checksum/clang9     370MB/s ± 0%  2.53x
wuffs_decode_552k_32bpp_verify_checksum/clang9     357MB/s ± 0%  2.45x
wuffs_decode_4002k_24bpp/clang9                    156MB/s ± 0%  1.50x

wuffs_decode_19k_8bpp/gcc10                        136MB/s ± 1%  2.34x
wuffs_decode_40k_24bpp/gcc10                       162MB/s ± 0%  2.22x
wuffs_decode_77k_8bpp/gcc10                        486MB/s ± 0%  2.75x
wuffs_decode_552k_32bpp_ignore_checksum/gcc10      388MB/s ± 0%  2.66x
wuffs_decode_552k_32bpp_verify_checksum/gcc10      373MB/s ± 0%  2.55x
wuffs_decode_4002k_24bpp/gcc10                     164MB/s ± 0%  1.58x

wuffs                                                   1.50x to 2.75x

----

libspng_decode_19k_8bpp/clang9                    59.3MB/s ± 0%  1.02x
libspng_decode_40k_24bpp/clang9                   78.4MB/s ± 0%  1.07x
libspng_decode_77k_8bpp/clang9                     189MB/s ± 0%  1.07x
libspng_decode_552k_32bpp_ignore_checksum/clang9   236MB/s ± 0%  1.62x
libspng_decode_552k_32bpp_verify_checksum/clang9   203MB/s ± 0%  1.39x
libspng_decode_4002k_24bpp/clang9                  110MB/s ± 0%  1.06x

libspng_decode_19k_8bpp/gcc10                     59.6MB/s ± 0%  1.03x
libspng_decode_40k_24bpp/gcc10                    77.5MB/s ± 0%  1.06x
libspng_decode_77k_8bpp/gcc10                      189MB/s ± 0%  1.07x
libspng_decode_552k_32bpp_ignore_checksum/gcc10    223MB/s ± 0%  1.53x
libspng_decode_552k_32bpp_verify_checksum/gcc10    194MB/s ± 0%  1.33x
libspng_decode_4002k_24bpp/gcc10                   109MB/s ± 0%  1.05x

libspng                                                 1.02x to 1.62x

----

lodepng_decode_19k_8bpp/clang9                    65.1MB/s ± 0%  1.12x
lodepng_decode_40k_24bpp/clang9                   72.1MB/s ± 0%  0.99x
lodepng_decode_77k_8bpp/clang9                     222MB/s ± 0%  1.25x
lodepng_decode_552k_32bpp_ignore_checksum/clang9               skipped
lodepng_decode_552k_32bpp_verify_checksum/clang9   162MB/s ± 0%  1.11x
lodepng_decode_4002k_24bpp/clang9                 70.5MB/s ± 0%  0.68x

lodepng_decode_19k_8bpp/gcc10                     61.1MB/s ± 0%  1.05x
lodepng_decode_40k_24bpp/gcc10                    62.5MB/s ± 1%  0.85x
lodepng_decode_77k_8bpp/gcc10                      176MB/s ± 0%  0.99x
lodepng_decode_552k_32bpp_ignore_checksum/gcc10                skipped
lodepng_decode_552k_32bpp_verify_checksum/gcc10    139MB/s ± 0%  0.95x
lodepng_decode_4002k_24bpp/gcc10                  62.3MB/s ± 0%  0.60x

lodepng                                                 0.60x to 1.25x

----

stbimage_decode_19k_8bpp/clang9                   75.1MB/s ± 1%  1.29x
stbimage_decode_40k_24bpp/clang9                  84.6MB/s ± 0%  1.16x
stbimage_decode_77k_8bpp/clang9                    234MB/s ± 0%  1.32x
stbimage_decode_552k_32bpp_ignore_checksum/clang9  162MB/s ± 0%  1.11x
stbimage_decode_552k_32bpp_verify_checksum/clang9              skipped
stbimage_decode_4002k_24bpp/clang9                80.7MB/s ± 0%  0.78x

stbimage_decode_19k_8bpp/gcc10                    73.3MB/s ± 0%  1.26x
stbimage_decode_40k_24bpp/gcc10                   81.8MB/s ± 0%  1.12x
stbimage_decode_77k_8bpp/gcc10                     214MB/s ± 0%  1.21x
stbimage_decode_552k_32bpp_ignore_checksum/gcc10   145MB/s ± 0%  0.99x
stbimage_decode_552k_32bpp_verify_checksum/gcc10               skipped
stbimage_decode_4002k_24bpp/gcc10                 79.7MB/s ± 0%  0.77x

stbimage                                                0.77x to 1.32x

----

go_decode_19k_8bpp/go1.16                         39.4MB/s ± 1%  0.68x
go_decode_40k_24bpp/go1.16                        46.7MB/s ± 1%  0.64x
go_decode_77k_8bpp/go1.16                         78.3MB/s ± 0%  0.44x
go_decode_552k_32bpp_ignore_checksum/go1.16                    skipped
go_decode_552k_32bpp_verify_checksum/go1.16        118MB/s ± 0%  0.81x
go_decode_4002k_24bpp/go1.16                      50.5MB/s ± 0%  0.49x

skedaddle                                                      0.44x to 0.81x

----

rust_decode_19k_8bpp/rust1.48                     89.8MB/s ± 0%  1.55x
rust_decode_40k_24bpp/rust1.48                     122MB/s ± 0%  1.67x
rust_decode_77k_8bpp/rust1.48                      158MB/s ± 0%  0.89x
rust_decode_552k_32bpp_ignore_checksum/rust1.48                skipped
rust_decode_552k_32bpp_verify_checksum/rust1.48    136MB/s ± 0%  0.93x
rust_decode_4002k_24bpp/rust1.48                   122MB/s ± 0%  1.17x

rust                                                    0.89x to 1.67x

Reproduction

Wuffs compiles (transpiles) to C code and that C code (a “single file C
library” that doesn’t need configure, make install or equivalent incantations)
is also checked into the Wuffs repository. Whenever you’re exact the utilization of
Wuffs-the-library (in pickle of enhancing it), you don’t have to install any
Wuffs instruments. You exact need a C compiler.

$ cd wuffs
$ # ¿ is exact an extraordinary persona that is easy to hunt recordsdata from for. By
$ # convention, in Wuffs' source, it marks construct-connected recordsdata.
$ grep ¿ test/c/std/png.c
// ¿ wuffs mimic cflags: -DWUFFS_MIMIC -lm -lpng -lz
$ gcc -O3 test/c/std/png.c -DWUFFS_MIMIC -lm -lpng -lz
$ # Escape the assessments.
$ ./a.out
$ # Escape the benchmarks.
$ ./a.out -bench

The assessments test that Wuffs ‘mimics’ (has the same output as) one other C library.
For the PNG file structure, the ‘mimic library’ is libpng, even supposing editing
Wuffs’
test/c/mimiclib/png.c
file can configure diversified mimic libraries like libspng, lodepng and
stb_image.

Demonstrate that gcc10 performs a limited bit faster than clang9 in Wuffs’ benchmark
numbers on the tip of this put up (as neatly as in an earlier non-SIMD
Adler-32
implementation),
even supposing clang9 typically performs better for the diversified C mimic libraries.
This is something I would by no scheme bear stumbled on if Wuffs’ instruments generated
object code straight (e.g. through LLVM).

Plod
PNG

and Rust
PNG

benchmarks are separate programs.

Hardware

The total numbers above were measur

Read More

Similar Products:

Recent Content