Hacker News new | past | comments | ask | show | jobs | submit login
Fast base64 encoding and decoding (lemire.me)
210 points by joeyespo on Jan 17, 2018 | hide | past | favorite | 77 comments



Last fall I tried to figure out why Rust is slower than D in a base64 benchmark (https://github.com/kostya/benchmarks). It turned out that decoding in Rust is really fast but encoding is a bit slow. I looked at the assembly output of both programs and I could more or less follow the steps in the Rust version, but the D version was only about 5 SIMD instructions (without a single bit shift). I also checked the source code, the Rust version is fairly long, because a part of the hot loop is manually unrolled, but the D version looked like the simple, naive implementation in C. It might be a result of newer LLVM version (LDC was already using LLVM 5 and rustc is still on LLVM 4) or rustc still might have issues with optimizer settings (both binaries were compiled for Intel Haswell).


Do you have LLVM IR output? There are definitely places where rust's hinting isn't optimal, and that might not be too difficult to fix.


I didn't look at the LLVM IR (since I'm not really familiar with it), but I might investigate it a bit further in the near future.


I would encourage it. LLVM IR is fairly easy to understand. One of my first forays into rustc was looking at the IR for iterators to see why one version was show (turned out to be that the bounds on the index variable couldn't proven tight enough to eliminate the bounds check, iirc).

Try writing the same simple code as D and compare the two IR representations.

And no matter how much I slag on rust being difficult, the community is really good about helping and teaching.


I made Go bindings for that library https://github.com/myfreeweb/go-base64-simd

…to use in a mail indexer project…

…only to discover that half of my mail was not being decoded, because that decoder does not skip whitespace.


Did you find any bugs in the library when implementing your binding?

Daniel Lemire has done brilliant work. His https://github.com/lemire/despacer looks great.

I wonder though if their Base64 SIMD algorithms could be tweaked to do the decode in a single pass?

Like you, also email related, I have written a native Base64 addon for Node in C optimized for decoding MIME wrapped Base64 in a single pass (https://github.com/ronomon/base64). It's not yet using SIMD instructions but it's 1.5-2x as fast as Node's routines with top speeds of 1398.18 MB/s decoding and 1048.58 MB/s encoding on a single core (Intel Xeon E5-1620 v3 @ 3.50GHz). Here is an explanation of the difference: https://github.com/nodejs/node/issues/12114

The Ronomon Base64 addon also ships with a fuzz test (https://github.com/ronomon/base64/blob/master/test.js) which you are welcome to try out on your Go binding. It picked up an interesting bug in Node's decode routine where some white space was actually decoded as data (https://github.com/nodejs/node/issues/11987).


To be fair, the original paper does mention the problem of whitespaces (see Appendix B on page 21). It seems that the recommended solution is to use this fast space remover: https://github.com/lemire/despacer


that sounds like 90% of base64 encoded mails not decoded, not half.


Base64 is commonly used in cryptography to exchange keys.

Note that if you're encoding or decoding secret data, such as keys, you most likely want a constant-time encoder (example: https://boringssl.googlesource.com/boringssl/+/master/crypto...), not the fastest one.


You seem to be confusing encoding (base64 for example), and encryption (AES for example).

Where a side channel attack studying the decryption speed/power use can reveal information, the encoding of said encrypted data can not as far as I ever heard about. (It will still be encrypted after base64 decoding has taken place).

Please correct me if I'm wrong.

EDIT: I see you updated your original message with different wording, please ignore this rant.


Side-channel attacks against base64 encoding have yet to be proven, but constant-time implementations are available just in case.

My contribution is here: https://github.com/paragonie/constant_time_encoding


I'll add a high-assurance implementation from Galois to that which is probably not constant time. Their blog and Github has quite a few useful tools.

https://galois.com/blog/2013/09/high-assurance-base64/

Also, anyone wanting constant time implementation might just run a verified implementation through something like FaCT or Jasmin:

https://cseweb.ucsd.edu/~dstefan/pubs/cauligi:2017:fact.pdf

https://acmccs.github.io/papers/p1807-almeidaA.pdf


You seem to be confusing encoding (base64 for example), and encryption (AES for example).

I don't.

Where a side channel attack studying the decryption speed/power use can reveal information, the encoding of said encrypted data can not as far as I ever heard about. (It will still be encrypted after base64 decoding has taken place).

Encoding/decoding a value to/from base64 depends on the value, and many implementations use a lookup table or branches. While the table is small, so it fits in CPU cache thus making timing differences harder to detect, it's important for code dealing with secret values to avoid any timing differences by using simple branchless instructions that we know are constant-time on CPUs.


Unless you’re doing something out of the ordinary, secrets and plaintext aren’t base64 encoded. A variably-timed encoder/decoder will leaks bits of the ciphertext, which the attacker presumably already has.


Most web servers on the Internet store unwrapped base64-encoded secret keys.

Most SSH hosts store their secret keys encoded in base64 (/etc/ssh/).

Then, there's JWK which supports unwrapped secret keys.

Basically, you can't guarantee that the whole world is only encoding encrypted keys. That's why it's a good software engineering practice to prevent possible attacks by using constant-time encoder/decoder. As you can see from the link in posted, BoringSSL implements one.

Finally, the context of this thread is encoding and decoding of secret data, so I'm not sure why you had to post your opinion stating that it never happens. 1) It happens in real world. 2) I mentioned a solution for the case when it happens.


Re “trusting trust” - do you know if it is possible to build a constant time encoder on top of a maliciously compromised system/toolchain? I assume it is not possible in general. Maybe this is a worthless observation because it applies to almost everything and is not exploited often in practice.


You're right, it's applies to everything.

There's also a problem with trust in CPU, and specifically, knowing which operations are constant time. See: https://bearssl.org/constanttime.html


When doing so programmatically, yes, but I actually assumed what you quoted was referring to manual key exchange. I.e. Certificates, PEM format, etc.


https://arxiv.org/abs/1704.00605

In short, they spend roughly (EDIT: Herp-derp) 1/5th a cycle per byte of base64 through the use of AVX2 instructions to encode / decode base64. So one-cycle every 5 bytes.

This includes robust error checking (allegedly)

---------

Because SIMD-based execution was so successful in this example, I do wonder if a GPGPU implementation would be worthwhile. If some very big data were Base64-encoded / decoded (like MBs worth), then would it be worthwhile to spend a costly PCIe Transaction to transfer the data to the GPU and back? Something like an email-attachment is Base64 encoded for example.

I'd expect most web programs to be only a few kilobytes (at most) of base64 data however. A few hundred bytes for Cookies and the like. So the typical web case of base64 doesn't seem "big" enough to warrant the use of a GPU, so the AVX2 methodology would be ideal.


I suspect you would bottleneck on the PCIe bus before you saw improvement. GPUs usually aren't great at problems where you have a huge stream of data to work with and not a huge amount of processing needed per byte.

If your data is already in GPU memory for some reason then yeah, it's going to blast through the data at an insane rate, but getting it there in the first place is the problem.


> I suspect you would bottleneck on the PCIe bus before you saw improvement.

Heh, you're right. Its not even close. PCIe x16 is slightly less than 16GB/s. And that's a 1-way transfer, the 2nd transfer back effectively halves the speed.

At 4GHz, this Base64 encoding / decoding scheme is doing 20GB/s of encoding (round numbers for simplicity). So literally, it is slower to transfer the data to the GPU than to use those AVX2 instructions.

Heck, its slower on the one-way trip to the GPU, before it even comes back (and before the GPU even touches the data!)


It's a shame nobody is really using yEnc instead of base64. yEnc only has a overhead cost of about 2%, compared to 33% for base64, and would be perfect for use on webpages.


yEnc has a fair share of problems related to reliable encoding and decoding for some character sets (UTF-8 for example). Normally this is solved with the MIME types, but yEnc is not officially registered. Nor does it have an official RFC standard. base64 on the other hand "just works".


yEnc is very much designed specifically for Usenet though. It gets its low overhead largely from using almost all characters which aren't special in NNTP (and also adapts allowed characters depending on NNTP context), which isn't appropriate for many cases where base64 is used.

There's encodings which may be more appropriate, which use more than 64 codepoints, such as base85 (https://en.wikipedia.org/wiki/Ascii85), assuming that these are allowed.

As an aside, I've implemented a fast SIMD accelerated yEnc encoder/decoder here (https://github.com/animetosho/node-yencode), for anyone interested.


To embed binary data into web-pages one needs encoding that is friendly to the text encoding that the page uses and uses characters that only the source syntax allows. Any of this rules out yEnc. Besides, web pages where one embeds the binary data should be transmitted compressed and gzip removes the overhead of base64 down to few percent.


base64 is supported in browsers with atob() and btoa(), no need for an external lib


Usenet uses yEnc


Note that this is considerably faster than the amount of memory bandwidth available per core.


No its not, at least if we're talking L2 cache or faster.

Intel processors support two loads + one write per core. That can be 2x256-bit reads + 1x256-bit write. Each clock, every clock. At least to L1 and L2 (the only levels of cache which have that level of bandwidth).

AMD Ryzen supports 2x128-bit reads OR 1x128-bit write (1-read+1write is fine).

----------

The kicker is that you must use AVX2 registers to achieve this bandwidth. If you're loading / storing 64-bit registers like RAX, you can't get there.

A raw "copy" between locations (assuming they're entirely in L2 memory or faster) can be done by Skylake at 32b per cycle. The speed of this Base64 encoder however is "just" at 5bytes per cycle.

I'm curious how their code would perform on AMD's Ryzen platform, because of the lower-bandwidth of Ryzen. Ryzen's AVX is also split into 2x128-bit ports, instead of Intel's 2x256-bit ports. I'd expect AMD's Ryzen to do worse, but maybe there are other bottlenecks in the code????


> Intel processors support two loads + one write per core. That can be 2x256-bit reads + 1x256-bit write. Each clock, every clock. At least to L1 and L2...

You can approach 2 reads + 1 write in L1, but you won't get close for data in L2, especially when writes are involved. Writes dirty lines in the L1, which later need to be evicted, which eat up both L1 read-port and L2 write-port bandwidth.

You'll find that with pure writes to a data set that fits in L2 but not L1, you get only 1 cache line every 3 cycles, so nowhere close to 1 per cycle. If you add reads, things will slow down a bit more.

Even Intel documents their L2 bandwidth at only 32 bytes/cycle max, and 25 bytes per cycle sustained (on Broadwell - slightly better for Skylake client), and that's "one way" i.e., read-only loads. When you have writes and have to share with evictions it gets worse. So again nowhere close to the 96 bytes/cycle if you could do 2 32-byte reads and 1 32-byte store a cycle.

Some details are discussed here (although it's for Haswell, it mostly applies to everything up to but not including Skylake-X):

https://software.intel.com/en-us/forums/software-tuning-perf...


I appreciate your thoughts on this.

I agree with your argument. But... I feel like I should defend my thoughts a bit. My statements were based on this Intel Article: https://software.intel.com/en-us/articles/why-efficient-use-...

Which suggests L1 and L2 cache bandwidth was the same. But when I look at the sources you point out, it seems like your statements are correct. I think I'll trust your sources more, since they're benchmark-based... than a single graph (probably doing theoretical analysis). Thanks for the info.


To be fair, unless I missed it, that article is mostly just saying "The L1 and L2 are much faster than L3 and beyond". That's correct - but it doesn't imply that the L2 is somehow as fast as the L1 (if that were true, why have the distinction between L1 and L2 at all?).

Once you go to the L3, you suffer a large jump in latency (something like 12 cycles for L2 to ~40 cycles for L3 - and on some multi-socket configurations this gets worse) and you are now sharing the cache with all the other cores on the chip, so the advice to "Keep in L1 and L2" make a lot of sense.

Intel used to claim their L2 had 64 bytes per cycle bandwidth - and indeed there is probably a bus between the L1 and L2 capable of transferring 64 bytes per cycle in some direction: but you could never achieve that since unlike the core to L1 interface, which has a "simple" performance model, the L1<->L2 interface is complicated by the fact that you need to (a) accept evictions from L1, and (b) that the L1 itself doesn't have an unlimited number of ports to simultaneously accept incoming 64-byte lines from L2 and (c) everything is cache-line based.

The upshot is that even ignoring (c) you never get more than 32 bytes per cycle from L2 and usually less. Intel recently changed all the mentions of "64 bytes per cycle for L2" into "32 bytes max, ~2x bytes sustained" in their optimization manual in recognition of this.

Once you consider (c) you might get even way less bandwidth from L2: in L1, on modern Intel, your requests can be scattered around more or less randomly and you'll get the expected performance (there is some small penalty for splitting a cache line: it counts "double" against your read/write limits), but with L2 you will get worse performance with scattered reads and writes since every access is essentially amplified to a full cache line.

It turns out that L2 writes are especially bad:

https://stackoverflow.com/q/47851120/149138


I said memory bandwidth, not cache bandwidth.


In which case, you may be surprised how fast RAM can be today.

A stick of 3200MHz DDR4 RAM has roughly 25.6 GB/s of bandwidth. You'll need to slightly overclock the memory controller to get there, but it wouldn't be insane. There are 4600MHz sticks of RAM btw (https://www.tweaktown.com/reviews/8432/corsair-vengeance-lpx...), so 3200MHz+ is kind of "typical" now, despite requiring the overclock.

The cheapest systems are dual channel, giving 51.2 GB/s of bandwidth to main-memory. Some higher-end workstations are quad-channel for 102.4 GB/s of bandwidth.

Your total RAM bandwidth per say... 4GHz clock is ~12.5 bytes/clock on dual-channel and ~25 bytes/clock on quad-channel.

Divide it further into 4-cores (on dual-channel case) and 8-cores (on quad-channel processors) we're looking at ~3-bytes main-memory bandwidth per clock per core.

A lot slower than L2 cache for sure, but not nearly as bad as it seems. So you're right, ~3 bytes per clock is still slower than the base64 implementation (which is 5-bytes per clock), but the bottleneck isn't nearly that bad. Especially if "other cores" aren't hitting the memory controller so hard.


You can't get close to 50 GB/s of bandwidth to a single core, even if your memory configuration and controller supports that because the per core bandwidth is limited by the latency per cache line and the maximum number of concurrent requests. There is a maximum of 10 outstanding requests that missed in L1 per core, so if you take a (very good) latency to DRAM of 50 ns, you can reach only 64 bytes/line * 10 lines outstanding / 50 ns = 12.8 GB/s.

In fact, you'll observe pretty much exactly that limit if you disable hardware prefetching. Of course, almost no one does that, and it turns out in particular that one can do better than this limit due to hardware L2 prefetching (but not software, since it restricted by the same 10 LFBs as above), since the concurrency from L2 to the uncore and beyond is higher than 10 (perhaps 16-20) and the latency is lower, so you get higher concurrency, but you are still concurrency limited to around 30 GB/s on modern Intel client cores.

On server cores, the latency is much higher, closer to 100ns, and so the per-core bandwidth is much lower, and the memory subsystem usually has a higher throughput, so it takes many more active server cores to saturate the memory bandwidth than on the client (which one just about does it on dual-channel systems).


But isn't the situation of base64 encoding a "streaming" situation? I agree with you that the 10-outstanding requests + latency metric is useful for a random-access situation. But base64 encoding (and decoding) starts at the beginning of something in memory, and keeps going till you reach the end.

Its about as ideal a streaming case as you can get it. No hash tables or anything involved. And WITH hardware prefetchers enabled, DDR4 becomes so much faster.

http://www.sisoftware.eu/wp-content/uploads/2017/08/amd-tr-m...

The measured "streaming" latency numbers seen in that picture are likely due to the hardware prefetcher. "Streaming" data from DDR4 is incredibly efficient! An order of magnitude more efficient than random-walks.

Intel's memory controller also seems to be optimized for DDR4 page-hits (a terminology that means the DDR4 stick only needs a CAS-command to fetch memory. Normally, DDR4 requires RAS+CAS or even PRE+RAS+CAS, which increases latency significantly). So if you can keep your data-walk inside of a DDR4 page, you can drop the latency from ~80ns to ~20ns on Skylake (but not Ryzen)

--------

With that being said, it seems like my previous post is a bit optimistic and doesn't take into account CAS commands and the like (which would eat up the theoretical bandwidth of DDR4 RAM). DDR4 3200MT/s RAM practically has a bandwidth of ~31GB/s on a dual-channel setup.

Still, I stand by the theory. If you "stream" beginning-to-end some set of data from DDR4 (WITH hardware prefetchers, that's the important part! Don't disable those), then you can achieve outstanding DDR4 bandwidth numbers.


Yes, it's a streaming situation - and that's my point: contrary to conventional wisdom, even the streaming situation - for a single core - is generally limited by concurrency and latency, and not by DRAM bandwidth, and there is just no way around it.

Now a streaming load, measured in cache line throughput, ends up about twice as fast as a random access load: because the latter is mostly limited by concurrency in the "superqueue" between L2 and DRAM, and the superqueue is larger (about 16-20 entries) and has a lower total latency than the full path from L1, while the former scenario is limited by the concurrency of the 10 LFBs between L1 and L2.

Prefetch isn't magic: it largely has to play by the same rules as "real" memory requests. In particular, it has to use the same limited queue entries that regular requests do: the only real "trick" is that when talking about L2 prefetchers (i.e., the ones that observe accesses to L2 and issue prefetches from there), they get to start their journey from the L2, which means they avoid the LFB bottleneck for demand requests (and software prefetches, and L1 hardware prefetches).

You don't have to take my word for it though: it's easy to test. First you can turn off prefetchers and observe that bandwidth numbers are almost exactly as predicted by the 10 LFBs. Then you can turn on prefetchers (one by one, if you want - so you see the L2 streamer is the only one that helps), and you'll see that the bandwidth is almost exactly as predicted based on 16-20 superqueue entries and a latency about 10 ns less than the full latency.

This is most obvious when you compare server and client parts. My laptop with a wimpy i7-6700HQ has a per-core bandwidth of about 30 GB/s, but a $10,000 Haswell/Broadwell/Skylake-X E5 server part, with 4 channels and faster DRAM will have a per-core bandwidth about half that, because it has a latency roughly 2x the client parts and it is _latency limited_ even for streaming loads! You find people confused about this all the time.

Of course, if you get enough cores streaming (usually about 4) you can still use all the bandwidth on the server parts and you'll become bandwidth limited.

There are other ways to see this in action, such as via NT stores, which get around the LFB limit because they don't occupy a fill buffer for the entire transaction, but instead only long enough to hand off to the superqueue.

This is covered in more detail in this SO answer:

https://stackoverflow.com/a/43574756/149138


This is a fabulous exchange. If you (or dragontamer) are interested in collaborating with Daniel (the blog author) on future academic papers where this level of architectural detail is relevant, I'm sure he'd appreciate hearing from you by email. Or contact me (email in profile) and I'll forward.


I appreciate the elaboration. I wasn't aware of this issue.


The benchmarks shows the 'generic' aka SWAR version w or w/o openmp is the best for almost all purposes. The only thing I would change is the func sig to size_t base64_enc(const char * src, char * dst, size_t len) to ret written size.


Another fast implementation (plain C99, without using SIMD), supporting buffer aliasing (i.e. allowing in-place encoding/decoding, check senc_b64() and sdec_b64() functions): https://github.com/faragon/libsrt/blob/master/src/saux/senc....


In my benchmarks only the AVX2 is faster than the portable scalar "turbobase64" https://github.com/powturbo/TurboBase64

SIMD does not always mean faster.


I hate base64. If you have to embed binary data within human-readable text, then you’re either

A: doing something horribly wrong

B: or forced to use a monkey-patched legacy standard

Otherwise just use hexadecimal digits. At least it will be somewhat readable.

I know, I know... Base64 encodes 3 bytes on 4 characters, while hex encodes 2 bytes on 4. Big fucking deal. It will almost surely be gzipped down the line, and if you’re storing amounts of data where this would count, then see point “A” above.

I hope one day we will laugh at the mere idea of base64 encoding like how we do with EBCDIC or Windows-12XX codepages.


Base64 has been around forever, and hex has significantly lower information density than it - consistently more than 30% smaller.

As for doing something horribly wrong, I dunno, ensuring literally billions of moms and pops can exchange baby pictures by e-mail, for the first time in human history, and doing so since the time when we were using 33.6kbit modems, that seems like a resounding success.†

† Yes, if e-mail had not paid such close attention to compatibility, and just ripped up every old standard on every chance it got (i.e. pretty much like any technology invented since the commoditization of the web), interoperability would never have happened. Legacy crap is fundamentally a part of that success, and Base64 is fundamentally a part of making those hacks practical for the technology we had at the time††

†† To head off an obvious counterargument: yes, there are contemporary examples of why this strategy is still important (it's, so far, the only one that has ever worked in the decentralized environment of the Internet), and is very likely to come in useful in future, time and time again


Hex has lower information density, however it is more compressible, particularly if your data is byte aligned. For example the string "aaaaaa" is b64 encoded as "YWFhYWFh" but in hex is "616161616161". On a project I'm on, we switched from base64 to hex for some binary data embedded in JSON, and saw ~20% size reduction, since its always compressed by the web server.


Today I learned!

    >>> import random,zlib,codecs
    >>> sample = random.getrandbits(1024).to_bytes(1024, 'big') * 1024
    >>> len(zlib.compress(codecs.encode(sample, 'base64')))
    22083
    >>> len(zlib.compress(codecs.encode(sample, 'hex')))
    4326
    >>>
edit: (25 minutes of digging around the Internet, and I find this):

    >>> sample = os.urandom(1024) * 1024

    >>> len(sample.encode('base64'))
    1416501
    >>> len(sample.encode('hex'))
    2097152
    >>> len(sample.encode('ascii85'))
    1310720

    >>> len(zlib.compress(sample.encode('base64')))
    82169
    >>> len(zlib.compress(sample.encode('hex')))
    12537
    >>> len(zlib.compress(sample.encode('ascii85')))
    8212
Had to use Python 2.x to make use of the 'hackercodecs' package that implements ascii85, but looks like it's the best of both worlds, assuming its character set suits whatever medium you are transferring over, and decoding it on the other end doesn't require some slow code.

final edit: I'm guessing it was an accidental trick of the repetitive input data lining up well. On real data hex still wins out (which should have been obvious in hindsight):

    >>> sample = ''.join(sorted(open('/usr/share/dict/words').readlines()))

    >>> len(zlib.compress(sample.encode('ascii85')))
    1320809
    >>> len(zlib.compress(sample.encode('base64')))
    1116678
    >>> len(zlib.compress(sample.encode('hex')))
    880651


Both hex and ascii85 will align at 1024 byte boundaries, base64 will align at 1024*3 bytes, so you'll end up with a symbol stream something like this:

symbol_1, symbol_2, symbol_3, ptr_to_1, ptr_to_2,...

ascii85 performs better than hex simply because it's dictionary is shorter.

If you compress something with a bit more structure, like "Hamlet", you'll get this:

    >>> import urllib.request,zlib,codecs,base64
    >>> hamlet = urllib.request.urlopen("http://www.gutenberg.org/cache/epub/2265/pg2265.txt").read()
    >>> len(codecs.encode(hamlet, 'base64'))
    248577
    >>> len(codecs.encode(hamlet, 'hex'))
    368018
    >>> len(base64.a85encode(hamlet))
    230012
    >>> len(zlib.compress(codecs.encode(hamlet, 'base64')))
    102788
    >>> len(zlib.compress(codecs.encode(hamlet, 'hex')))
    88827
    >>> len(zlib.compress(base64.a85encode(hamlet)))
    121364


FWIW, Python (≥ 3.4) has an ascii85 implementation in the standard library:

    >>> import base64
    >>> base64.a85encode(b'spam')
    b'F)YQ)'
https://docs.python.org/3/library/base64.html#base64.a85enco...


Huh, my results are drastically different:

    $ dd if=/dev/urandom bs=512 count=2048 | base64 | gzip | wc
    ...
    4088   31928 1059085

    $ dd if=/dev/urandom bs=512 count=2048 | xxd -p | gzip | wc    ...
    5019   33798 1231268
1 megabyte of random data consistently results in ~1 megabyte of compressed base64 text, ~1.2 megabytes of compressed hex.


Repeating the exercise using a photograph (https://imgur.com/B4tqkrZ):

    $ pv lock-your-screen.png | base64 | gzip | wc
     2.1MiB 0:00:00 [  13MiB/s] [================================>] 100%
       8641   48904 2264151


    $ pv lock-your-screen.png | xxd -p | gzip | wc
     2.1MiB 0:00:00 [4.41MiB/s] [================================>] 100%
      10109   49956 2573293

See the same against the jpg version (I've no idea why I have kept both a jpg and png of the same image around, especially given jpg is much better suited format):

    $ pv lock-your-screen.jpg | base64 | gzip | wc
     377KiB 0:00:00 [19.2MiB/s] [================================>] 100%
       1420    8373  392796


    $ pv lock-your-screen.jpg | xxd -p | gzip | wc
     377KiB 0:00:00 [9.36MiB/s] [================================>] 100%
       1487    8935  441077
In both cases base64 is both faster and compresses smaller.


This test isn't very informative because both .png and .jpg are already compressed formats, with "better than gzip" strength so gzip/deflate isn't going to be able to compress the underlying data.

You only see some compression because gzip is just backing out some of the redundancy added by the hex or base64 encoding, and the way the huffman coding works favors base64 slightly.

Try with uncompressed data and you'll get a different result.

Your speed comparison seems disingenuous: you are benchmarking "xxd", a generalized hex dump tool, against base64, a dedicated base-64 library. I wouldn't expect their speeds to have any interesting relationship with best possible speed of a tuned algorithm.

There is little doubt that base-16 encoding is going to be very fast, and trivially vectorizable (in a much simpler way than base-64).


> both .png and .jpg are already compressed formats, with "better than gzip" strength

FWIW, PNG and gzip both use the DEFLATE algorithm, so I wouldn't call PNG's compression "better than gzip".

Source: https://en.wikipedia.org/wiki/DEFLATE

> This has led to its widespread use, for example in gzip compressed files, PNG image files and the ZIP file format for which Katz originally designed it.


Like any domain-specific algorithms PNG uses deflate as a final step after using image-specific filters which take advantage of typical 2D features. So in general png will do much better than gzip on image data, but it will generally always do at least as well (perhaps I should have said that originally). In particular, the worse case .png compression (e.g., if you pass it random data, or text or something) is to use the "no op" filter followed by the usual deflate compression, which will end up right around plain gzip.

Now at least as good is enough for my point: by compressing a .png file with gzip you aren't going to see additional compression in general. When compressing a base-64 or hex encoding .png file, the additional compression you see is largely only a result of removing the redundancy of the encoding, not any compression of the underlying image.


Ooops, that should read "Like many domain-specific algorithms" not "Like any ..."


My data wasn't quite random! It repeats every 1kb, much smaller than zlib's window size (which is I think 16kb)


A good rule-of-thumb might be that if your results show that you able to consistently compress supposedly random data to less than the size required for just the random binary bits, you should either recheck your numbers, verify your random number generator, or quickly file for a patent!


You should add googling “Shannon” “entropy” and “information theory” to that list ;-)


Try with (much) more data, theoretically they should even out at some point.


If you’re compressing it anyway why turn it into base64 or ascii hex? The compressed data will be binary anyway, so just compress the input data directly.


One example are JSON/XML/HTML responses over HTTP: you need the 8-bit ("binary") data in an ASCII format to fit into JSON/XML/HTML, while HTTP provides gzip (or DEFLATE, Brotli, etc.) compression over the whole response if both the client (by including the "Accept-Encoding" header in the request) and the server implementations support it.


Correct, we're sending data points as typed arrays inside a much larger JSON payload.


Most of the modern Web is unusable without a highly optimized JIT-compiling JS interpreter. Are you sure about the importance of interoperability & compatibility?


I'm not sure why you believe that is a counterexample - it is anything but. That interpreter can still run code written against the original DOM APIs in the original syntax, as defined 22 years ago - around time the <noscript> tag appeared in HTML!

Of course if mom and pop were using UUCP and an ADM-3A to fetch their mail, they couldn't view the images -- but all the infrastructure that allowed them to read mail at all could be repurposed without change, possibly including their modem (thanks, RS-232, introduced 1960), if they bought a 80386 desktop with a VGA card

These are all interoperability success stories! Meanwhile, I need 19 different IM apps on my phone to talk to 19 different people. That's what happens when you ignore interoperability


> Meanwhile, I need 19 different IM apps on my phone to talk to 19 different people. That's what happens when you ignore interoperability

IO was not ignored, it was willfully cast aside. That's the world of walled gardens we live in today... federation makes it possible for users to switch outside your walled garden.


Yes, HTTP and all of the layers below are quite fundamental to the success of the WWW. Interoperability breeds competition and real innovation because nobody can lean on a network effect the way modern walled gardens do.


Not everything goes over the web and is gzipped.

I've been working on a project lately with 1200bps radio links that are making REST calls on the other end and 2 vs 3 bytes makes a massive difference.


And why don’t you just send it in binary? REST doesn’t mean you have to use JSON for the HTTP body.


Yeah, that would be ideal.

Some of the protocols we talk over(APRS) are ASCII only(which is a bit of point B from above).

Transmissions are also signed at the radio so that they can be authenticated down the line which means that you can't easily change the payload envelope once it leaves the radio.

Our main goal is adoption and talking base64 means that lots of APIs that would take an intermediate server to do the translation aren't needed. This lowers to barrier to use and means we have a larger surface area of use cases.


It sounds like her use case is badly suited for raw binary transmission due to the high possibility of data loss, much like the internet used to be when base64 was originally introduced.


Base64 on the web is such an unbelievably terrible solution to such a trivial problem that it beggars belief. The idea that parsing CSS somehow blocks on decompressing string delimited, massively inefficient, nonlinearly encoded blobs of potentially incompressible data is insane.

We've taken the most obvious latency path a normal user sees and somehow decided that mess was better than sending an ar file.

(Not that this wasn't inevitable as soon as someone decided CSS should be a text format... sigh)


Why should parsing CSS block on decompressing base64, unless the CSS itself is a base64-encoded data: URI?

If you're talking about data: background images in CSS, then all the CSS parser has to do is find the end of the url(). It doesn't have to do any base64 decoding or anything like that until the rule actually matches.


You have to gz-decompress it, which is a harder job, and the only reason you're doing that (at least for images) is to undo the inefficiency you just added.


Ah, gz-decompress the stylesheet itself, ok.

For what it's worth, in my profiles of pageloads in at least Firefox I haven't seen gz-decompression of stylesheets show up in any noticeable way, but I can believe that it could be a problem if you have a lot of data: images in the sheet...


I once came across a site with that was base 64 encoding GET params for an incredibly insecure http route.

E.g., the request was something like http://example.com/bad/?v=YT0xMjMmYj00NTY= which they would then decode on the server to a=123&b=456



I'm not sure what your point is here?

I mean, we could all go and dig up links to base64 encoders and decoders, but the article you're commenting on is specifically about how fast they've been able to get using vector instructions on modern processors.


They provide an optimized non-SIMD generic implementation. The version I posted is similar, speed-wise, and it supports in-place base64 encoding/decoding (i.e. allowing aliasing, supporting the case of using one buffer for both input and output).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: