posts about

Binary encodings

December 6, 2018.

Binary encodings are essential when it comes to handling of binary data.

Let’s have a look at possible issues if no encoding is used and raw binary data is handled instead.

Encoding schemes

There are different encoding schemes out there. Each scheme has a radix (or base) which points to the number of characters to represent binary. While there are 16 available characters in Base16, there are 64 that you can use in Base64 for encoding.

Scheme Alphabet Alignment
Base16 [0-9A-F] Not needed
Base32 [A-Z2-7] Zero padding, 40 bits
Base64 [A-Za-z0-9+/] Zero padding, 24 bits
Base64-URL [A-Za-z0-9-_] Zero padding, 24 bits

Encoding is about grouping the bits and mapping each group to a symbol from the alphabet. How many symbols or characters you have depends on the base of the encoding scheme. If it’s Base16, you have 16 symbols, from 0 to 9 and from A to F. Below is a private exponent of a RSA private key in 512 bits. It is encoded in Base16 and each byte is separated with a colon for better readability.


Let’s have a look at this in more detail. As explained, there are 64 symbols in the alphabet. Each symbol represents exactly 6-bits of data. We have a problem now. We can’t map 6-bit to 8-bit directly. Because of that, there are requirements on the length of the encoded string. This is an alignment between encoding scheme and byte representation. Therefore, we are constrained to have a multiple of 24 bits and if necessary, use padding characters to maintain alignment.

Let’s demonstrate this with an example and encode 0xDEADBEEF. Firstly, do this in the command line with echo. To omit the newline in input we use -n flag and -e is to interpret backslashes as hex encoding.

$ echo -n -e "\xDE\xAD\xBE\xEF" | base64

Now step by step

1. Hex D E A D B E E F
2. Binary 1101 1110 1010 1101 1011 1110 1110 1111
3. Align 110111 101010 110110 111110 111011 110000 000000 000000
4. Encoding enc(55) enc(42) enc(54) enc(62) enc(59) enc(48)
5. Encoded 3 q 2 + 7 w = =

See how we used zero padding in the sixth chunk. It didn’t have an impact on the encoding symbol though. Also notice how we appended two additional chunks to maintain 24 bits alignment. Alignment resulted with two padding characters = which leads to an extra space.

Space constraints

Let’s have a look on how many characters we need to encode 1K bytes of data for each scheme.

Scheme Characters Compared
Base16 2048 ~ 144%
Base32 1640 ~ 120%
Base64 1368 100%

Base64 is the most efficient in terms of space compared to the others.

You can’t encode strings

Lastly, I want to point out to a common fallacy. We can’t encode strings for the simple reason that strings are not binary. We need to find out first the character encoding scheme of the string, be it ASCII, ISO or UTF-8 etc. Once we have it, we are able to decode the string into raw binary and then finally we can encode resulting binary data. The same rationale applies for decoding, as well. Binary decoding gives us raw binary data. In order to build the string, we need to know character encoding scheme.

comments powered by Disqus Send feedback to me @karakays_.

← Git commit ranges  Taking the Linux Foundation exam →