December 6, 2018.
Binary encodings are essential when it comes to handling of binary data.
Let’s have a look at possible issues if no encoding is used and raw binary data is handled instead.
How to interpret binary data on a platform? Is it in little or big endian format? What about other assumptions on the data? How the platform gets aware of them?
How to sort out incompatabilities in various platforms? For instance, think about metacharacters or control characters like NUL, CR, LF, EOT etc. This kind of inputs might have different meanings on different architectures.
Persisting binary data is problematic.
Transmitting binary data is dangerous. Binary payload can interfer with the transfer protocol and cause for misinterpretations.
What if transmission channel does not allow binary data? See SMTP.
By the nature, humans are familiar with readable and printable characters. Binary data feels strange to interact with.
There are different encoding schemes out there. Each scheme has a radix (or base) which points to the number of characters to represent binary. While there are 16 available characters in Base16, there are 64 that you can use in Base64 for encoding.
|Base32||[A-Z2-7]||Zero padding, 40 bits|
|Base64||[A-Za-z0-9+/]||Zero padding, 24 bits|
|Base64-URL||[A-Za-z0-9-_]||Zero padding, 24 bits|
Encoding is about grouping the bits and mapping each group to a symbol from the alphabet. How many symbols or characters you have depends on the base of the encoding scheme. If it’s Base16, you have 16 symbols, from 0 to 9 and from A to F. Below is a private exponent of a RSA private key in 512 bits. It is encoded in Base16 and each byte is separated with a colon for better readability.
14:37:1A:1E:DE:88:40:75:42:AE:46:3F:71:A9:FB: 93:10:CF:DB:13:B3:52:26:AE:E2:2D:34:83:7B:01: 34:49:F3:15:FB:13:24:B6:94:47:65:CD:6B:8E:DD: D5:FE:F9:8F:4B:ED:02:E2:1C:1E:8C:2C:BF:B7:70: AF:D7:93:09
Let’s have a look at this in more detail. As explained, there are 64 symbols in the alphabet. Each symbol represents exactly 6-bits of data. We have a problem now. We can’t map 6-bit to 8-bit directly. Because of that, there are requirements on the length of the encoded string. This is an alignment between encoding scheme and byte representation. Therefore, we are constrained to have a multiple of 24 bits and if necessary, use padding characters to maintain alignment.
Let’s demonstrate this with an example and encode 0xDEADBEEF. Firstly, do this in the command line with
echo. To omit the newline in input we use
-n flag and
-e is to interpret backslashes as hex encoding.
$ echo -n -e "\xDE\xAD\xBE\xEF" | base64 3q2+7w==
Now step by step
See how we used zero padding in the sixth chunk. It didn’t have an impact on the encoding symbol though. Also notice how we appended two additional chunks to maintain 24 bits alignment. Alignment resulted with two padding characters = which leads to an extra space.
Let’s have a look on how many characters we need to encode 1K bytes of data for each scheme.
Base64 is the most efficient in terms of space compared to the others.
You can’t encode strings
Lastly, I want to point out to a common fallacy. We can’t encode strings for the simple reason that strings are not binary. We need to find out first the character encoding scheme of the string, be it ASCII, ISO or UTF-8 etc. Once we have it, we are able to decode the string into raw binary and then finally we can encode resulting binary data. The same rationale applies for decoding, as well. Binary decoding gives us raw binary data. In order to build the string, we need to know character encoding scheme.