Sunday, September 29, 2019

Base64 Binary-to-Text Encoding

Base64

Base-64 encoding is a way of taking binary data and turning it into text, It's a convenient way to store/transmit binary data over media that is specifically used for textual data.
It's basically a way of encoding arbitrary binary data in ASCII text. It takes 4 characters per 3 bytes of data, plus potentially a bit of padding at the end.


It's a textual encoding of binary data where the resultant text has nothing but letters, numbers and the symbols "+", "/" and "=".

Base
Values
States
Base 2
{0,1}
2
Base 10
{0,1,..,9}
10
Base 16
{0,1,..,9,A,B,..,F}
16
Base 64
{0,1,….,63} , where 0 -25 = {A,..,Z} , 26-51 = {a,z}52-61 = {0,..,9},62-63={+,/}
64

In computer science, Base64 is a group of binary-to-text encoding schemes that represent binary data in an ASCII string format by translating it into a radix-64 representation. The term Base64 originates from a specific MIME content transfer encoding. Each Base64 digit represents exactly 6 bits of data. Three 8-bit bytes (i.e., a total of 24 bits) can therefore be represented by four 6-bit Base64 digits.

Why Base-64 , not Decimal or Hexadecimal ?

The two alternatives for converting binary data into text that immediately spring to mind are:
  • Decimal: store the decimal value of each byte as three numbers: 045 112 101 037 etc. where each byte is represented by 3 bytes. The data bloats three-fold.
  • Hexadecimal: store the bytes as hex pairs: AC 47 0D 1A etc. where each byte is represented by 2 bytes. The data bloats two-fold.
Base-64 maps 3 bytes (8 x 3 = 24 bits) in 4 characters that span 6-bits (6 x 4 = 24 bits). The result looks something like "TWFuIGlzIGRpc3Rpb...". Therefore the bloating is only a mere 4/3 = 1.3333333 times the original.

Why not store/transmit binary data directly

  • Data Integrity : When you have some binary data that you want to ship across a network, you generally don't do it by just streaming the bits and bytes over the wire in a raw format. Why? because some media are made for streaming text. You never know some protocols may interpret your binary data as control characters (like a modem), or your binary data could be screwed up because the underlying protocol might think that you've entered a special character combination (like how FTP translates line endings). 
  • Hashes: Hashes are one-way functions that transform a block of bytes into another block of bytes of a fixed size such as 128bit or 256bit (SHA/MD5). Converting the resulting bytes into Base64 makes it much easier to display the hash especially when you are comparing a checksum for integrity. Hashes are so often seen in Base64 that many people mistake Base64 itself as a hash.
  • Cryptography: Since an encryption key/encrypted data does not have to be text but raw bytes it is sometimes necessary to store it in a file or database, which Base64 comes in handy for. Same with the resulting encrypted bytes.
  • Certificates : x509 certificates in PEM format are base 64 encoded.

Encoding (Output padding)

 

The final == sequence indicates that the last group contained only one byte, and = indicates that it contained two bytes. (3*8=24) ---> (4*6=24) , we have to convert 3 characters at a time to 4 characters at a time for encoding.

Input Length
Input Text
Output Length
Output
Padding
Last group conatined
1
a
4
YQ==
==(2)
1 Byte  (a)
2
ab
4
YWI=
=  (1)
2 Bytes (ab)
3
abc
4
YWJj
    (0)
3 Bytes (abc)
7
Sridhar
12
U3JpZGhhcg==
==(2)
1 Byte  (r)
6
Sridha
8
U3JpZGhh
    (0)
3 Bytes (dha)
5
Sridh
8
U3JpZGg=
=  (1)
2 Bytes (dh)
4
Srid
8
U3JpZA==
==(2)
1 Byte  (d)
3
Sri
4
U3Jp
    (0)
3 Bytes (Sri)
2
Sr
4
U3I=
=  (1)
2 Bytes (Sr)
1
S
4
Uw==
==(2)
1 Byte  (S)


The ratio of output bytes to input bytes is 4:3 (33% overhead). Specifically, given an input of n bytes, the output will be  4[(1/3)n] bytes long, including padding characters.The padding character is technically not needed for decoding, since the number of missing bytes can be calculated from the number of Base64 digits. In some implementations, the padding character is mandatory, while for others it is not used. One exception, where padding characters are technically required, is when multiple Base64 encoded files have been concatenated.

Decoding with Padding

 

When decoding Base64 text, four characters are typically converted back to three bytes. The only exceptions are when padding characters exist. A single = indicates that the four characters will decode to only two bytes, while == indicates that the four characters will decode to only a single byte.

Encoded
Padding
Length
Decoded
U3JpZA==
==(2)
1
Srid
U3JpZGg=
=  (1)
2
Sridh
U3JpZGhh
    (0)
3
Sridha

Decoding without Padding

 

Without padding, after normal decoding of four characters to three bytes over and over again, fewer than four encoded characters may remain. In this situation only two or three characters shall remain. A single remaining encoded character is not possible (because a single Base64 character only contains 6 bits, and 8 bits are required to create a byte, so a minimum of 2 Base64 characters are required: The first character contributes 6 bits, and the second character contributes its first 2 bits.)

Length
Encoded
Length
Decoded
2
U3JpZA
1
Srid
3
U3JpZGg
2
Sridh
4
U3JpZGhh
3
Sridha

Example step by step

 
Source (Text): Sridhar

String to be encoded : “Sridhar” Length=7, it's not multiple of 3.
So to make string length multiple of 3 , we must add 2 bit padding to make length= 9. Padding bit is represented by “=” sign

Note : One padding bit equals two zero's 00 so two padding bit equals four zero's 0000

Step 1 : Convert each character to decimal.
S=83,r=114,i=105,d=100,h=104,a=97,r=114
[83,114,105,100,104,97,114]

Step 2 : Change each decimal to 8-bit binary representation. (128 64 32 16 8 4 2 1)
83=01010011,114=01110010,105=01101001,100=01100100,104=01101000,97=01100001,114=01110010
[01010011 01110010 01101001 01100100 01101000 01100001 01110010]

Step 3 : Separate in a group of 6-bit.
[010100 110111 001001 101001 011001 000110 100001 100001 011100 10]
so the last 6-bit is not complete so we insert two padding bit which equals four zero's “0000”.
[010100 110111 001001 101001 011001 000110 100001 100001 011100 100000]
Now, it is equal. Two equals sign at the end show that 4 zero's were added (helps in decoding).
[010100 110111 001001 101001 011001 000110 100001 100001 011100 100000  ==]

Step 4 : Calculate binary to decimal
010100=20,110111=55,001001=9,101001=41,011001=25,000110=6,100001=33,100001=33,011100=28,100000=32, ==
20 55 9 41 25 6 33 33 28 32

Step 5 : Covert decimal characters to base64 using base64 chart.
U3JpZGhhcg==

Base64 Encoding and Decoding using Sun Java Library

 package com.java2depth;

import sun.misc.BASE64Decoder;
import sun.misc.BASE64Encoder;

public class Base64SunTest { 
    public static void main(String[] args) throws Exception{
        String input = "Man is distinguished, not only by his reason, but by this singular passion from other animals," 
                + "which is a lust of the mind, that by a perseverance of delight in the continued and indefatigable" 
                + "generation of knowledge, exceeds the short vehemence of any carnal pleasure.";
        String outputEncoded = new BASE64Encoder().encode(input.getBytes());
        System.out.println("Encoded Data :\n\t" + outputEncoded);
        String outputDecoded = new String(new BASE64Decoder().decodeBuffer(outputEncoded));
        System.out.println("Decoded Data :\n\t" + outputDecoded);
    }
}
Encoded Data :

TWFuIGlzIGRpc3Rpbmd1aXNoZWQsIG5vdCBvbmx5IGJ5IGhpcyByZWFzb24sIGJ1dCBieSB0aGlz
IHNpbmd1bGFyIHBhc3Npb24gZnJvbSBvdGhlciBhbmltYWxzLHdoaWNoIGlzIGEgbHVzdCBvZiB0
aGUgbWluZCwgdGhhdCBieSBhIHBlcnNldmVyYW5jZSBvZiBkZWxpZ2h0IGluIHRoZSBjb250aW51
ZWQgYW5kIGluZGVmYXRpZ2FibGVnZW5lcmF0aW9uIG9mIGtub3dsZWRnZSwgZXhjZWVkcyB0aGUg
c2hvcnQgdmVoZW1lbmNlIG9mIGFueSBjYXJuYWwgcGxlYXN1cmUu

Decoded Data :

    Man is distinguished, not only by his reason, but by this singular passion from other animals,which is a lust of the mind, that by a perseverance of delight in the continued and indefatigablegeneration of knowledge, exceeds the short vehemence of any carnal pleasure.
 

Why “\n” after 76 characters
  • Breaking a base64 encoded string into multiple lines has been necessary for many old programs that couldn't handle long lines. Programs written in Java can usually handle long lines since they don't need to do the memory management themselves. As long as the lines are shorter than 64 million characters there should be no problem.
  • Introducing a newline character was introduced every certain number of characters. In computer science argot this is called text wrapping. the number 76 of characters comes from the good practice of having code lines of at most 80 characters and having 2 per side as margin . This choose of 76 characters (or columns) comes from the standards in RFC2045  and is also a standard in the Linux command base64.
We don't need "\n" after 76 characters
  • Base64 encoders usually impose some maximum line (chunk) length, and adds newlines when necessary. we can normally configure that, but that depends on the particular coder implementation.
  • Apache Commons has a line length attribute, setting it to zero (or negative) disables the line separation.
  • We are using Sun inbuilt API (sun.misc.BASE64Encoder) to encode or decode the encrypted data, Hence it will separate the encoded data after every 76 characters. 
  • We can use Apache common codec API (org.apache.commons.codec.binary.Base64) to log the data into a single line.
 Base64 Encoding and Decoding using Apache Commons Codec Library
package com.java2depth;
import org.apache.commons.codec.binary.Base64;

public class Base64ApacheTest {
    public static void main(String[] args) throws Exception{
        String input = "Man is distinguished, not only by his reason, but by this singular passion from other animals," 
                + "which is a lust of the mind, that by a perseverance of delight in the continued and indefatigable" 
                + "generation of knowledge, exceeds the short vehemence of any carnal pleasure.";
        String outputEncoded = new String(new Base64().encode(input.getBytes()));
        System.out.println("Encoded Data :\n\t"+outputEncoded);
        String outputDecoded = new String(new Base64().decode(outputEncoded));
        System.out.println("Decoded Data :\n\t"+outputDecoded);
    }
}
Encoded Data :
 TWFuIGlzIGRpc3Rpbmd1aXNoZWQsIG5vdCBvbmx5IGJ5IGhpcyByZWFzb24sIGJ1dCBieSB0aGlzIHNpbmd1bGFyIHBhc3Npb24gZnJvbSBvdGhlciBhbmltYWxzLHdoaWNoIGlzIGEgbHVzdCBvZiB0aGUgbWluZCwgdGhhdCBieSBhIHBlcnNldmVyYW5jZSBvZiBkZWxpZ2h0IGluIHRoZSBjb250aW51ZWQgYW5kIGluZGVmYXRpZ2FibGVnZW5lcmF0aW9uIG9mIGtub3dsZWRnZSwgZXhjZWVkcyB0aGUgc2hvcnQgdmVoZW1lbmNlIG9mIGFueSBjYXJuYWwgcGxlYXN1cmUu
Decoded Data :
 Man is distinguished, not only by his reason, but by this singular passion from other animals,which is a lust of the mind, that by a perseverance of delight in the continued and indefatigablegeneration of knowledge, exceeds the short vehemence of any carnal pleasure.

Base64 Encoding and Decoding using Linux Shell

To encode text to base64, Wrap lines while encoding using -w option and To decode, use base64 -d


Server certificate in Base64 encoding format :

java2depth@java2depth-PC ~ $ openssl s_client -connect google.co.in:443

Server certificate
-----BEGIN CERTIFICATE-----
MIIEsDCCA5igAwIBAgIQPmb2e1RfA2vD38qb8d47HjANBgkqhkiG9w0BAQsFADBU
MQswCQYDVQQGEwJVUzEeMBwGA1UEChMVR29vZ2xlIFRydXN0IFNlcnZpY2VzMSUw
IwYDVQQDExxHb29nbGUgSW50ZXJuZXQgQXV0aG9yaXR5IEczMB4XDTE5MDUyMTIw
NDcyOFoXDTE5MDgxMzIwMzEwMFowaDELMAkGA1UEBhMCVVMxEzARBgNVBAgMCkNh
bGlmb3JuaWExFjAUBgNVBAcMDU1vdW50YWluIFZpZXcxEzARBgNVBAoMCkdvb2ds
ZSBMTEMxFzAVBgNVBAMMDiouZ29vZ2xlLmNvLmluMIIBIjANBgkqhkiG9w0BAQEF
AAOCAQ8AMIIBCgKCAQEAwDByURedzywtEjCudpFVR02WpTnNk4wAl6YVU62arFG5
/2xCue1tlyKwUUWfSDvd3TZF4j86XVMe8BN0VT/NlXGm358ojY+86fuN324cn7cB
04R1SBeYZa69+T2SfHOeeoe0jhNE1xzQZaCUKPX63mTICV6IJb6x6Z4ezarhsdAA
8FkVOZpyKDO1xRoIaUrqnw7yngjc+8sCRZq6hy8lzXDXiXjgMSogL8uqkqrIRS4f
1S8RVcdNEavitR661x2sE4PwGC8A9R1jq5QAC1RZFJSOy9NmROjUXguZEgbwXckw
UrDTKLQy1+PF3hXXhEkh3YllQyPSgISZbveODJWQLQIDAQABo4IBaDCCAWQwEwYD
VR0lBAwwCgYIKwYBBQUHAwEwPwYDVR0RBDgwNoIOKi5nb29nbGUuY28uaW6CCyou
Z29vZ2xlLmluggxnb29nbGUuY28uaW6CCWdvb2dsZS5pbjBoBggrBgEFBQcBAQRc
MFowLQYIKwYBBQUHMAKGIWh0dHA6Ly9wa2kuZ29vZy9nc3IyL0dUU0dJQUczLmNy
dDApBggrBgEFBQcwAYYdaHR0cDovL29jc3AucGtpLmdvb2cvR1RTR0lBRzMwHQYD
VR0OBBYEFEfPgxEGPaSuxL1XENJo3keqk+D9MAwGA1UdEwEB/wQCMAAwHwYDVR0j
BBgwFoAUd8K4UJpndnaxLcKG0IOgfqZ+ukswIQYDVR0gBBowGDAMBgorBgEEAdZ5
AgUDMAgGBmeBDAECAjAxBgNVHR8EKjAoMCagJKAihiBodHRwOi8vY3JsLnBraS5n
b29nL0dUU0dJQUczLmNybDANBgkqhkiG9w0BAQsFAAOCAQEAwyI5Y0EaL5Nv/8e7
Mv8CXofOsasPILPWYMq7CSzZ03kncNoezJatQzlglCiczSJTi1vSagEwpm6j4BCb
Wl5+7sBDx6phpKmeliUzv9vneOfIUiVfPVkeamYQaMXnOSpJYdt5xKhZk+3/WS8J
K9RufOLN/yYNIBUX1ZH6GReGLkRqNeJRQa0wgyPRXh41/TL22sqpZkUSDkn2vIo2
0EZ7U5GvMHHruqTJ6GFBz7+qc/ZWVM3yZU9z0E/sf7Y7DgbVX/hStmYUXzQUz90Z
6o+qNEdi9MlbxRpQDF+7PR42xDhi5Z2dnFLuoGgdm4KZIVLJlREz3/F09mpuMR2f
L1Zlkg==
-----END CERTIFICATE-----