[TIF] UNDERSTANDING ABOUT CHARSET AND ENCODING

What are Charset and encoding?

We read a plain text – set of characters. However, computer saves everything to memory as bytecode.

Charset and encoding are invented to solve this problem.

Charset is a set of symbols that can be encoded such as English alphabet, number or phonetic symbols. People defined a table mapping (aka “code page” or “table of code points”) to convert character to bytecode and vice versa.

USASCII_code_chart.png

“Encode” is an action of converting.

First character encoding ASCII

At the beginning, people just cared about English alphabet, then they invented ASCII. ASCII uses 7-bit to define 128 characters (0- 127). ASCII is not only for defining number, phonetic alphabet and punctuation symbols but also control characters. For example, “A” character will be defined as 65 and presented by “1000001”; or 7 will make computer beep.

32 control + 1 space + 2 graphic + 1 @ + 10 number + 52 alphabet (upper and lower case) + 10 number + 12 (stuff) = 128

So far so good.

Evolution of character encoding

But ASCII could no longer meet the demand of non-English: accented characters. Unicode (to 8, 16, 32-bit) and ISO/IEC 10646 came after that and they have a much wider array of characters to define more characters, so they supplanted ASCII quickly. The important thing is they are backward compatible with 7-bit ASCII. They use the same code for first 128 ASCII symbols.

Any character can be encoded in many different bit sequences; and any particular bit sequence can represent many different characters. It depends on which encoding is used to read and write them.

Characters Encoding Bits
ö UTF-8 11000011 10110110
ö ISO 8859-1 11110110
ö Mac Roman 10011010

Two-way-street: encoding and decoding

The bit sequence has changed at all while you encode or decode. The thing has changed that is the text you read. You can see in the below example. No matter I encoding/decoding  multiple times or using 8 bits or 16 bits to save a character. At the end, I can read the right text by the right decoding.2-way-street

If you document can’t read because of different encoding. The only things you have to do to save it that is KNOW encoding.

Reference articles:

https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses

http://kunststube.net/encoding/

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s