What are Charset and encoding?
We read a plain text – set of characters. However, computer saves everything to memory as bytecode.
Charset and encoding are invented to solve this problem.
Charset is a set of symbols that can be encoded such as English alphabet, number or phonetic symbols. People defined a table mapping (aka “code page” or “table of code points”) to convert character to bytecode and vice versa.
“Encode” is an action of converting.
First character encoding ASCII
At the beginning, people just cared about English alphabet, then they invented ASCII. ASCII uses 7-bit to define 128 characters (0- 127). ASCII is not only for defining number, phonetic alphabet and punctuation symbols but also control characters. For example, “A” character will be defined as 65 and presented by “1000001”; or 7 will make computer beep.
32 control + 1 space + 2 graphic + 1 @ + 10 number + 52 alphabet (upper and lower case) + 10 number + 12 (stuff) = 128
So far so good.
Evolution of character encoding
But ASCII could no longer meet the demand of non-English: accented characters. Unicode (to 8, 16, 32-bit) and ISO/IEC 10646 came after that and they have a much wider array of characters to define more characters, so they supplanted ASCII quickly. The important thing is they are backward compatible with 7-bit ASCII. They use the same code for first 128 ASCII symbols.
Any character can be encoded in many different bit sequences; and any particular bit sequence can represent many different characters. It depends on which encoding is used to read and write them.
Two-way-street: encoding and decoding
The bit sequence has changed at all while you encode or decode. The thing has changed that is the text you read. You can see in the below example. No matter I encoding/decoding multiple times or using 8 bits or 16 bits to save a character. At the end, I can read the right text by the right decoding.
If you document can’t read because of different encoding. The only things you have to do to save it that is KNOW encoding.