If you save a text file with non-english characters and wondered how the editor knows how to interpret it correctly, then this is for you.
The Unicode standard defines a set of rules that govern how text is encoded and stored as bytes, such that it can be read back preserving the information that was encoded. In this tutorial, we’ll particularly look at UTF-8.
In UTF-8, characters are given a number, also known as code point, and these code points are grouped into what we will call code blocks for simplicity. The code block determines to an extent, how many bytes will be used to store the data.
Now, let’s see what this means in practice. if we open a notepad and type in the word “hello”, save the file as a text file (“sample.txt”), and check the file size, we will see it’s 5 bytes which makes a little sense as there are only 5 characters:
Continue reading “Practically Understanding UTF-8 encoding”