Character encoding

If you could see a conversation between computers, it would probably look something like this: "010110111011101011010010110...". This language, called binary, is the one used by computers to communicate with each other or to transmit data between their components. But, why would someone design a computer with a binary system as base instead of using a normal alphabet? Well, the reason is that the binary system adapts perfectly to the physical structure of circuits, where you can use only two different signals: an electrical pulse (1) or no signal at all (0).

But then, this codes are, at least, very hard to understand. The reason for the existence of character encodings is related to the need of converting the binary language of computers in something more comprehensible for humans.

What an n-bit character encoding does is to separate these bit (zeros and ones) into groups of n bits each, and assign a symbol to each sequence acquired. In a simple way, a character encoding (or character set) can be considered as a translation table, where each group of bits is related to a single character. For example, an 8-bits character encoding could represent the sequence "10010101" as the letter "a", the sequence "01101100" as the symbol "&", and so on.

To throw some light on the subjects treated in the previous paragraph, we'll take one particular character encoding that happens to be one of the most famous for eastern computer systems: ASCII. This encoding is very primitive and, therefore, simple. The next table shows all ASCII values for uppercase letters (from A to Z):

SymbolBinary code
A1000001
B1000010
C1000011
D1000100
E1000101
F1000110
G1000111
H1001000
I1001001
J1001010
K1001011
L1001100
M1001101
SymbolBinary code
N1001110
O1001111
P1010000
Q1010001
R1010010
S1010011
T1010100
U1010101
V1010110
W1010111
X1011000
Y1011001
Z1011010

If you take a close look, you'll see that no code is repeated. This is something set on purspose to avoid ambiguities and is common to all character encodings.

This way, if we intercept a line of computer code that goes "1001000101010010011011001100" and we know that the character encoding used to produce it is ASCII, we could translate it to the western alphabet. To do it, first we need to separate the line in groups of 7 numbers each (because ASCII is a 7-bit character encoding), what would result in "1001000 1010100 1001101 1001100". Now each one of these chunks is a character, that we can decode looking up its correspondence in the ASCII table. If you search the matches in the table above you'll see the word is "HTML".

Character encodings are widely used (even when we don't realize of it) in operative systems, text documents, HTML documents, e-mail documents, etc. Every piece of text must have a character encoding because, in the end, all digital information is nothing but a sequence of bits.

This is why character encodigns matter to HTML. As an HTML document is a piece of text that must be read and interpreted (by browsers and people), it must have a character encoding.

One of the ways available to specify a character encoding used by a document is through the meta element and its charset attribute, which has to be declared in the head section of the document. In the following example, we make a declaration where we set the encoding use by the document to ASCII (with its preferred name US-ASCII) as the character encoding of the page.

<meta charset="US-ASCII">

Other very popular and useful character encodings are ISO-8859-1 (usually called "Latin 1") and UTF-8.

The UTF-8 unification

Character encodings where born with computers as a mechanism to translate their binary code into a human readable text. Very soon, computer vendors begun making their own character sets according to the needs of their markets (language and use), which resulted in a massive growth in the amount of available chracter encodings.

The Unicode industry standard is intended to replace and unify the existing character encodings to reduce their number and improve effectiveness (by solving other character encodings problems). This standard encodes characters using different schemes called "Unicode Transformation Formats" (UTF).

UTF-8 is an 8-bit variable length character encoding for Unicode that's becoming very popular in the Internet due to its capability to represent every universal character still being ASCII compatible. UTF-8 uses from 1 to 4 bytes to encode a single character (from 8 to 32 bits) depending on the Unicode symbol.

An HTML document using the UTF-8 character encoding should contain the following declaration in its header:

<meta charset="UTF-8">

Serving documents with UTF-8 encoding allows authors to insert any character without having to resort to character entity references (except for markup symbols distinctive of HTML). But be aware that the creation and edition of UTF-8 documents must be done with a text editor supporting this encoding and it must be properly configured to use it in the document that's being edited. Otherwise, a misinterpretation of characters will take place.