Unicode and Character Sets


This post is based on an article by Wikipedia and Joel Spolsky. Joel Spolsky is the co-founder of Trello and Fog Creek Software, and CEO of Stack Overflow. The article is called The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

The article approaches the topic chronologically. First of all he says: “EBCDIC is not relevant to your life. We don’t have to go that far back in time.”

ASCII

He says: “The only characters that mattered were good old unaccented English letters, and we had a code for them called ASCII which was able to represent every character using a number between 32 and 127. Space was 32, the letter “A” was 65, etc. This could conveniently be stored in 7 bits. Most computers in those days were using 8-bit bytes, so not only could you store every possible ASCII character, but you had a whole bit to spare.” The codes below 32 were not printable and were called control characters. This worked fine in the English language.

What about the 8th bit? Since bytes have room for up to eight bits, lots of people got to thinking, “we can use the codes 128-255 for our own purposes.” The values 128 to 255 were not standardized yet. The IBM-PC had something that came to be known as the OEM character set. Wikipedia says: “Code page 437 is the character set of the original IBM PC (personal computer), or DOS. It is also known as CP437, OEM-US, OEM 437, PC-8, or DOS Latin US. The set includes ASCII codes 32–126, extended codes for accented letters (diacritics), some Greek letters, icons, and line-drawing symbols. It is sometimes referred to as the “OEM font” or “high ASCII”, or as “extended ASCII” (one of many mutually incompatible ASCII extensions).”

Alt Codes

As an aside right now we’ll mention alt codes. Wikipedia says: “On IBM compatible personal computers, many characters not directly associated with a key can be entered using the Alt Numpad input method or Alt code: pressing and holding the Alt key while typing the number identifying the character with the keyboard’s numeric keypad.” Alt+130 will give you é. However, as Joel says: “as soon as people started buying PCs outside of America all kinds of different OEM character sets were dreamed up, which all used the top 128 characters for their own purposes. For example on some PCs the character code 130 would display as é, but on computers sold in Israel it was the Hebrew letter Gimel (ג)”. He says: “when Americans would send their résumés to Israel they would arrive as rגsumגs”. At this point in time, there was no standard.

ANSI Standard and Code Pages

As Joel says: “Eventually this OEM free-for-all got codified in the ANSI standard. In the ANSI standard, everybody agreed on what to do below 128, which was pretty much the same as ASCII, but there were lots of different ways to handle the characters from 128 and on up, depending on where you lived. These different systems were called code pages.” He also says: “So for example in Israel DOS used a code page called 862, while Greek users used 737. They were the same below 128 but different from 128 up, where all the funny letters resided”.

Unicode

Unicode is a single character set that included every reasonable writing system possible. Joes says: “Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct. It is the single most common myth about Unicode.”

Unicode has a different way of thinking about characters, and you have to understand the Unicode way of thinking of things or nothing will make sense. Until now, we’ve assumed that a letter maps to some bits which you can store on disk or in memory. In Unicode, a letter maps to something called a code point which is still just a theoretical concept. How that code point is represented in memory or on disk is a whole different matter.

Every platonic letter in every alphabet is assigned a number by the Unicode consortium which is written like this: U+0639. The English letter A would be U+0041. The U+ means “Unicode” and the numbers are hexadecimal. There is no real limit on the number of letters that Unicode can define and in fact they have gone beyond 65536 so not every Unicode letter can really be squeezed into two bytes.

Endianness

So the letter A becomes 00 41. Or does it become 41 00? It depends. What is endianness? Wikipedia says the following about Endianness: “Endianness refers to the sequential order in which bytes are arranged into larger numerical values when stored in memory or when transmitted over digital links. Endianness is of interest in computer science because two conflicting and incompatible formats are in common use: words may be represented in big-endian or little-endian format, depending on whether bits or bytes or other components are ordered from the big end (most significant bit) or the little end (least significant bit).”

Encodings

Back to Unicode. We have these code points, which are really just numbers that represent letters. We have not yet said anything about how we are going to store these numbers in a computer in bits and bytes. Earliest ideas suggested that we just use 2 bytes. But what about endianness? Joel says: “So the people were forced to come up with the bizarre convention of storing a FE FF at the beginning of every Unicode string; this is called a Unicode Byte Order Mark and if you are swapping your high and low bytes it will look like a FF FE and the person reading your string will know that they have to swap every other byte. Phew. Not every Unicode string in the wild has a byte order mark at the beginning.” Joel says: “For a while it seemed like that might be good enough.”

UTF-8

UTF-8 was another system for storing your string of Unicode code points, those U+ numbers, in memory using 8 bit bytes. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes. So its variable. How do we determine how many bytes are used for a character? Have a look at this video on YouTube called Characters, Symbols and the Unicode Miracle.

Joel says: “This has the neat side effect that English text looks exactly the same in UTF-8 as it did in ASCII, so Americans don’t even notice anything wrong. Only the rest of the world has to jump through hoops. Specifically, Hello, which was U+0048 U+0065 U+006C U+006C U+006F, will be stored as 48 65 6C 6C 6F, which, behold! is the same as it was stored in ASCII, and ANSI, and every OEM character set on the planet. Now, if you are so bold as to use accented letters or Greek letters or Klingon letters, you’ll have to use several bytes to store a single code point, but the Americans will never notice.”

Regarding encodings, it does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that “plain” text is ASCII. If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.

A Couple of Other Points

Note: Versions prior to SQL Server 2016 do not support code page 65001 (UTF-8 encoding).

UTF-8 is the default encoding for XML and since 2010 has become the dominant character set on the Web.

In an XML prolog, the encoding is typically specified as an attribute:

<?xml version="1.0" encoding="UTF-8" ?>

Not that character set specifications are case insensitive, so utf-8 is valid, just as UTF-8 is valid.

SQL Server Data Types

In SQL Server, you can configure a character column with a Unicode data type (nchar, nvarchar, or ntext) or non-Unicode data type (char, varchar, or text). For Unicode types, the character bit patterns conform to international standards that define a double-byte encoding scheme for mapping most of the world’s written languages, ensuring that the same bit pattern is always associated with the same character, regardless of the underlying environment.