Emojis paved the way for UTF-8 everywhere

Emojis paved the way for UTF-8 everywhere

When I was growing up using computers I was clueless about character sets. Mostly there were just US characters, and everyone converted the umlauts (Ä and Ö) and the A with the overring (Å) used in Finnish to their A, O and A equivalents. I guess you could call that transliteration, and we all did it because we had to. Technology evolved and we got more and more Äs, Ös and Ås in our digital lives. Everyone I knew was using Windows. Life was good. Then came the web and trouble started brewing.

I started my webmaster amateur career sometime in 1996. I still didn't know (or care) much about character sets, and the most common mistakes I made while learning was with images; I accidentally referred to them from my local C: drive or they were so large it would take a fortnight to load with my somewhat outdated 9.6k GVC modem.

It wasn't until my first "real job" in early 2000 when I got a glimpse of the horrors of character sets. At work Art Directors were using colorful Macintosh G3 tower computers with Mac OS 9 that didn't seem to run straight for more than a day. Even worse, sometimes they couldn't even display contemporary words like lärvilautanen or names like Rytsölä right. I snickered and crouched over the keyboard of my rock-solid Windows NT workstation to craft some fine artesan HTML with notepad.exe.

The early 2000s were an intense learning experience for me. I switched over from Microsoft Notepad to Macromedia DreamWeaver to Allaire HomeSite. My career peaked when I worked on the homepage for Bomfunk MCs. I felt invincible. I ran into trouble with characters occasionally, but the "real developers" working with ASP or Java were always willing to help the front end developer greenhorn navigate the seas of broken Verdana on a manager's screen. Every so often I also had issues with line breaks too, but the bearded guy always helped me out. I think he used Solaris.

I didn't really start my hate-hate relationship with characters until around 2003. At twenty-three I was now a Senior Developer, and had stretched my wings to the server-side as well with mad LAMP skillz and setting up vhosts on Apache. I was, what you would now call a Senior Fullstack Ninja. Or that's how I saw myself. I was the person people came to get advice from, knee-deep in JavaScript, having innovated an amazing client-side bug tracker that would change how we work. Little did I know.

At this point I was also working on some projects for a well-known (at the time) Finnish mobile phone manufacturer in foreign languages like Hebrew and Hungarian. This is when it all started falling apart. In the past I had been able to solve issues with weird characters by poking them with a stick or the magical charset=iso-8859-1 metatag. But things were gradually getting worse. I was increasingly unable to use HomeSite because the characters were garbled. After too many Jallu shots, I opened up to a guru about this and he mumbled something about no support for UTF-8. 

Now I was a man on a mission. I would find out all about this UTF-8 and conquer this land of SGML-like markup once again. I armed myself with all the information on the interwebs and even bought a hard copy of the O'Reilly mini reference on UTF-8. Then I kind of got it, but things were complicated in the real world. Some things were UTF-8, some ISO-8859-1 and occasionally there was some random Macintosh Latin from a unicorn who could do design and HTML. I read deeper and started seeing nightmares about monsters like the the Byte Order Mark and Endianness. Cumulatively I spent weeks on figuring out character sets from files and parsing out something not féd.

I was still quite confused with the prospect of UTF-16 and what it would mean for me. Gladly that seemed to have been a whole lot of nothing, and with a steady diet of Linux, OS X and Windows on the desktop it seemed the worst was over. There were some cases where I needed to figure out strange issues with broken characters (usually because of using non-multibyte safe functions on the server), but this was much more of my own fault than random wonky encodings from input data.

Still, there were some systems not using UTF-8 by default. There was little incentive for that as it wasn't like there were new letters being added to the English alphabet or anything. This changed gradually with the mainstream adoption of smartphones and Emojis. Those cute little 🖼️s are essentially new letters to the alphabet. To stay relevant in the age of social media you had to support emoji or you were 💀 in the 💧. Fast forward to today and I can't remember the time I wrestled broken characters.

There are still enterprise systems that can suffer a death by a thousand candy-bar emojis and UTF-8 quirks in MySQL, but overall I am now in a much happier place. I still don't really know how all this character set black magic works deep down, but I don't need to know. For most part it's now a non-issue in my life. And it's amusing to see Apple using new emojis as a carrot to get people to install the latest security patches. This was a fun nostalgia trip, but I certainly wouldn't want to go back in time to work on multilingual web development. Or implement designs with rounded corners.

I'd like to extend a big thank you to everyone for using emoji! You saved my 🍑!

Photo by Christiann Koepke on Unsplash

eZ Platform is now Ibexa DXP

Ibexa DXP was announced in October 2020. It replaces the eZ Platform brand name, but behind the scenes it is an evolution of the technology. Read the Ibexa DXP v3.2 announcement blog post to learn all about our new product family: Ibexa Content, Ibexa Experience and Ibexa Commerce

Introducing Ibexa DXP 3.2
Product Launch: Introducing Ibexa DXP 3.2

Insights and News