You Have Unicode, Why Not Use It?

The debate on whether to include Unicode glyphs in programming languages is far from new, but no major programming language culture has yet made heavy use of Unicode (though Perl 6 is almost there). One could suppose that the main barriers to using non-ASCII characters have been the lack of a standard way to input them, and the fact that some of them can look very similar to one another. But the core of the matter is that a well-designed programming language does not really benefit enough from using Unicode characters as opposed to equivalent ASCII sequences. However, there are programming systems where using Unicode can help a lot: template systems. For illustration, here is a chunk from an Embedded Ruby (ERB) template like you might find in any Rails app.

<ul class="links <%= @category %>"> % for l in @links <li><a href="<%= l.href %>"><%= l.title %></a></li> % end </ul>

The bits surrounded by <% and %> are Ruby code, instead of HTML. These are the standard delimiters in most templating systems targeting HTML (though ERB also recognizes lines beginning with % and requires an = to interpolate instead of executing). The selection of the delimiters with angle brackets is no doubt intended to make the code bits look similar to HTML tags, and the use of angle brackets has the advantage that text somewhere else in the document will not accidentally contain the delimiters, since literal angle brackets have to be escaped in HTML anyway. However, this similarity can easily lead to visual confusion. Note that two of the Ruby pseudo-tags are inside an attribute of an HTML tag—is that a tag within a tag? Weird. Nobody said templates had to be pretty, but surely we can do better.

<ul class="links ◀= @category ▶"> ◆ for l in @links <li><a href="◀= l.href ▶">◀= l.title ▶</a></li> ◆ end </ul>

How about this? The Ruby bits pop right out of the HTML bits, because the delimiter characters are solid, unlike the surrounding ASCII. These are the delimiters I picked for the templates that form this website. I type them in vim with <ctrl-K>PL for ◀, <ctrl-K>PR for ▶, and <ctrl-K>Db for ◆. There is still a possibility of running into a literal one of these characters, but it is far less likely than for any ASCII character, and it will be easy to catch (unless you're autogenerating template files or something silly like that). These characters can be escaped with the HTML decimal entities ◀, &9654;, and &9670;.

Unfortunately there is still the problem that there is no standardized way of inputting Unicode characters between operating systems and characters. This is something that ought to be standardized in as near a future as possible. In the meantime, if you have a way to input Unicode and you don't expect to share coding responsibilities with a more Unicodally-challanged individual, you may benefit from using Unicode characters in situations like these.