Can’t seem to remove the formatting from a string of text?

I had a fella email me a line of text almost just like this:

š‚ššš„š„šž šš„ššš§šœšØš¬, š‚šØš¬š­šš š‘š¢šœšš

He said he could not remove that formatting no matter what he did. It looks kinda bold, doesn’t it? And set into a serif font. You’d think you could select it in the text editor you’re in and remove that formatting. He said he tried copy/pasting it into places where no text formatting is even allowed, like in VS Code or the URL bar of a browser. Voodoo, he said.

Here’s the thing: that text isn’t formatted.

That first “C” you see above isn’t a regular uppercase character C, our typical friend U+0043 : LATIN CAPITAL LETTER C, it’s “š‚”, that is, U+1D402 : MATHEMATICAL BOLD CAPITAL C. It’s literally a different character in Unicode. There are… a lot of Unicode characters:

As of Unicode version 16.0, there are 155,063 characters with code points, covering 168 modern and historical scripts, as well as multiple symbol sets.

List of Unicode characters ā€” Wikipedia

It could be written like š•®š–†š–‘š–‘š–Š š•­š–‘š–†š–“š–ˆš–”š–˜, š•®š–”š–˜š–™š–† š•½š–Žš–ˆš–† instead, or š—–š—®š—¹š—¹š—² š—•š—¹š—®š—»š—°š—¼š˜€, š—–š—¼š˜€š˜š—® š—„š—¶š—°š—®.

Should you do this to get super sweet effects in places you otherwise couldn’t? Probably not. The accessibility is rough. Listen to the audio output in this blog post. If you’re going to do it on the web where you have HTML control, do something like:

<!-- Don't do this! Leaving for posterity. -->
<span aria-label="Calle Blancos, Costa Rica">
  <span aria-hidden="true">š•®š–†š–‘š–‘š–Š š•­š–‘š–†š–“š–ˆš–”š–˜, š•®š–”š–˜š–™š–† š•½š–Žš–ˆš–†</span>
</span>Code language: HTML, XML (xml)

UPDATE: See Ben’s comment on why not to do the above. Instead, make a visually hidden version that a screen reader would still see, and an ARIA hidden one that will be seen visually. (Noting potential concerns about copy/paste that started this whole article.)

<span class="visually-hidden">Calle Blancos, Costa Rica</span>
<span aria-hidden="true">š•®š–†š–‘š–‘š–Š š•­š–‘š–†š–“š–ˆš–”š–˜, š•®š–”š–˜š–™š–† š•½š–Žš–ˆš–†</span>Code language: HTML, XML (xml)

Wanna be a better web typographer?

3 responses to “Can’t seem to remove the formatting from a string of text?”

  1. Ben Myers says:

    Howdy! Great callout on not using alternate Unicode characters in place of the true characters for these letters. Unfortunately, placing an aria-label on a roleless span (or any generic element) is not a valid use of aria-label, and so you won’t get the results this article would expect in most screenreader/browser combinations. VoiceOver for macOS will do this substitution, which is what leads to developers’ expectations in this case, but this is nonstandard behavior that shouldn’t be relied upon.

    In this case, the safest thing to do would probably be to combine a .visually-hidden/.sr-only span with the safe characters, with an aria-hidden span of the alternate Unicode characters.

  2. James Moberg says:

    I focus more on backend (versus frontend) using ColdFusion. CF runs on top of Java and I use the java.text.Normalizer class and JUnidecode library to normalize Unicode strings and reduce them to ASCII 7. (I started doing this because comment form spammers started using Unicode to bypass spam filters.)
    https://github.com/gcardone/junidecode

    Related to this, I added this function to a REST API and wrote a Windows AutoHotKey shortcut to take my clipboard, pass the contents to the API and return ASCII7 content free of any Unicode formatting.

Leave a Reply

Your email address will not be published. Required fields are marked *

Did you know?

Frontend Masters Donates to open source projects. $363,806 contributed to date.