matatu blog

Brian writes about computing

Whitespace Characters and Coding

Nov 24, 2018

On the presence or absence of whitespace in code.

Part 1: Types of whitespace

$ raku -e 'say " ".uniname'
SPACE
$ raku -e 'say " ".ord'
32
$ raku -e 'say " ".ord.fmt("%x")'
20

ASCII 32, hex 0x20 is the space character.

There is also tab of course.

$ raku -e 'say "\t".ord.fmt("%x")'
9
$ raku -e 'say "\t".ord'
9
$ raku -e 'say "\t".uniname'
<control-0009>

Okay, I guess it's also called control-0009.

Moving on, we have the ubiquitous carriage return.

~ $ raku -e 'say "\n".uniname'
<control-000A>

which is different on half the computers in the world

$ raku -e 'say "\r\n".uninames'
(<control-000D> <control-000A>)

and I think is a neat example of how to count characters. Put these two together and you still only have one "character".

raku -e 'say "\r\n".chars'
1

What do these all have in common?

~ $ raku -e 'say " ".uniprops("White_Space")'
(True)
~ $ raku -e 'say "\t".uniprops("White_Space")'
(True)
~ $ raku -e 'say "\r\n".uniprops("White_Space")'
(True True)

They all have the "White_Space" unicode property.

So, what else has that property?

say (.fmt("%5d"),.fmt("0x%04x"),.uniname)
  for (0..0x10fff).grep: *.uniprop("White_Space")

(    9 0x0009 <control-0009>)
(   10 0x000a <control-000A>)
(   11 0x000b <control-000B>)
(   12 0x000c <control-000C>)
(   13 0x000d <control-000D>)
(   32 0x0020 SPACE)
(  133 0x0085 <control-0085>)
(  160 0x00a0 NO-BREAK SPACE)
( 5760 0x1680 OGHAM SPACE MARK)
( 8192 0x2000 EN QUAD)
( 8193 0x2001 EM QUAD)
( 8194 0x2002 EN SPACE)
( 8195 0x2003 EM SPACE)
( 8196 0x2004 THREE-PER-EM SPACE)
( 8197 0x2005 FOUR-PER-EM SPACE)
( 8198 0x2006 SIX-PER-EM SPACE)
( 8199 0x2007 FIGURE SPACE)
( 8200 0x2008 PUNCTUATION SPACE)
( 8201 0x2009 THIN SPACE)
( 8202 0x200a HAIR SPACE)
( 8232 0x2028 LINE SEPARATOR)
( 8233 0x2029 PARAGRAPH SEPARATOR)
( 8239 0x202f NARROW NO-BREAK SPACE)
( 8287 0x205f MEDIUM MATHEMATICAL SPACE)
(12288 0x3000 IDEOGRAPHIC SPACE)

A practical character from these is the "NO-BREAK SPACE"; a pithy alternative to &nbsp; or <nobr>.

Putting some positive space next to the negative space makes it easier to distinguish them.

 say ("\c[FULL BLOCK]{.chr}\c[FULL BLOCK]",.uniname)
     for (0..0x10fff).grep: *.uniprop("White_Space")

(█      █ <control-0009>)
(█
█ <control-000A>)
(█
  █ <control-000B>)
(██ <control-000C>)
█ <control-000D>)
(█ █ SPACE)
(█
█ <control-0085>)
(█ █ NO-BREAK SPACE)
(█ █ OGHAM SPACE MARK)
(█ █ EN QUAD)
(█ █ EM QUAD)
(█ █ EN SPACE)
(█ █ EM SPACE)
(█ █ THREE-PER-EM SPACE)
(█ █ FOUR-PER-EM SPACE)
(█ █ SIX-PER-EM SPACE)
(█ █ FIGURE SPACE)
(█ █ PUNCTUATION SPACE)
(█ █ THIN SPACE)
(█ █ HAIR SPACE)
(█
█ LINE SEPARATOR)
(█
█ PARAGRAPH SEPARATOR)
(█ █ NARROW NO-BREAK SPACE)
(█ █ MEDIUM MATHEMATICAL SPACE)
(█ █ IDEOGRAPHIC SPACE)    

We can also remove space with the zero width joiner.

The zero width joiner is not whitespace.

raku -e 'say "\c[ZERO WIDTH JOINER]".uniprops("White_Space")'
(False)

This is not to be confused with the zero-width space, which is also not whitespace. The zero width space and no-break space give you fine-grained control over exactly where your browser decides to split up your words when doing word-wrapping. But let's look at the joiner first:

Frank Stella Harran II

say "\c[MAN]\c[ARTIST PALETTE]"
# 👨🎨

say "\c[MAN]​\c[ZERO WIDTH JOINER]​\c[ARTIST PALETTE]"
# 👨‍🎨    

Do those look the same or different?

It may depend on your browser.

By the way, wouldn't this code be less readable if ARTIST PALETTE or ZERO WIDTH JOINER were split across lines? How did I prevent that? ¹


Conclusions

  • There are many representations of whitespace.
  • Whitespace has two opposites: presence of non-white space, and lack of space.
  • Deliberate use of space affects both presentation and content.
  • In part 2, we look at the use of whitespace, non-white space, and unspacing in several programming languages.


    1. "\\c[MAN]\c[ZERO WIDTH SPACE]\\c[ZERO\c[NO-BREAK SPACE]WIDTH\c[NO-BREAK SPACE]JOINER]\c[ZERO WIDTH SPACE]\\c[ARTIST\c[NO-BREAK SPACE]PALETTE]"