Making Use of Unicode

In August of 1988, I began development of a text editor that I now call ϕEdit. The system defined a 32-bit character that had an 11-bit symbol and 21-bits of text properties. I also developed a byte-stream encoding method (called ϕTSE) that stored and retrieved text to and from files. Unbeknownst to me at the time, the Unicode effort began that very same month. But our goals were different. Unicode's purpose was to represent the world's written text for use on the World Wide Web and electronic publishing. My purpose was to define a character set that would be used for programming computers. Eventually, I became aware of Unicode and decided to adopt it for my own use. The transition went remarkably smoothly. The new system still used a 32-bit flat character. But it now accommodated a 21-bit symbol and the text properties were reduced to 11 bits. I call this system ϕText.

I am in accord with the philosophy of Unicode. Highest priority for what a character means and looks like is given to how it has been used in print since long before the days of computer programming. Relatively little consideration is given for how programmers may have used it more recently. For the most part, I am on board with this philosophy. This even affects how some ASCII characters are (re)interpreted at the likely dismay of many modern day programmers. For example, the asterisk ‘*’ is not the multiply operator nor does it sit near the text's base-line like the ‘+’ does. Instead it is used to indicate footnotes and is raised above the base-line like a super-script should be. The proper multiply operator (which most mathematicians unfortunately omit in their expressions) is what we all learned in elementary school. The times ‘×’ operator looks something like the Roman letter ‘x’. My first programming experience was with the Radio Shack TRS-80. I was immediately puzzled why they used the asterisk for multiply instead of the correct one. It took me years to realize why that was. It was because the team that created ASCII left it out. Programmers just recruited the asterisk for the purpose because it was what they had. A similar thing can be said about the slash ‘/’ and the divide symbol ‘÷’. Many of my readers don't realize that the ASCII ‘-’ is defined by Unicode to be the hyphen and not the subtraction operator ‘−’ which is not in ASCII either. But the hyphen is what virtually all modern programming languages use today as the subtraction operator.

My intent is to make extensive use of ASCII characters in places where they make sense with a strong influence on what is familiar. But changes had to be made for practices that I thought were bad. Those include some of the ones I discussed above. I didn't want to get carried away with using new symbols willy nilly. I only wanted to define a new character when I thought it would really be beneficial. And new symbols must look distinctly different enough from all others that are used in the system so that they are not confused. It is preferable to use a symbol that Unicode has already defined rather than creating a new one. There is a procedure for defining new Unicode symbols but requests are likely to be rejected or put off if it is not a symbol that has already been in use in one of the hundreds of existing languages. Unicode provides a “private area” for defining new symbols for anyone's private use. But these will not display properly when the viewer's computer does not have a font installed to display it. Many web pages use forms to store text that do not support user defined fonts. So it is greatly preferable to use Unicode symbols that are already in existence so that any reader will see something that bears some resemblance to the glyph that I intend for them to see. An example pair of characters that I have adopted are the Record Braces ‘⸢’ and ‘⸣’. The common glyphs used for these two characters are typically close enough to know what I am referring to when displayed with the default fonts on third party web pages. But they look precisely like they are intended when displayed using the PhiBASIC font which can be downloaded from the home page of this website.

I avoid using some of the features of Unicode because of their ill side effects. Some of the Code Points are not characters in themselves but they modify ones that come after them. These are called “Combining Characters” and they are never used in ϕText source code syntax. They can be enclosed within string literals but their behavior is not guaranteed to be what you would expect. Even though they are not displayable characters in themselves, they still require storage space and so they reduce the number of characters that can be stored in a line. ϕEdit puts a limit on line length of 255 characters. The number of characters that will be shown on a line will be reduced by one for every combining character on that line. When using them, ϕEdit, in its current form, will fail to place the cursor where it belongs on the line so they should just be avoided in source code. Another problem feature is character sets that print from right to left instead of left to right. Examples are Hebrew and Arabic. Even Windows' GDI seems to get confused when printing strings that mix Hebrew with Roman for example. One day the technology may become more dependable and my understanding may become good enough that the two can be intermixed. But until that day comes, I will just avoid using them in any of my source code.