
According to Wikipedia, in August of 1988, Joe Becker (an employee of Xerox) published a draft proposal for a character encoding system to be called “Unicode”. The main purpose for that effort was to digitize the world's written languages for electronic publishing. It is now the standard for representing text on the World Wide Web. Unaware of that development at the time, during that same month (amazing!) I began work on developing a text system for an advanced programming language. Its purpose was to provide a rich text environment for creating source code documents with a larger character set than is in ASCII. Source code files would have more of the look and feel of word processor documents than plain ASCII text that is the general rule today. I now call it ϕText1. As a C programmer (since 1985), it had become obvious to me that Dennis Ritchie had run out of characters when defining the C programming language. ϕText1 defined a 32-bit “Flat Character” that would serve as the basis for representing a single character. The low-order 11 bits encoded 2048 displayable symbols while the upper 21-bits was used to encode text properties.
ϕText1 established a scheme in which a series of flat characters were encoded into a byte stream. For ASCII characters, the high bit was clear and needed no further encoding. All other characters (along with their text properties) were encoded using a sequence of two or more bytes beginning with a byte with its high bit set. The other bits were used to identify what was being encoded. This system solved many problems including the byte-endian dilemma. The algorithm was inherently self-synchronizing which limited data loss caused by corruption in the data stream. I now call it ϕTSE1 for “ϕText1 Stream Encoding”. I wrote the first text editor (called ϕEdit) that implemented the encoding scheme. Programmable characters were displayed on a Hercules Graphic Plus board running on MS-DOS in an IBM-compatible PC (Model XT clone). Editing was done using the Flat Character format but they were stored using ϕTSE.
Over time I became aware of Unicode. In 2009 I decided that ϕText needed to become Unicode compatible. Using the same encoding algorithms I had developed for ϕText1, I redefined the system so that the 32-bit flat character would accommodate the much larger Unicode Code Point. The new Flat Character has a 21-bit Symbol Code and that leaves only 11 bits for Text Properties. Gone were the Background Colors among other things. But it turned out to work pretty well. I now call this new system ϕText2. Since I now consider ϕText1 to be obsolete, I just call it ϕText.
Since 1988, ϕEdit has gone through six generations. Gen 1 used the Hercules Graphic Card Plus running on MSDOS. Gen 2 used the Hercules InColor Card. It was the first to show colored text. Gen 3 used an IBM VGA card and it still ran on MSDOS. Gen 4 was the first version to run on MS NT 4.0. It still used bit-mapped fonts so the printout didn't look very good. Gen 5 was the first version to use TrueType and OpenType fonts. It was buggy and needed rework of its data structures. It didn't support the different character sizes. Gen 6 is the version I use now and it supports all of the text sizes, colors, styles and attributes as well as paragraph formatting.

The ϕText Flat Character is the form in which individual characters are processed. It is a 32-bit value composed of a number fields. The lower-order 21-bit Symbol represents the Unicode Code Point. Next is the Style field that determines the type face of the character. The Size field stores one of four sizes. Next are the three Attribute bits that can be individually set and cleared. These are for Italic, Underscore and Bold. Finally there is the 3-bit Color field. This gives your text one of eight foreground colors. Background color is always assumed to be white as printed on paper. They are similar to the resistor color codes. Each field value and its meaning are tabulated below:
ϕText supports four kinds of paragraph
formatting as you will find in most word processors. These
are:
1. Left Justified
2. Right Justified
3. Centered
4. Fully Justified
ϕText allows for lines to be up to 255
characters long. Lines longer than this will be wrapped
using a Line-Continuation terminator. This permits
paragraphs with indefinite length while keeping text on
lines within a reasonable horizontal viewing range.
ϕEdit has a parameter that controls where a line should
wrap. It provides a command that normalizes the number of
words on each line using word wrap. This automatically
inserts and deletes line continuation codes so as to make
text fill available width. When importing UTF-8 files,
New-Line characters become End of Paragraph Left
Justified. When encoding a ϕText file to UTf-8, line
continuation codes are removed and all end of paragraph
codes are converted to Newline characters. Paragraph
formatting is lost. This is one of many reasons why source
code is generally kept in ϕText format instead of UTF-8.
A sequence of flat characters can be fed
into a ϕTSE (or UTF-8) encoder for storage in a file or even
memory. The output is a zero-terminated sequence of bytes.
The same sequence of flat characters can be recovered by
feeding the same byte stream into its paired decoder. Both
ϕText and UTF-8 encoded byte streams are self-synchronizing.
That keeps the encoder and decoder synchronized so that, if
there is data corruption in the encoded byte stream, the
decoder will quickly lock onto the beginning of the next
character. The basic encoding methods for ϕTSE and UTF-8
have the same efficiency when compression in ϕTSE is
disabled. But when compression is enabled, byte streams
generated by ϕTSE can be much smaller.
1. Repeated CharacterWhen text is all ASCII, the only compression that is realized is Repeated Character. This includes space characters. When an end of paragraph is seen, the character that is assumed to be seen last is the space character for the upcoming line. Character indents are done with repeated spaces. Any amount of repeated characters on a line is encoded in a 2-byte sequence. 3 or more non-space characters in a row are also compressed using the same method.
2. Recently Used Symbol
3. Recently Used Symbol Page