
UTF-8 and ϕTSE (ϕText Stream Encoding) both
support 21-bits of the Unicode Flat Character. ϕTSE
includes an additional 11 bits of text properties and
paragraph formatting. If a series of flat characters are fed
to an Encoder, a stream of bytes is produced as
output. If this same Byte Stream is fed into its
corresponding Decoder, the series of original Unicode
Flat Characters is recovered. The Encoder and Decoder
have to stay synchronized to work properly. Synchronization
means that the Encoder and Decoder both
understand where each character begins in the byte stream. A
missing byte, an extra byte or a changed byte can throw the Decoder
off so that it is not in the same state as the Encoder was
when the byte was produced. This is called loss of
synchronization. Additional erroneous text can be produced or
omitted beyond that directly caused by the corruption if the
Decoder does not soon resynchronize. Some text encoding
methods that have been in wide-spread use do not fare very
well when it comes to recovery of synchronization. Of
particular note are Shift JIS (SJIS),
EUC-JP, and ISO-2022-JP
used for encoding Japanese and are notorious for excessive
loss of data due to corruption because they are not
self-synchronizing.
Fortunately, both UTF-8 and ϕTSE are
self-synchronizing. Neither method can recover a character
that depended on the extra, missing or changed bytes. That is
because they have no means of error recovery. But they both
resynchronize on known boundaries which limits the effect of
the corruption. Explicit character encoding for both UTF-8 and
ϕTSE will bring about resynchronization at the beginning of
every character. If compression is used in ϕTSE, some of the
information may be corrupted up to the end of the current
paragraph. After that, the decoder will fully synchronize and
the data will be correct until another error occurs in the
byte stream.
Unicode greatly expands the number of symbols
that you can put into your text documents beyond those
supported in familiar plain ASCII text editors, but nothing
more. ϕText does that plus adds text properties and paragraph
formatting which makes its documents look more like those
produced with a word processor. Encoding of ϕText via ϕTSE
includes three forms of compression while UTF-8 has no
compression. A sequence of 2∼255 of consecutive symbols are
compressed into a sequence of two bytes. This is particularly
efficient for source code that uses extensive indentation or
repeating sequences of symbols often used in comment boxes. A
second form of compression substitutes a single byte code for
any of the last 32 non-ASCII symbols seen in the byte stream.
This can eliminate up to three bytes per character. A third
form of compression substitutes a 2-byte code for any symbol
in the last 32 of the
most recent 128-symbol "pages" requiring 3 or 4 byte
encodings. This can eliminate 1 or 2 bytes per character. When
compression is turned off for plain text with no text
properties, files that are produced are the same size for
UTF-8 and ϕTSE. This is because the number of bytes that are
produced in the byte stream for any one Unicode symbol is the
same for the two algorithms. ϕTSE more efficiently uses the
code space that is available for encoding. The encoding space
that is wasted in UTF-8 is used to record the additional
information handled by ϕTSE.
Byte-streams encoded in UTF-8 format retain
the same sorting order as their un-encoded sequences of
characters. Those in ϕTSE do not. However, this feature is of
limited use as upper and lower case alphabetic characters will
not order correctly in either one. So this property is of
little value. The ϕName convention solves this problem for
identifiers as they are defined in ϕ. But it is only useful
for identifiers that are composed of those characters of the
Roman, Greek and Cyrillic alphabets that are included in the
character subset (in addition to underscore, the ten numerals
and a separator).
What a single character represents in the
Roman, Greek and Cyrillic character sets is different from
Japanese which is itself different from Chinese. European
characters represent sounds. One or more of these are
combined to make a syllable. A single Japanese character
represents a syllable. By contrast a single Chinese
character generally represents a whole word. Because of
this, Chinese gobbled up far more of the Unicode code space
than any other character set. It should not have been placed
into the Mult-lingual Plane. But because it was, the same
text in Chinese usually requires less file space than most
other non-Roman/Greek/Cyrillic text including Japanese.
When text is ASCII, it includes the most
commonly used Roman characters. This means that files with
text in English, Spanish, French, German and others rarely
need other characters and it minimizes the sizes of files.
By contrast, files that are in Cyrillic or Greek will
produce considerably larger files. English benefits the
most because it rarely needs anything that is outside of
ASCII.
| Language | UTF-8 |
ϕTSE |
Change (%) |
| English |
1521 |
1521 |
0% |
| Spanish |
1638 |
1635 |
−0.18% |
| German |
1724 |
1725 |
+0.06% |
| French |
1764 |
1748 |
−0.91% |
| Greek |
3024 |
1816 |
−39.94% |
| Russian |
2676 |
1617 |
−39.57% |
| Japanese |
2080 |
1257 |
−39.57% |
| Chinese |
538 |
390 |
−27.51% |