technoventure
            logo
©2025, technoventure, inc.

Language Compression Comparison

UTF-8 and ϕTSE (ϕText Stream Encoding) both support 21-bits of the Unicode Flat Character. ϕTSE includes an additional 11 bits of text properties and paragraph formatting. If a series of flat characters are fed to an Encoder, a stream of bytes is produced as output. If this same Byte Stream is fed into its corresponding Decoder, the series of original Unicode Flat Characters is recovered. The Encoder and Decoder have to stay synchronized to work properly. Synchronization means that the Encoder and Decoder both understand where each character begins in the byte stream. A missing byte, an extra byte or a changed byte can throw the Decoder off so that it is not in the same state as the Encoder was when the byte was produced. This is called loss of synchronization. Additional erroneous text can be produced or omitted beyond that directly caused by the corruption if the Decoder does not soon resynchronize. Some text encoding methods that have been in wide-spread use do not fare very well when it comes to recovery of synchronization. Of particular note are Shift JIS (SJIS), EUC-JP, and ISO-2022-JP used for encoding Japanese and are notorious for excessive loss of data due to corruption because they are not self-synchronizing.

Fortunately, both UTF-8 and ϕTSE are self-synchronizing. Neither method can recover a character that depended on the extra, missing or changed bytes. That is because they have no means of error recovery. But they both resynchronize on known boundaries which limits the effect of the corruption. Explicit character encoding for both UTF-8 and ϕTSE will bring about resynchronization at the beginning of every character. If compression is used in ϕTSE, some of the information may be corrupted up to the end of the current paragraph. After that, the decoder will fully synchronize and the data will be correct until another error occurs in the byte stream.

An encoded Byte Stream is generally considered to be an intermediate form of text storage that is often but not always smaller than the series of Unicode characters from which is was produced. But Encoding can have other benefits. For one, a Byte Stream is endian independent while a series of Unicode Flat Characters is not. For another, the method can use one or more compression algorithms to make the size of the data even smaller than would be accomplished with simple encoding. Byte Stream Encoded text is the form most often used for text literals and storing text files. But the Flat Character Format is still very important because it is the form that software can most easily directly manipulate.

Unicode greatly expands the number of symbols that you can put into your text documents beyond those supported in familiar plain ASCII text editors, but nothing more. ϕText does that plus adds text properties and paragraph formatting which makes its documents look more like those produced with a word processor. Encoding of ϕText via ϕTSE includes three forms of compression while UTF-8 has no compression. A sequence of 2∼255 of consecutive symbols are compressed into a sequence of two bytes. This is particularly efficient for source code that uses extensive indentation or repeating sequences of symbols often used in comment boxes. A second form of compression substitutes a single byte code for any of the last 32 non-ASCII symbols seen in the byte stream. This can eliminate up to three bytes per character. A third form of compression substitutes a 2-byte code for any symbol in the last 32 of the most recent 128-symbol "pages" requiring 3 or 4 byte encodings. This can eliminate 1 or 2 bytes per character. When compression is turned off for plain text with no text properties, files that are produced are the same size for UTF-8 and ϕTSE. This is because the number of bytes that are produced in the byte stream for any one Unicode symbol is the same for the two algorithms. ϕTSE more efficiently uses the code space that is available for encoding. The encoding space that is wasted in UTF-8 is used to record the additional information handled by ϕTSE.

Byte-streams encoded in UTF-8 format retain the same sorting order as their un-encoded sequences of characters. Those in ϕTSE do not. However, this feature is of limited use as upper and lower case alphabetic characters will not order correctly in either one. So this property is of little value. The ϕName convention solves this problem for identifiers as they are defined in ϕ. But it is only useful for identifiers that are composed of those characters of the Roman, Greek and Cyrillic alphabets that are included in the character subset (in addition to underscore, the ten numerals and a separator).

Character Set Bias

What a single character represents in the Roman, Greek and Cyrillic character sets is different from Japanese which is itself different from Chinese. European characters represent sounds. One or more of these are combined to make a syllable. A single Japanese character represents a syllable. By contrast a single Chinese character generally represents a whole word. Because of this, Chinese gobbled up far more of the Unicode code space than any other character set. It should not have been placed into the Mult-lingual Plane. But because it was, the same text in Chinese usually requires less file space than most other non-Roman/Greek/Cyrillic text including Japanese.

File Sizes for European Character Sets

When text is ASCII, it includes the most commonly used Roman characters. This means that files with text in English, Spanish, French, German and others rarely need other characters and it minimizes the sizes of files. By contrast, files that are in Cyrillic or Greek will produce considerably larger files. English benefits the most because it rarely needs anything that is outside of ASCII.

UTF-8 versus ϕTSE

Because English is so ASCII-centric, it benefits little from type 2 & 3 compression. Other languages based on the Roman character set benefit a little more because of the non-ASCII symbols that are sprinkled through the text. Languages that are based on other alphabets benefit a lot more. Without compression, languages based on Cyrillic suffer more than English because most characters have to be represented with encodings of 2 or more bytes. Japanese benefits from a syllable-level character set that reduces the number of characters needed. But they still require more than one byte per character for their encodings. Chinese benefits even more with a word-level character set.

To get an idea of the relative sizes of files that contain the same information in different languages,  I used Google and other resources to translate the Gettysburg Address into Spanish, German, French, Greek, Russian, Japanese and Chinese. Then I stored them using UTF-8 and ϕTSE. Below are my results:

Language UTF-8
ϕTSE
Change (%)
English
1521
1521
0%
Spanish
1638
1635
−0.18%
German
1724
1725
+0.06%
French
1764
1748
−0.91%
Greek
3024
1816
−39.94%
Russian
2676
1617
−39.57%
Japanese
2080
1257
−39.57%
Chinese
538
390
−27.51%

I found it interesting that the Greek UTF-8 required the largest file of all. That shouldn't be surprising since ASCII is heavily weighted for the Roman character set. Since Japanese characters represent syllables instead of fundamental sounds, the text had fewer characters. Chinese required even fewer characters since most words can be represented by only one or a few characters. Even though these characters require more storage than Roman characters, the file ended up being quite smaller anyway. Another interesting observation is that Greek, Russian and Japanese all benefited the most from compression and they all benefited about the same with just under 40% reduction in size.

Storing the English in ϕTSE didn't change the file size at all. This is because compression only kicks in for characters that are not ASCII. Japanese began larger than its English equivalent and then ending up smaller than the English. This was largely due to it being composed of fewer characters to begin with.