Oliver Bonten
2004-05-17 19:49:48 UTC
Hello,
does anybody have statistical information about average word lengths, word
counts and byte consumption in Thai and Vietnamese text compared to
English (or other western European languages). Specifically, I'm
interested in
- average word count
- average length of words in characters
- average length of words in UTF8 bytes (that's easy for Thai, since most
characters are 3-Byte, but I have no idea about Vietnamese, which seems to
be using characters with 1, 2 and 3 byte encoding and also to make liberal
use of diacritical marks).
- average number of *different* words
compared between english text and a Thai/Vietnamese translation of the
same text.
If no one knows this numbers, it may also be helpful to find a host of
english documents plus Thai and Vietnamese translations (either UCS2 or
UTF8) somewhere for download, so I can run my own statistics. Preferrably
technical documents, but any document will do.
Regards,
Oliver Bonten
does anybody have statistical information about average word lengths, word
counts and byte consumption in Thai and Vietnamese text compared to
English (or other western European languages). Specifically, I'm
interested in
- average word count
- average length of words in characters
- average length of words in UTF8 bytes (that's easy for Thai, since most
characters are 3-Byte, but I have no idea about Vietnamese, which seems to
be using characters with 1, 2 and 3 byte encoding and also to make liberal
use of diacritical marks).
- average number of *different* words
compared between english text and a Thai/Vietnamese translation of the
same text.
If no one knows this numbers, it may also be helpful to find a host of
english documents plus Thai and Vietnamese translations (either UCS2 or
UTF8) somewhere for download, so I can run my own statistics. Preferrably
technical documents, but any document will do.
Regards,
Oliver Bonten