Discussion:
Word count statistics on Thai/Vietnamese text
(too old to reply)
Oliver Bonten
2004-05-17 19:49:48 UTC
Permalink
Hello,

does anybody have statistical information about average word lengths, word
counts and byte consumption in Thai and Vietnamese text compared to
English (or other western European languages). Specifically, I'm
interested in
- average word count
- average length of words in characters
- average length of words in UTF8 bytes (that's easy for Thai, since most
characters are 3-Byte, but I have no idea about Vietnamese, which seems to
be using characters with 1, 2 and 3 byte encoding and also to make liberal
use of diacritical marks).
- average number of *different* words
compared between english text and a Thai/Vietnamese translation of the
same text.

If no one knows this numbers, it may also be helpful to find a host of
english documents plus Thai and Vietnamese translations (either UCS2 or
UTF8) somewhere for download, so I can run my own statistics. Preferrably
technical documents, but any document will do.

Regards,

Oliver Bonten
My Hobby
2004-05-18 15:14:50 UTC
Permalink
Post by Oliver Bonten
Hello,
does anybody have statistical information about average word lengths, word
counts and byte consumption in Thai and Vietnamese text compared to
English (or other western European languages). Specifically, I'm
interested in
- average word count
- average length of words in characters
- average length of words in UTF8 bytes (that's easy for Thai, since most
characters are 3-Byte, but I have no idea about Vietnamese, which seems to
be using characters with 1, 2 and 3 byte encoding and also to make liberal
use of diacritical marks).
- average number of *different* words
compared between english text and a Thai/Vietnamese translation of the
same text.
If no one knows this numbers, it may also be helpful to find a host of
english documents plus Thai and Vietnamese translations (either UCS2 or
UTF8) somewhere for download, so I can run my own statistics. Preferrably
technical documents, but any document will do.
Regards,
Oliver Bonten
Interesting. I don't have the info you are looking for, just some musings. I
thought Thai was single byte. After all, it has only 44 consonsants and 22
vowels plus 4 tone marks and a few other special symbols. Getting stats like
avg length of words in chars is not easy with Thai since there are no spaces
between words so automatic tools that look for white space don't work.
Oliver Bonten
2004-05-18 19:52:08 UTC
Permalink
Post by My Hobby
Interesting. I don't have the info you are looking for, just some musings. I
thought Thai was single byte. After all, it has only 44 consonsants and 22
vowels plus 4 tone marks and a few other special symbols. Getting stats like
avg length of words in chars is not easy with Thai since there are no spaces
between words so automatic tools that look for white space don't work.
you can define a single byte character encoding for Thai, and I'm sure
someone has done it. But I'm interested in UTF8, which is a Unicode
encoding. In Unicode, Thai characters are between 0x07ff and 0xffff,
and that means that in UTF8 they require 3 bytes.

Oliver

Loading...