Automatic word segmentation is the process of finding the most likely sequence of words from a sequence of characters without spaces. It may also mean to do morphological analysis to segment the word itself; like segmenting the word unreadable into morphemes unreadable. Usually, a human performs a word segmentation easily by the accumulative experience. However, in case of a machine, the automatic word segmentation process is not an easy task. This is for the reason that the machine has to deal with the ambiguities of language and the complexity of the segmentation process.
Automatic word segmentation can be applied in several do- mains. For example, it is a vital process in many natural language processing tasks; in particular speech recognition. Additionally, word segmentation can also be used to recognize words in the languages that written without spaces (e.g., Chinese, Japanese and Thai), where words are not delimited by white-space but must be inferred from the basic character sequence.
Generally there are two types of methods applied to word segmentation namely a dictionary and statistical based method. The main advantage of the first method is that it has a low complexity; However it may have much lower accuracy for some languages due to ambiguity. On the other hand, the main advantage of the second method is that it has more accuracy, but it has a higher complexity.
The main contribution of this paper is to propose a novel hybrid method for word segmentation that combines in the same framework both statistical-based and dictionary-based method. The proposed method uses Google language models and BNC and it is evaluated using the brown corpus. The results showed that the proposed method gives a better accuracy than Word Segmentation Using Maximum Length Descending Frequency and Entropy rate methods.
The proposed method is implemented and compared with maximum length descending frequency and entropy rate method. Additionally, we preform experiments on several corpora to verify our performance. The results of the paper show that the proposed method gives a better accuracy.