Today I'd like to highlight the challenge of search functionality in Chinese. In this article, we will go through the main difficulties of full-text search implementation for CJK languages and how to overcome them with the help of Manticore Search.
Chinese search difficulties
The Chinese language belongs to the so-called CJK language family (Chinese, Japanese, and Korean). They are probably the most complicated languages for full-text search implement as in them word meanings heavily depend on numerous hieroglyphs variations and their sequences and the characters are not split up into words.
Chinese languages specifics:
- Chinese hieroglyphs do not have an upper or lower case. They have only one notion, regardless of the context.
- There is no additional decoration for the letters, as in Arabic, for example.
- There are no spaces between words in sentences.
So whatās the point? To find an exact match in a full-text search, we have to face the challenge of tokenization whose main task is to break down the text into low-level units of values that can be searched by the user.
Tokenization/Segmentation ofĀ Chinese
To be more specific, tokenization is the process of turning a meaningful piece of data, a word for example, into a unique identifier called a token that represents a piece of data in the system. In full-text search engines tokens serve as a reference to the original data, but cannot be used to guess those actual values.
In most languages, we use spaces or special characters to divide the text into fragments. However, in Chinese and the other CJK languages, it is not possible due to its morphological properties. However, we still need that to be done. And that process is called segmentation.
In other words for Chinese segmentation is a prerequisite to tokenization.
Here is an example showing the difference between English and Chinese tokenization.
As you see the Chinese sentence is half as short and doesnāt have any spaces, commas, and is not even split into words, each word here is represented by a hieroglyphic character or few. Here is a quantitative comparison:
Another challenge is that Chinese hieroglyphs may have different meanings depending on their sequences and combinations. Letās take a look at different meanings of hieroglyphical combinations:
Here we can see the combination of two hieroglyphic characters āē®åā which means āSimplicityā, but if we take each of the hieroglyphs separately, they will have different meanings āē®ā (simple) and āåā (single)
In some cases, depending on where you put a boundary between the words the meanings may be different. For example:
The problem, as you can see, is that a group of characters can be segmented differently, resulting in different values.
Letās look at possible ways of solving the Chinese tokenization/segmentation problems.
Implementations
There are a few approaches for Chinese texts segmentation, but the main two are:
- N-grams: considers the overlapping groups of āNā neighboring Chinese characters as tokens, where āNā could be 1āāāUni; 2āāāBi; 3- Tri; and so on ā-gramsā.
- Dictionary-based: performs word segmentation based on a dictionary.
The easiest way of Chinese text segmentation assumes the use of N-grams. The algorithm is straightforward, but it is known to lack in quality and gives considerable overhead that grows with the length of the text being processed because each N-gram is a separate token which makes the tokens dictionary bigger and processing a search query much more complicated. It historically was the common way of CJK texts indexation in Manticore Search.
In Manticore Search version 3.1.0 was introduced a new way of segmentation of Chinese texts based on the ICU text segmentation algorithm which follows the second approachāāādictionary-based segmentation.
Benefits of usingĀ ICU
ICU is a set of open-source libraries providing Unicode and Globalization support for software applications. Along with many other features, it solves the task of text boundariesā determination. ICU algorithms locate positions of words, sentences, paragraphs within a range of text, or identify locations that would be suitable for line wrapping when displaying the text.
The algorithm of how the ICU segmentation works in Manticore can be briefly described as follows:
- The original text is regarded as an array of symbols.
- Then Manticore goes through the array and if it finds a set of Chinese symbols it passes that along to the ICU library to process them.
- The segmented parts of the Chinese text replace the original, unsegmented parts.
- Other natural language processing algorithms ( charset_table, wordforms, etc.) are applied to the modified text like in a common segmentation workflow.
To enable the ICU-Chinese segmentation, the following index configuration options must be set:
- morphology = icu_chinese
- charset_table = cjk / chinese
An example of configuring ICU-Chinese text tokenization
Index configuration
To set up an index for proper Chinese segmentation there are 2 index settings we need to touch: ācharset_tableā and āmorphologyā.Ā
To set up an index for proper Chinese segmentation there are 2 index settings we need to touch: ācharset_tableā and āmorphologyā.Ā
For the charset table, there is a built-in alias āchineseā which contains all Chinese characters.
First, letās connect to Manticore:
# mysql -P 9306 -h0
And create our table:
mysql> CREATE TABLE testrt (title TEXT, content TEXT, gid INT) charset_table = 'chinese' morphology = 'icu_ chinese';
Basic usage
Letās insert into the index a simple mixed Chinese-English sentence āä¹°ę°ēAppleēµčā (āI like Apple computersā):
mysql> INSERT INTO testrt (title, content, gid) VALUES ( 'first record', 'ä¹°ę°ēAppleēµč', 1 );
To check if our Chinese content is tokenized as expected we can run CALL KEYWORDS against our index:
mysql> call keywords('ä¹°ę°ēAppleēµč', 'testrt');
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | ä¹° | ä¹° |
| 2 | ę°ē | ę°ē |
| 3 | ēµč | ēµč |
+------+------ ----+------------+
We can see that the Chinese text got broken into 4 words, as expected.
And now letās make a search against our content field with a query phrase āAppleēµčā (āApple computersā):
mysql> SELECT * FROM testrt WHERE MATCH ('@content Appleēµč');
+---------------------+------+--------------+----------------------+
| id | gid | title | content |
+---------------------+------+--------------+----------------------+
| 5477080993472970753 | 1 | first record | ä¹°ę°ēAppleēµč |
+---------------------+------+--------------+----------------------+
As we see, the search has been successfully executed and we got the expected result despite the fact that both the original sentence and the query phrase we used in this example didnāt have any separators between the words.
We can use āSHOW METAā command to see extended information about the query and make sure that our sentence was properly segmented into separate words:
mysql> SHOW META;
+---------------+--------+
| Variable_name | Value |
+---------------+--------+
| total | 1 |
| total_found | 1 |
| time | 0.199 |
| keyword[0] | ēµč |
| docs[0] | 1 |
| hits[0] | 1 |
+---------------+--------+
We see that our search phrase was indeed split up into āappleā and āēµčā (ācomputersā) just as we supposed it to be done.
Conclusion
Implementing a Chinese full-text search is not an easy task, but Manticore Search copes with it without any troubles out of the box. So if you need a simple and reliable Chinese search for your service or application, you know what to use. You can also find this article in the form of an interactive course here.
Previously published at https://manticoresearch.com/2019/07/17/new-way-of-tokenization-of-chinese/