Techno Blender
Digitally Yours.

Building a Khmer Spelling Checker | by Socret Lee | May, 2022

0 84


Learning about the BK-tree and the Edit Distance (a.k.a Levenshtein Distance)

Image by Author.

Recently, I had the opportunity to work on a Khmer keyboard app. Like any other keyboard app, spelling checking and spelling suggestions are core features of the user experience. While trying to implement these features, I came across a popular method that relies on two concepts to do spell checking: Edit Distance (a.k.a Levenshtein Distance) and the BK-tree. Hence, this post will discuss these concepts and their shortcomings and share a Python implementation of a Khmer Spell Checker that relies on them.

The Edit Distance is the minimum number of steps/operations needed to transform one string into another. There are multiple techniques to calculate the Edit distance. However, the most popular one is the Levenshtein Distance, proposed by a Soviet mathematician, Vladimir Levenshtein. His approach allows for three types of string operations to be factored into the edit distance: Insertion, Deletion, and Substitution/Replacement. In the context of spell checking, the Edit Distance allows us to measure “how different” two words are from one another.

Algorithm

We can efficiently calculate the Edit Distance through dynamic programming. It uses a matrix represented by a 2D array to store the temporary distance between substrings to save computation costs. Let’s look at how we can calculate the Edit Distance between “កាល” and “ក្បាល”.

Step 1: Initialize an empty 2D array of size (len(word1) + 1, len(word2) +1)

Figure 1: Step 1-Initialize the 2D array. Image by Author.

Step 2: Fill in the cells in the first rows with their column index.

Figure 2: Step 2-Fill in the first row. Image by Author.

Step 3: Fill in the cells in the first cols with their row index.

Figure 3: Step 3-Fill in the first column. Image by Author.

Step 4: At position (1, 1), we look at the characters of both words at that position: word1 (ក) and word2 (ក). Since they are the same, we set the current cell value as the value of its top-left neighbor. Therefore, the value of cell (1, 1) becomes 0. Then, we move to the next column in row 1.

Figure 4: Step 4-When the characters of word1 and word2 are the same. Image by Author.

Step 5: At the position (1, 2), we again compare the characters of both words at the position: word1 (ក) and word2 (្). This time the two characters are different. So, the current cell’s value is the min([left, top, top-left]) + 1. Then, we repeat steps 4 and 5 for the remaining cells.

Figure 5: Step 5-When the characters of word1 and word2 are different. Image by Author.

Step 6: After we fill all the cells, the minimum Edit distance is the bottom-right cells of the 2D matrix. Hence, the minimum Edit distance between the word “កាល” and “ក្បាល” is 2.

Figure 6: Step 6-The resulting 2D matrix and the minimum Edit distance. Image by Author.

In addition, to know which combination of edit operations leads to the minimum Edit Distance, we need to perform the shortest path search from the top-left to the lower-right cell. Of course, there can be multiple combinations since there can be multiple shortest paths that can lead to the lower-right cell.

A naive approach to building a spell checker is to find the Edit distance between the word provided by the user to all other words in the dictionary and return those with the smallest Edit distance to that word. However, that approach will take time and calculations if the dictionary is extensive. The BK-tree allows us to perform the search faster by limiting the number of comparisons needed using a tree data structure that utilizes Edit distance. Each node in the tree is a word. Each edge connecting different nodes is the Edit Distance between that node to its parent. To better understand the BK-tree, we can look at an example of how we can build the tree for a dictionary of 4 words: {“ស្គម”, “ស្អាត”, “កាល”, “ក្បាល”}. Then, we will go further by looking at how we can search the tree for matching nodes.

BK-tree Construction Algorithm

Step 1: Add the first word “ស្គម” as the tree’s root node. Then move on to the next word, “ស្អាត”.

Figure 7: Step 1-Add the first word as the root node. Image by Author.

Step 2: Set the root node as the current node. Then calculate the edit distance between “ស្អាត” to the current node. So dist(ស្អាត, ស្គម) = 3. After that, we look at the children of the current node. Since the current node has no children whose distance is equal to dist(ស្អាត, ស្គម), we insert “ស្អាត” as a child of the current node. Then we move to the next word, “កាល”.

Figure 8: Step 2-Add the second word. Image by Author.

Step 3: Set the root node as the current node. Then calculate the edit distance between “កាល” to the current node. So we get dist(កាល, ស្គម) = 4. Similar to step 2, the current node has no children whose distance is equal to dist(កាល, ស្គម). Hence, we insert “កាល” as a child of the current node. Then we move to the next word, “ក្បាល”.

Figure 8: Step 3-Add the third word. Image by Author

Step 4: Set the root node as the current node. Then calculate the edit distance between “ក្បាល” to the current node, which gives us dist(ក្បាល, ស្គម) = 4. This time, the current node has a child whose distance is 4, which is the dist(កាល, ស្គម). In this scenario, we push the insertion process down to that child node by setting it as the current node. After that, we repeat the insertion process with the current node as “កាល”, giving us dist(ក្បាល, កាល) = 2. At this stage, there is no child of “កាល” whose distance is equal to dist(ក្បាល, កាល). Hence, we insert “ក្បាល” as a child of “កាល”.

Figure 9: Step 4-Add the fourth word (When a child has the same Edit Distance). Image by Author

After building the tree, we can search it to find matching nodes as word corrections.

BK-tree Search

We can apply any tree searching algorithm to search the BK-tree. In our case, I opted for the depth-first search. However, one special feature that the BK-tree gives us is the ability to prune/skip a large portion of the nodes in the tree. It achieves this by allowing us to only look at nodes whose Edit distance is within a certain tolerance threshold. More specially, we only look at children of the current node whose edit distance is between [dist(word, current_node)-N, dist(word, current_node) + N], where N is the tolerance level.

Let’s look at what will get returned if a user types in the word “កាក” and our tolerance level N = 2.

Step 1: Initialize an empty stack and an empty matches list.

Figure 10: Step 1-Initialize empty stack and matches. Image by Author.

Step 2: Add the root node (ស្គម) to the stack.

Figure 11: Step 2-Add the first root node to the stack. Image by Author.

Step 3: Get the node at the top of the stack (ស្គម). After that, we find that dist(ស្គម, កាក) = 4 > N. Hence, we do not add it to the matches list. Then look at its children “ស្អាត” and “កាល”. For each child, if dist(ស្គម, កាក) – 2 ≤ dist(child, ស្គម) ≤ dist(ស្គម, កាក) + 2, then we add it to the stack. In this case, both “ស្អាត” and “កាល” are added to the stack since 2 ≤ dist(ស្អាត, ស្គម) = 3 ≤ 6 and 2 ≤ dist(កាល, ស្គម) = 4 ≤ 6.

Figure 12: Step 3-Traversing the tree 1. Image by Author.

Step 4: Get the node at the top of the stack (កាល). Then, dist(កាល, កាក) = 1 ≤ N. So we add it to the matches list. Again we find the children of “កាល” that can be added to the stack. This time only “ក្បាល” can be added since -1 ≤ dist(ក្បាល, កាល) = 2 ≤ 3.

Figure 13: Step 3-Traversing the tree 2. Image by Author.

Step 5: Get the node at the top of the stack (ក្បាល). Then, dist(ក្បាល, កាក) = 3 > N. Hence, we do not add it to the matches list. Since “ក្បាល” does not have any children, then we do not add any more nodes to the stack.

Figure 14: Step 3-Traversing the tree 3. Image by Author.

Step 5: Get the node at the top of the stack (ស្អាត). Then, dist(ស្អាត, កាក) = 4 > N. Hence, we do not add it to the matches list. Since “ស្អាត” does not have any children, then we do not add any more nodes to the stack.

Figure 15: Step 4-Traversing the tree 3. Image by Author.

Step 7: Since there are no more nodes in the stack, the algorithm ends.

Figure 16: Result of the search. Image by Author.

In our example above, the search traversed through all the nodes. However, when the number of words in the dictionary is large, this method will allow us to skip many unnecessary nodes.

This method for spelling checking and suggestion while providing acceptable results has some drawbacks.

  1. The tree construction process can take quite a long time to complete and therefore need to be pre-calculated and saved into a file.
  2. The quality of suggestions depends heavily on the words in the dictionary. If the word is not in the dictionary, it could not be suggested.
  3. It does not have any previous context words, and therefore the suggestions might not fit the overall sentence structure.

After learning about these concepts, I got a chance to write a python package that utilizes them. The first thing that I did was prepare a dictionary of Khmer words. To do that, I downloaded an open-source (MIT License) Khmer words dataset from SBBIC and did some preprocessing to end up with a list of around 40k unique words. Then, I used the below code to construct the BK-tree, save the tree to an XML file, and use it to make spelling suggestions. Furthermore, I also packaged it into a python library that you can install through pip.

Edit Distance

BK-tree Construction

BK-tree Search / Spell Checking and Suggestions

Learning about the Edit distance and the BK-tree allowed me to implement a simple but effective spelling checker and suggestion features. I hope this post provides you with enough understanding to implement these concepts in other programming languages and maybe extend them to more than just a spell checker. Happy learning!

Back To Back SWE. (2019). Edit Distance Between 2 Strings — The Levenshtein Distance (“Edit Distance” on LeetCode) [YouTube channel]. YouTube. Retrieved from https://www.youtube.com/watch?v=MiqoA-yF-0M&t=5s

EpicFactFind. (2019). Spell checking with Levenshtein distance and Bk-Trees [YouTube channel]. YouTube. Retrieved from https://www.youtube.com/watch?v=oIsPB2pqq_8

Jurafsky, D. (n.d.). Minimum Edit Distance [PowerPoint Slides]. Standford University. Retrieved from https://web.stanford.edu/class/cs124/lec/med.pdf

Kumar, N. (2022). BK-Tree | Introduction & Implementation. GeeksforGeeks. Retrieved from https://www.geeksforgeeks.org/bk-tree-introduction-implementation/

Mishra, B, H. (n.d.). Burkhard Keller Tree (BK Tree). OpenGenus. Retrieved from https://iq.opengenus.org/burkhard-keller-tree/

SBBIC. (2017). SBBIC Khmer Word List [Data set]. SBBIC. Retrieved from https://sbbic.org/2010/07/29/sbbic-khmer-word-list/

Singh, A. (2020). BK Trees: Unexplored Data Structure. Towards Data Science. Retrieved from https://medium.com/future-vision/bk-trees-unexplored-data-structure-ec234f39052d

Tushar Roy — Coding Made Simple. (2015). Minimum Edit Distance Dynamic Programming [YouTube channel]. YouTube. Retrieved from https://www.youtube.com/watch?v=We3YDTzNXEk


Learning about the BK-tree and the Edit Distance (a.k.a Levenshtein Distance)

Image by Author.

Recently, I had the opportunity to work on a Khmer keyboard app. Like any other keyboard app, spelling checking and spelling suggestions are core features of the user experience. While trying to implement these features, I came across a popular method that relies on two concepts to do spell checking: Edit Distance (a.k.a Levenshtein Distance) and the BK-tree. Hence, this post will discuss these concepts and their shortcomings and share a Python implementation of a Khmer Spell Checker that relies on them.

The Edit Distance is the minimum number of steps/operations needed to transform one string into another. There are multiple techniques to calculate the Edit distance. However, the most popular one is the Levenshtein Distance, proposed by a Soviet mathematician, Vladimir Levenshtein. His approach allows for three types of string operations to be factored into the edit distance: Insertion, Deletion, and Substitution/Replacement. In the context of spell checking, the Edit Distance allows us to measure “how different” two words are from one another.

Algorithm

We can efficiently calculate the Edit Distance through dynamic programming. It uses a matrix represented by a 2D array to store the temporary distance between substrings to save computation costs. Let’s look at how we can calculate the Edit Distance between “កាល” and “ក្បាល”.

Step 1: Initialize an empty 2D array of size (len(word1) + 1, len(word2) +1)

Figure 1: Step 1-Initialize the 2D array. Image by Author.

Step 2: Fill in the cells in the first rows with their column index.

Figure 2: Step 2-Fill in the first row. Image by Author.

Step 3: Fill in the cells in the first cols with their row index.

Figure 3: Step 3-Fill in the first column. Image by Author.

Step 4: At position (1, 1), we look at the characters of both words at that position: word1 (ក) and word2 (ក). Since they are the same, we set the current cell value as the value of its top-left neighbor. Therefore, the value of cell (1, 1) becomes 0. Then, we move to the next column in row 1.

Figure 4: Step 4-When the characters of word1 and word2 are the same. Image by Author.

Step 5: At the position (1, 2), we again compare the characters of both words at the position: word1 (ក) and word2 (្). This time the two characters are different. So, the current cell’s value is the min([left, top, top-left]) + 1. Then, we repeat steps 4 and 5 for the remaining cells.

Figure 5: Step 5-When the characters of word1 and word2 are different. Image by Author.

Step 6: After we fill all the cells, the minimum Edit distance is the bottom-right cells of the 2D matrix. Hence, the minimum Edit distance between the word “កាល” and “ក្បាល” is 2.

Figure 6: Step 6-The resulting 2D matrix and the minimum Edit distance. Image by Author.

In addition, to know which combination of edit operations leads to the minimum Edit Distance, we need to perform the shortest path search from the top-left to the lower-right cell. Of course, there can be multiple combinations since there can be multiple shortest paths that can lead to the lower-right cell.

A naive approach to building a spell checker is to find the Edit distance between the word provided by the user to all other words in the dictionary and return those with the smallest Edit distance to that word. However, that approach will take time and calculations if the dictionary is extensive. The BK-tree allows us to perform the search faster by limiting the number of comparisons needed using a tree data structure that utilizes Edit distance. Each node in the tree is a word. Each edge connecting different nodes is the Edit Distance between that node to its parent. To better understand the BK-tree, we can look at an example of how we can build the tree for a dictionary of 4 words: {“ស្គម”, “ស្អាត”, “កាល”, “ក្បាល”}. Then, we will go further by looking at how we can search the tree for matching nodes.

BK-tree Construction Algorithm

Step 1: Add the first word “ស្គម” as the tree’s root node. Then move on to the next word, “ស្អាត”.

Figure 7: Step 1-Add the first word as the root node. Image by Author.

Step 2: Set the root node as the current node. Then calculate the edit distance between “ស្អាត” to the current node. So dist(ស្អាត, ស្គម) = 3. After that, we look at the children of the current node. Since the current node has no children whose distance is equal to dist(ស្អាត, ស្គម), we insert “ស្អាត” as a child of the current node. Then we move to the next word, “កាល”.

Figure 8: Step 2-Add the second word. Image by Author.

Step 3: Set the root node as the current node. Then calculate the edit distance between “កាល” to the current node. So we get dist(កាល, ស្គម) = 4. Similar to step 2, the current node has no children whose distance is equal to dist(កាល, ស្គម). Hence, we insert “កាល” as a child of the current node. Then we move to the next word, “ក្បាល”.

Figure 8: Step 3-Add the third word. Image by Author

Step 4: Set the root node as the current node. Then calculate the edit distance between “ក្បាល” to the current node, which gives us dist(ក្បាល, ស្គម) = 4. This time, the current node has a child whose distance is 4, which is the dist(កាល, ស្គម). In this scenario, we push the insertion process down to that child node by setting it as the current node. After that, we repeat the insertion process with the current node as “កាល”, giving us dist(ក្បាល, កាល) = 2. At this stage, there is no child of “កាល” whose distance is equal to dist(ក្បាល, កាល). Hence, we insert “ក្បាល” as a child of “កាល”.

Figure 9: Step 4-Add the fourth word (When a child has the same Edit Distance). Image by Author

After building the tree, we can search it to find matching nodes as word corrections.

BK-tree Search

We can apply any tree searching algorithm to search the BK-tree. In our case, I opted for the depth-first search. However, one special feature that the BK-tree gives us is the ability to prune/skip a large portion of the nodes in the tree. It achieves this by allowing us to only look at nodes whose Edit distance is within a certain tolerance threshold. More specially, we only look at children of the current node whose edit distance is between [dist(word, current_node)-N, dist(word, current_node) + N], where N is the tolerance level.

Let’s look at what will get returned if a user types in the word “កាក” and our tolerance level N = 2.

Step 1: Initialize an empty stack and an empty matches list.

Figure 10: Step 1-Initialize empty stack and matches. Image by Author.

Step 2: Add the root node (ស្គម) to the stack.

Figure 11: Step 2-Add the first root node to the stack. Image by Author.

Step 3: Get the node at the top of the stack (ស្គម). After that, we find that dist(ស្គម, កាក) = 4 > N. Hence, we do not add it to the matches list. Then look at its children “ស្អាត” and “កាល”. For each child, if dist(ស្គម, កាក) – 2 ≤ dist(child, ស្គម) ≤ dist(ស្គម, កាក) + 2, then we add it to the stack. In this case, both “ស្អាត” and “កាល” are added to the stack since 2 ≤ dist(ស្អាត, ស្គម) = 3 ≤ 6 and 2 ≤ dist(កាល, ស្គម) = 4 ≤ 6.

Figure 12: Step 3-Traversing the tree 1. Image by Author.

Step 4: Get the node at the top of the stack (កាល). Then, dist(កាល, កាក) = 1 ≤ N. So we add it to the matches list. Again we find the children of “កាល” that can be added to the stack. This time only “ក្បាល” can be added since -1 ≤ dist(ក្បាល, កាល) = 2 ≤ 3.

Figure 13: Step 3-Traversing the tree 2. Image by Author.

Step 5: Get the node at the top of the stack (ក្បាល). Then, dist(ក្បាល, កាក) = 3 > N. Hence, we do not add it to the matches list. Since “ក្បាល” does not have any children, then we do not add any more nodes to the stack.

Figure 14: Step 3-Traversing the tree 3. Image by Author.

Step 5: Get the node at the top of the stack (ស្អាត). Then, dist(ស្អាត, កាក) = 4 > N. Hence, we do not add it to the matches list. Since “ស្អាត” does not have any children, then we do not add any more nodes to the stack.

Figure 15: Step 4-Traversing the tree 3. Image by Author.

Step 7: Since there are no more nodes in the stack, the algorithm ends.

Figure 16: Result of the search. Image by Author.

In our example above, the search traversed through all the nodes. However, when the number of words in the dictionary is large, this method will allow us to skip many unnecessary nodes.

This method for spelling checking and suggestion while providing acceptable results has some drawbacks.

  1. The tree construction process can take quite a long time to complete and therefore need to be pre-calculated and saved into a file.
  2. The quality of suggestions depends heavily on the words in the dictionary. If the word is not in the dictionary, it could not be suggested.
  3. It does not have any previous context words, and therefore the suggestions might not fit the overall sentence structure.

After learning about these concepts, I got a chance to write a python package that utilizes them. The first thing that I did was prepare a dictionary of Khmer words. To do that, I downloaded an open-source (MIT License) Khmer words dataset from SBBIC and did some preprocessing to end up with a list of around 40k unique words. Then, I used the below code to construct the BK-tree, save the tree to an XML file, and use it to make spelling suggestions. Furthermore, I also packaged it into a python library that you can install through pip.

Edit Distance

BK-tree Construction

BK-tree Search / Spell Checking and Suggestions

Learning about the Edit distance and the BK-tree allowed me to implement a simple but effective spelling checker and suggestion features. I hope this post provides you with enough understanding to implement these concepts in other programming languages and maybe extend them to more than just a spell checker. Happy learning!

Back To Back SWE. (2019). Edit Distance Between 2 Strings — The Levenshtein Distance (“Edit Distance” on LeetCode) [YouTube channel]. YouTube. Retrieved from https://www.youtube.com/watch?v=MiqoA-yF-0M&t=5s

EpicFactFind. (2019). Spell checking with Levenshtein distance and Bk-Trees [YouTube channel]. YouTube. Retrieved from https://www.youtube.com/watch?v=oIsPB2pqq_8

Jurafsky, D. (n.d.). Minimum Edit Distance [PowerPoint Slides]. Standford University. Retrieved from https://web.stanford.edu/class/cs124/lec/med.pdf

Kumar, N. (2022). BK-Tree | Introduction & Implementation. GeeksforGeeks. Retrieved from https://www.geeksforgeeks.org/bk-tree-introduction-implementation/

Mishra, B, H. (n.d.). Burkhard Keller Tree (BK Tree). OpenGenus. Retrieved from https://iq.opengenus.org/burkhard-keller-tree/

SBBIC. (2017). SBBIC Khmer Word List [Data set]. SBBIC. Retrieved from https://sbbic.org/2010/07/29/sbbic-khmer-word-list/

Singh, A. (2020). BK Trees: Unexplored Data Structure. Towards Data Science. Retrieved from https://medium.com/future-vision/bk-trees-unexplored-data-structure-ec234f39052d

Tushar Roy — Coding Made Simple. (2015). Minimum Edit Distance Dynamic Programming [YouTube channel]. YouTube. Retrieved from https://www.youtube.com/watch?v=We3YDTzNXEk

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment