Burrows Wheeler in Python. The magical algorithm to index and… | by Cody Glickman, PhD | Sep, 2022

By Jessie Hobb On Sep 9, 2022

The magical algorithm to index and compress large string data then find substrings quickly

The Burrows Wheeler Transform (BWT) was developed in 1994 by Michael Burrows and David Wheeler. In simple terms, BWT is a string transformation that acts as a preprocessing step for lossless compression. BWT has implementations that exhibit both a linear O(n) performance and space complexity. Originally designed to prepare data for compression with techniques like bzip2, BWT has found prominence in bioinformatics allowing the fast mapping of short reads paving the way for high throughput genetic sequencing.

In this article, we will implement a simple BWT in python then show how to find small subsegments with mismatches using a streamlined suffix array BWT.

BWT Algorithm:

Rotate letters: apple becomes [‘eappl’, ‘leapp’, ‘pleap’, ‘pplea’, ‘apple’]
Sort rotated words alphabetically: [‘apple’, ‘eappl’, ‘leapp’, ‘pleap’, ‘pplea’]
Take last column: elppa

BWT of apple becomes elppa

Simple Implementation of BWT

Implementation of BWT with Suffix Array for Fuzzy String Searching

In this python implementation, we will be generating a suffix array and use that to perform the BWT.

BWT from Suffix Array Algorithm:

Generate suffix array: apple becomes [5, 0, 4, 3, 2, 1]
Implementation of suffix array BWT: apple + [5, 0, 4,3,2,1] becomes: e$lppa
Find match locations: app in apple returns position [0]
Find imperfect matches: apl in apple with mismatches =1 returns positions [0, 1]

The following code chunk defines functions that will be used to find imperfect matches. The terminal function in the chunk below generate_all(“apple”) returns a tuple of unique letters, the BWT output, the LF Map, the index of unique letters, and the suffix array.

generate_all("apple")

Returns:

({‘a’, ‘e’, ‘l’, ‘p’},
‘e$lppa’,
{‘e’: [1, 1, 1, 1, 1, 1, 1, 0],
‘p’: [0, 0, 0, 1, 2, 2, 2, 0],
‘l’: [0, 0, 1, 1, 1, 1, 1, 0],
‘$’: [0, 1, 1, 1, 1, 1, 1, 0],
‘a’: [0, 0, 0, 0, 0, 1, 1, 0]},
{‘a’: 0, ‘e’: 1, ‘l’: 2, ‘p’: 3},
[5, 0, 4, 3, 2, 1])

Identifying (Fuzzy Matching) Substrings within BWT Data

Here we identify matches in a BWT string by the following algorithm. All searches occur in reverse for example finding app in apple starts by finding “p”, then “pp”, then “app”.

Note: Last First (LF) property is the ith occurrence of a letter [X] in the last column corresponds to the ith occurrence of [X] in the first column.

Find the range of last letter in search string in the first column of the BWT
Look at the same range of last column of the BWT
Find the next character of the search in the LF mapping. Set “next character” column [NC] equal to the mapping entry for the observed range in the LF Mapping matrix. Set the next column [NC+1] in the LF mapping matrix equal to the “next character” value in the last + 1 row of the current range.
Find the range for the “next character” in the first row and use NC & NC+1 to find the right subrange within the “next character” range.

find("app", 'apple', mismatches=0)

Returns: [0]

We also implement a way to identify mismatches in the subsequence. The characters must be the same though.

find("apZ", 'apple', mismatches=1)

Returns: [] “Empty return because Z is not in reference string”

find("ape", 'apple', mismatches=1)

Returns: [0]

The full notebook to perform the suffix array BWT can be found on GitHub.

We the following chunks will demonstrate the usefulness and speed of BWT. We will download a copy of Alice in Wonderland from Project Gutenberg and munge the text to start from Chapter 8. This leaves us with a text document with 61235 characters comprising of 27432 words in 3762 lines (Still plenty large for our purposes). We will then look for the number of times the phrase “Off with her head” is mentioned. We will compare our BWT search with a default python string search.

It takes about 3 seconds to create the suffix array and perform the BWT. Here is where the magic happens after creating the BWT, we can perform searches orders of magnitude faster than a standard string search. Even a fuzzy search is almost 100x times faster than a standard string search. Wow!

Speed performance comparison of string searching. Image by author.

The code for this article can be found on my personal GitHub. The speed to map small genetic sequences to a giant genome string is why BWT searching has become popular with bioinformatics tools like Bowtie and BWA. My name is Cody Glickman and I can be found on LinkedIn. Be sure to check out some of my other articles!

The magical algorithm to index and compress large string data then find substrings quickly

In this article, we will implement a simple BWT in python then show how to find small subsegments with mismatches using a streamlined suffix array BWT.

BWT Algorithm:

Rotate letters: apple becomes [‘eappl’, ‘leapp’, ‘pleap’, ‘pplea’, ‘apple’]
Sort rotated words alphabetically: [‘apple’, ‘eappl’, ‘leapp’, ‘pleap’, ‘pplea’]
Take last column: elppa

BWT of apple becomes elppa

Simple Implementation of BWT

Implementation of BWT with Suffix Array for Fuzzy String Searching

In this python implementation, we will be generating a suffix array and use that to perform the BWT.

BWT from Suffix Array Algorithm:

Generate suffix array: apple becomes [5, 0, 4, 3, 2, 1]
Implementation of suffix array BWT: apple + [5, 0, 4,3,2,1] becomes: e$lppa
Find match locations: app in apple returns position [0]
Find imperfect matches: apl in apple with mismatches =1 returns positions [0, 1]

generate_all("apple")

Returns:

Identifying (Fuzzy Matching) Substrings within BWT Data

Here we identify matches in a BWT string by the following algorithm. All searches occur in reverse for example finding app in apple starts by finding “p”, then “pp”, then “app”.

Note: Last First (LF) property is the ith occurrence of a letter [X] in the last column corresponds to the ith occurrence of [X] in the first column.

Find the range of last letter in search string in the first column of the BWT
Look at the same range of last column of the BWT
Find the next character of the search in the LF mapping. Set “next character” column [NC] equal to the mapping entry for the observed range in the LF Mapping matrix. Set the next column [NC+1] in the LF mapping matrix equal to the “next character” value in the last + 1 row of the current range.
Find the range for the “next character” in the first row and use NC & NC+1 to find the right subrange within the “next character” range.

find("app", 'apple', mismatches=0)

Returns: [0]

We also implement a way to identify mismatches in the subsequence. The characters must be the same though.

find("apZ", 'apple', mismatches=1)

Returns: [] “Empty return because Z is not in reference string”

find("ape", 'apple', mismatches=1)

Returns: [0]

The full notebook to perform the suffix array BWT can be found on GitHub.

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.