Predicting Drug Resistance In Mycobacterium Tuberculosis Using A Convolutional Network — Paper Review | by Uri Almog | Mar, 2023

By Jessie Hobb On Mar 20, 2023

In this post I’m going to review a recent paper in the interface between medicine research and modeling and machine learning. The paper, Green, A.G., Yoon, C.H., Chen, M.L. et al. A convolutional neural network highlights mutations relevant to antimicrobial resistance in Mycobacterium tuberculosis. Nat Commun 13, 3817 (2022). https://doi.org/10.1038/s41467-022-31236-0 , describes two approaches to training neural network models for predicting the resistance of a given M. tuberculosis (MTB) strain, to 13 antibiotics, based on its genome. This modeling technique has the advantage of producing a saliency map that highlights features with the greatest effect on the prediction, thereby addressing some of the concern regarding model explainability.

Tuberculosis (TB) is a leading cause of death worldwide from an infectious pathogen. Its pathogen, the M. tuberculosis (or MTB), is gradually developing resistance to antibiotics — a process that poses a threat to public health. While empirically testing the resistance of MTB isolate to a series of antibiotics for every patient may be the most accurate method, it may take weeks to complete and will not enable a timely treatment. Molecular diagnosis of the isolate takes only hours or days, but only focuses on specific loci in the genome sequence. Therefore, machine learning models that learn the dependence of the phenotype (drug resistance) on the pathogen genotype (the structure of the diagnosed loci) may provide the required solution.

The authors describe two modeling methods: the first, named SD-CNN (Single Drug CNN), trains 13 different CNNs, each predicting resistance to a different drug. The second, named MD-CNN (Multi Drug CNN), predicts resistance to 13 drugs simultaneously. The insight behind this modeling technique is the pioneering work on multitask learning (Caruana, R. Multitask learning. Mach. Learn. 28, 41–75 (1997) that showed that, somewhat contrary to intuition, training a CNN to perform different tasks simultaneously, may indeed improve its performance on each individual task, given that the tasks are related. The explanation for this result is that the features generated by one task are beneficial to the performance of the other tasks (e.g. training an autonomous car steering model with an auxiliary task of road-sign detection). The advantages of multitask learning in genetics research was demonstrated by Dobrescu, A., Giuffrida, M. V. & Tsaftaris, S. A. Doing more with less: a multitask deep learning approach in plant phenotyping. Front. Plant Sci. 11, 141 (2020).

The data used for training was 10,201 isolates of M. tuberculosis pathogen, that were tested for resistance on 13 antibiotics. The input for the MD-CNN is a 5x18x10,291 array, where 5 is a one-hot encoding for 4 nucleotides (adenine, thymine, guanine, cytosine and a gap character), 18 is the locus index (the authors are using 18 loci with known relevance to drug resistance), and 10,291 is the length of the longest locus. A locus (plural — loci) is a specific, fixed position in a chromosome, where a particular gene or genetic sequence is located. A locus is defined by its start index and end index, counting the nucleotides from an agreed starting point. Different loci have different lengths.

The input for each of the 13 SD-CNN models consists of a subset of the 18 loci, that has known effect on resistance to that drug.

The MD-CNN model output is a 13-element vector (indexed according to the anti-TB drugs), each containing the the sigmoid result confidence of that strain being resistant to that drug. The SD-CNN models return a single sigmoid value corresponding to the confidence of resistance for that drug.

The model is a CNN consisting of 2 1-D convolution and maxpooling blocks, followed by 3 fully connected layers. Description is given in Fig. 1.

Fig. 1 — MD-CNN architecture. Conv 1a and 1b kernel dimensions are 5×12 and 1×12. Conv 2a and 2b kernel dimensions are 1×3. Maxpool layer shapes are 1×3. All strides are 1×1. All activation functions are ReLU except in the output layer, where it is a sigmoid. The output dimensions for each layer are given below the graphic representation. The SD-CNN models differ from this image in that their locus dimension isn’t 18 and their output dimension is 1. Image by the author.

The SD-CNN and MD-CNN models were benchmarked against each other and against two previous models: Reg+L2 and the SOTA model WDNN (Chen, M. L. et al. Beyond multidrug resistance: leveraging rare variants with machine and statistical learning models in Mycobacterium tuberculosis resistance prediction. EBioMedicine 43, 356–369 (2019)). The benchamrking was by using a 5-fold cross-validation on the training set.

The tests show that MD-CNN performs en par with WDNN (the current SOTA model, that uses as input a boolean encoding of known mutations in the genome sequence. It is designed as a combination of multilayer perceptrons, i.e. does not use convolutions). The mean AUC of MD-CNN was 0.948 (compared to 0.960 for WDNN) on 1st line drugs, and 0.912 (compared to 0.924 for WDNN) for 2nd line drugs. SD-CNN was slightly less accurate with 0.888 for both drug groups. MD-CNN and SD-CNN showed ability to generalize to new data, achieving approximately the same AUC on a separately collected test set of 12,848 isolates). — For graphical comparison of the models, see the original paper.

The authors note that the MD-CNN model achieved higher sensitivity than the SD-CNN models (i.e. a smaller miss rate of drug resistance), while the SD-CNN models achieved higher specificity (i.e. a smaller rate of wrongly classifying an isolate as resistant to a given drug). In other words — the MD-CNN is less conservative, tending to classify more cases as ‘resistant’.

Analysing the SD-CNN performance, the authors looked into the false negative cases. Upon examining the data, they observed that isolates with identical model inputs were in some cases resistant and in other cases sensitive to the same drug (i.e. their ground truth classification differed). This leads the authors to hypothesize that mutations in loci that are not incorporated in the SD-CNN model are responsible to the resistance.

The authors use DeepLIFT (Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. 2017. Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning — Volume 70 (ICML’17). JMLR.org, 3145–3153.), A method for calculating the contribution of input features to the output, to explain the model predictions. By varying the genotype input in-silico (simulating an input) and comparing the result to a ‘reference result’, the authors find variants that were previously unknown to have an effect on MTB drug resistance.

As a machine learning engineer and researcher with most of my focus on computer vision, I learned much from reading this paper and the relevant background material. It is evident that neural networks have great potential to improve modeling techniques in the realm of medicine and biology. When comparing the techniques used in this model with my own experience, I thought of several things that I would be interested to try, if I were working on a second phase of this research:

Gap encoding — The four nucleotides are encoded in a 1-hot encoding, plus an extra element representing a gap. I am curios to see if the results improve if the gap representation is changed to simply [0, 0, 0, 0].
Feature depth — the architecture presented here uses a single feature throughout the model. My intuition from computer vision makes me curious about the possibilities in feature diversification. Just as in computer vision, the training process may converge in a way so that a single position in the image may have various features e.g. ‘roundness’, ‘metallicity’, ‘smoothness’, I guess it can be the same in genomic sequence.
Padding type — The authors are using ‘valid’ padding in their convolution layers, as opposed to ‘same’ padding which is commonly used in computer vision. This gradually shortens the sequence as it is passed between layers. ‘Same’ padding preserves the spatial size of the sequence, allowing structures near the sequence edge to maintain some effect even in later stages of the model. It also enables operations such as concatenation of outputs from layers in different stages of the model.
Attention mechanism — (Vaswani et al., Attention Is All You Need, 2017, NIPS) — Attention blocks are useful in finding subtle relationships between remote tokens in a sequence (e.g. different parts of a sentence in NLP), and they are especially relevant when the value of one token may have a significant effect on the interpretation of the value of the other token. It would be interesting to see if adding an attention block improves results and if so — use it it retrace the hidden relationships between areas in the loci.

The input for each of the 13 SD-CNN models consists of a subset of the 18 loci, that has known effect on resistance to that drug.

The model is a CNN consisting of 2 1-D convolution and maxpooling blocks, followed by 3 fully connected layers. Description is given in Fig. 1.

Gap encoding — The four nucleotides are encoded in a 1-hot encoding, plus an extra element representing a gap. I am curios to see if the results improve if the gap representation is changed to simply [0, 0, 0, 0].
Feature depth — the architecture presented here uses a single feature throughout the model. My intuition from computer vision makes me curious about the possibilities in feature diversification. Just as in computer vision, the training process may converge in a way so that a single position in the image may have various features e.g. ‘roundness’, ‘metallicity’, ‘smoothness’, I guess it can be the same in genomic sequence.
Padding type — The authors are using ‘valid’ padding in their convolution layers, as opposed to ‘same’ padding which is commonly used in computer vision. This gradually shortens the sequence as it is passed between layers. ‘Same’ padding preserves the spatial size of the sequence, allowing structures near the sequence edge to maintain some effect even in later stages of the model. It also enables operations such as concatenation of outputs from layers in different stages of the model.
Attention mechanism — (Vaswani et al., Attention Is All You Need, 2017, NIPS) — Attention blocks are useful in finding subtle relationships between remote tokens in a sequence (e.g. different parts of a sentence in NLP), and they are especially relevant when the value of one token may have a significant effect on the interpretation of the value of the other token. It would be interesting to see if adding an attention block improves results and if so — use it it retrace the hidden relationships between areas in the loci.

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.