Conversion to Audio may Improve Results: Using Siamese Networks for Nickname Classification | by David Harar | Nov, 2022


Using siamese networks to learn the similarity between names and nicknames. Converting text to speech improves results.

Photo by Waldemar Brandt on Unsplash

I use siamese networks in this article to learn similarity measures between names and nicknames. The trained model tells us how plausible a nickname is for a given name (assuming the names came first). I explore different architectures, and experiments show that for a much smaller network, using the spectrograms of the TTSed names and nicknames pairs improves results.

The current project presents an interesting study-case because looking at names, which are “context free” texts, allows us to test the effect of converting texts to audio on performance regardless of other factors that may affect results. In sentiment analysis, for example, different intonations can alter the whole perception of a given text. Thus converting sentences to speech might induce artifact.

This article isn’t meant to serve as an introduction to siamese networks or contrastive loss. Nevertheless, I added links for such introduction, and the curious reader is kindly referred to any of the links below.

For more details about the experiments below, please see the project’s GitHub page here.

Introduction

Learning the similarity between two elements has been important for many applications. Every time one uses face recognition on their phone, their image is compared with existing images. Images of the same person may differ depending on light, angle, facial hair, and more. All of these make this problem of comparison a non-trivial task.

Similarity can be measured by a distance metric, like what is done by KNN, but it would take us only some of the way when facing sophisticated data. Here comes to our aid the siamese network architecture. In siamese networks, we usually use two or three inputs, which then are passed through the same weights; Instead of calculating the distance between the inputs, the distance between their embeddings is measured and is entered into an appropriate loss. This allows us to push the embeddings of two different inputs further away while pulling the embeddings of two inputs from the same class toward each other.

Siamese networks have some interesting use cases. They have been used to detect scene changes (Baraldi, Grana and Cucchiara (2015) [1]). In this excellent blog post, Lei uses siamese network for clustering MNIST digits. By learning similarity (and thus, dissimilarity) between different kinds of input, siamese networks are used for zero/few shots learning, where we are trying to say whether an input is familiar with inputs that the network is familiar with. For a more extensive preview of that subject, visit this blog post.

While most of the online examples use siamese networks for computer vision, siamese networks also have some cool applications in the NLP domain. García, Álvaro, et al. (2021) [2] use siamese networks for fact-checking, building a semantic-aware model that can assess the level of similarity between two texts, one of which is fact. Gleize, Martin, et al. (2019) [3] use siamese networks to find more convincing arguments. Their data contains both the same stance pairs of sentences, as well as cross stance pairs of sentences (i.e., one is supporting and the other is contesting the topic). Their network task is choosing which side of the debate was more convincing. Neculoiu, Versteegh and Rotaru (2016) [4] present a deep architecture for learning a similarity metric on variable-length character sequences. Their model is applied to learn the similarity between job titles (“java programmer” should be closer to “java developer”, but further away from “HR specialist”). Their architecture includes four layers of BLSTM followed by a linear layer, and they use a contrastive loss on the cosine similarity between their embeddings. This was my initial architecture, but it didn’t perform well, probably due to low amounts of data.

On Nicknames

A nickname is a substitute for the proper name of a familiar person, place or thing. Commonly used to express affection, a form of endearment, and sometimes amusement, it can also be used to express defamation of character. (Source: wikipedia, here)

Nicknames may include in the name or be included in the name. Sometimes, the connection between the name and the nickname isn’t immediately apparent. The medieval English people loved rhyming. Thus, “Robert” became “Rob,” and then, since it rhymes with “Bob,” the nickname stuck. See Oscar Tay’s fascinating answer to this Quora question.

“Bill” Shakespeare, Wikipedia, here

In a substantial amount of cases, one can see substrings of the name in the nickname and vice versa, it is also common to change vowels while keeping the sound of the nickname similar (“i” to “ea”, for example). Another standard change is replacing the end of the name with an “ie” (“Samantha” to “Sammie,” for example).

The task of creating a similarity measure between names and nicknames is, first and foremost, to have fun while learning. Some use cases for such a model could include trust assessment — how likely a nickname/username that a person uses in their online account, for example, matches their name, which was associated with a transaction. Password strength assessment — Using one’s own name may lead to a weak password, using one’s own nickname may do so as well.

Data

The data for this project was created using a few sources:

  1. Male/Female diminutives from here
  2. Secure Enterprise Master Patient Index (SOEMPI), here
  3. common_nickname_csv, here

The sample isn’t big enough, which strongly limits our network size and ability to learn and generalize. The following table describes it.

Sample sizes

The most naive approach for checking whether a nickname is plausible for a given name would be to use substring matching. In some cases the name’s substring is fully contained in the nickname and vice versa. Surprisingly, the second is much more common (2.2% vs 26.5%!).

Experiments

Experiment 1: First I implemented the model from [4]. The model consists of four Bidirectional LSTM layers with 64-length hidden states. The hidden states of the last BLSM layer (one forward and one backward) are averaged (“temporal average”) and then the 64-length vector is passed through a dense layer. In the original paper, two inputs, job titles, are entered into the model, and in our case, names and nicknames are passed.

The BLSM model inspired by [4]

Experiment 2: Making an LSTM network work is challenging, and it may be impossible for the amount of data we have. Then, I used 1d-CNN siamese network on the texts. One limitation of using a CNN on time-series data (e.g. the task of matching two sequences of letters) is that it doesn’t preserve any information about the recurrence relations between elements in the sequence. To account for that, I am using both letter embedding and positional embedding (Gehring et al. (2017) [5]).

Model for the second experiment. In the first three blocks, relu was used as an activation function, and in the fourth and last block, sigmoid was used.

Experiment 3: Since nicknames are usually invented via social exchanges, it is more likely that some names sound more similar than they are written. For example, “Relief” and “Leafa”, are a pair in the training data. While the “lief” part in the name is written differently than the “Leaf” part in the nickname, they sound very similar. To account for that, I convert all the names and nicknames to speech via Google’s gtts library. Then I convert the resulting .mp3 files to spectrograms using the librosa package.

The model for the third experiment. D(A,B) is the Euclidean distance between the two elements.

Both in experiment 2 and in experiment 3 the hidden 1d/2d-CNN layers are constructed in such a way they would return tensors of the same dimension as the input. Then the input and the results are added together in a ResNet fashion (He et al., (2016a)) [6].

Experiment 4: Lastly, as a benchmark, I compare the above models with non-learning methods like Jaro, Jaro-Winkler and Levinstein Distance (see William, Ravikumar and Fienberg (2003) [7] for more info).

Results Comparisson

The unfortunate results show that the non-learning algorithms performed better than any of the networks (but where is the fun in that!). The BLSTM couldn’t learn anything, even after trials to shrink it (remove some of the BLSTM layers, and output a much shorter hidden state length). The 1d-CNN got some nice results, but the 2d CNN got better results for a much smaller number of parameters.

Eventually, Jaro Winkler got the best results.

One possible explanation for the above results is that the data we had was too small to train such models to begin with.

Nevertheless, both networks were able to learn, and more then just simple cases of inclusion of the nickname in the name, or vice versa.

Results Analysis

The results in this part are of the 1d-CNN model. It is interesting to see which cases were classified correctly, but a careful analysis of the incorrect classifications could tell us something about the similarity that was estimated.

Correct cases: The table below presents the results for the pairs that were classified correctly. Unsurprisingly, most of the true positive cases were easy cases, such as where the name is included in the nickname or vice versa. Similarly, the true-negative lowest-scored pairs cases are very different both in letters and in how they are sounded, with one exception of “Kimberly” and “Becki” (I could have believed that “Becki” is a nickname for “Kimberly”, but maybe it is just me).

Incorrect Cases: The more-interesting group. It seems that most of the FP cases are reasonable pairs. One could believe that “Mattie” (and not “Maty” which may sound differently), is an actual nickname for “Martha”. The same goes for “Mary” and “Margy”. Allie and Margaret are interesting cases in particular. The data I have at hand includes the pair “Margaret” and “Meggie”. “Allie” as a name has only “Ali” as a nickname. “Allie” as a nickname, is the nickname for “Alan”, “Alice” and more. These examples may imply that the network was able to learn some of the underlying logic between names and nicknames. On the other hand, looking at the 10 lowest false-negative cases is saddening, as many of the names of this group include their nicknames or vice versa.

For more examples around the decision, and boundary, see the appendix.

Appendix: Loss Function, Training and Refining

Loss Function: During my experiments, I used several loss functions. First, I used the loss function described in [4]. Then I went to explore different functions. I used contrastive loss, inspired by this implementation. Then, I also used BCE loss, inspired by [8], who used a weighted average of contrastive loss and BCE loss for epileptic seizure prediction. Overall, the contrastive loss achieved slightly worse results in terms of ROC AUC than the BCE loss, but its goal is a bit different. The purpose of contrastive loss is to discriminate the features of the input vectors, pushing the negative scores towards zero as alpha in the following equation decreases.

Contrastive loss, notation was taken from here.

I did not explore triplet loss in the described project, a following project could extend for using that as well. See this post for a more detailed comparison between the BCE loss, contrastive loss, and triplet loss.

The figure below shows how well the contrastive loss can discriminate between the classes.

Scores distribution by loss type

Training: In siamese networks, the two inputs are going through the same network (or, in other words, we are using shared weights on them). It is helpful when we don’t know which input will be entered first. In our case, from construction, the pairs are built in the same order — (name, nickname), therefore, using different weights to encode each of them slightly differently, but in a way such that the results of these encodings will still have a lower euclidean distance, might improve results. And it did. Using different weights for names and nicknames improved results but also doubled the number of parameters of the network. Future work could assess the benefit of using non-shared weights by comparing the results of such an unrestricted network with a deeper network with shared weights, such that both of them have a similar number of parameters.

Refine: Following Prof. Laura Leal-Taixé suggestion from this excellent lecture on siamese networks, the training of a siamese network could be refined by following those steps:

  1. Train the network for a few epochs
  2. Denoting d(A,P) the distance between a couple of two inputs of the same class, and d(A,N) the distance between two elements of different classes. In the second step, we are to take only the hard cases, such that
Similar distance between the A, anchor (given name), a positive example P (true nickname) and a negative example, N (incorrect nickname). Notation by Prof. Laura Leal-Taixé lecture here

3. Train only on the hard cases.

Refining the network was able to achieve slightly improved results, but it did not change the network beyond recognition.

Decision Boundary: The plot below presents the scores’ distributions for each class. The distributions were normalized separately so we could look at the data as if it was balanced. I chose to use the decision boundary of 0.4 .

Below are 10 samples below the threshold, followed by the 10 closest examples (to the threshold), and then another 10 examples from the area above the threshold.

We see that the examples that are the closest to 0.3 are mostly invalid pairs. Four of the 10 pairs which are the closest to the decision threshold are valid, and seven of the 10 pairs which are the closest to 0.5 are valid pairs. Moreover, we see that in most of the positive examples, the names don’t include the nicknames and vice versa.

Bibliography

[1] Baraldi, Lorenzo, Costantino Grana, and Rita Cucchiara. “A deep siamese network for scene detection in broadcast videos.” Proceedings of the 23rd ACM international conference on Multimedia. 2015.

[2] Huertas-García, Álvaro, et al. “Countering misinformation through semantic-aware multilingual models.” International conference on intelligent data engineering and automated learning. Springer, Cham, 2021.

[3] Gleize, Martin, et al. “Are you convinced? choosing the more convincing evidence with a Siamese network.” arXiv preprint arXiv:1907.08971 (2019).

[4] Neculoiu, Paul, Maarten Versteegh, and Mihai Rotaru. “Learning text similarity with siamese recurrent networks.” Proceedings of the 1st Workshop on Representation Learning for NLP. 2016.

[5] Gehring, Jonas, et al. “Convolutional sequence to sequence learning.” International conference on machine learning. PMLR, 2017.

[6] He, Kaiming, et al. “Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

[7] Cohen, William W., Pradeep Ravikumar, and Stephen E. Fienberg. “A Comparison of String Distance Metrics for Name-Matching Tasks.” IIWeb. Vol. 3. 2003.

[8] Dissanayake, Theekshana, et al. “Patient-independent epileptic seizure prediction using deep learning models.” arXiv preprint arXiv:2011.09581 (2020).


Using siamese networks to learn the similarity between names and nicknames. Converting text to speech improves results.

Photo by Waldemar Brandt on Unsplash

I use siamese networks in this article to learn similarity measures between names and nicknames. The trained model tells us how plausible a nickname is for a given name (assuming the names came first). I explore different architectures, and experiments show that for a much smaller network, using the spectrograms of the TTSed names and nicknames pairs improves results.

The current project presents an interesting study-case because looking at names, which are “context free” texts, allows us to test the effect of converting texts to audio on performance regardless of other factors that may affect results. In sentiment analysis, for example, different intonations can alter the whole perception of a given text. Thus converting sentences to speech might induce artifact.

This article isn’t meant to serve as an introduction to siamese networks or contrastive loss. Nevertheless, I added links for such introduction, and the curious reader is kindly referred to any of the links below.

For more details about the experiments below, please see the project’s GitHub page here.

Introduction

Learning the similarity between two elements has been important for many applications. Every time one uses face recognition on their phone, their image is compared with existing images. Images of the same person may differ depending on light, angle, facial hair, and more. All of these make this problem of comparison a non-trivial task.

Similarity can be measured by a distance metric, like what is done by KNN, but it would take us only some of the way when facing sophisticated data. Here comes to our aid the siamese network architecture. In siamese networks, we usually use two or three inputs, which then are passed through the same weights; Instead of calculating the distance between the inputs, the distance between their embeddings is measured and is entered into an appropriate loss. This allows us to push the embeddings of two different inputs further away while pulling the embeddings of two inputs from the same class toward each other.

Siamese networks have some interesting use cases. They have been used to detect scene changes (Baraldi, Grana and Cucchiara (2015) [1]). In this excellent blog post, Lei uses siamese network for clustering MNIST digits. By learning similarity (and thus, dissimilarity) between different kinds of input, siamese networks are used for zero/few shots learning, where we are trying to say whether an input is familiar with inputs that the network is familiar with. For a more extensive preview of that subject, visit this blog post.

While most of the online examples use siamese networks for computer vision, siamese networks also have some cool applications in the NLP domain. García, Álvaro, et al. (2021) [2] use siamese networks for fact-checking, building a semantic-aware model that can assess the level of similarity between two texts, one of which is fact. Gleize, Martin, et al. (2019) [3] use siamese networks to find more convincing arguments. Their data contains both the same stance pairs of sentences, as well as cross stance pairs of sentences (i.e., one is supporting and the other is contesting the topic). Their network task is choosing which side of the debate was more convincing. Neculoiu, Versteegh and Rotaru (2016) [4] present a deep architecture for learning a similarity metric on variable-length character sequences. Their model is applied to learn the similarity between job titles (“java programmer” should be closer to “java developer”, but further away from “HR specialist”). Their architecture includes four layers of BLSTM followed by a linear layer, and they use a contrastive loss on the cosine similarity between their embeddings. This was my initial architecture, but it didn’t perform well, probably due to low amounts of data.

On Nicknames

A nickname is a substitute for the proper name of a familiar person, place or thing. Commonly used to express affection, a form of endearment, and sometimes amusement, it can also be used to express defamation of character. (Source: wikipedia, here)

Nicknames may include in the name or be included in the name. Sometimes, the connection between the name and the nickname isn’t immediately apparent. The medieval English people loved rhyming. Thus, “Robert” became “Rob,” and then, since it rhymes with “Bob,” the nickname stuck. See Oscar Tay’s fascinating answer to this Quora question.

“Bill” Shakespeare, Wikipedia, here

In a substantial amount of cases, one can see substrings of the name in the nickname and vice versa, it is also common to change vowels while keeping the sound of the nickname similar (“i” to “ea”, for example). Another standard change is replacing the end of the name with an “ie” (“Samantha” to “Sammie,” for example).

The task of creating a similarity measure between names and nicknames is, first and foremost, to have fun while learning. Some use cases for such a model could include trust assessment — how likely a nickname/username that a person uses in their online account, for example, matches their name, which was associated with a transaction. Password strength assessment — Using one’s own name may lead to a weak password, using one’s own nickname may do so as well.

Data

The data for this project was created using a few sources:

  1. Male/Female diminutives from here
  2. Secure Enterprise Master Patient Index (SOEMPI), here
  3. common_nickname_csv, here

The sample isn’t big enough, which strongly limits our network size and ability to learn and generalize. The following table describes it.

Sample sizes

The most naive approach for checking whether a nickname is plausible for a given name would be to use substring matching. In some cases the name’s substring is fully contained in the nickname and vice versa. Surprisingly, the second is much more common (2.2% vs 26.5%!).

Experiments

Experiment 1: First I implemented the model from [4]. The model consists of four Bidirectional LSTM layers with 64-length hidden states. The hidden states of the last BLSM layer (one forward and one backward) are averaged (“temporal average”) and then the 64-length vector is passed through a dense layer. In the original paper, two inputs, job titles, are entered into the model, and in our case, names and nicknames are passed.

The BLSM model inspired by [4]

Experiment 2: Making an LSTM network work is challenging, and it may be impossible for the amount of data we have. Then, I used 1d-CNN siamese network on the texts. One limitation of using a CNN on time-series data (e.g. the task of matching two sequences of letters) is that it doesn’t preserve any information about the recurrence relations between elements in the sequence. To account for that, I am using both letter embedding and positional embedding (Gehring et al. (2017) [5]).

Model for the second experiment. In the first three blocks, relu was used as an activation function, and in the fourth and last block, sigmoid was used.

Experiment 3: Since nicknames are usually invented via social exchanges, it is more likely that some names sound more similar than they are written. For example, “Relief” and “Leafa”, are a pair in the training data. While the “lief” part in the name is written differently than the “Leaf” part in the nickname, they sound very similar. To account for that, I convert all the names and nicknames to speech via Google’s gtts library. Then I convert the resulting .mp3 files to spectrograms using the librosa package.

The model for the third experiment. D(A,B) is the Euclidean distance between the two elements.

Both in experiment 2 and in experiment 3 the hidden 1d/2d-CNN layers are constructed in such a way they would return tensors of the same dimension as the input. Then the input and the results are added together in a ResNet fashion (He et al., (2016a)) [6].

Experiment 4: Lastly, as a benchmark, I compare the above models with non-learning methods like Jaro, Jaro-Winkler and Levinstein Distance (see William, Ravikumar and Fienberg (2003) [7] for more info).

Results Comparisson

The unfortunate results show that the non-learning algorithms performed better than any of the networks (but where is the fun in that!). The BLSTM couldn’t learn anything, even after trials to shrink it (remove some of the BLSTM layers, and output a much shorter hidden state length). The 1d-CNN got some nice results, but the 2d CNN got better results for a much smaller number of parameters.

Eventually, Jaro Winkler got the best results.

One possible explanation for the above results is that the data we had was too small to train such models to begin with.

Nevertheless, both networks were able to learn, and more then just simple cases of inclusion of the nickname in the name, or vice versa.

Results Analysis

The results in this part are of the 1d-CNN model. It is interesting to see which cases were classified correctly, but a careful analysis of the incorrect classifications could tell us something about the similarity that was estimated.

Correct cases: The table below presents the results for the pairs that were classified correctly. Unsurprisingly, most of the true positive cases were easy cases, such as where the name is included in the nickname or vice versa. Similarly, the true-negative lowest-scored pairs cases are very different both in letters and in how they are sounded, with one exception of “Kimberly” and “Becki” (I could have believed that “Becki” is a nickname for “Kimberly”, but maybe it is just me).

Incorrect Cases: The more-interesting group. It seems that most of the FP cases are reasonable pairs. One could believe that “Mattie” (and not “Maty” which may sound differently), is an actual nickname for “Martha”. The same goes for “Mary” and “Margy”. Allie and Margaret are interesting cases in particular. The data I have at hand includes the pair “Margaret” and “Meggie”. “Allie” as a name has only “Ali” as a nickname. “Allie” as a nickname, is the nickname for “Alan”, “Alice” and more. These examples may imply that the network was able to learn some of the underlying logic between names and nicknames. On the other hand, looking at the 10 lowest false-negative cases is saddening, as many of the names of this group include their nicknames or vice versa.

For more examples around the decision, and boundary, see the appendix.

Appendix: Loss Function, Training and Refining

Loss Function: During my experiments, I used several loss functions. First, I used the loss function described in [4]. Then I went to explore different functions. I used contrastive loss, inspired by this implementation. Then, I also used BCE loss, inspired by [8], who used a weighted average of contrastive loss and BCE loss for epileptic seizure prediction. Overall, the contrastive loss achieved slightly worse results in terms of ROC AUC than the BCE loss, but its goal is a bit different. The purpose of contrastive loss is to discriminate the features of the input vectors, pushing the negative scores towards zero as alpha in the following equation decreases.

Contrastive loss, notation was taken from here.

I did not explore triplet loss in the described project, a following project could extend for using that as well. See this post for a more detailed comparison between the BCE loss, contrastive loss, and triplet loss.

The figure below shows how well the contrastive loss can discriminate between the classes.

Scores distribution by loss type

Training: In siamese networks, the two inputs are going through the same network (or, in other words, we are using shared weights on them). It is helpful when we don’t know which input will be entered first. In our case, from construction, the pairs are built in the same order — (name, nickname), therefore, using different weights to encode each of them slightly differently, but in a way such that the results of these encodings will still have a lower euclidean distance, might improve results. And it did. Using different weights for names and nicknames improved results but also doubled the number of parameters of the network. Future work could assess the benefit of using non-shared weights by comparing the results of such an unrestricted network with a deeper network with shared weights, such that both of them have a similar number of parameters.

Refine: Following Prof. Laura Leal-Taixé suggestion from this excellent lecture on siamese networks, the training of a siamese network could be refined by following those steps:

  1. Train the network for a few epochs
  2. Denoting d(A,P) the distance between a couple of two inputs of the same class, and d(A,N) the distance between two elements of different classes. In the second step, we are to take only the hard cases, such that
Similar distance between the A, anchor (given name), a positive example P (true nickname) and a negative example, N (incorrect nickname). Notation by Prof. Laura Leal-Taixé lecture here

3. Train only on the hard cases.

Refining the network was able to achieve slightly improved results, but it did not change the network beyond recognition.

Decision Boundary: The plot below presents the scores’ distributions for each class. The distributions were normalized separately so we could look at the data as if it was balanced. I chose to use the decision boundary of 0.4 .

Below are 10 samples below the threshold, followed by the 10 closest examples (to the threshold), and then another 10 examples from the area above the threshold.

We see that the examples that are the closest to 0.3 are mostly invalid pairs. Four of the 10 pairs which are the closest to the decision threshold are valid, and seven of the 10 pairs which are the closest to 0.5 are valid pairs. Moreover, we see that in most of the positive examples, the names don’t include the nicknames and vice versa.

Bibliography

[1] Baraldi, Lorenzo, Costantino Grana, and Rita Cucchiara. “A deep siamese network for scene detection in broadcast videos.” Proceedings of the 23rd ACM international conference on Multimedia. 2015.

[2] Huertas-García, Álvaro, et al. “Countering misinformation through semantic-aware multilingual models.” International conference on intelligent data engineering and automated learning. Springer, Cham, 2021.

[3] Gleize, Martin, et al. “Are you convinced? choosing the more convincing evidence with a Siamese network.” arXiv preprint arXiv:1907.08971 (2019).

[4] Neculoiu, Paul, Maarten Versteegh, and Mihai Rotaru. “Learning text similarity with siamese recurrent networks.” Proceedings of the 1st Workshop on Representation Learning for NLP. 2016.

[5] Gehring, Jonas, et al. “Convolutional sequence to sequence learning.” International conference on machine learning. PMLR, 2017.

[6] He, Kaiming, et al. “Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

[7] Cohen, William W., Pradeep Ravikumar, and Stephen E. Fienberg. “A Comparison of String Distance Metrics for Name-Matching Tasks.” IIWeb. Vol. 3. 2003.

[8] Dissanayake, Theekshana, et al. “Patient-independent epileptic seizure prediction using deep learning models.” arXiv preprint arXiv:2011.09581 (2020).

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
Ai Newsartificial intelligenceAudioClassificationConversionDavidHararimproveNetworksNicknameNovResultsSiameseTechnology
Comments (0)
Add Comment