Techno Blender
Digitally Yours.

SourceCodeAI — how to handle Train-Inference mismatch | by Ori Abramovsky | May, 2022

0 71


Photo by Alex Dumitru from Pexels

Source code AI has many unique features which differentiate it from the more general NLP applications (like the common practice to heavily process the input prior to feeding it to the model). One of it’s main challenges is the fact that while generating source code train datasets seems quite easy (‘just crawl Github’), the reality includes many hidden pitfalls to avoid, between is the fact that such highly available sources (like Github, Bitbucket or Stackoverflow) commonly differ from the inference data (production phase), what can affect the projected model performance. More on the phenomena and how to avoid it is ahead.

The first ingredient for machine learning applications is a working dataset. For source code applications such datasets can be easily built using public source code hosting services like Github. Let’s assume for example that we want to develop an app that will find PII left unintentionally on source code snippets. To generate a train dataset we crawl Github, actively search for relevant terms (like ‘credit card number =..’, to make sure we will have enough positive examples). Finally we train a model using that dataset only to find a dramatic performance drop on our internal repositories. What went wrong? A deeper look would reveal that the repositories we used for train-test-validation highly differ from our internal ones. Github repositories, for example, can be owned by an organisation or by a single user and can have public or private scope (visible only to the account members). Looking into these populations characteristics, there is an inherent tension between private and public repos; while public code is the easily available resource, the real target of the AI applications we train is commonly internal code, and it’s fair to assume such code will look differently; take for example your organisation internal code snippets and compare it to Github random snippets. Consider for example features like documentation level and the rate of best practices in use. How likely is it to meet a Monolith, Environment Variables or a good API Encapsulation structure?. How likely is it to face AWS integrations in database related snippets on random Github repositories?. It becomes even more complicated given Github’s many possible sub populations — while we would assume a public, company owned, repository to be SDK, open source or a general example (otherwise, why to make it public?) and a public, privately owned, repository to be POC or a general side project (otherwise, why not place it on the company repositories), the reality is Github has it all, as a quick Github search can reveal private repositories which seem company owned and vice versa. The main issue with such a phenomena is the distribution differences; how likely is it to find a PII (for our example) on a public Github repository and how likely is it on your internal company repositories? In order to better understand how such a gap can affect the generated models performance, let’s have a look at a systematic analysis that was done to estimate such difference importance for another source code AI use case — code Autocomplete.

Code autocomplete is a trendy topic in the source code AI world. While traditionally it was a vital part of almost every IDE, more recently solutions like Github Copilot and Tabnine started to suggest a competing AI based solution. Hellendoorn et al (2019) highlight an inherent issue with such developments — autocomplete models are commonly trained on synthetically made datasets, being asked to fill randomly removed tokens. The two main issues with this approach is first the fact that committed code commonly differs from how it looks during the writing of it (like having partial, maybe even not working, context VS ready to commit one), and secondly the fact that in reality not all tokens have the same likelihood to be asked to autocomplete by the app users. From a model perspective, some repetitive, short, and easy to predict tokens are more beneficial to learn than the longer and more rare ones. But in reality we can assume the longer and rare tokens to be more relevant to autocomplete. Not surprisingly the research found a dramatic performance drop in reality VS what was mentioned in the relevant papers. Interestingly many of the inference time completions were within project APIs, not visible by definition to models which were trained on general populations (what can be solved using a ‘tuning’ step per given population, similar to the ULMFit practice for transfer learning). Such findings make us suspect how reliable are the offline numbers to predict these models accuracy in real world usages, leading us to a scenario where we can’t really trust the offline metrics.

Now that we learned that train-inference mismatch will probably affect autocomplete applications performance, is it possible to get good inference results for such use cases at all?. Following Hellendoorn findings, Facebook (Gareth at el, 2020) looked into that question and found that training such models on real autocomplete examples significantly improved their results. Interestingly they found not only a performance drop (for synthetic VS real examples models) but also a usage drop; an internal A-B test experiment revealed that models which were trained on real data end up gaining higher usage. Moreover, it’s important to note that such a drop was even on state of the art models like GPT-2 (interestingly BPE tokenization had the smallest performance drop, probably due to having the lowest level of out-of-vocabulary tokens, which they mention as clearly fits Karampatsis el al findings that ‘subtoken encoding can be especially helpful when developer activity datasets are unavailable for model training’). Not surprisingly, the analysis revealed that the accepted autocomplete tokens were on average longer (harder to remember) than the common committed tokens. Concluding these papers results we should find a way to make sure our train dataset highly resembles the inference one (or at least to be close enough to the way the inference data look like through the model perspective).

While we would like our train dataset to be as similar as possible to the inference phase, for many this is not a feasible requirement. Consider for example the use case we started with; from where should we get real companies repositories with PII inside them? (assuming such snippets to be super critical). While big companies can theoretically leverage their internal code base for that need (with the risk of overfitting their internal code practices and styling), med-small companies don’t have such a luxury. A more realistic evaluation would reveal that most companies must rely on public sources like Github to generate their datasets. But as many of Github sub populations are probably not relevant (private POCs for example probably look differently than big companies internal code), we should find a way to target the more relevant subpopulations within it. It requires a better analysis of the problem domain — what are we trying to solve? What are the data key factors that a model will pay attention to? How do we anticipate such factors to look at internal repositories?. Once answering those questions we can begin with validating our assumptions on our internal repositories VS the general Github ones, to better nail the population types we are needing. Then we can approach Github in a more directed way, to actively target the sub populations we need.

Now that we have a better view of the inference phase data characteristics, we can start searching Github for relevant snippets. It’s important to remember that a realistic dataset will include different subpopulations with varied degree of label relevance; some may have high False Positive rate while others may be total True Positives. For our example (finding PII) it makes sense to divide the world into three main parts –

  • High standards; repos with almost zero likelihood to include a PII. Relevant examples could be documentation or big companies (like S&P top 50) repositories. Looking into Facebook (public) repositories as example, we can assume super low likelihood to find a PII.
  • Med standards; repos with some likelihood to include a PII. Relevant examples could be small size open sources or small size organisations repositories (where mistakes are more likely to happen). Both can be targeted by looking into Github accounts meta fields like number of committers or the general repositories count.
  • Low standards; repos with high likelihood to include a PII. Relevant examples could be (public) repositories of private users which include relevant terms within them (like ‘credit card =’. Can be assumed to have higher likelihood to find a PII).

The good news is that all the mentioned examples can be targeted using Github search APIs, enabling us to actively (synthetically) generate a population that will better resemble our target (inference, private) populations. Towards better performance at the inference phase.

Now that we manage to generate an inference like population, it’s important to verify that it truly answers our needs. The main challenge will be how to decide each subpopulation rate in our synthetic made dataset. Commonly it can take a few iterations to generate a dataset with characteristics similar enough to the projected inference phase ones. The meanwhile solution is to keep monitoring the performance per sub population (per generated dataset sub population and in our internal repositories). Make sure the predicted behaviour truly holds (like in our example, that the high standards sub-population shouldn’t include almost any PII, otherwise there is an issue with the targeting mechanism). This is why it’s also important to collect feedback from your users, to verify that important sub populations weren’t missed. Keep re-training with new, false positives examples. Towards generating a fully tuned dataset that could enable beating any, tones of data, based model.


Photo by Alex Dumitru from Pexels

Source code AI has many unique features which differentiate it from the more general NLP applications (like the common practice to heavily process the input prior to feeding it to the model). One of it’s main challenges is the fact that while generating source code train datasets seems quite easy (‘just crawl Github’), the reality includes many hidden pitfalls to avoid, between is the fact that such highly available sources (like Github, Bitbucket or Stackoverflow) commonly differ from the inference data (production phase), what can affect the projected model performance. More on the phenomena and how to avoid it is ahead.

The first ingredient for machine learning applications is a working dataset. For source code applications such datasets can be easily built using public source code hosting services like Github. Let’s assume for example that we want to develop an app that will find PII left unintentionally on source code snippets. To generate a train dataset we crawl Github, actively search for relevant terms (like ‘credit card number =..’, to make sure we will have enough positive examples). Finally we train a model using that dataset only to find a dramatic performance drop on our internal repositories. What went wrong? A deeper look would reveal that the repositories we used for train-test-validation highly differ from our internal ones. Github repositories, for example, can be owned by an organisation or by a single user and can have public or private scope (visible only to the account members). Looking into these populations characteristics, there is an inherent tension between private and public repos; while public code is the easily available resource, the real target of the AI applications we train is commonly internal code, and it’s fair to assume such code will look differently; take for example your organisation internal code snippets and compare it to Github random snippets. Consider for example features like documentation level and the rate of best practices in use. How likely is it to meet a Monolith, Environment Variables or a good API Encapsulation structure?. How likely is it to face AWS integrations in database related snippets on random Github repositories?. It becomes even more complicated given Github’s many possible sub populations — while we would assume a public, company owned, repository to be SDK, open source or a general example (otherwise, why to make it public?) and a public, privately owned, repository to be POC or a general side project (otherwise, why not place it on the company repositories), the reality is Github has it all, as a quick Github search can reveal private repositories which seem company owned and vice versa. The main issue with such a phenomena is the distribution differences; how likely is it to find a PII (for our example) on a public Github repository and how likely is it on your internal company repositories? In order to better understand how such a gap can affect the generated models performance, let’s have a look at a systematic analysis that was done to estimate such difference importance for another source code AI use case — code Autocomplete.

Code autocomplete is a trendy topic in the source code AI world. While traditionally it was a vital part of almost every IDE, more recently solutions like Github Copilot and Tabnine started to suggest a competing AI based solution. Hellendoorn et al (2019) highlight an inherent issue with such developments — autocomplete models are commonly trained on synthetically made datasets, being asked to fill randomly removed tokens. The two main issues with this approach is first the fact that committed code commonly differs from how it looks during the writing of it (like having partial, maybe even not working, context VS ready to commit one), and secondly the fact that in reality not all tokens have the same likelihood to be asked to autocomplete by the app users. From a model perspective, some repetitive, short, and easy to predict tokens are more beneficial to learn than the longer and more rare ones. But in reality we can assume the longer and rare tokens to be more relevant to autocomplete. Not surprisingly the research found a dramatic performance drop in reality VS what was mentioned in the relevant papers. Interestingly many of the inference time completions were within project APIs, not visible by definition to models which were trained on general populations (what can be solved using a ‘tuning’ step per given population, similar to the ULMFit practice for transfer learning). Such findings make us suspect how reliable are the offline numbers to predict these models accuracy in real world usages, leading us to a scenario where we can’t really trust the offline metrics.

Now that we learned that train-inference mismatch will probably affect autocomplete applications performance, is it possible to get good inference results for such use cases at all?. Following Hellendoorn findings, Facebook (Gareth at el, 2020) looked into that question and found that training such models on real autocomplete examples significantly improved their results. Interestingly they found not only a performance drop (for synthetic VS real examples models) but also a usage drop; an internal A-B test experiment revealed that models which were trained on real data end up gaining higher usage. Moreover, it’s important to note that such a drop was even on state of the art models like GPT-2 (interestingly BPE tokenization had the smallest performance drop, probably due to having the lowest level of out-of-vocabulary tokens, which they mention as clearly fits Karampatsis el al findings that ‘subtoken encoding can be especially helpful when developer activity datasets are unavailable for model training’). Not surprisingly, the analysis revealed that the accepted autocomplete tokens were on average longer (harder to remember) than the common committed tokens. Concluding these papers results we should find a way to make sure our train dataset highly resembles the inference one (or at least to be close enough to the way the inference data look like through the model perspective).

While we would like our train dataset to be as similar as possible to the inference phase, for many this is not a feasible requirement. Consider for example the use case we started with; from where should we get real companies repositories with PII inside them? (assuming such snippets to be super critical). While big companies can theoretically leverage their internal code base for that need (with the risk of overfitting their internal code practices and styling), med-small companies don’t have such a luxury. A more realistic evaluation would reveal that most companies must rely on public sources like Github to generate their datasets. But as many of Github sub populations are probably not relevant (private POCs for example probably look differently than big companies internal code), we should find a way to target the more relevant subpopulations within it. It requires a better analysis of the problem domain — what are we trying to solve? What are the data key factors that a model will pay attention to? How do we anticipate such factors to look at internal repositories?. Once answering those questions we can begin with validating our assumptions on our internal repositories VS the general Github ones, to better nail the population types we are needing. Then we can approach Github in a more directed way, to actively target the sub populations we need.

Now that we have a better view of the inference phase data characteristics, we can start searching Github for relevant snippets. It’s important to remember that a realistic dataset will include different subpopulations with varied degree of label relevance; some may have high False Positive rate while others may be total True Positives. For our example (finding PII) it makes sense to divide the world into three main parts –

  • High standards; repos with almost zero likelihood to include a PII. Relevant examples could be documentation or big companies (like S&P top 50) repositories. Looking into Facebook (public) repositories as example, we can assume super low likelihood to find a PII.
  • Med standards; repos with some likelihood to include a PII. Relevant examples could be small size open sources or small size organisations repositories (where mistakes are more likely to happen). Both can be targeted by looking into Github accounts meta fields like number of committers or the general repositories count.
  • Low standards; repos with high likelihood to include a PII. Relevant examples could be (public) repositories of private users which include relevant terms within them (like ‘credit card =’. Can be assumed to have higher likelihood to find a PII).

The good news is that all the mentioned examples can be targeted using Github search APIs, enabling us to actively (synthetically) generate a population that will better resemble our target (inference, private) populations. Towards better performance at the inference phase.

Now that we manage to generate an inference like population, it’s important to verify that it truly answers our needs. The main challenge will be how to decide each subpopulation rate in our synthetic made dataset. Commonly it can take a few iterations to generate a dataset with characteristics similar enough to the projected inference phase ones. The meanwhile solution is to keep monitoring the performance per sub population (per generated dataset sub population and in our internal repositories). Make sure the predicted behaviour truly holds (like in our example, that the high standards sub-population shouldn’t include almost any PII, otherwise there is an issue with the targeting mechanism). This is why it’s also important to collect feedback from your users, to verify that important sub populations weren’t missed. Keep re-training with new, false positives examples. Towards generating a fully tuned dataset that could enable beating any, tones of data, based model.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment