SourceCodeAI — how to handle Train-Inference mismatch | by Ori Abramovsky | May, 2022
Photo by Alex Dumitru from PexelsSource code AI has many unique features which differentiate it from the more general NLP applications (like the common practice to heavily process the input prior to feeding it to the model). One of it’s main challenges is the fact that while generating source code train datasets seems quite easy (‘just crawl Github’), the reality includes many hidden pitfalls to avoid, between is the fact that such highly available sources (like Github, Bitbucket or Stackoverflow) commonly differ from the…