Improving Code Search Using Learning-to-Rank and Query Reformulation Techniques
MetadataShow full item record
During the process of software development, developers often encounter unfamiliar programming tasks. Online Q&A forums, such as StackOverflow, are one of the resources that developers can ask for answers to their programming questions. Automatic recommendation of a working code example can be helpful to solve developers’ programming questions. However, existing code search engines support mainly keyword-based queries, and do not accomodate well natural-language code search queries. Specifically, natural-language queries contain less technical keywords, i.e., class or method names, which negatively affects the success of the code search process of existing code search engines. On the other hand, a code search engine requires a ranking schema to place relevant code examples at the top of the result list. However, existing ranking schemas are hand-crafted heuristics where the configurations are hard to determine, which leads to the difficulty in using them for new languages or frameworks. In this paper, we propose the approach which uses query reformulation techniques to improve the search effectiveness of existing code search engines for naturallanguage queries. The approach automatically reformulate natural-language queries using class-names with semantic relations. We also propose an approach to automatically train a ranking schema for the code example search using the learning-to-rank technique. We evaluate the proposed approaches using a large-scale corpus of code examples. The evaluation results show that our approaches can effectively recommend semantically related class-names to reformulate natural-language queries, and the improvement on the search effectiveness over existing query reformulation approaches is statistically significant. The automatically trained ranking schema can effectively rank code examples, and outperform the existing ranking schemas by 35.65% and 48.42% in terms of normalized discounted cumulative gain (NDCG) and expected reciprocal rank (ERR), respectively.