Automated Generation of Language Use Vector Extractors from TXL Grammars
Software development continues to demand advanced features which current programming languages do not offer. Language developers constantly evolve their languages to simplify common tasks, make the language more natural to use, and extend its features in response to the demands of software development. Language features are often not utilized efficiently in software development due to various obstacles. Existing software source code provides vital information on programming language use and the frequency with which the developers utilize language features. This has encouraged many researchers to examine the use of language features. TXL is a domain-specific language used in research focused on software analysis. TXL has proven to be an excellent option for constructing language parsers with which researchers can efficiently extract information on language use. Language use vectors, which encode language use data, can be a crucial feature in language use studies and software analysis, because they precisely represent language use statistics in compact mathematical form. Language use vectors can be derived directly from the grammar of the target programming language, which gives them an edge over other metrics employed in language use studies. In all the prior studies, researchers have manually constructed TXL feature analyzers for each target language, a process that significantly delays analysis projects. This thesis explores the possibility of automating the construction of TXL language use extractors from a given language grammar. In this way, language feature analyzers for new programming lanuguages can be rapidly and accurately built with little or no manual effort. We present an automated process for generating TXL-based extractors directly from programming language grammars, and demonstrate its use in analyzing language use in large corpora of three programming languages, Java, Ruby and C.