Tuning large language models to generate diverse and non-flaky software test-cases
Abstract
Large-scale software engineering uses continuous practices to accomplish a fast and smooth integration and deployment of software changes. A vital part of this is automated testing for quality assurance of the software. If this part fails, there will be large delays and queues of software changes. A common impediment to continuous practices is so called flaky tests, that is tests that change their verdict from pass to fail in an indeterministic way even if there are no changes made to the software to be tested. In previous research we have used machine learning to detect flaky tests, and in the proposed project we want to adapt a large language model to generate tests that i) have low probability of becoming flaky, ii) have low similarity to existing tests, and iii) have a high probability to cover the parts of the software that were recently changed. The data and models will come from larger open-source projects, primarily from the Huggingface community, where we can replay the history of testing and evaluate the outcome. No problems with ownership or sensitive data are expected.