Master's thesis project developed at the University of Illinois at Chicago. TIGER is a system that generates and improves Python code using LLMs by leveraging test-driven development and iterative feedback cycles.
TIGER uses documentation, docstrings, and tests from a codebase to generate function implementations using LLMs, without ever showing the model the actual code. The system mimics a real-world development loop, generating code, testing it, analyzing failures, and prompting the model again until the tests pass or a max retry limit is reached.
The process includes preprocessing the repo, extracting function signatures and tests, generating code using structured prompts, executing the test suite, and providing LLM-generated feedback if any tests fail. This loop continues iteratively.
TIGER aims to test whether LLMs can write correct, maintainable code using only test specifications. Experiments across 10 open-source Python repositories showed the iterative method outperforms single-pass generation, achieving up to 30% better success rates.
Python, PyTest, LLMs (Gemini, QwenCoder), AST parsing, call graph analysis, GitHub Actions, OpenRouter API.
TIGER simulates a realistic developer workflow, making LLMs more effective in practical software engineering tasks by integrating test-based feedback. It showcases automation, code quality evaluation, and prompt engineering in one system. Notably, it uses public LLM APIs as a third-party user — no custom training or powerful hardware is needed, making the system cost-efficient and accessible. The strength lies in the methodology, not the model itself.
Metric | Value |
---|---|
Repositories Tested | 10 Python projects |
Functions Processed | ~230 (e.g., in boltons) |
Average Iterations to Success | 1.9 |
Success Rate (Iterative vs One-Shot) | +30% improvement |
Model Type | OpenRouter LLM APIs (no fine-tuning) |