TIGER: Testing and Improving Generated Code with LLMs

Master's thesis project developed at the University of Illinois at Chicago. TIGER is a system that generates and improves Python code using LLMs by leveraging test-driven development and iterative feedback cycles.

Overview

TIGER uses documentation, docstrings, and tests from a codebase to generate function implementations using LLMs, without ever showing the model the actual code. The system mimics a real-world development loop, generating code, testing it, analyzing failures, and prompting the model again until the tests pass or a max retry limit is reached.

Workflow

The process includes preprocessing the repo, extracting function signatures and tests, generating code using structured prompts, executing the test suite, and providing LLM-generated feedback if any tests fail. This loop continues iteratively.

Objectives & Results

TIGER aims to test whether LLMs can write correct, maintainable code using only test specifications. Experiments across 10 open-source Python repositories showed the iterative method outperforms single-pass generation, achieving up to 30% better success rates.

Technologies

Python, PyTest, LLMs (Gemini, QwenCoder), AST parsing, call graph analysis, GitHub Actions, OpenRouter API.

Why it matters

TIGER simulates a realistic developer workflow, making LLMs more effective in practical software engineering tasks by integrating test-based feedback. It showcases automation, code quality evaluation, and prompt engineering in one system. Notably, it uses public LLM APIs as a third-party user — no custom training or powerful hardware is needed, making the system cost-efficient and accessible. The strength lies in the methodology, not the model itself.

Key Results

MetricValue
Repositories Tested10 Python projects
Functions Processed~230 (e.g., in boltons)
Average Iterations to Success1.9
Success Rate (Iterative vs One-Shot)+30% improvement
Model TypeOpenRouter LLM APIs (no fine-tuning)
← Back to ProjectsView Source Code on GitHub →