Test-Abylity - Solo‑built GenAI test‑automation MVP.

TEST-ABYLITY

GenAI-Powered Testing Platform

✨ MVP Experiment & Prototype

MVP experiment & prototype built on weekends and late nights over several months exploring GenAI and Large Language Models (LLMs) and future of test automation. Designed, engineered and product-managed an AI test-automation stack.

This prototype helps understand how a single engineer, powered by Generative AI, can achieve an increase in testing throughput.

LLM Function Calling Prompt Engineering Java Spring + Spring AI Multi-LLM Architecture

Five practical use cases experimented - each built to answer the question : "Could an LLM take this grunt work off my plate?"

5
GenAI Tools MVP
2+
Dual‑vendor LLM Stack
9
AI Models Evaluated
32+
Learning notes

Single click API Test Generator

Drafts tests in seconds

Upload any OpenAPI/Swagger file (JSON/YAML) and receive a fully-parameterised comprehensive RestAssured Test Suite in under 10 seconds - complete with boundary-condition and field level coverage.

OpenAPI SpecsRestAssuredPositive/Negative & Boundary Tests
// Generated RestAssured test suite
@Test public void testUsers() {
  given().when().get("/api/users")
  .then().statusCode(200);
}

Instant Page-Object Builder

Cuts down boilerplate code

Paste raw HTML DOM , and get production-ready Selenium Page Object Model classes with locators and page methods generated.

SeleniumPage Object Model (POM) classesDOM Analysis
Paste HTML or URL
AI analyzes DOM
Generate page objects

Selenium-to-Playwright Migrator

Single shot conversion

Drop legacy Java Selenium tests and convert to TypeScript Playwright scripts - minimal manual edits required. (MVP had challenges with hallucinations for larger codebases

Java → TSMinimal Edits
// Selenium Java
driver.findElement(By.id("login")).click();
// Playwright TS
await page.locator("#login").click();

Requirements Analyzer

Highlights ambiguity

Upload user stories or requirements documents and receive AI-powered analysis highlighting ambiguities, missing acceptance criteria, and suggestions to improve testability.

Ambiguity DetectionTestability ScoreAuto-Suggestions
Upload requirements
AI analyzes quality
Get improvement suggestions

AI Test-Assistant Chat

Answers in <2s

Open the floating modal from any page and ask "How to find evolution stage of a Pokemon?" or "Show me a list of Pokemons that can fly". Chat to query API and product specifications, Streamed answers via Anthropic Claude.

Anthropic ClaudeReal-time
Ask question
AI & RAG analysis
Instant results
🙋Why I Built This
🎬See Test‑Abylity in Action

📱Product Showcase

Screens, code diff views, and the live chat overlay - all from the same local run. Every shot is un-mocked; what you see is what the MVP product prototype does today

TestAbylity Login Page

Login interface for the GenAI-powered TestAbylity platform with multiple sign-in options.Sign in with Google, GitHub, or plain email—no extra setup. OAuth keeps credentials off the server while you dive straight into the Portal.

Primary Core Sign-In
AI Model Integration Interface

RestAssured Test Generation

Drop any OpenAPI/Swagger file and receive a fully parameterised RestAssured test suite in under a minute. Generates positive, negative, and boundary tests.

Core Tool API Testing

AI Test Assistant Chat

Chat with your test requirements: ask details about API specifications, discover API functionalities and endpoints. LLM layer streams answers grounded in real run data.

Core Tool Assistant
🏗️

Architecture at a Glance

Layer Technology/Component Description / Decision Note
AI Model & Generation Layer Anthropic Claude (Sonnet/Opus), OpenAI GPT (fallback) After experimenting with both OpenAI models and Anthropic's Claude models (Sonnet), I stuck with Anthropic for its code quality and structured outputs. The fallback to OpenAI ensured a resilient pipeline even when limits kicked in.
Prompt Engineering Layer Custom Prompts (based on test domain understanding) I iterated through dozens of prompt versions before landing on one that gave consistently clean JSON responses. Having domain knowledge in testing helped craft the prompts and fine-tune the outputs. It taught me how sensitive LLMs are to tone and phrasing.
Authentication Layer Firebase Auth + Google OAuth 2 I didn't want to reinvent login or handle authentication myself - Firebase worked in minutes and let me protect routes effortlessly.
Presentation Layer ReactJS Clean UI was a selling point. Some AI assistance helped (though code-based LLMs weren't that useful at the time). Most components worked well, even for code viewers and charts. As always, styling was the hardest part.
Client Logic Layer React Router, LocalStorage, Form Inputs I stored results locally so users could revisit without re-processing. This minimized backend load and improved UX.
API Layer Spring Boot REST Controllers I exposed endpoints for sending front-end data, uploading Swagger, and for the chat service. These became the primary bridge between UI and LLM engine.
Business Logic Layer TestGenService, POMGenService (Spring Components) This layer gave structure to AI calls - especially when multiple endpoints needed similar orchestration. Separation helped with reusability.
Chat Interaction Layer React modal + Spring `/chat-v1` endpoint Streaming responses from Anthropic made the chatbot feel alive. Typing indicators and input disabling during load improved UX significantly.
File Upload Handling Layer Spring MultipartConfig + Drag & Drop UI Simple but essential. I needed to validate files and give instant feedback - many Swagger files failed silently without good checks.
Metrics & Visualization Layer Charts, Tables, Syntax Highlighting Charts made AI-generated test summaries more tangible. Seeing test counts and coverage visually made it more engaging.
RAG Orchestrator Layer Spring AI `RetrievalAugmentedGenerationModel` Wiring this up was a turning point - it abstracted the full flow from chunk to Claude with minimal glue code. Switching vector stores or models became trivial.
Embedding Model Layer HuggingFace `bge-small-en` / `bge-base-en-v1.5` Initially tried OpenAI embeddings, but HuggingFace models gave near-equal performance without needing an API key or GPU - great for quick iteration and local dev. Still learning more.
Retriever Layer Spring AI `VectorStoreRetriever` Plugging this in gave a quick win - suddenly the chatbot felt aware of Swagger details.
Vector Store Layer Chroma DB (file-based) I replaced pgvector with Chroma after realizing I didn't need a persistent Postgres instance. Its `.parquet` structure made debugging embeddings easier. This involved more trial and error than expected.
Chunking + Metadata Layer Custom Java logic + DTO Enricher Chunking by endpoint seemed obvious, but getting metadata tags right (e.g. `auth`, `statusCodes`) was key for retrieval precision.
Document Loader Layer Java (Jackson / SnakeYAML) Swagger YAMLs were messier than expected. I had to normalize quirks to prevent prompt hallucinations - custom POJOs helped.
Analytics & Monitoring Layer Spring Logging Enabled logs and dashboards for metrics visibility and debugging. Not flashy, just needed logs when things failed.
Build & DevOps Layer Maven, Spring Boot Starters Setup was intentionally simple. Maven plugins gave me repeatable builds and helped during local debugging and Dockerization.

TestAbyLity Architecture

I wanted architectural choices that could survive a real team setting, not just a weekend spike. The diagram below shows how each layer stays swappable and low-ops.

🎨 Frontend Layer

React

Dashboard
Charts & Analytics
Result Visualization
Session History
Upload Forms
Swagger/OpenAPI
Requirements & HTML Input
Selenium Code Import
Code Viewer
Syntax Highlighting
Download Features
Diff View
AI Chat Modal
Streaming Responses
Interactive Assistant
Floating UI
Page Navigation:
• API Test Generator Page
• Page Object Builder
• Selenium-to-Playwright Converter
• Requirements Analyzer
• Settings & Profile
Sidebar: Persistent navigation with collapsible sections
Routing: React Router v5 with protected routes
State: Context API + LocalStorage Persistence

⚙️ Backend Layer

Spring Boot 3.x + Maven

REST API Endpoints:

POST /generate-api-tests
POST /pom-generator-v1
POST /ui-scripts-migrator-v1
POST /req-analyzer-v1
POST /chat-v1
GET /health
File Processing
Swagger/OpenAPI Parser
HTML DOM Analysis
Requirements Parser
Code Generation
RestAssured Tests
POM Classes
Playwright Migration
Spring AI Integration
ChatClient Interface
RetrievalAugmentedGenerationModel
Structured Prompting Engine

🔐 Authentication

Firebase + Google SSO

Firebase Auth
Google Sign-In
JWT Token Management
Frontend Integration
AuthContext Provider
Protected Routes
Session Management

🤖 AI Services

Multi-Model LLM Architecture

Anthropic Claude
Primary AI Engine (Sonnet/Opus)
Test Case Generation
POM Class Creation
Requirements Analysis
OpenAI GPT
Fallback Option (GPT-3.5)
Chat Responses
Code Analysis
Resilient Pipeline
Vector Embeddings
HuggingFace bge-small-en
bge-base-en-v1.5
Chroma DB Storage
Prompt Engineering
Structured System Prompts
JSON Schema Validation
Function Calling
Chain-of-Thought Processing

Core Feature Data Flows

🔄 API Test Generation

Upload Swagger/OpenAPI File
Parse Schema and Endpoints
AI Analysis for Test Scenarios
Generate RestAssured Code
Display Results with Charts

🏗️ POM Generation

Input HTML Snippet
DOM Element Analysis
AI Generate POM Classes
Java/TypeScript Output
Download Generated Code

📝 Requirements Analyzer

Upload Requirements Document
Ambiguity Detection
Testability Analysis
Generate Improvement Suggestions
Provide Testability Score

🔄 Selenium Converter

Input Selenium Code
Syntax Analysis
AI Transform Logic
Playwright Code Output
Interactive Stepper UI

💬 AI Chat Assistant

User Query Input
RAG Context Retrieval
AI Response Generation
Stream Response to UI
Interactive Chat Modal
🛠️

Tech Stack & Tools

LLM-agnostic architecture with swappable components

AI Generation & Prompting

Dual-vendor LLMs with domain-tuned prompts for robust test generation.

Anthropic Claude (Sonnet / Opus)Primary
OpenAI GPT (fallback)Fallback
Domain-tuned custom promptsCore

RAG Orchestrator

Advanced retrieval system orchestrating context-aware AI responses.

Spring AI RetrievalAugmentedGenerationModelCore
Spring AI VectorStoreRetrieverCore
Custom chunker & metadata enricherTool

Embeddings & Vector Store

High-performance vector storage and embedding models for semantic search.

HuggingFace bge-small-en / bge-base-en-v1.5AI
Chroma DB (file-based .parquet)Data
Document loaders (Jackson / SnakeYAML)Tool

Frontend UI

Modern React-based interface with rich data visualization.

React 18Core
React Router v5Tool
Charts, tables & syntax highlightingUI

Backend Services

Spring Boot services powering core logic and API endpoints.

Spring Boot REST ControllersCore
TestGenService (Spring)Core
POMGenService (Spring)Core

Chat Interaction

Interactive chat interface with real-time streaming and UX enhancements.

React modal chat widgetUI
Streaming /chat-v1 endpointCore
Typing & loading UX hintsUI

Client State & Auth

Secure authentication and client-side state management.

Firebase AuthCore
Google OAuth 2OAuth
LocalStorage cachingTool

Build & Ops

Streamlined build process and operational monitoring.

Maven (Spring Boot starters)Tool
Dockerized worker imagesDevOps
Spring logging & dashboardsMonitoring

🚀 Build & Learning Journey

Breadcrumbs of code spikes, bug hunts & aha-moments.

32
Learning Points
40+
Technologies
9
AI Models
100%
Hands-On
👇Tap to expand learning timeline

🛠️ Spring Boot Refresher

I kicked off by rebuilding a simple REST service just to shake off the rust - Spring Boot had changed more than I remembered. Reacquainting myself with annotations and tricks helped me get back on track. Watching live reloads work again felt oddly satisfying.

Spring Boot 3.x REST Service Testing Knowledge Refresh
Phase 1: Foundation & Core Technologies / Foundation ✅ Service + Tests

🌐 REST API Foundation

Getting the backend to talk to the frontend wasn't as smooth as I expected - CORS issues, multipart limits, and MIME headers all needed tuning. I used Postman and browser dev tools in loops until errors stopped. Fixing this gave me a clearer picture of what real-world API bridging involves.

Spring Boot REST CORS Multipart Config LLM-UI Bridge Setup
Phase 1: Foundation & Core Technologies / Service Layer 🔗 API Bridge

🧱 Service Architecture

When wiring up AI routes, I quickly saw things getting messy. Breaking them out into TestGenService and POMGenService gave clarity. I didn't expect how reusable these would become. Debugging was a easier after separating the logic.

TestGenService POMGenService Spring Components Clean Orchestration Reusability
Phase 1: Foundation & Core Technologies / Logic Layer 🏗️ Service Design

⚡ OpenAI Experiments

I ran dozens of prompts through OpenAI's models - some evenings were just latency tests and JSON parse failures. I created a local sheet comparing response length, price, and formatting reliability. It made the tradeoffs really concrete.

text-davinci-003 GPT-3.5-turbo GPT-4 Cost vs Quality
Phase 2: AI Model Exploration & Selection / Benchmarking 📊 Performance Analysis

🤖 Spring AI Deep‑Dive

I cloned their samples and traced every piece - function calls, prompt formatting, fallback logic. That's when I stumbled on the Spring AI, which later became critical. It felt like finding a hidden lever.

Spring AI Function Calls JSON Output Game Changer
Phase 2: AI Model Exploration & Selection / Discovery Phase 🎯 Key Discovery

🎯 Anthropic Claude Integration

I dropped Claude 3 Sonnet into the same prompts I used with GPT-4. To my surprise, Claude's output was shorter, cheaper, and just as clean. I ran diff comparisons between them to confirm it wasn't a luck.

Claude 3 Sonnet Java Tests 15% Shorter Code 40% Cheaper
Phase 2: AI Model Exploration & Selection / Model Selection 🏆 Winner Selected

🏁 Multi‑Model Evaluations

This was the fun part - prompt packs, retries, OpenAI as fallback. I intentionally pushed rate limits to simulate worst-case. Watching the pipeline hold up taught me the real value of fallbacks and model orchestration.

Claude 3 Opus GPT-4o OpenAI Fallback Quality‑Resilient Pipeline
Phase 2: AI Model Exploration & Selection / Comprehensive Test 🔬 Model Showdown

🧪 AI Model & Generation Layer

Final setup had Claude (Sonnet/Opus) for day-to-day, and OpenAI as a safety net. When Claude hiccupped, OpenAI took over. I tested fallback by toggling API keys mid-session. It worked - and gave me peace of mind.

Claude OpenAI Model Switching Quality‑Resilient Pipeline
Phase 2: AI Model Exploration & Selection / GenAI Core Layer 🔄 Model Orchestration

📚 Prompt‑Engineering Basics

I tried zero-shot first - results were all over the place. Few-shot helped, but chain-of-thought unlocked tricky cases. I kept a prompt history log just to track improvements. Coverage jumped and errors dropped.

Zero-shot Few-shot Chain-of-Thought Increased prompt accuracy
Phase 3: Prompt Engineering & Quality / Pattern Discovery 📈 Documented Gains

📊 Prompt Benchmark Script

Got tired of eyeballing performance. I wrote a CLI script to dump latency, token count, and JSON parse status into CSV. Every test run now leaves a footprint - makes A/B testing trivial.

Python CLI Latency Tracking Token Stats JSON Validation Unified Metrics
Phase 3: Prompt Engineering & Quality / Tooling Phase 🔧 Benchmarking Tool

🎯 Prompt Fine‑Tuning

At one point, parse errors hit 28% and most of my prompts included a constant reminder "please respond only in JSON, please", I added a strict JSON schema to the system prompt and re-ran the same test set - errors dropped to 3%. I felt like a backend QA engineer at that moment.

JSON Schema System Prompts 28% → 3% Error Rate Parse Reliability
Phase 3: Prompt Engineering & Quality / Quality Improvement 📉 Major Error Reduction

🧠 Prompt Engineering Layer

This phase was gritty. I went through 20+ prompt iterations, hand-checking each one. Once I started using testing-specific tone and consistent formatting, JSON stopped breaking. Domain knowledge really helped here.

Custom Prompts Testing Domain Knowledge Tuning Mastery Domain Specific
Phase 3: Prompt Engineering & Quality / Prompt Optimization 🎯 Domain Expertise

🛡️ Guardrails & Schema Validation

Silent failures haunted me early on. Wrapping outputs in Pydantic and adding re-ask loops gave peace of mind. I'd let broken output through one too many times - this finally stopped that.

Pydantic Schema Validation Re-ask Loops Zero Silent Errors Robust Validation
Phase 3: Prompt Engineering & Quality / Quality Assurance 🛡️ Error Prevention

🎨 Frontend Build & Troubleshooting

The front end template and the libraries used didn't play nice at first. Some styles overrode others, and lazy loading failed silently. I had to inspect bundles and isolate imports - remided me how difficult UI development can be

ReactJS Style Conflicts Fixed Lazy Loading
Phase 4: Frontend Development / Frontend Phase 🖼️ UI Issues Resolved

🖼️ Frontend Dashboard Build

Building the dashboard was a UI masterclass - drag-and-drop uploads, syntax highlighting, animated modals. I ran dozens of tests just to make one button feel responsive. The polish came from pain.

ReactJS Charts Drag‑and‑Drop Visual Clarity UX Lessons
Phase 4: Frontend Development / UI Design Phase 🎨 UI Masterclass

🔐 Authentication Layer

I plugged in Firebase + Google Auth and was stunned it just worked. Securing routes in under 60 minutes made me wonder why I ever tried doing this manually before.

Firebase OAuth 2 Quick Auth Setup
Phase 4: Frontend Development / User Access Control 🔒 Security Layer

🔄 Client State Management

Using localStorage + React Router let users pick up where they left off. I tested persistence across tabs and refreshes. This small addition made the whole UX feel premium.

React Router LocalStorage State Persistence State Handling Session Continuity
Phase 4: Frontend Development / UX Enhancement 💾 State Management

🔌 Backend ↔ Frontend Wiring

Cleared CORS gremlins, raised the 413 limit and fixed MIME headers - backend and UI finally shook hands. I tested with curl, browser uploads, and large Swagger files until it clicked. Felt like debugging plumbing leaks.

Axios CORS Multipart MIME Headers Integration Complete
Phase 5: Full-Stack Integration / Integration Phase 🔗 Full Stack Connected

📂 File Upload Pipeline

MultipartConfig helped trap malformed Swagger files before they silently broke everything. I added toast notifications after misclicks left me confused. It was hands-on validation tuning with lots of trial and error.

Multipart Upload File Validation Toast Notifications UX Validation Error Prevention
Phase 5: Full-Stack Integration / Input Handling 📁 File Processing

💬 Chat Interface Integration

Streaming Claude replies into a floating modal looked simple - but debugging typing indicators and reset flow was more work than expected. It made me think like a bot interaction designer.

React Modal Streaming Responses Spring Chat Endpoint Typing Indicators Real-time Conversational UX
Phase 5: Full-Stack Integration / Chat Feature Build 💬 Chat Experience

📈 Results Dashboard

I split POM and API results into separate dashboards using localStorage as a poor man's state store. Chart rendering broke twice on malformed data - fixing that gave me a new respect for UI error boundaries.

Charts Syntax Highlighting Code Display Metrics Dashboard Data Visualization Code Presentation
Phase 5: Full-Stack Integration / Results Visualization 📊 Data Display

📥 Document Loader Layer

Swagger YAMLs were far messier than I expected. Parsing failed on nested examples, inline enums, and missing fields. I cleaned them with Jackson and custom POJOs - felt like prepping data for a finicky chef.

Jackson SnakeYAML Cleaner Input Pipeline
Phase 6: RAG & Vector Search Implementation / Loader Setup 📄 Document Processing

🧩 Chunking + Metadata Layer

I wrote a Java chunker that tagged endpoints with verbs, auth status, and status codes. Debugged it using real Swagger samples and compared what was captured vs missed. Metadata precision became my obsession.

Java Chunker Metadata Enricher Precision Targeting
Phase 6: RAG & Vector Search Implementation / Preprocessing Layer 🔍 Content Analysis

🧬 Embedding Model Layer

Swapped OpenAI embeddings with HuggingFace's bge-small-en. I ran tests side-by-side on the same chunks and measured cosine similarity in local notebooks - no API keys, no stress.

HuggingFace bge-small-en Local Dev Cost‑free Iteration
Phase 6: RAG & Vector Search Implementation / Vectorization 🧠 Local Embeddings

🗄️ Vector Storage Optimization

Compared PGVector, Chroma and ElasticSearch. File-based Chroma won - it was easier to inspect raw vectors and debug weird retrieval issues. Rebuilding the store was 2× faster too.

PGVector Chroma DB ElasticSearch File-based Lightweight 2× Faster
Phase 6: RAG & Vector Search Implementation / Database Evaluation 🚀 Performance Winner

🔍 Retriever Layer

I wired up Spring AI's VectorStoreRetriever and immediately tested Swagger Q&A queries. Seeing the bot return endpoint names and auth rules from memory was a turning point - it felt aware.

Spring AI Retriever Retrieval Accuracy
Phase 6: RAG & Vector Search Implementation / Retriever Config 🔍 Knowledge Retrieval

🔄 RAG Orchestrator Layer

I was stunned how clean Spring AI's RAG model API was - just passed chunks and prompts, and it handled routing. I toggled Claude ↔ OpenAI in 3 lines. This changed how I viewed multi-model wiring.

Spring AI RAG Claude RAG Simplified
Phase 6: RAG & Vector Search Implementation / Retrieval Pipeline 🔄 RAG Orchestration

🔄 Code‑Conversion Spike

I prototyped a tool to migrate Selenium to Playwright tests. Ran it on my old UI suite - some locators broke, some DOM assumptions didn't translate. Every failure gave ideas for smarter prompt scaffolding.

Selenium Playwright Code Migration Edge Cases Documented POC Complete
Phase 7: Advanced Research & Experiments / Research Spike 🧪 Migration Strategy

🤖 Agent Pattern Prototype

I chained together tool-calls in Spring AI to go from spec → test → cleanup. Half of it broke, but when it didn't - it was magical. It felt like watching agents reason. Definitely unfinished, but full of promise.

Spring AI Tool-call Chaining Agent Pattern Promising Results Future Architecture
Phase 7: Advanced Research & Experiments / Architecture Research ⚡ Agent Prototype

📚 Knowledge Base Creation

I rolled all the regex extractors and prompt patterns into a shared cookbook. The latency vs context vs cost matrix saved me mid-debug more than once. Onboarding now takes 50% less time - and fewer Slack pings.

Regex Extractors System Templates Model Comparison Latency Metrics Cost Analysis Context Limits 50% Faster Onboarding Standardized Patterns
Phase 8: Documentation & Operations / Documentation 📚 Knowledge Base

📈 Analytics & Monitoring Layer

Added just enough logging in Spring Boot to flag when things silently failed. Watching request cycles and payload dumps during dev made debugging much less guesswork.

Spring Boot Logging Visibility with Simplicity
Phase 8: Documentation & Operations / Debug Phase 📊 Monitoring Setup

🛠️ Build & DevOps Layer

I kept the build stack minimal - Maven, Spring Starters, and a Dockerfile. Ran dry builds repeatedly until it stopped throwing surprises. Simple, predictable, ready to ship (but I paused here :) )

Maven Spring Starters Repeatable Build Flow
Phase 8: Documentation & Operations / Infra Layer 🏗️ Build System
Closing Thoughts & Next Steps

Since the time I built this, the tooling landscape has also shifted - many developer tools and IDEs now offer built-in code generation features that simply didn't exist back then.

I'm writing this now because GenAI and LLM technology have rapidly evolved. This project was a way to explore the space hands-on - and now that it's built, I'm treating this as a thoughtful wrap-up, not a SaaS launch. Test‑Abylity served its purpose: it let me experiment, learn, and contribute something useful to the conversation around GenAI in testing.

From here, I'm excited to move on and explore new, innovative solutions that take QA even further.

Exploring Agentic systems and multi-agent systems, model context protocols and others; maybe this project needs a complete revamp with a different lens now 😊

Profile Image
AJ
Aby Thannikal Joseph
Got thoughts or seen a similar challenge? Drop me a note on LinkedIn -- always keen to chat testing & AI.
Message me on LinkedIn
Chat with Me Watch the Demo