rello-code Ruthwik-Data

Ruthwik Arepelly

Open to AIPM and product-adjacent roles at early-stage AI startups (pre-seed to Series D, teams under 30) building LLMs, RAG, or eval tooling. LinkedIn · Email

I build evaluation-first AI systems — and I can tell you exactly why each one works, where it breaks, and what the numbers say.

7+ years building 0→1 products. Co-founded Photon (EdTech fintech, 75+ schools, $100K ARR). Now building at the intersection of LLMs, RAG, and AI evaluation.

Start here → Mechanic Trust case study — the clearest example of how I design evaluation-first AI products.

What Each Project Proves

Project	Problem Solved	What It Demonstrates	Live
Self-Improving Prompt Agent	How do you improve a prompt without guessing?	Built an eval loop that ran 10 rounds — score went 0.10 → 0.80. Key insight: better prompts come from better evals, not more attempts	—
finrag-eval	Financial RAG hallucinates confidently — and you can't tell	Found 2/3 hallucinations were honest refusals, 1/3 were confidently wrong. Filed a metric-level bug in DeepEval that the team is now fixing	—
GitScope	Evaluating a GitHub repo takes hours of manual reading	Built an MCP-powered agent that gives PMs structured repo analysis in seconds — PM-first output, not raw code	—
Mechanic Trust	Auto repair shops exploit trust gaps with opaque pricing	Case study: designed the trust, explainability, and pricing transparency layer for a high-friction AI product	—
ReceiptIQ	Accountants manually copy-paste receipt data for hours	GPT-4o Vision pipeline with confidence scoring — forces the AI to be honest about what it's uncertain about	Demo
Warmlist	PMs lose track of warm contacts who could open doors	GPT-4o-mini CRM that surfaces who to reach out to and why — using LLMs for PM work, not just AI products	—
SugarShield	AI classifiers over-warn or miss hidden sugar — you can't tell which failure mode you're in	Built eval infrastructure into the product: 0 false negatives by design, conservative bias as explicit product decision, 87% trigger match rate. Strict vs. Lenient mode comparison built-in	Demo · Eval

Case Studies

How I think through AI product decisions — not just what I built, but why, what failed, and what the system gets wrong:

Published

Mechanic Trust — Trust-critical design in consumer AI: explainability, pricing transparency, failure mode planning

Case Study Pipeline — detailed write-ups in progress, expected June 2026:

finrag-eval — Evaluation infrastructure for financial RAG: where metrics lie, where hallucinations hide
Self-Improving Prompt Agent — Recursive eval loops: what happens when the optimizer is only as good as its evaluator

Open Source Signal

DeepEval Issue #2594 — Filed a root-cause bug report on ContextualPrecisionMetric over-penalizing overlapping chunks in financial RAG. Drove technical consensus on the group_by API fix. The Confident AI team is shipping it in the next release. This is what evaluation obsession looks like in practice: I wasn't just using the tool — I found where the metric itself was wrong.

Stack I Work In

Evaluation: DeepEval, Claude as evaluator, LLM-as-judge patterns, custom eval harnesses, ground-truth scoring RAG: pgvector, Supabase, LangChain, OpenAI embeddings, section-aware chunking Agents: MCP, Claude agents, tool-use patterns, agentic loops, prompt optimization Shipping: Python, TypeScript, Next.js, Vercel, SQL, Docker Models: GPT-4o Vision, Claude Opus, Claude Sonnet, DeepEval for benchmarking

Writing

I write about product thinking, AI systems, and what I learn from building:

Product Learning: How Gifting Became a Growth Engine, Not a Feature — Feature → growth lever
How I Turn User Complaints Into Feature Ideas (Simple 7-Step Method) — Product thinking framework
From Venue to Platform: The Bernabéu as a Product — How physical spaces evolve into platforms
How I Built SugarShield: From a Grocery Aisle Problem to a Working AI Product — Full build case study
Tap & Pray Is Not a Payment Strategy — Fintech product lessons
Product Experiment: IntentTabs — Adding Friction to Fight Impulse — Behavioral design in product

View all on Medium →

Background

Photon (Co-founder): Built B2B SaaS payments platform for schools — 75+ schools in India, $100K ARR, 8-person team
Digital Connect: AI product — built and shipped features for university admin workflows
BS Computer Science · MSc Business Analytics, Trine University

What Sets Me Apart

Most AI PMs talk about outputs. I focus on whether the system is trustworthy.

That means evaluating the evaluator (DeepEval Issue #2594), designing products around failure modes before launch (SugarShield: 0 false negatives by design), and measuring improvement through behavior change, not vanity metrics (Self-Improving Prompt Agent: 0.10 → 0.80).

I don't just use AI tools. I find where they break, why they break, and what to ship next because of it.

LinkedIn · Email · Medium

Provide feedback

Saved searches

Use saved searches to filter your results more quickly