Skip to content
View Ruthwik-Data's full-sized avatar
:atom:
:atom:

Block or report Ruthwik-Data

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Ruthwik-Data/README.md

Ruthwik Arepelly

Open to AIPM and product-adjacent roles at early-stage AI startups (pre-seed to Series D, teams under 30) building LLMs, RAG, or eval tooling. LinkedIn · Email

I build evaluation-first AI systems — and I can tell you exactly why each one works, where it breaks, and what the numbers say.

7+ years building 0→1 products. Co-founded Photon (EdTech fintech, 75+ schools, $100K ARR). Now building at the intersection of LLMs, RAG, and AI evaluation.

Start here → Mechanic Trust case study — the clearest example of how I design evaluation-first AI products.


What Each Project Proves

Project Problem Solved What It Demonstrates Live
Self-Improving Prompt Agent How do you improve a prompt without guessing? Built an eval loop that ran 10 rounds — score went 0.10 → 0.80. Key insight: better prompts come from better evals, not more attempts
finrag-eval Financial RAG hallucinates confidently — and you can't tell Found 2/3 hallucinations were honest refusals, 1/3 were confidently wrong. Filed a metric-level bug in DeepEval that the team is now fixing
GitScope Evaluating a GitHub repo takes hours of manual reading Built an MCP-powered agent that gives PMs structured repo analysis in seconds — PM-first output, not raw code
Mechanic Trust Auto repair shops exploit trust gaps with opaque pricing Case study: designed the trust, explainability, and pricing transparency layer for a high-friction AI product
ReceiptIQ Accountants manually copy-paste receipt data for hours GPT-4o Vision pipeline with confidence scoring — forces the AI to be honest about what it's uncertain about Demo
Warmlist PMs lose track of warm contacts who could open doors GPT-4o-mini CRM that surfaces who to reach out to and why — using LLMs for PM work, not just AI products
SugarShield AI classifiers over-warn or miss hidden sugar — you can't tell which failure mode you're in Built eval infrastructure into the product: 0 false negatives by design, conservative bias as explicit product decision, 87% trigger match rate. Strict vs. Lenient mode comparison built-in Demo · Eval

Case Studies

How I think through AI product decisions — not just what I built, but why, what failed, and what the system gets wrong:

Published

  • Mechanic Trust — Trust-critical design in consumer AI: explainability, pricing transparency, failure mode planning

Case Study Pipeline — detailed write-ups in progress, expected June 2026:

  • finrag-eval — Evaluation infrastructure for financial RAG: where metrics lie, where hallucinations hide
  • Self-Improving Prompt Agent — Recursive eval loops: what happens when the optimizer is only as good as its evaluator

Open Source Signal

DeepEval Issue #2594 — Filed a root-cause bug report on ContextualPrecisionMetric over-penalizing overlapping chunks in financial RAG. Drove technical consensus on the group_by API fix. The Confident AI team is shipping it in the next release. This is what evaluation obsession looks like in practice: I wasn't just using the tool — I found where the metric itself was wrong.


Stack I Work In

Evaluation: DeepEval, Claude as evaluator, LLM-as-judge patterns, custom eval harnesses, ground-truth scoring RAG: pgvector, Supabase, LangChain, OpenAI embeddings, section-aware chunking Agents: MCP, Claude agents, tool-use patterns, agentic loops, prompt optimization Shipping: Python, TypeScript, Next.js, Vercel, SQL, Docker Models: GPT-4o Vision, Claude Opus, Claude Sonnet, DeepEval for benchmarking


Writing

I write about product thinking, AI systems, and what I learn from building:

View all on Medium →


Background

  • Photon (Co-founder): Built B2B SaaS payments platform for schools — 75+ schools in India, $100K ARR, 8-person team
  • Digital Connect: AI product — built and shipped features for university admin workflows
  • BS Computer Science · MSc Business Analytics, Trine University

What Sets Me Apart

Most AI PMs talk about outputs. I focus on whether the system is trustworthy.

That means evaluating the evaluator (DeepEval Issue #2594), designing products around failure modes before launch (SugarShield: 0 false negatives by design), and measuring improvement through behavior change, not vanity metrics (Self-Improving Prompt Agent: 0.10 → 0.80).

I don't just use AI tools. I find where they break, why they break, and what to ship next because of it.


LinkedIn · Email · Medium

Pinned Loading

  1. finrag-eval finrag-eval Public

    RAG eval pipeline on Apple's FY 2024 10-K — found confident hallucinations, filed a metric-level bug in DeepEval, and built section-aware chunking.

    Python

  2. gitscope gitscope Public

    MCP-powered AI agent that analyzes GitHub repos and surfaces structured insights for product managers and founders.

    Python

  3. self-improving-prompt-agent self-improving-prompt-agent Public

    Prompt optimization loop that improves prompts through iterative mutation and LLM-as-judge evaluation. Score went 0.10 → 0.80 in 10 rounds.

    Python

  4. mechanictrust mechanictrust Public

    AI product case study for trust, pricing transparency, and explainable diagnosis in auto repair.

  5. receiptiq receiptiq Public

    AI-powered receipt extraction and finance dashboard using GPT-4o Vision.

    TypeScript