Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -305,7 +305,7 @@ Lockstep versioning — one dispatch bumps every product to the same `vX.Y.Z`. T
- [x] **7a — `VECTOR(N)` column type** *(v0.1.10)*: dense f32 vectors with bracket-array literal syntax (`[0.1, 0.2, ...]`); file format bumped to v4
- [x] **7b — Distance functions** *(v0.1.11)*: `vec_distance_l2/cosine/dot` + `ORDER BY <expr> LIMIT k` so KNN queries work end-to-end
- [x] **7c — Bounded-heap top-k optimization** *(v0.1.12)*
- [x] **7d — HNSW ANN index** *(v0.1.13–15)*: `CREATE INDEX … USING hnsw (col)`; recall@10 ≥ 0.95 at default `M=16, ef_construction=200, ef_search=50`; persisted as a `KIND_HNSW` cell tree
- [x] **7d — HNSW ANN index** *(v0.1.13–15, +SQLR-28)*: `CREATE INDEX … USING hnsw (col) [WITH (metric = '<l2|cosine|dot>')]`; recall@10 ≥ 0.95 at default `M=16, ef_construction=200, ef_search=50`; persisted as a `KIND_HNSW` cell tree, with the metric round-tripping through the synthesized `sqlrite_master` SQL
- [x] **7e — JSON column type + path queries** *(v0.1.16)*: `JSON` / `JSONB` columns stored as canonical text; `json_extract` / `json_type` / `json_array_length` / `json_object_keys`; `$.key`, `[N]`, chained JSONPath subset
- [x] **7g.1 — `sqlrite-ask` crate** *(v0.1.18)*: foundational natural-language → SQL via the [Anthropic API](https://docs.anthropic.com/) (Sonnet 4.6 by default), prompt-cached schema dump, sync `ureq` HTTP.
- [x] **7g.2 — REPL `.ask` + dep-direction flip** *(v0.1.19)*: `.ask <question>` meta-command with `Run? [Y/n]` confirmation. The wiring required dropping the engine dep from `sqlrite-ask` (cargo cycle) — `sqlrite-ask` is now pure over `&str` schemas; the `Connection`/`Database` integration moved to the engine's new `ask` feature. Public surface for callers: `use sqlrite::{Connection, ConnectionAskExt}`.
Expand Down
12 changes: 9 additions & 3 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -170,14 +170,20 @@ Even at 10k scale, the gap is large:

### W10 — vector top-10 (cosine), brute-force vs HNSW

10k 384-dim vectors. Two variants per the plan: brute-force (no index) and HNSW (`CREATE INDEX … USING hnsw`). SQLRite-only — `sqlite-vec` extension wiring is a follow-up (`rusqlite[bundled]` doesn't ship it; loading a pre-compiled `.dylib` at runtime is non-trivial and was out of scope for v1).
10k 384-dim vectors. Two variants per the plan: brute-force (no index) and HNSW (`CREATE INDEX … USING hnsw (embedding) WITH (metric = 'cosine')`, per SQLR-28). SQLRite-only — `sqlite-vec` extension wiring is a follow-up (`rusqlite[bundled]` doesn't ship it; loading a pre-compiled `.dylib` at runtime is non-trivial and was out of scope for v1).

| Variant | SQLRite median | Throughput |
The **W10.v1** numbers below were taken before SQLR-23 (parser-bound) and SQLR-28 (HNSW probe was L2-only, so the HNSW variant silently fell through to brute-force on cosine queries) — they are retained for historical context only. **W10.v3** ships with the cosine-built index + cosine-aware optimizer probe; republish under SQLR-25.

| Variant | SQLRite median (v1, retired) | Throughput |
|---|---|---|
| brute-force | ~122 ms | ~8 ops/s |
| hnsw | ~132 ms | ~7 ops/s |

**Read this as:** at 10k vectors × 384 dim, **HNSW barely beats brute-force**. That's not the index's fault — both numbers are dominated by the **per-iter SQL parse cost** (the 384-element bracket-array literal in the `ORDER BY` clause is ~4 KB of SQL the parser walks every iteration; the actual cosine work is ~3.8M FP ops ≈ a few ms). At a much larger corpus (millions of vectors) HNSW would dominate; at 10k the parser cost masks the algorithmic win. A future "prepare-vector-query-once" path or VECTOR-bind binding would surface the real HNSW vs brute-force gap.
**Read this as (v1, retired):** at 10k vectors × 384 dim, **HNSW barely beats brute-force**. Two reasons compounded:
1. **Per-iter SQL parse cost** — the 384-element bracket-array literal in the `ORDER BY` clause was ~4 KB of SQL the parser walked every iteration. Fixed in SQLR-23 (`Value::Vector` bind).
2. **Cosine queries silently brute-forced on the HNSW path** — the optimizer's `try_hnsw_probe` was L2-only; cosine queries never hit the graph. Fixed in SQLR-28 (per-index metric + matching probe).

W10.v3 measures the *actual* HNSW-vs-brute-force gap with both fixes in place.

### W11 — BM25 top-10

Expand Down
32 changes: 23 additions & 9 deletions benchmarks/src/workloads/vector.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3,21 +3,23 @@
//! ```sql
//! CREATE TABLE vecs (id INTEGER PRIMARY KEY, embedding VECTOR(384));
//! -- 10k 384-dim vectors, deterministic per-id.
//! -- HNSW variant adds:
//! CREATE INDEX vecs_hnsw ON vecs USING hnsw (embedding);
//! -- HNSW variant adds (SQLR-28: cosine-built index, matched to the
//! -- query's vec_distance_cosine):
//! CREATE INDEX vecs_hnsw ON vecs USING hnsw (embedding)
//! WITH (metric = 'cosine');
//!
//! -- Hot loop:
//! SELECT id FROM vecs
//! ORDER BY vec_distance_cosine(embedding, [...]) ASC
//! LIMIT 10;
//! ```
//!
//! Two criterion groups land per driver: `W10.v1/brute-force` (no HNSW
//! Two criterion groups land per driver: `W10.v3/brute-force` (no HNSW
//! index — every probe full-scans + bounded-heap top-k) and
//! `W10.v1/hnsw` (with the HNSW index, optimizer probes the graph
//! per [`docs/supported-sql.md`](../../docs/supported-sql.md) "HNSW
//! indexes"). The gap between the two is the headline number for
//! "did Phase 7d's ANN actually deliver?"
//! `W10.v3/hnsw` (with the cosine-built HNSW index, optimizer probes
//! the graph per [`docs/supported-sql.md`](../../docs/supported-sql.md)
//! "HNSW indexes"). The gap between the two is the headline number
//! for "did Phase 7d's ANN actually deliver?"
//!
//! ## Comparator
//!
Expand All @@ -40,10 +42,18 @@ use crate::{Driver, Value, WorkloadId};
/// inlined as a 4 KB bracket-array literal in the SQL string. The
/// brute-force-vs-HNSW gap should widen materially because the
/// per-iter parser cost no longer dominates.
///
/// SQLR-28 — bumped again to v3: the HNSW variant now creates the
/// index `WITH (metric = 'cosine')`, matching the hot-loop SQL's
/// `vec_distance_cosine`. v1/v2 used the optimizer's L2-only probe,
/// which silently fell through to brute-force on a cosine query —
/// the HNSW variant was never actually exercising the graph. Numbers
/// from before v3 are not comparable to v3 numbers and have been
/// retired.
pub const W10: WorkloadId = WorkloadId {
id: "W10",
name: "vector-top10",
version: "v2",
version: "v3",
};

/// `(label, with_hnsw_index)` — two variants per driver.
Expand Down Expand Up @@ -71,9 +81,13 @@ pub fn setup<D: Driver>(
let dataset = vector_dataset();
insert_rows(driver, &mut conn, &dataset)?;
if with_hnsw {
// SQLR-28: build the graph for cosine — matches the hot-loop
// SQL's vec_distance_cosine. Without the metric clause the
// index defaults to L2 and the optimizer's metric gate falls
// through to brute-force, which is exactly the bug v3 fixes.
driver.execute(
&mut conn,
"CREATE INDEX vecs_hnsw ON vecs USING hnsw (embedding)",
"CREATE INDEX vecs_hnsw ON vecs USING hnsw (embedding) WITH (metric = 'cosine')",
)?;
}
Ok((conn, dataset))
Expand Down
5 changes: 3 additions & 2 deletions docs/phase-7-plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,7 @@ SELECT id, title FROM docs ORDER BY embedding <-> [0.1, ...] LIMIT 10;
> - **✅ 7d.1 — Pure HNSW algorithm** *(~700 LOC, shipped in v0.1.13).* `src/sql/hnsw.rs` standalone module: insert + search + layer assignment + beam search per layer + L2/cosine/dot distance dispatch. No SQL integration yet — vectors are passed in via a `get_vec` closure so the algorithm doesn't depend on table types. Tests verify recall@k ≥ 0.95 vs brute-force on randomly-generated vector sets; deterministic via a fixed RNG seed.
> - **✅ 7d.2 — SQL integration** *(~500 LOC).* `CREATE INDEX … USING hnsw (col)` parser + engine, INSERT wiring (also calls `hnsw.insert()` incrementally), query optimizer hook (recognizes `ORDER BY vec_distance_l2(col, literal) LIMIT k` and probes the HNSW instead of full-scanning). HNSW lives in memory only at this point; the **CREATE INDEX SQL persists in `sqlrite_master` and reopen rebuilds the graph from current rows** — partial persistence ahead of 7d.3. DELETE/UPDATE on HNSW-indexed tables refused with helpful error pointing at 7d.3.
> - **✅ 7d.3 — Persistence** *(~600 LOC).* New `KIND_HNSW` cell tag and `HnswNodeCell` encoding (varint node_id + per-layer neighbor lists). Each HNSW index gets its own page tree parallel to secondary indexes. Open path loads cells directly into `HnswIndex::from_persisted_nodes` — no algorithm runs, exact bit-for-bit reproduction. Also unblocks DELETE / UPDATE on HNSW-indexed tables: those mark the index `needs_rebuild`, save rebuilds from current rows before staging. ~2× the original 300-LOC estimate because the cell encoding + tests + rebuild path together added more than expected.
> - **✅ 7d.4 (SQLR-28) — Per-index distance metric.** Q2's "deferred per-index metric knob" lands as `CREATE INDEX … USING hnsw (col) WITH (metric = '<l2|cosine|dot>')`. The metric is stored on `HnswIndexEntry` and round-tripped via the synthesized CREATE INDEX SQL in `sqlrite_master` (no file-format bump — pre-SQLR-28 rows omit the WITH clause and decode as L2). The optimizer's `try_hnsw_probe` widens to all three `vec_distance_*` functions but only fires when the query function matches the index's metric; mismatches fall through to brute-force. Surfaced by the SQLR-23 v2 bench: W10 uses cosine, the optimizer was L2-only, and the HNSW variant had been silently brute-forcing the entire time. SQLR-25 (republish v2 numbers) was the gating consumer.
>
> Each 7d.x ships as its own PR + release wave. The user-facing value lands at 7d.2; 7d.3 closes the persistence loop. 7d.1 is foundational but ships a tested algorithmic primitive on its own — useful as documentation of the engine's "from scratch" theme.

Expand Down Expand Up @@ -368,12 +369,12 @@ Q1–Q10 were resolved by the project owner on 2026-04-26. Each question keeps i

### Q2. HNSW parameters: fixed defaults or per-index configurable?

> **Decided: fixed defaults** (`M=16, ef_construction=200, ef_search=50`).
> **Decided: fixed defaults** (`M=16, ef_construction=200, ef_search=50`) for the algorithmic knobs. **Distance metric** *did* land as a per-index `WITH (metric = '<l2|cosine|dot>')` clause in **SQLR-28 / sub-phase 7d.4** — see the 7d split note above. Was deferred from the original 7d.2 cut; surfaced as a gap by the SQLR-23 v2 bench, where W10's cosine query had been silently brute-forcing because the optimizer hook was L2-only.

- **Fixed:** `M=16, ef_construction=200, ef_search=50`. Simpler API, less to test. Matches sqlite-vec's defaults.
- **Configurable:** `CREATE INDEX … USING hnsw (col) WITH (m=32, ef_construction=400)`. Power-user knobs, more code, more test matrix.

**Recommendation:** fixed defaults for MVP. Configurable can land as a follow-up if anyone actually asks.
**Recommendation:** fixed defaults for MVP. Configurable can land as a follow-up if anyone actually asks. (`metric` already came back as a follow-up; `m` / `ef_*` haven't been requested yet.)

### Q3. JSON storage format

Expand Down
15 changes: 9 additions & 6 deletions docs/supported-sql.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,15 +113,18 @@ These are full-citizen indexes — they're visible via `.tables`-adjacent catalo
### HNSW indexes (Phase 7d)

```sql
CREATE INDEX <name> ON <table> USING hnsw (<vector_column>);
CREATE INDEX <name> ON <table> USING hnsw (<vector_column>)
[WITH (metric = '<l2|cosine|dot>')];
```

Builds an [HNSW](https://arxiv.org/abs/1603.09320) approximate-nearest-neighbor index over a `VECTOR(N)` column. The query optimizer recognizes `ORDER BY vec_distance_l2(col, literal) LIMIT k` (or the cosine / dot variants) on an HNSW-indexed column and probes the graph instead of full-scanning. SQLR-23 — the second arg can be either an inline `[...]` literal *or* a bound `Value::Vector(...)` parameter via `Statement::query_with_params`; the optimizer recognizes both, so prepared-statement KNN queries still take the graph shortcut.
Builds an [HNSW](https://arxiv.org/abs/1603.09320) approximate-nearest-neighbor index over a `VECTOR(N)` column. The query optimizer recognizes `ORDER BY vec_distance_l2(col, literal) LIMIT k` (or the cosine / dot variants) on an HNSW-indexed column **whose metric matches the query's distance function**, and probes the graph instead of full-scanning. SQLR-23 — the second arg can be either an inline `[...]` literal *or* a bound `Value::Vector(...)` parameter via `Statement::query_with_params`; the optimizer recognizes both, so prepared-statement KNN queries still take the graph shortcut.

- Recall@10 ≥ 0.95 at default parameters (`M=16`, `ef_construction=200`, `ef_search=50`). Parameters aren't tunable from SQL yet — see Q2 of [`docs/phase-7-plan.md`](phase-7-plan.md).
- The index is built incrementally on `INSERT`. `DELETE` / `UPDATE` mark the index `needs_rebuild`; the next save rebuilds from current rows.
- Persisted as a `KIND_HNSW` cell tree alongside the regular page hierarchy — open path loads the graph bit-for-bit, no algorithm runs.
- Without an HNSW index, the same `ORDER BY vec_distance_… LIMIT k` query still works — it just brute-force-scans every row (Phase 7c's bounded-heap top-k optimization keeps the memory footprint to O(k)).
The `WITH (metric = '…')` clause picks the distance the graph is built for. Three values are recognized: `'l2'` (Euclidean — the default, also accepts `'euclidean'`), `'cosine'`, and `'dot'` (negated dot-product — also accepts `'inner_product'` / `'ip'`). Omitting the clause is equivalent to `metric = 'l2'`, so pre-SQLR-28 catalogs round-trip unchanged. **The metric is not a query-time choice** — the graph topology depends on the metric used during INSERT (neighbour pruning is metric-specific), so a query whose `vec_distance_*` function doesn't match the index's metric falls through to brute-force rather than getting a wrong answer back from the graph. If you need both L2 and cosine probes on the same column, create two indexes.

- Recall@10 ≥ 0.95 at default parameters (`M=16`, `ef_construction=200`, `ef_search=50`). The `M` / `ef_*` knobs aren't tunable from SQL yet — see Q2 of [`docs/phase-7-plan.md`](phase-7-plan.md).
- The index is built incrementally on `INSERT`. `DELETE` / `UPDATE` mark the index `needs_rebuild`; the next save rebuilds from current rows under the same metric.
- Persisted as a `KIND_HNSW` cell tree alongside the regular page hierarchy — open path loads the graph bit-for-bit, no algorithm runs. The metric travels through the synthesized CREATE INDEX SQL in `sqlrite_master`; no file-format bump.
- Without an HNSW index — or with a metric mismatch — the same `ORDER BY vec_distance_… LIMIT k` query still works; it just brute-force-scans every row (Phase 7c's bounded-heap top-k optimization keeps the memory footprint to O(k)).

### FTS indexes (Phase 8)

Expand Down
Loading
Loading