Skip to content

Fix XLSX ingestion memory spikes with streaming parser#42

Open
plasma16 wants to merge 1 commit into
VectifyAI:mainfrom
plasma16:fix/xlsx-memory-spike
Open

Fix XLSX ingestion memory spikes with streaming parser#42
plasma16 wants to merge 1 commit into
VectifyAI:mainfrom
plasma16:fix/xlsx-memory-spike

Conversation

@plasma16
Copy link
Copy Markdown

@plasma16 plasma16 commented May 7, 2026

Summary

  • route .xlsx conversion through a streaming openpyxl reader (read_only=True, data_only=True)
  • cap scan bounds (max_rows=5000, max_cols=64) to prevent pathological worksheet ranges from exploding memory
  • stop scanning after sustained empty tails to avoid sparse-sheet runaway processing

Why

Some workbooks report huge used ranges (e.g. max_row=1048571) despite having very little real data, which can cause generic converters to consume excessive RAM.

Result

Significantly lower memory use during XLSX ingest while preserving useful sheet content for KB compilation.

@plasma16 plasma16 force-pushed the fix/xlsx-memory-spike branch from 9075e4c to c3a1f11 Compare May 7, 2026 07:30
@KylinMountain
Copy link
Copy Markdown
Collaborator

Code review

Found 1 issue:

  1. convert_xlsx_streaming silently truncates spreadsheets at max_rows=5000 and max_cols=64 with no warning or signal back to the caller. When a sheet exceeds either bound, the excess is dropped, the function returns a partial markdown string, convert_document writes it to wiki/sources/<doc_name>.md, and the file hash is registered as fully processed — so the next openkb add will skip it as "already known." The user has no way to discover that a 50k-row workbook only ingested its first 5k rows. The early-exit on empty_streak >= 200 has the same property. Every other branch in convert_document either converts the full document or raises.

The hard caps and the partial-success completion path:

def convert_xlsx_streaming(path: Path, *, max_rows: int = 5000, max_cols: int = 64) -> str:
"""Convert .xlsx to markdown using a memory-safe streaming reader.
This avoids large in-memory DataFrames and pathological worksheet ranges
that can cause extreme RAM usage in generic converters.
"""
wb = load_workbook(filename=str(path), read_only=True, data_only=True)
lines: list[str] = []
try:
for ws in wb.worksheets:
lines.append(f"# Sheet: {ws.title}")
lines.append("")
rows_written = 0
empty_streak = 0
for row in ws.iter_rows(min_row=1, max_row=max_rows, min_col=1, max_col=max_cols, values_only=True):
vals = []
non_empty = False
for cell in row:
if cell is None:
vals.append("")
continue
text = str(cell).strip()
vals.append(text)
if text:
non_empty = True
if not non_empty:
empty_streak += 1
if rows_written == 0:
continue
# Stop after sustained empty tail to avoid scanning sparse sheets.
if empty_streak >= 200:
break
else:
empty_streak = 0
if non_empty:
lines.append(" | ".join(vals).rstrip())
rows_written += 1
if rows_written == 0:
lines.append("(No non-empty cells found within scan limits.)")
lines.append("")
finally:
wb.close()
return "\n".join(lines)
def convert_document(src: Path, kb_dir: Path) -> ConvertResult:

Suggested fix: when the row/col cap is reached or empty_streak breaks the scan, append an explicit truncation marker to the returned markdown (e.g. > Truncated: scan limits reached at row N / col M), log a warning, and ideally surface a flag on ConvertResult so the caller can decide whether to register the hash as complete.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants