Fix XLSX ingestion memory spikes with streaming parser by plasma16 · Pull Request #42 · VectifyAI/OpenKB

plasma16 · 2026-05-07T07:16:33Z

Summary

route .xlsx conversion through a streaming openpyxl reader (read_only=True, data_only=True)
cap scan bounds (max_rows=5000, max_cols=64) to prevent pathological worksheet ranges from exploding memory
stop scanning after sustained empty tails to avoid sparse-sheet runaway processing

Why

Some workbooks report huge used ranges (e.g. max_row=1048571) despite having very little real data, which can cause generic converters to consume excessive RAM.

Result

Significantly lower memory use during XLSX ingest while preserving useful sheet content for KB compilation.

KylinMountain · 2026-05-11T03:57:53Z

Code review

Found 1 issue:

convert_xlsx_streaming silently truncates spreadsheets at max_rows=5000 and max_cols=64 with no warning or signal back to the caller. When a sheet exceeds either bound, the excess is dropped, the function returns a partial markdown string, convert_document writes it to wiki/sources/<doc_name>.md, and the file hash is registered as fully processed — so the next openkb add will skip it as "already known." The user has no way to discover that a 50k-row workbook only ingested its first 5k rows. The early-exit on empty_streak >= 200 has the same property. Every other branch in convert_document either converts the full document or raises.

The hard caps and the partial-success completion path:

OpenKB/openkb/converter.py

Lines 37 to 88 in c3a1f11

    
           def convert_xlsx_streaming(path: Path, *, max_rows: int = 5000, max_cols: int = 64) -> str: 
        
               """Convert .xlsx to markdown using a memory-safe streaming reader. 
        
               This avoids large in-memory DataFrames and pathological worksheet ranges 
        
               that can cause extreme RAM usage in generic converters. 
        
               """ 
        
               wb = load_workbook(filename=str(path), read_only=True, data_only=True) 
        
               lines: list[str] = [] 
        
               try: 
        
                   for ws in wb.worksheets: 
        
                       lines.append(f"# Sheet: {ws.title}") 
        
                       lines.append("") 
        
                       rows_written = 0 
        
                       empty_streak = 0 
        
                       for row in ws.iter_rows(min_row=1, max_row=max_rows, min_col=1, max_col=max_cols, values_only=True): 
        
                           vals = [] 
        
                           non_empty = False 
        
                           for cell in row: 
        
                               if cell is None: 
        
                                   vals.append("") 
        
                                   continue 
        
                               text = str(cell).strip() 
        
                               vals.append(text) 
        
                               if text: 
        
                                   non_empty = True 
        
                           if not non_empty: 
        
                               empty_streak += 1 
        
                               if rows_written == 0: 
        
                                   continue 
        
                               # Stop after sustained empty tail to avoid scanning sparse sheets. 
        
                               if empty_streak >= 200: 
        
                                   break 
        
                           else: 
        
                               empty_streak = 0 
        
                           if non_empty: 
        
                               lines.append(" | ".join(vals).rstrip()) 
        
                               rows_written += 1 
        
                       if rows_written == 0: 
        
                           lines.append("(No non-empty cells found within scan limits.)") 
        
                       lines.append("") 
        
               finally: 
        
                   wb.close() 
        
               return "\n".join(lines) 
        
           def convert_document(src: Path, kb_dir: Path) -> ConvertResult:

Suggested fix: when the row/col cap is reached or empty_streak breaks the scan, append an explicit truncation marker to the returned markdown (e.g. > Truncated: scan limits reached at row N / col M), log a warning, and ideally surface a flag on ConvertResult so the caller can decide whether to register the hash as complete.

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

Fix XLSX ingestion memory spikes with streaming parser

c3a1f11

plasma16 force-pushed the fix/xlsx-memory-spike branch from 9075e4c to c3a1f11 Compare May 7, 2026 07:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix XLSX ingestion memory spikes with streaming parser#42

Fix XLSX ingestion memory spikes with streaming parser#42
plasma16 wants to merge 1 commit into
VectifyAI:mainfrom
plasma16:fix/xlsx-memory-spike

plasma16 commented May 7, 2026

Uh oh!

KylinMountain commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

plasma16 commented May 7, 2026

Summary

Why

Result

Uh oh!

KylinMountain commented May 11, 2026

Code review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants