Skip to content

CASSANDRA-21131: Fix CSV COPY TO/FROM corrupting text values containing backslashes#4813

Open
Jens-G wants to merge 1 commit into
apache:trunkfrom
Jens-G:jens-g/CASSANDRA-21131/trunk
Open

CASSANDRA-21131: Fix CSV COPY TO/FROM corrupting text values containing backslashes#4813
Jens-G wants to merge 1 commit into
apache:trunkfrom
Jens-G:jens-g/CASSANDRA-21131/trunk

Conversation

@Jens-G
Copy link
Copy Markdown
Member

@Jens-G Jens-G commented May 15, 2026

Summary

COPY TO followed by COPY FROM corrupts text column values that contain backslashes: each round-trip doubles the backslash count. Reported in CASSANDRA-21131.

Before (one round-trip):

  • Stored: V\S → exported CSV: V\\\\S → re-imported: V\\S
  • Stored: \"Marianne"\ → re-imported: \\"Marianne"\\

list<text>, set<text>, map<text,text>, tuples and UDTs with text fields are affected in the same way.

Root Cause

format_value_text in formatting.py doubles backslashes unconditionally:

escapedval = val.replace('\\', '\\\\')

This is intentional for terminal display (SELECT output shows V\\S so the backslash is visible). However, ExportProcess.format_value in copyutil.py calls the same function when writing CSV. The csv.writer (configured with escapechar='\\') then escapes backslashes a second time, quadrupling them in the CSV file. On COPY FROM the csv.reader unescapes once, leaving doubled backslashes in Cassandra.

Fix

Add an escape_backslash parameter (default True, preserving existing terminal display behaviour) to format_value_text, format_simple_collection, and all collection formatters. Pass escape_backslash=False from ExportProcess.format_value so the csv.writer handles all backslash escaping exclusively.

Changed functions:

  • format_value_text — new parameter
  • format_simple_collection — new parameter, propagated to element format_value calls
  • format_value_list, format_value_set, format_value_tuple — new parameter, forwarded to format_simple_collection
  • format_value_map — new parameter, propagated through subformat
  • format_value_utype — new parameter, propagated through format_field_value
  • ExportProcess.format_value in copyutil.py — passes escape_backslash=False

Test Plan

Two standalone Python test scripts (no running Cassandra cluster required) are attached to the JIRA ticket and verify the bug and fix:

  • test_cassandra_21131.py — 10 test cases for plain text columns: 5/10 pass before fix → 10/10 after
  • test_cassandra_21131_collections.py — 12 test cases for list/set/map<text>: 3/12 before → 12/12 after

Integration testing against a live cluster with the exact scenario from the bug report (COPY TOTRUNCATECOPY FROMSELECT) is needed before merge.

Notes

  • A separate but related bug (UNICODE_CONTROLCHARS_RE converting control chars like \n to repr-notation \\n during CSV export) was discovered and will be tracked in a separate ticket.
  • The Generated-by: commit token is included per ASF generative tooling policy. The fix was developed with AI assistance (Claude Sonnet 4.6 / Anthropic) under human review and direction. All code has been verified manually.

🤖 Generated with Claude Code

…ng backslashes

format_value_text in formatting.py doubles backslashes for terminal display
(so SELECT output renders them visibly). When used via ExportProcess.format_value
for COPY TO, this pre-escaping is applied before csv.writer runs its own
backslash escaping (escapechar='\\'), resulting in quadrupled backslashes in the
CSV file. On COPY FROM the csv.reader unescapes once, leaving doubled backslashes
in Cassandra — data corruption that compounds on every round-trip.

The fix adds an escape_backslash parameter (default True, preserving existing
terminal display behaviour) and passes escape_backslash=False from the CSV
export path in ExportProcess.format_value. The parameter is propagated through
format_simple_collection, format_value_list/set/tuple/map, and format_value_utype
so that collection types (list<text>, set<text>, map<text,text>, UDTs) are
covered as well.

Generated-by: Claude Sonnet 4.6 (Anthropic) with human review and direction
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@bschoening
Copy link
Copy Markdown
Contributor

bschoening commented May 18, 2026

@Jens-G It seems the exporter is sending values through the display formatter, which doubles backslashes for human-readable SELECT output, before handing them to the CSV writer. Claude suggests just stop using the display formatter for CSV export of text in copyutil.py. What would you think of that approach?

def format_value(self, val, cqltype):
    if val is None or val == EMPTY:
        return format_value_default(self.nullval, colormap=NO_COLOR_MAP)

    # Text-like values: pass through unmodified. The csv.writer will apply
    # CSV-level escaping using the dialect's escapechar. Running these
    # through the display formatter would double-escape backslashes, see
    # CASSANDRA-21131.
    if cqltype.type_name in ('text', 'varchar', 'ascii'):
        return val
    ...

@Jens-G
Copy link
Copy Markdown
Member Author

Jens-G commented May 18, 2026

TBH if it works I'm fine with either approach 👍 🚀

Copy link
Copy Markdown
Contributor

@bschoening bschoening left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider an alternative approach where formated_value() bypasses formatting for text

if cqltype.type_name in ('text', 'varchar', 'ascii'):
return val

formatted = formatter(val, cqltype=cqltype,
encoding=self.encoding, colormap=NO_COLOR_MAP, date_time_format=self.date_time_format,
float_precision=cqltype.precision, nullval=self.nullval, quote=False,
escape_backslash=False,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider an alternative approach where formatted_value() bypasses display formatting for text.

escape_backslash=False,
decimal_sep=self.decimal_sep, thousands_sep=self.thousands_sep,
boolean_styles=self.boolean_styles)
return formatted
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add unit tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants