Skip to content

Implement invalidCharHandler for Single-Byte Encodings#388

Open
DecimalTurn wants to merge 4 commits intopillarjs:masterfrom
DecimalTurn:dev-invalid3
Open

Implement invalidCharHandler for Single-Byte Encodings#388
DecimalTurn wants to merge 4 commits intopillarjs:masterfrom
DecimalTurn:dev-invalid3

Conversation

@DecimalTurn
Copy link
Copy Markdown

@DecimalTurn DecimalTurn commented Mar 21, 2026

As suggested in #53 (comment), this PR implements the invalidCharHandler as an optional argument part of the EncodeOptions. I remain open to discuss and make changes for this implementation.

Motivation

To keep the scope of this PR small and focused, it's limited to only SBCS. I consider that even without having support for all encodings, this would still be beneficial since the majority of encodings that can't represent all Unicode codepoints remain single-byte.

I think this PR would solve an issue with the recent update to the VS Code EditorConfig extension. The issue arise when it tried to enforce latin1 encoding on-save without checking if the encoding would cause a loss of information (editorconfig/editorconfig-vscode#474). This could be prevented if VS Code was able to detect lossy conversions and prevent them with a callback as mentioned here.

Implementation

As described in the README changes, the new feature would provide the following:

For single-byte encoding, invalidCharHandler can be used to observe unsupported characters and warn or throw. You can also stop early by returning true from the handler, in which case encode() returns null.

The fact that returning true stops the encoding and make the encode function return null isn't mandatory obviously, but it can be nice if we don't want to people to use a try-catch and a throw if they want to stop the encoding on the first error.

Note that the way GitHub presents the diff doesn't make it clear that my changes would not cause any changes to the SBCSEncoder.prototype.write function if the invalidCharHandler is not present. Here's a better diff of the changes in my opinion to illustrate that point:

SBCSEncoder.prototype.write = function (str) {
  var buf = Buffer.alloc(str.length)
+
+  var encodeBuf = this.encodeBuf
+  var invalidCharHandler = this.invalidCharHandler
+
+  if (typeof invalidCharHandler === "function") {
+    return encodeWithInvalidCharHandler(this, str, buf, encodeBuf, invalidCharHandler)
+  }
+
  for (var i = 0; i < str.length; i++) {
-    buf[i] = this.encodeBuf[str.charCodeAt(i)]
+    buf[i] = encodeBuf[str.charCodeAt(i)]
  }
  return buf
}

+function encodeWithInvalidCharHandler (encoder, str, buf, encodeBuf, invalidCharHandler) {
+  var defaultCharByte = encoder.defaultCharByte
+  var defaultCharCode = encoder.defaultCharCode
+  var codePointIndex = 0
+
+  for (var i = 0; i < str.length; i++) {
+    var charCode = str.charCodeAt(i)
+    var encodedByte = encodeBuf[charCode]
+
+    // `encodeBuf` uses default byte for unmappable chars. Disambiguate by
+    // allowing the codec character that genuinely maps to that default byte.
+    if (encodedByte !== defaultCharByte || charCode === defaultCharCode) {
+      buf[i] = encodedByte
+      codePointIndex++
+      continue
+    }
+
+    // If an unencodable char is a surrogate pair, pass the full pair to the handler once.
+    // Index is Unicode code-point based.
+    if (charCode >= 0xD800 && charCode <= 0xDBFF && i + 1 < str.length) {
+      var nextCharCode = str.charCodeAt(i + 1)
+      if (nextCharCode >= 0xDC00 && nextCharCode <= 0xDFFF) {
+        if (invalidCharHandler(str.slice(i, i + 2), codePointIndex) === true) {
+          return null
+        }
+        buf[i] = encodedByte
+        buf[i + 1] = encodeBuf[nextCharCode]
+        i++
+        codePointIndex++
+        continue
+      }
+    }
+
+    if (invalidCharHandler(str.charAt(i), codePointIndex) === true) {
+      return null
+    }
+    buf[i] = encodedByte
+    codePointIndex++
+  }
+
+  return buf
+}

Regarding the issue of detecting if the presence of ? or is a sign that the current character can't be encoded (as discussed in https://github.com/pillarjs/iconv-lite/pull/283/changes), I solved the problem by passing down the defaultCharCode and defaultCharByte values so that we can perform the following check:

 if (encodedByte !== defaultCharByte || charCode === defaultCharCode) 

Basically, we consider a character to be valid if it doesn't map to the defaultCharByte OR if the character code was already the defaultCharCode in the original string (ie. we already had a ? or in the original string).

Simplify approach by only needing encodeBuf with no isEncodeable

docs: Update README

Update invalidCharHandler behavior to return null on early cancellation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant