Implement invalidCharHandler for Single-Byte Encodings by DecimalTurn · Pull Request #388 · pillarjs/iconv-lite

DecimalTurn · 2026-03-21T02:45:22Z

As suggested in #53 (comment), this PR implements the invalidCharHandler as an optional argument part of the EncodeOptions. I remain open to discuss and make changes for this implementation.

Motivation

To keep the scope of this PR small and focused, it's limited to only SBCS. I consider that even without having support for all encodings, this would still be beneficial since the majority of encodings that can't represent all Unicode codepoints remain single-byte.

I think this PR would solve an issue with the recent update to the VS Code EditorConfig extension. The issue arise when it tried to enforce latin1 encoding on-save without checking if the encoding would cause a loss of information (editorconfig/editorconfig-vscode#474). This could be prevented if VS Code was able to detect lossy conversions and prevent them with a callback as mentioned here.

Implementation

As described in the README changes, the new feature would provide the following:

For single-byte encoding, invalidCharHandler can be used to observe unsupported characters and warn or throw. You can also stop early by returning true from the handler, in which case encode() returns null.

The fact that returning true stops the encoding and make the encode function return null isn't mandatory obviously, but it can be nice if we don't want to people to use a try-catch and a throw if they want to stop the encoding on the first error.

Note that the way GitHub presents the diff doesn't make it clear that my changes would not cause any changes to the SBCSEncoder.prototype.write function if the invalidCharHandler is not present. Here's a better diff of the changes in my opinion to illustrate that point:

SBCSEncoder.prototype.write = function (str) {
  var buf = Buffer.alloc(str.length)
+
+  var encodeBuf = this.encodeBuf
+  var invalidCharHandler = this.invalidCharHandler
+
+  if (typeof invalidCharHandler === "function") {
+    return encodeWithInvalidCharHandler(this, str, buf, encodeBuf, invalidCharHandler)
+  }
+
  for (var i = 0; i < str.length; i++) {
-    buf[i] = this.encodeBuf[str.charCodeAt(i)]
+    buf[i] = encodeBuf[str.charCodeAt(i)]
  }
  return buf
}

+function encodeWithInvalidCharHandler (encoder, str, buf, encodeBuf, invalidCharHandler) {
+  var defaultCharByte = encoder.defaultCharByte
+  var defaultCharCode = encoder.defaultCharCode
+  var codePointIndex = 0
+
+  for (var i = 0; i < str.length; i++) {
+    var charCode = str.charCodeAt(i)
+    var encodedByte = encodeBuf[charCode]
+
+    // `encodeBuf` uses default byte for unmappable chars. Disambiguate by
+    // allowing the codec character that genuinely maps to that default byte.
+    if (encodedByte !== defaultCharByte || charCode === defaultCharCode) {
+      buf[i] = encodedByte
+      codePointIndex++
+      continue
+    }
+
+    // If an unencodable char is a surrogate pair, pass the full pair to the handler once.
+    // Index is Unicode code-point based.
+    if (charCode >= 0xD800 && charCode <= 0xDBFF && i + 1 < str.length) {
+      var nextCharCode = str.charCodeAt(i + 1)
+      if (nextCharCode >= 0xDC00 && nextCharCode <= 0xDFFF) {
+        if (invalidCharHandler(str.slice(i, i + 2), codePointIndex) === true) {
+          return null
+        }
+        buf[i] = encodedByte
+        buf[i + 1] = encodeBuf[nextCharCode]
+        i++
+        codePointIndex++
+        continue
+      }
+    }
+
+    if (invalidCharHandler(str.charAt(i), codePointIndex) === true) {
+      return null
+    }
+    buf[i] = encodedByte
+    codePointIndex++
+  }
+
+  return buf
+}

Regarding the issue of detecting if the presence of ? or � is a sign that the current character can't be encoded (as discussed in https://github.com/pillarjs/iconv-lite/pull/283/changes), I solved the problem by passing down the defaultCharCode and defaultCharByte values so that we can perform the following check:

 if (encodedByte !== defaultCharByte || charCode === defaultCharCode)

Basically, we consider a character to be valid if it doesn't map to the defaultCharByte OR if the character code was already the defaultCharCode in the original string (ie. we already had a ? or � in the original string).

Simplify approach by only needing encodeBuf with no isEncodeable docs: Update README Update invalidCharHandler behavior to return null on early cancellation

…problematic character

…nstead of utf-16 code unit position

DecimalTurn added 4 commits March 20, 2026 23:40

feat: add invalidCharHandler

90c16cb

Simplify approach by only needing encodeBuf with no isEncodeable docs: Update README Update invalidCharHandler behavior to return null on early cancellation

feat: make invalidCharHandler support surrogate pairs when returning …

5369fdd

…problematic character

feat: update invalidCharHandler to report actual character position i…

5331d2e

…nstead of utf-16 code unit position

docs: no longer needed to use SON.stringify

9740f99

This was referenced Mar 27, 2026

No warning when saving a file with unsupported characters in the latin1 encoding editorconfig/editorconfig-vscode#474

Open

Add user callback for handling invalid characters. #53

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement invalidCharHandler for Single-Byte Encodings#388

Implement invalidCharHandler for Single-Byte Encodings#388
DecimalTurn wants to merge 4 commits intopillarjs:masterfrom
DecimalTurn:dev-invalid3

DecimalTurn commented Mar 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

DecimalTurn commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Implementation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

DecimalTurn commented Mar 21, 2026 •

edited

Loading