Implement invalidCharHandler for Single-Byte Encodings#388
Open
DecimalTurn wants to merge 4 commits intopillarjs:masterfrom
Open
Implement invalidCharHandler for Single-Byte Encodings#388DecimalTurn wants to merge 4 commits intopillarjs:masterfrom
DecimalTurn wants to merge 4 commits intopillarjs:masterfrom
Conversation
Simplify approach by only needing encodeBuf with no isEncodeable docs: Update README Update invalidCharHandler behavior to return null on early cancellation
…problematic character
…nstead of utf-16 code unit position
This was referenced Mar 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
As suggested in #53 (comment), this PR implements the
invalidCharHandleras an optional argument part of theEncodeOptions. I remain open to discuss and make changes for this implementation.Motivation
To keep the scope of this PR small and focused, it's limited to only SBCS. I consider that even without having support for all encodings, this would still be beneficial since the majority of encodings that can't represent all Unicode codepoints remain single-byte.
I think this PR would solve an issue with the recent update to the VS Code EditorConfig extension. The issue arise when it tried to enforce
latin1encoding on-save without checking if the encoding would cause a loss of information (editorconfig/editorconfig-vscode#474). This could be prevented if VS Code was able to detect lossy conversions and prevent them with a callback as mentioned here.Implementation
As described in the README changes, the new feature would provide the following:
The fact that returning
truestops the encoding and make the encode function returnnullisn't mandatory obviously, but it can be nice if we don't want to people to use a try-catch and a throw if they want to stop the encoding on the first error.Note that the way GitHub presents the diff doesn't make it clear that my changes would not cause any changes to the
SBCSEncoder.prototype.writefunction if theinvalidCharHandleris not present. Here's a better diff of the changes in my opinion to illustrate that point:SBCSEncoder.prototype.write = function (str) { var buf = Buffer.alloc(str.length) + + var encodeBuf = this.encodeBuf + var invalidCharHandler = this.invalidCharHandler + + if (typeof invalidCharHandler === "function") { + return encodeWithInvalidCharHandler(this, str, buf, encodeBuf, invalidCharHandler) + } + for (var i = 0; i < str.length; i++) { - buf[i] = this.encodeBuf[str.charCodeAt(i)] + buf[i] = encodeBuf[str.charCodeAt(i)] } return buf } +function encodeWithInvalidCharHandler (encoder, str, buf, encodeBuf, invalidCharHandler) { + var defaultCharByte = encoder.defaultCharByte + var defaultCharCode = encoder.defaultCharCode + var codePointIndex = 0 + + for (var i = 0; i < str.length; i++) { + var charCode = str.charCodeAt(i) + var encodedByte = encodeBuf[charCode] + + // `encodeBuf` uses default byte for unmappable chars. Disambiguate by + // allowing the codec character that genuinely maps to that default byte. + if (encodedByte !== defaultCharByte || charCode === defaultCharCode) { + buf[i] = encodedByte + codePointIndex++ + continue + } + + // If an unencodable char is a surrogate pair, pass the full pair to the handler once. + // Index is Unicode code-point based. + if (charCode >= 0xD800 && charCode <= 0xDBFF && i + 1 < str.length) { + var nextCharCode = str.charCodeAt(i + 1) + if (nextCharCode >= 0xDC00 && nextCharCode <= 0xDFFF) { + if (invalidCharHandler(str.slice(i, i + 2), codePointIndex) === true) { + return null + } + buf[i] = encodedByte + buf[i + 1] = encodeBuf[nextCharCode] + i++ + codePointIndex++ + continue + } + } + + if (invalidCharHandler(str.charAt(i), codePointIndex) === true) { + return null + } + buf[i] = encodedByte + codePointIndex++ + } + + return buf +}Regarding the issue of detecting if the presence of
?or�is a sign that the current character can't be encoded (as discussed in https://github.com/pillarjs/iconv-lite/pull/283/changes), I solved the problem by passing down thedefaultCharCodeanddefaultCharBytevalues so that we can perform the following check:Basically, we consider a character to be valid if it doesn't map to the
defaultCharByteOR if the character code was already thedefaultCharCodein the original string (ie. we already had a?or�in the original string).