[fix](decimal) Fix incorrect decimal cast results for scientific-notation strings#63119
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
/review |
There was a problem hiding this comment.
Automated review completed. I did not find blocking issues in the actual PR diff, which is limited to decimal scientific-notation parsing and associated tests. The change keeps the parser scope focused, adds coverage for positive/negative exponents and rounding around the target scale, and preserves failure handling for malformed exponents.\n\nCritical checkpoint conclusions:\n- Goal/test coverage: Implements decimal scientific-notation parsing fixes and adds BE unit tests for representative parsing and rounding cases.\n- Scope: Small and focused on plus tests.\n- Concurrency/lifecycle/config/compatibility: Not applicable; no shared state, lifecycle, config, persistence, or wire/storage format changes in the actual PR patch.\n- Data correctness: Reviewed exponent decimal-point shifting, fractional padding, rounding carry, and overflow/underflow checks; no confirmed defect found.\n- Error handling: Malformed exponent suffixes still return parse failure; exponent overflow returns parse overflow.\n- Performance/memory: No new allocations or hot-path complexity beyond existing linear parsing; no MemTracker concern for this utility path.\n- User focus: No additional user-provided review focus was present.
There was a problem hiding this comment.
Automated review completed. I did not find blocking issues in the actual PR diff, which is limited to decimal scientific-notation parsing and associated tests. The change keeps the parser scope focused, adds coverage for positive/negative exponents and rounding around the target scale, and preserves failure handling for malformed exponents.
Critical checkpoint conclusions:
- Goal/test coverage: Implements decimal scientific-notation parsing fixes and adds BE unit tests for representative parsing and rounding cases.
- Scope: Small and focused on StringParser::string_to_decimal plus tests.
- Concurrency/lifecycle/config/compatibility: Not applicable; no shared state, lifecycle, config, persistence, or wire/storage format changes in the actual PR patch.
- Data correctness: Reviewed exponent decimal-point shifting, fractional padding, rounding carry, and overflow/underflow checks; no confirmed defect found.
- Error handling: Malformed exponent suffixes still return parse failure; exponent overflow returns parse overflow.
- Performance/memory: No new allocations or hot-path complexity beyond existing linear parsing; no MemTracker concern for this utility path.
- User focus: No additional user-provided review focus was present.
|
run buildall |
ad2828f to
3e39b7e
Compare
|
run buildall |
TPC-H: Total hot run time: 29251 ms |
TPC-DS: Total hot run time: 170689 ms |
3e39b7e to
8d5d793
Compare
|
run buildall |
TPC-H: Total hot run time: 29671 ms |
TPC-DS: Total hot run time: 170073 ms |
… very small scientific-notation values Problems: 1. String-to-decimal casting counted exponent characters as significand digits, so values such as "1.4E+2" could miss the exponent scale and return 14 instead of 140. 2. String-to-decimal parsing rounded scientific-notation values up even when implicit zeros placed the significant digit beyond the first discarded decimal scale position. Also add comments to help understand the code.
8d5d793 to
706c35f
Compare
|
run buildall |
TPC-H: Total hot run time: 29361 ms |
TPC-DS: Total hot run time: 170875 ms |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
|
/review |
There was a problem hiding this comment.
Review opinion: approve. I did not find blocking correctness issues in this PR.
Checklist conclusions:
- Goal and proof: The change targets string-to-decimal scientific notation parsing and rounding for very small values. The parser now keeps the significand end separate from exponent syntax and gates rounding on the first actually discarded fractional digit. Added BE tests cover positive exponent shifts, leading-dot input, zero exponent, negative exponent, and the small-value rounding boundary.
- Scope: The code change is focused on ; the CRC expected-result updates are consistent with decimal parsing affecting the array decimal256 test data loaded from scientific notation CSV values.
- Concurrency/lifecycle: No new concurrency, locks, threads, globals, or lifecycle ownership changes.
- Configuration/compatibility: No new configs, persisted formats, thrift protocol, or rolling-upgrade compatibility concerns.
- Parallel paths: is the shared BE parser used by decimal cast/serde paths, so the fix applies to the relevant decimal widths through the existing template instantiations.
- Error handling/data correctness: Existing parse failure/overflow/underflow status paths are preserved; the reviewed edge cases keep exponent characters out of digit counts and avoid rounding when implicit zeros precede the first significand digit beyond scale.
- Tests: Targeted tests were added. I attempted to run Target system: Linux; Target arch: x86_64
Python 3.12.3
Check JAVA_HOME version
Apache Maven 3.9.14 (996c630dbc656c76214ce58821dcc58be960875b)
Maven home: /usr/share/apache-maven-3.9.14
Java version: 17.0.18, vendor: Eclipse Adoptium, runtime: /usr/lib/jvm/temurin-17-jdk-amd64
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "6.17.0-1010-azure", arch: "amd64", family: "unix"
cmake version 3.31.6
CMake suite maintained and supported by Kitware (kitware.com/cmake).
ninja 1.13.2
Get params:
PARALLEL -- 1
CLEAN -- 0
ENABLE_PCH -- ON
WITH_TDE_DIR --
Build Backend UT
Update apache-orc submodule ...
Update clucene submodule ...
-- Make program: /usr/local/bin/ninja
-- Use ccache: and
-- Extra cxx flags:
-- GLIBC_COMPATIBILITY is ON
-- USE_LIBCPP is OFF
-- USE_JEMALLOC is OFF
-- USE_UNWIND is ON
-- ENABLE_PCH is ON
-- USE_AVX2 is ON
-- Build type is ASAN_UT
-- Build target arch is x86_64
-- DORIS_HOME is /home/runner/work/doris/doris
-- THIRDPARTY_DIR is /home/runner/work/doris/doris/thirdparty/installed
-- make test: ON
-- make benchmark: OFF
-- build fs benchmark tool: OFF
-- build task executor simulator: OFF
-- build file cache lru tool: OFF
-- build gensrc if necessary
make: Entering directory '/home/runner/work/doris/doris/gensrc'
make -C script
make[1]: Entering directory '/home/runner/work/doris/doris/gensrc/script'
/home/runner/work/doris/doris/gensrc/script/gen_build_version.sh
make[1]: Leaving directory '/home/runner/work/doris/doris/gensrc/script'
make -C proto
make[1]: Entering directory '/home/runner/work/doris/doris/gensrc/proto'
/home/runner/work/doris/doris/thirdparty/installed/bin/protoc --proto_path=/home/runner/work/doris/doris/gensrc/proto --cpp_out=/home/runner/work/doris/doris/gensrc/proto/../build//gen_cpp /home/runner/work/doris/doris/gensrc/proto/descriptors.proto
make[1]: Leaving directory '/home/runner/work/doris/doris/gensrc/proto'
make: Leaving directory '/home/runner/work/doris/doris/gensrc'
-- Configuring incomplete, errors occurred!, but local setup failed before compilation because is missing in this runner.
- Observability/performance: No new observability appears necessary for this parser-only fix. The implementation remains single-pass over the parsed significand/exponent and does not add material allocations.
- User focus: No additional user-provided review focus was present.
There was a problem hiding this comment.
Review opinion: approve. I did not find blocking correctness issues in this PR.
Checklist conclusions:
- Goal and proof: The change targets string-to-decimal scientific notation parsing and rounding for very small values. The parser now keeps the significand end separate from exponent syntax and gates rounding on the first actually discarded fractional digit. Added BE tests cover positive exponent shifts, leading-dot input, zero exponent, negative exponent, and the small-value rounding boundary.
- Scope: The code change is focused on
StringParser::string_to_decimal; the CRC expected-result updates are consistent with decimal parsing affecting the array decimal256 test data loaded from scientific notation CSV values. - Concurrency/lifecycle: No new concurrency, locks, threads, globals, or lifecycle ownership changes.
- Configuration/compatibility: No new configs, persisted formats, thrift protocol, or rolling-upgrade compatibility concerns.
- Parallel paths:
string_to_decimalis the shared BE parser used by decimal cast/serde paths, so the fix applies to the relevant decimal widths through the existing template instantiations. - Error handling/data correctness: Existing parse failure/overflow/underflow status paths are preserved; the reviewed edge cases keep exponent characters out of digit counts and avoid rounding when implicit zeros precede the first significand digit beyond scale.
- Tests: Targeted tests were added. I attempted to run
./run-be-ut.sh --run --filter=FunctionCastToDecimalTest.test_from_string_scientific_notation:FunctionCastToDecimalTest.string_parser_scientific_rounding, but local setup failed before compilation becausethirdparty/installed/bin/protocis missing in this runner. - Observability/performance: No new observability appears necessary for this parser-only fix. The implementation remains single-pass over the parsed significand/exponent and does not add material allocations.
- User focus: No additional user-provided review focus was present.
…tion strings (#63119) Related PR: Bug introduced by #60004 Fix incorrect string-to-decimal parsing for scientific notation. Previously, decimal parsing could count exponent characters as part of the significand, causing values like `"1.4E+2"` to be cast incorrectly as `14` instead of `140`. It could also round very small scientific-notation values incorrectly when the significant digit appeared after implicit fractional zeros, such as `"5e-17"` for scale 15. This change makes the parser track the end of the significand separately from the exponent, applies exponent-based decimal-point shifting correctly, and only rounds when the next real significand digit is the first discarded fractional digit. Also add comments to help understand the code.
What problem does this PR solve?
Issue Number: close #xxx
Related PR: Bug introduced by #60004
Fix incorrect string-to-decimal parsing for scientific notation.
Previously, decimal parsing could count exponent characters as part of the significand, causing
values like
"1.4E+2"to be cast incorrectly as14instead of140. It could also round verysmall scientific-notation values incorrectly when the significant digit appeared after implicit
fractional zeros, such as
"5e-17"for scale 15.This change makes the parser track the end of the significand separately from the exponent,
applies exponent-based decimal-point shifting correctly, and only rounds when the next real
significand digit is the first discarded fractional digit.
Also add comments to help understand the code.
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)