TextDecoder is wrong and very slow

Why this is a good first issue

Complex performance and correctness issues across multiple encodings require deep TextDecoder expertise.

AI Summary

The issue reports both correctness failures in multiple encodings and significant performance problems in TextDecoder implementations. While the requirements are somewhat clear from benchmark comparisons, this requires deep knowledge of text encoding standards and performance optimization. The maintainer discussion indicates this may involve cross-browser coordination.

Issue Description

Correctness

Encodings that return invalid results:

Single-byte:
- ibm866 (fails at even ascii input)
- koi8-u
- windows-874
- windows-1252
- windows-1253
- windows-1255
Multi-byte (all except gb18030):
- gbk (should be identical to gb18030 but it is instead broken)
- big5
- euc-jp
- iso-2022-jp
- shift_jis (fails at even ascii input)
- euc-kr

Unimplemented encodings that throw:

iso-8859-16
x-user-defined

If built without icu, utf-16le encoding also returns invalid results:

> new TextDecoder('utf-16le').decode(Uint16Array.of(0xd800))
'�' // correct
'\ud800' // no ICU

Performance

utf-8 (aka default) TextDecoder is much slower on ascii input than it can and should be 1.3x on 4096 bytes, ~3x on 1 MiB input
The above applies to buffer.toString() too It's much slower on ASCII input than a checked js impl (same 1.3x-3x)
windows-1252 aka new TextDecoder('ascii') aka new TextDecoder('latin1') is ~2x-4x slower than an optimized impl on ascii input
windows-1252 aka new TextDecoder('latin1') is ~6x-12x slower than an optimized impl on latin1 input
windows-1252 is ~7x-12x slower than an optimized js impl
Other single-byte encodings that are significantly slower than js impl even on non-ascii input: iso-8859-3, iso-8859-6, iso-8859-7, iso-8859-8, iso-8859-8-i, windows-1253, windows-1255, windows-1257
None of the single-byte encodings are faster than the js impl even on non-ascii input
All of the single-byte encodings except windows-1252 are >=10x slower than the js impl on ascii input (windows-1252 is only ~2-4x slower)

References

Nothing of the above requires any changes on the native side, I compared to a somewhat optimized JS implementation

See https://docs.google.com/spreadsheets/d/1pdEefRG6r9fZy61WHGz0TKSt8cO4ISWqlpBN5KntIvQ/edit

See tests in https://github.co

GitHub Labels

performance