Fresh Hacker News | WorstFit: Unveiling Hidden Transformers in Windows ANSI

▲WorstFit: Unveiling Hidden Transformers in Windows ANSI(blog.orange.tw)

214 points by notmine1337 9 hours ago | 18 comments

▲vessenes 7 hours ago

This is a tough one. It’s systemic —- MS provides a “best fit” code mapping from wide Unicode to ASCII, which is a known, published, “vibes-based” mapper. This best fit parser is used a lottt of places, and I’m sure that it’s required for ongoing inclusion based on how MS views backward compatibility. It’s linked in by default everywhere, whether or not you know you included it.

The exploits largely revolved around either speccing an unusual code point that “vibes” into say a slash or a hyphen or quotes. These code points are typically evaluated one way (correct full Unicode evaluation) inside a modern programming language, but when passed to shell commands or other Win32 API things are vibes-downed. Crucially this happens after you check them, since it’s when you’ve passed control.

To quote the curl maintainer “curl is a victim” here — but who is the culprit? It seems certain that curl will be used to retrieve user supplied data automatically by a server in the future. When that server mangles user input in one way for validation and another when applied to system libraries, you’re going to have a problem.

It seems to me like maybe the solution is to provide an opt-out of “best fit” munging in the Win32 space, but I’m not a Windows guy, so I speculate. At least then open source providers could just add the opt out to best practices, and deal with the many terrible problems that things like a Unicode wide variant of “ or \ delivers to them.

And of course even if you do that, you’ll interact with officially shipped APIs and software that has not opted out.

▲wongarsu 7 hours ago

The opt-out is to use the unicode windows APIs (the functions ending in "w" instead of "a"). This also magically fixes all issues with paths longer than 260 characters (if you add a "\\?\" prefix or set you manifest correctly), and has been available and recommended since Windows XP.

I'm not sure why the non-unicode APIs are still so commonly used. I can't imagine it's out of a desire to support Windows 98 or Window 2000.

▲Sharlin 6 hours ago

As mentioned elsewhere in this discussion, 99% of the time the cause is likely the use of standard C functions (or C++ `std::string`…) instead of MS's nonstandard wide versions. Which of course is a ubiquitous practice in portable command-line software like curl.

▲pishpash 4 hours ago

So the culprit is still the software writer. They should have wrapped the C++ library for OS-specific behavior on Windows. Because they are publishing buggy software and calling it cross-platform.

▲bayindirh 4 hours ago

curl first released in 1996, shortly after Windows 95 has born and runs on numerous Windows versions even today. So, how many different versions shall be maintained? Are you going to help one of these versions?

On top of that, how many new gotchas these “modern” Windows functions hide, and how many fix cycles are required to polish them to the required level?

▲thrdbndndn 10 minutes ago

If we're talking about curl specifically, I absolutely think they would (NOT "should") fix/workaround it if there are actually common problems caused by it.

Yes it would have required numerous fix cycles, but curl in my mind is such a polished product and they would have bit the bullet.

▲ 1 hour ago

▲comex 6 hours ago

_Or_ set your application to use UTF-8 for the "A" APIs. Apparently this is supported as of a Windows 10 update from 2019. [1]

[1] https://learn.microsoft.com/en-us/windows/apps/design/global...

▲vessenes 7 hours ago

I think the issue is that native OS things like the windows command line, say, don’t always do this. Check the results of their ‘cd’ commands with Japanese Yen characters introduced. You can see that the path descriptor somehow has updated to a directory name with Yen (or a wide backslash) in it, while the file system underneath has munged, and put them into an actual directory. It’s precisely the problem that you can’t control the rest of the API surface to use W that is the source of the difficulties.

▲cesarb 5 hours ago

> I'm not sure why the non-unicode APIs are still so commonly used. I can't imagine it's out of a desire to support Windows 98 or Window 2000.

Nowadays, it's either for historical reasons (code written back when supporting Windows 9x was important, or even code migrated from Windows 3.x), or out of a desire to support non-Windows systems. Most operating systems use a byte-based multi-byte encoding (nowadays usually UTF-8) as their native encoding, instead of UTF-16.

▲p_ing 6 hours ago

> paths longer than 260 characters (if you add a "\\?\" prefix or set you manifest correctly)

A long ago released build of Windows 10 did this automatically so no need for adjustments anymore, 32k is the max....

...except for Office! It can't handle long paths. But Office has always been hacky (the title bar, for example).

▲mmastrac 8 hours ago

This is kind of unsurprising, but still new to me even as someone who did Windows development (and some Wine API hacking) for a decade around when this W/A mess came about.

Windows is like the card game Munchkin, where a whole bunch of features can add up to a completely, unbelievably random over-powered exploit because of unintentional synergy between random bits.

I'm happy to see that they are converting the ANSI subsystem to UTF-8, which should, in theory, mitigate a lot of these problems.

I wonder if the Rust team is going to need YetAnotherFix to the process spawning API to fix this...

▲Joker_vD 7 hours ago

> the only thing we can do is to encourage everyone, the users, organizations, and developers, to gradually phase out ANSI and promote the use of the Wide Character API,

This has been Microsoft's official position since NT 3.5, if I remember correctly.

Sadly, one of the main hurdles is the way Microsoft's own C/C++ runtime library (msvcrt.dll) is implemented. Its non-standard "wide" functions like _wfopen(), _wgetenv(), etc. internally use W-functions from Win API. But the standard, "narrow" functions like fopen(), getenv(), etc., instead of using the "wide" versions and converting to-from Unicode themselves (and reporting conversion failures), simply use A-functions. Which, as you see, generally don't report any Unicode conversion failures but instead try to gloss over them using best-fit approach.

And of course, nobody who ports software, written in C, to Windows wants to rewrite all of the uses of standard functions to use Microsoft's non-portable functions because at this point, it becomes a full-blown rewrite.

▲terinjokes 7 hours ago

The position I got reading documentation Microsoft has written in the last two years is the opposite: set activeCodePage in your application manifest to UTF-8 and only ever use the "ANSI" functions.

▲ziml77 6 hours ago

Yes that does seem to be the way going forward. Makes it a lot easier to write cross-platform code. Though library code still has to use the Wide Character APIs because it's up to the application as a whole to opt into UTF-8. Also if you're looking for maximal efficiency, the WChar APIs still make sense because it avoids the conversion of all the string inputs and outputs on every call.

▲terinjokes 4 hours ago

Many libraries I've encountered have defines available now to use the -A APIs; previously they were using -W APIs and converting to/from UTF-8 internally.

As for my application, any wchar conversions being done by the runtime are a drop in the bucket compared to the actual compute.

▲Joker_vD 5 hours ago

Ah, so they've finally given up? Interesting to hear. But I guess the app manifests does give them a way to move forward this way while maintaining the backward-compatible behaviour (for apps without this setting in their manifests).

▲dblohm7 6 hours ago

What really annoys me these days is that if you search for a Win32 API on Google, it will always come up with the -A variant, not the -W variant. I don't know if they've got something weird in their robots.txt or what, but I find it bizarre that an API whose guidelines desire developers to use the -W variants in all greenfield code, instead returns the legacy APIs by default.

▲masfuerte 6 hours ago

In my portable code I #define standard functions like main and fopen to their wide equivalents when building on Windows.

This does mean I can't just use char* and unadorned string literals, so I define a tchar type (which is char on Linux and wchar_t on Windows) and an _T() macro for string literals.

This mostly works without thinking about it.

▲delta_p_delta_x 7 hours ago

> Microsoft's own C/C++ runtime library (msvcrt.dll) is implemented

This has been superseded by the Universal C runtime (UCRT)[1] which is C99-compliant.

▲userbinator 1 hour ago

And of course making everything twice as big as it needs to be is also extremely repugnant.

▲nialv7 5 hours ago

Windows really should provide an API that treats path names as just bytes, without any of these stupid encoding stuff. Could probably have done that when they introduced UNC paths.

▲Joker_vD 5 hours ago

Windows does treat path names as just sequences of uint16_t (which is how NTFS stores them) if you use W-functions and prepend the paths with "\\?\".

▲nialv7 54 minutes ago

oh, that's interesting. do UNC paths not have to be valid UTF-16?

▲rubatuga 17 minutes ago

From what I can tell the largest vulnerability is argument passing to executables in Windows. Essentially it is very difficult to safeguard it. I've seen some CLI programs use the '--' to signify user input at the end, maybe this would solve this for a single argument scenario. Overall, this is an excellent article and vulnerability discovery.

▲layer8 5 hours ago

> until Microsoft chooses to enable UTF-8 by default in all of their Windows editions.

I don’t know how likely this is. There are a lot of old applications that assume a particular code page, or assume 1 byte per character, that this would break. There are also more subtle variations of this, like applications assuming that converting from wide characters to ANSI can’t increase the number of bytes (and hence an existing buffer can be safely reused), which isn’t the case for UTF-8 (but for all, or almost all, existing code pages). It can open up new vulnerabilities.

It would probably cause much less breakage to remove the Best-Fit logic from the win32 xxxA APIs, and instead have all unmappable characters be replaced by a character without any common meta semantics, like “x”.

▲kgeist 5 hours ago

Maybe they can introduce OS API versions (if there's no such thing yet) and require new (or updated) apps targetting new API versions/newer SDKs to assume UTF8 by default? So everything below a certain API version is emulated legacy mode. Windows already has the concept of shims to emulate behavior of different Windows versions.

▲layer8 4 hours ago

Apps can already opt-in to UTF-8 for the ANSI APIs (see https://news.ycombinator.com/item?id=42649122), or use the wide-character APIs.

▲tambre 2 hours ago

One example of such an application is Adobe After Effects [0]. Or at least used to be, I no longer use Windows.

[0] https://tambre.ee/blog/adobe_after_effects_windows_utf-8/

▲Dwedit 5 hours ago

There are two ways to force the "Ansi" codepage to actually be UTF-8 for an application that you write (or an EXE that you patch).

One way is with a Manifest file, and works as of a particular build of Windows 10. This can also be applied to any EXE after building it. So if you want a program to gain UTF-8 support, you can hack it in. Most useful for console-mode programs.

The other way is to use the hacks that "App Locale" type tools use. One way involves undocumented function calls from NTDLL. I'm not sure exactly which functions you need to call, but I think it might involve "RtlInitNlsTables" and "RtlResetRtlTranslations" (not actually sure).

▲segasaturn 7 hours ago

I've been inadvertantly safe from this bug on my personal Windows computer for years thanks to having the UTF-8 mode set, as shown at the bottom of the article. I had it set due to some old, foreign games showing garbled nonsense text on my computer. Have not noticed any bugs or side effects despite it being labelled as "Beta".

▲numpad0 4 hours ago

Interesting, to me that checkbox have done nothing but crashing too many random apps. I guess whether it works depends on the user's home codepage with it off.

▲nitwit005 1 hour ago

There are presumably some similar .Net COM issues when communicating with unmanaged code, as there is an attribute for controlling this conversion: https://learn.microsoft.com/en-us/dotnet/api/system.runtime....

It directly mentions: "Setting BestFitMappingAttribute parameters in this manner provides an added measure of security."

▲cesarb 5 hours ago

> However, resolving this problem isn’t that as simple as just replacing the main() with its wide-character counterpart. Since the function signature has been changed, maintainers would need to rewrite all variable definitions and argument parsing logics, converting everything from simple char * to wchar_t *. This process can be painful and error-prone.

You don't need to convert everything from char * to wchar *. You can instead convert the wide characters you received to UTF-8 (or to something like Rust's WTF-8, if you want to also allow invalid sequences like unpaired surrogates), and keep using "char" everywhere; of course, you have to take care to not mix ANSI or OEMCP strings with UTF-8 strings, which is easy if you simply use UTF-8 everywhere. This is the approach advocated by the classic https://utf8everywhere.org/ site.

▲scoopr 4 hours ago

I was wondering if the beta checkbox the same thing as setting the ActiveCodePage to UTF-8 in the manifest, but the docs[0] clarify that GDI doesn't adhere to per-process codepage, but only a single global one, which is what the checkbox sets.

Bit of a shame that you can't fully opt-in to be UTF-8 with the *A API, for your own apps. But I think for the issues highlighted in the post, I think it would still be a valid workaround/defence-in-depth thing.

[0] https://learn.microsoft.com/en-us/windows/apps/design/global...

▲lilyball 2 hours ago

> Worse still, as the attack exploits behavior at the system level during the conversion process, no standard library in any programming language can fully stop our attack!

What happens if the standard library updates its shell escaping to also escape things like the Yen character and any other character that has a Best-Fit translation into a quote or backslash? Which is to say, what does Windows do for command-line splitting if it encounters a backslash-escaped nonspecial character in a quoted string? If it behaves like sh and the backslash simply disables special handling of the next character, then backslash-escaping any threat characters should work.

▲bangaladore 8 hours ago

I tend to agree that this is not an issue with many of the applications that are mentioned in the post.

Fundamentally this boils down to essentially bugs in functions that are supposed to transform untrusted into trusted input like the example they gave:

`system("wget.exe -q " . escapeshellarg($url));`

`escapeshellarg` is not producing a trusted output with some certain inputs.

▲blibble 6 hours ago

the escaping rules for windows are so complicated (and can vary with configuration) such that it's not possible to do it securely

vs. posix that just dumps the arguments directly into argv

▲hnlmorg 5 hours ago

Windows doesn’t really have an ARGV though. It’s a user space abstraction for compatibility with POSIX.

Windows technically just works on the principle of an executable name + a single argument. And it does this for compatibility with DOS.

So you end up with this stupid escaping rules you’ve described so there are compatibility conventions at the kernel level with earlier implementations of Windows, which in turn maintained compatibility with MS-DOS. While providing a C abstraction that’s compatible with POSIX.

Which is just one of many reasons why it’s a nightmare to write cross platform shells that also target Windows.

▲bangaladore 4 hours ago

> the escaping rules for windows are so complicated (and can vary with configuration) such that it's not possible to do it securely

This is bold claim.

Is it not possible? Or not easy to do correctly?

▲blibble 4 hours ago

all the kernel passes to executables is one long string

and then every program handles it in whatever way it feels is best

as examples: go/java/python all process arguments slightly differently

even microsoft's libc changes handling between versions

given it's not possible to know what parser a specific target program is going to use: it's not possible to generically serialise an array safely

▲layer8 5 hours ago

> And yes, Python’s subprocess module can’t prevent this.

A reasonably sane solution would be for it to reject command line arguments on Windows that contain non-ASCII characters or ASCII characters that aren’t portable across code pages (not all code pages are a superset of US-ASCII), by default, and to support an optional parameter to allow the full range, documenting the risk.

▲ppp999 4 hours ago

Character encoding has been such a mess for so long it's crazy.

▲ok123456 6 hours ago

Bush hid the facts

▲cesarb 5 hours ago

> Bush hid the facts

For those who don't know the reference: https://en.wikipedia.org/wiki/Bush_hid_the_facts it's a vaguely related issue, in which a Windows component misinterprets a sequence of ASCII characters as a sequence of UTF-16 characters. Windows just seems full of these sorts of character-handling bugs, in part due to its long history as a descendant of the codepage-using MS-DOS and 16-bit Windows operating systems.

▲ 6 hours ago

▲mouse_ 8 hours ago

Unicode on modern systems is absolutely terrifying. Anyone remember the black dot of death? https://mashable.com/article/black-dot-of-death-unicode-imes...

▲Randor 8 hours ago

That was a long read. Just be happy that you never had to deal with Trigraphs. https://learn.microsoft.com/en-us/cpp/c-language/trigraphs?v...

▲tiahura 2 hours ago

Imagine no Unicode, It’s easy if you try, No bytes that bloat our systems, No errors make us cry. Imagine all the coders, Living life in ASCII…

Imagine no emojis, Just letters, plain and true, No accents to confuse us, No glyphs in Sanskrit too. Imagine all the programs, Running clean and fast…

You may say I’m a dreamer, But I’m not the only one. I hope someday you’ll join us, And encoding wars will be done.