Rust intentionally provides the simplest possible growable string buffer String, which is literally (under the hood, you can't poke this legitimately) Vec<u8> plus the promise that this is UTF-8 text.
But you might find your needs better served by one (or several) of:
Box<str> -- you don't need capacity, so, don't store it => length == capacity
CompactString -- use the entire 24 bytes for SSO, up to 24 bytes of UTF-8 inline, obviously doesn't make sense if all or the vast majority of your strings are 25 bytes or longer
ColdString -- same idea but for 8 bytes, and also not storing capacity, this only makes sense over Box<str> if you have plenty of <= 8 byte strings
Atoms: Each string can be referenced with a single u32 or even u16, and they're inherently deduplicated.
Bump allocator: your strings are &str, allocation is super fast with limited fragmentation.
Single pointer strings (this has a name, I can't think of it right now): you store the length inside the allocation instead of in each reference, so your strings are a single pointer.
Yes. It is exactly how they are described.
https://docs.rs/string_cache/latest/string_cache/struct.Atom...
> Represents a string that has been interned.
These aren't really optimizations. They are specialized implementations that introduce design and architectural tradeoffs.
For example, Rust's Atom represents a string that has been interned, and it's actually an implementation of a design pattern popular in the likes of Erlang/Elixir. This is essentially a specialized implementations of the old Flyweight design pattern, where managing N independent instances of an expensive read-only object is replaced with a singleton instance that's referenced through a key handle.
I would hardly call this an optimization. It actually represents a significant change to a system's architecture. You have to introduce a set of significant architectural constraints into your system to leverage a specific tradeoff. This isn't just a tweak that makes everything run magically leaner and faster.
In my opinion, there's no magic in the software engineering. Everything (or almost everything) is a system that can be described, explained, modified and so on. Applications, libraries, operating systems, kernels, CPUs/RAM/GPU/NPU/xPU/whatever silicon there is, ALUs/etc, transistors, electricity, physics... That's nowhere near "magic". There's always some trade-offs, it's just that you may not be aware of them initially.
`String::as_vec_mut` kinda implies that, since it gives you access to that underlying `Vec` which must then exist somewhere.
In case anyone else was wondering it, yes, it's "unsafe".
Commonly as... conversions are actually no-ops at runtime (the type changes but the data does not, no CPU instructions are emitted) whereas to... conversions might do quite a lot, especially if they bring into existence an actual thing at runtime -- maybe Goose::to_donkey actually needs to go allocate memory for a Donkey and destroy the Goose.
Yes it's unsafe because the Vec doesn't enforce the promise we made about this being UTF-8 text whereas String did, so now that promise is ours to keep and `unsafe` is how we signify that you the programmer took on the responsibility for safety here.
Option<Box<SmithyTraits> -> SmithyTrait^?I think I agree though, especially with Option. Swift’s option syntax (and kotlin’s which is similar) is so much better, a simple question mark in the type. Options are important enough that dedicated syntax makes so much sense. Rust blew their chance here with ? meaning “maybe early return”, it would have been a lot more useful as an Option indicator.
Note that both Option and Result implement that same trait.
Perhaps if try blocks ever become a thing... we can finally use it for our own types ;)
It's especially problematic because traits don't have memory behaviors like this article in most cases - by default they're unsized, because it's a description of behavior, not data, and you can't even use them as a struct field without extra work.
Like, replace "trait" in here with "box" and see how confusing it would be to be describing how you saved memory by boxing your box, because option doesn't box like many other languages do.
Perhaps because this feels like a fairly rust-specific gotcha. Especially if you're coming from languages where there's often not much syntactical distinction made between "this is a pointer because I don't want to be copying it" and "this is a pointer because it's optional."
For instance, it's not until now that I actually understood what the sibling comment about the Enum type size discrepancy lint meant: "This lint obviously cannot take the distribution of variants in your running program into account. It is possible that the smaller variants make up less than 1% of all instances, in which case the overhead is negligible and the boxing is counter-productive. Always measure the change this lint suggests." I had always accidentally read this backwards, thinking it meant something more to the effect of "if most of the instances are actually small, then it's not a problem here, but be aware that some of them are much larger so some of your calls to things with this could end up passing much larger types."
And in this particular context None of Option<CompactString> is the bit pattern for a carefully chosen impossible 24 byte slice, all zeroes is of course a completely valid way to spell 24 of the ASCII NUL U+0000 character so we can't use that to signify None, but many 24 byte slices are not valid UTF-8 encodings.
When some day I get to make my own BalancedI8 in stable Rust (the 8-bit signed integer except without the slightly annoying and rarely needed most negative value -128) then None of Option<BalancedI8> will occupy the bit pattern for -128 which is 0x80
Do you know a solution to this?
So, the long answer with lots of reading material is that you want "Pattern Types" and who knows when that could land in Rust. Pattern types are a simple Refinement Type, you can go absolutely wild with refinement in theory and some niche languages go much deeper but all we want here is to say "Only these values of the type are allowed" and by implication all bit patterns not used for those values are a niche and Rust would optimise accordingly.
Rust doesn't (even unstably) have Pattern Types so you might wonder how NonZeroU32 works for example, or indeed OwnedFd, and similar types. For these types there's a permanently unstable "compiler only" feature flag which allows you to explicitly specify the niche, that's how they're made. So, if you're comfortable writing unstable nightly-only software you can do this today, go look at the guts of NonZeroU32 for example.
If you want to write Rust software for ordinary people who use stable Rust you have two options today and for the indefinite future in which Pattern Types are not stable:
1. Enumerations: An enumeration which has say 5 values clearly doesn't need byte values 0x5 through 0xFF, so Rust stably promises this is a niche. For BalancedI8 that's kinda messy with 255 values but tolerable, some real crates use this trick or one related to it and they work fine - for a hypothetical BalancedI16 it's imaginable to create 65535 values but silly, and for BalancedI32 clearly we should not expect the compiler to accept an enumeration with 4 billion+ named values so no...
2. XOR trick: Rust provides NonZeroI8 for example. We can "make" BalancedI8 by providing accessors which always XOR -128 (0x80) with the value and then actually internally we store the NonZeroI8 instead, at runtime now operations incur an extra XOR which is a very cheap ALU operation. This works for any "Not this" value because of the properties of the XOR operation, so unlike with enums this is practical for BalancedI64 for example.
Eventually rust will likely gain pattern types as a more generally useful version of the features used in nook. There is even some actively ongoing work on this [3]
1: https://github.com/tialaramex/nook/blob/main/src/balanced.rs
Clippy is essentially a linter; and one of its checks catches cases where different enum variants have a significantly different size; with a suggestion to Box the larger variant.
Since this is just a linter, it doesn't actually have any knowledge of how frequently each variant is actually used. It also doesn't address the situation in the article at all.
> a lot of boxes means a fragmented heap. In such case it's not a problem but this might be worth keeping in mind.
A good malloc will be able to handle this without issue due to various optimizations specifically that inherently fight fragmentation. Default Linux malloc (glibc) may have issues but I did say good malloc (and even glibc generally shouldn’t struggle with the pattern described I think).
Without that, if you try to suggest a transformation like this when the schema is first conceived, it will likely be considered premature optimization.