Next step would be to do reproducible builds (if it's not already the case).
https://arxiv.org/abs/2405.17238 https://arxiv.org/abs/2404.02056 https://arxiv.org/abs/2404.19715 https://www.sciencedirect.com/science/article/pii/S266638992...
If you just want to see if there is incidence of valid differences, this seems fine. But I wouldn’t use it as a guarantee.
However, LLMs are incredibly naive, so they could be easily fooled by a malicious actor (probably as easy as adding a comment that this is definitely NOT a backdoor).
- files not being tracked in the repo
- files being part of the repo not being part of the published crate
- publishing with allow dirty from a local copy of the repo with changes that haven't been committed
- publishing from a commit that hasn't been pushed
I'm sure there are more.
It definitely does contain generated files, at least one crate has Rust code generated by a Python script that is not in the crate, only in the upstream Git repository.
"the best way"? Please make the argument for why. To do it properly, you must steel-man the alternatives (not shoot down straw-men)
If you introduce a backdoor into the compilation step, you run a much greater risk of detection. As long as there are multiple machines compiling packages and verifying whether the checksums match, any single backdoored machine will immediately be caught. This is much more important for package managers that do their own builds and ship their own binaries than for those who just ship whatever they got from the developer.
Without deterministic compilation, two builds of the exact same code might differ. This makes backdoors very hard to detect unless you have prior suspicion that one is present in a particular program.
Deterministic compilation forces people to embed backdoors directly in the source code repository, which creates an audit trail, is very visible in diffs, much easier to catch in reviews and so on. You can still get away with it (see the XZ situation), but it requires far more work.
It is important to remember that crates.io doesn't store binaries.
Prove me wrong! I'm open to it.
Or be that person who is too lazy to respond with an actual comment, downvotes, and probably assumes they are right.
https://central.sonatype.org/publish/requirements/immutabili...
You mean giving it the Go ahead? ;)
https://github.com/apex/up-examples/blob/master/oss/golang-g...
The crates.io approach instead defends you both from a dependency changing silently and from it disappearing from one day to the next, without having to deal with vendoring.
Go does not delete modules from the module proxy except for copyright/legal reasons (I suspect crates.io would delete for these reasons too), and additionally the checksums of all modules are published in a tamper-proof transparency log (https://sum.golang.org/) so if they did alter or delete a module, it would be detected.
1. The proxy delivers the changed code while issuing a warning to the end user, which if the warning is overlooked would mean that the checksum achieved nothing.
2. The proxy ignores the changed code while delivering the old code and a warning to the user, which would introduce the same concern raised here that the code that gets delivered and the code listed on Github have no requirement to be the same.
3. The proxy refuses to deliver any code and issues a warning to the user, which would mean that anyone can effectively remove their code from the proxy by simply changing the tag to something else.
I would be interested to know which one Go actually goes with, because none of these are ideal.
If the Git tag changes, the proxy returns the original code. There is no warning that the Git tag has changed, but it can't reliably detect this anyways because the Git repository could be returning different content to different clients. I don't think there is much difference between crates.io and the Go module proxy in this regard - in neither ecosystem can you assume the Git repo matches what the packaging tool downloads. (I pointed this out here: https://news.ycombinator.com/item?id=40699948)
Where Go is different is that it provides assurance that the module proxy is providing the same code to everyone, eliminating the module proxy as a potential source of compromise. It also ensures that people who disable the module proxy and fetch code directly using the go command get the same code that the module proxy has. To reiterate, this does not help with doing code audits of Git repos - you have to either audit the code in the module proxy, or compute the checksum of the Git repo to make sure it matches the sumdb.
I don't know of a single modern build tool that can do this but doesn't require or record this information specifically? Maybe the earlier versions of Go? (I know they've gone through a few changes in module/import strategies.)
Ultimately, if you're doing a code audit, you have to compute the checksum of the code that you're looking at, and compare it against the entry in go.sum or the checksum database to make sure you're auditing the right copy.
[0] https://www.practical-go-lessons.com/chap-18-go-module-proxi...
However, publishing a list that basically says "these are the least trustworthy Rust users" would cause quite a stir, so I'm not doing that.
They work on a different model, where only packages that are deemed "worthy" are included, and there's a small-ish set of maintainers that are authorized to make changes and/or accept change requests from the community. In contrast, programming language package managers like cargo, pip or npm let anybody upload new packages with little to no prior verification, and place the responsibility of maintaining them solely on their author.
The distribution way of doing things is sometimes necessary, as different distributions have different policies on what they allow in their repositories, might want to change compilation options or installation paths, backport bug and security fixes from newer project versions for compatibility, or even introduce small code changes to make the program work better (or work at all) on that system.
One example of such a repository, for the Alpine Linux distribution, is at https://github.com/alpinelinux/aports
The problem with it is you need the full git url in every file you import it. which is a pain if the repo changes locations, or you want to use a fork or a local version. Versioning is also tricky, to the point that go recommends creating a separate branch for a major/breaking version, which requires updating every import statement.
I think a good middle ground would be to have a central repository and/or package configuration file that maps package names to git repos and versions to commits (possibly via tags). And of course use hashes to lock the version to specific contents.
Bazel kind of does this, but it doesn't have any built in version resolution or transitive dependency resolution (although in some cases there are other tools that help). And it can add a lot of complexity that you may not need.
Not tried them but they look like a reasonable dep handling solution on paper - each module can declare it's own dependencies and bazel will figure it out for you like a package manager. Their old workspaces way of doing it was a nightmare, as while patterns emerged where repos would export a function to register their dependencies, the first declaration of any name would win and thus you weren't guaranteed to have a compatible set of workspaces at the end.
Because what you want exists and has a thriving community, and a package set that outclasses, well, statistically every other package manager in existence.
I swear, it's a daily occurrence for me to see software engineering challenges posited here as damn near impossible that Nix has been solving for over a decade.
What if you could run a single command and have exact insight to the source you're using for every single package on your system with the context of the dependency graph it exists in.
I cannot wait for this wave to crash and for people to realize how much engineering effort is reduced by using Nix. And that all of these things they know they want for years, already exists. But hey, the syntax takes time to get used to and how do you compare that against the countless blog posts and hours and institutional knowledge you need to actually use docker properly. And then later on some Go-based SBOM tool made by a VC-backed startup that fundamentally still does an inferior job to Nix. Sigh.
Well anyway I guess nix will keep being used by hedge funds, algorithmic traders, "advanced defensive capabilities" companies, literal (launched, in space) satellites, wallet manufacturers, etc, while everyone else listens to the syntax decriers.
But the code-source that is sent to crates.io is not necessarily the same as the one in the public repo linked to the crate.
It definitely does contain generated files, at least one crate has Rust code generated by a Python script that is not in the crate, only in the upstream Git repository.