Fresh Hacker News | I've compared nearly all Rust crates.io crates to contents of their Git repos

▲I've compared nearly all Rust crates.io crates to contents of their Git repos(mastodon.social)

155 points by robin_reala 9 days ago | 6 comments

▲jraph 9 days ago

Good initiative. Now people need to go through this and do the reviews :-)

Next step would be to do reproducible builds (if it's not already the case).

▲jonahx 9 days ago

Not a Rust dev so maybe a dumb question, but is this more involved than just running diffs? If so, what needs to be done?

▲johannes1234321 9 days ago

The key thing is interpretation of the diff. Is there a difference since they ran some code generator ins the crate contains generated code, not present in the repo or did they add a backdoor?

▲thayne 9 days ago

Most of the diffs are probably innocuous. I suspect the most common diff would be the version line of Cargo.toml, both from CI that automatically updates that line, and people who forgot to update it before making a tag in git.

▲mmastrac 9 days ago

As someone with a crate that's in the 50MM plus range, this happens all the time. I really should automate this via a GH action.

▲OptionOfT 9 days ago

Interested to see the crate, and maybe I can help?

▲mmastrac 9 days ago

https://crates.io/crates/ctor

Patches welcome!

Email in profile as well.

▲SubiculumCode 9 days ago

First pass gpt?

▲SubiculumCode 9 days ago

Heavily down voted, which is fair because I didn't really explain what I meant, which was: Would using LLM's to parse the generated diffs, as a first pass, be useful/efficient for spotting and interpreting discrepancies?

▲arccy 9 days ago

when your goal is to improve security, the unreliability that comes with LLMs is not the answer.

▲chipdart 9 days ago

I don't think this is a relevant take. Your goal is to implement a system to automatically scan countless packages and run a heuristic to determine if a package is suspicious or not. You're complaining about false positives/false negatives while ignoring that packages that not checking packages at all is not an improvement.

▲Zambyte 9 days ago

Personally I think using LLMs to scan is a good idea, but an honesty negative is a potential false sense of security. I think using LLMs here are useful for finding unintentional security flaws. I don't think it's a great tool to find intentional security flaws a la the xz situation. People might be less inclined to dig into the code directly if it was stamped with a green check mark by a GPT.

▲SubiculumCode 9 days ago

Using machine learning, including LLMs, to detect and mitigate malicious code is of interest to a whole lot of smarter people than me, really suggests your flippant rejection of their potential is premature.

https://arxiv.org/abs/2405.17238 https://arxiv.org/abs/2404.02056 https://arxiv.org/abs/2404.19715 https://www.sciencedirect.com/science/article/pii/S266638992...

▲Aeolun 9 days ago

Not necessarily, it reduces false positives. It just doesn’t do anything for false negatives (arguably makes the problem worse).

If you just want to see if there is incidence of valid differences, this seems fine. But I wouldn’t use it as a guarantee.

▲seoulmetro 9 days ago

Neither does grabbing yet another online third party's untested data?

▲pornel 9 days ago

It could work for classifying honest/innocent differences.

However, LLMs are incredibly naive, so they could be easily fooled by a malicious actor (probably as easy as adding a comment that this is definitely NOT a backdoor).

▲SubiculumCode 9 days ago

LLMs are broadly naive, but when fine tuned on a small domain of expertise/knowledge, this problem is less impactful.

▲grahar64 9 days ago

Deterministic compilation is the best way to let people validate what they are downloading from repositories is what is in the codebases.

▲progval 9 days ago

crates.io does not host compiled artifacts. If packages on crates.io differ from their Git repository it's because of a custom pre-build step of that particular package, so a deterministic compilation toolchain won't help here.

▲estebank 9 days ago

There are other possible reasons for them not matching:

- files not being tracked in the repo

- files being part of the repo not being part of the published crate

- publishing with allow dirty from a local copy of the repo with changes that haven't been committed

- publishing from a commit that hasn't been pushed

I'm sure there are more.

▲pabs3 9 days ago

> crates.io does not host compiled artifacts.

It definitely does contain generated files, at least one crate has Rust code generated by a Python script that is not in the crate, only in the upstream Git repository.

▲xpe 9 days ago

> Deterministic compilation is the best way to let people validate what they are downloading from repositories is what is in the codebases.

"the best way"? Please make the argument for why. To do it properly, you must steel-man the alternatives (not shoot down straw-men)

▲miki123211 9 days ago

Because deterministic compilation lets you (or someone else) do this automatically.

If you introduce a backdoor into the compilation step, you run a much greater risk of detection. As long as there are multiple machines compiling packages and verifying whether the checksums match, any single backdoored machine will immediately be caught. This is much more important for package managers that do their own builds and ship their own binaries than for those who just ship whatever they got from the developer.

Without deterministic compilation, two builds of the exact same code might differ. This makes backdoors very hard to detect unless you have prior suspicion that one is present in a particular program.

Deterministic compilation forces people to embed backdoors directly in the source code repository, which creates an audit trail, is very visible in diffs, much easier to catch in reviews and so on. You can still get away with it (see the XZ situation), but it requires far more work.

▲xpe 9 days ago

Ok, but you didn't talk about alternatives. Why not just checksum or sign the source code?

It is important to remember that crates.io doesn't store binaries.

▲xpe 9 days ago

See also a related comment in the overall thread saying "crates.io does not host compiled artifacts. If packages on crates.io differ from their Git repository it's because of a custom pre-build step of that particular package, so a deterministic compilation toolchain won't help here."

Prove me wrong! I'm open to it.

Or be that person who is too lazy to respond with an actual comment, downvotes, and probably assumes they are right.

▲k8svet 9 days ago

It means we can build the same thing and check the output hashes. In terms of deriving a method for trusting build infra, it's basically the end all-be-all. I almost don't know how to answer it.

▲makeworld 9 days ago

This is why I like what Go does, where you're downloading from Git directly (optionally proxied through Google, yes)

▲kibwen 9 days ago

I'm not sure what is meant by "downloading from Git", I assume you mean downloading from Github. And Github is far less secure than what crates.io does, because crates.io is immutable (once published, uploaders can't change anything without opening a support ticket which will get rejected if they don't have a good reason), whereas Github history is trivially rewriteable. This means that if you rely on "v1.2.3" of a library from crates.io, that's always going to give everyone the same code; conversely, relying on a git tag of "v1.2.3" from a random Github repo could be anything at any point.

▲cgh 9 days ago

It’s interesting you even have to point this out. Maven solved this and other problems literally decades ago but the repository packaging wheel keeps getting reinvented. For example, here’s the page on Maven Central’s immutability policy:

https://central.sonatype.org/publish/requirements/immutabili...

▲nerdponx 9 days ago

Meanwhile the Python Package Index has recently dropped official support for PGP signatures: https://blog.pypi.org/posts/2023-05-23-removing-pgp/ -- apparently due to too few useful/verifiable signatures, as per https://blog.yossarian.net/2023/05/21/PGP-signatures-on-PyPI...

▲sosodev 9 days ago

Go modules can be hosted in any Git repository. The Go toolchain also keeps hashes of the selected tag so if you've reviewed it once it will never change without you explicitly giving it the ok.

▲moffkalast 9 days ago

> giving it the ok

You mean giving it the Go ahead? ;)

▲Yasuraka 9 days ago

Also any svn/hg repository afaik

▲leoh 9 days ago

That’s true, except that the lockfile records revision as a commit sha.

https://github.com/apex/up-examples/blob/master/oss/golang-g...

▲fsmv 9 days ago

The goproxy server makes it somewhat immutable for go too. Once they have a version cached they will never delete it. You can only supercede it with a new version and mark the old version as bad.

▲estebank 9 days ago

TBH, "somewhat immutable" is not "immutable". The Go approach aids to limit the effects of an attempted attach where you're misled into building your project with different code than originally intended, but does nothing to guarantee continuity over time of dependencies being available. For that you have to rely on vendoring.

The crates.io approach instead defends you both from a dependency changing silently and from it disappearing from one day to the next, without having to deal with vendoring.

▲agwa 9 days ago

As kibween noted above, crates.io can be changed if there's a "good reason" so it's not truly immutable either.

Go does not delete modules from the module proxy except for copyright/legal reasons (I suspect crates.io would delete for these reasons too), and additionally the checksums of all modules are published in a tamper-proof transparency log (https://sum.golang.org/) so if they did alter or delete a module, it would be detected.

▲kibwen 9 days ago

The Go module proxy doesn't retain modules forever, as documented in the link there. And in the case where someone alters the underlying git repo by attaching a previously-used tag to a new commit, one of the following scenarios must then result:

1. The proxy delivers the changed code while issuing a warning to the end user, which if the warning is overlooked would mean that the checksum achieved nothing.

2. The proxy ignores the changed code while delivering the old code and a warning to the user, which would introduce the same concern raised here that the code that gets delivered and the code listed on Github have no requirement to be the same.

3. The proxy refuses to deliver any code and issues a warning to the user, which would mean that anyone can effectively remove their code from the proxy by simply changing the tag to something else.

I would be interested to know which one Go actually goes with, because none of these are ideal.

▲agwa 8 days ago

In general, the Go module proxy retains modules forever, for the explicit purpose of not breaking builds (https://sum.golang.org/#faq-retract-version). crates.io can delete content also. I don't think there is much difference between crates.io and the Go module proxy in this regard - they both aim to keep source code forever to avoid breaking builds, but will delete content if there's a good reason.

If the Git tag changes, the proxy returns the original code. There is no warning that the Git tag has changed, but it can't reliably detect this anyways because the Git repository could be returning different content to different clients. I don't think there is much difference between crates.io and the Go module proxy in this regard - in neither ecosystem can you assume the Git repo matches what the packaging tool downloads. (I pointed this out here: https://news.ycombinator.com/item?id=40699948)

Where Go is different is that it provides assurance that the module proxy is providing the same code to everyone, eliminating the module proxy as a potential source of compromise. It also ensures that people who disable the module proxy and fetch code directly using the go command get the same code that the module proxy has. To reiterate, this does not help with doing code audits of Git repos - you have to either audit the code in the module proxy, or compute the checksum of the Git repo to make sure it matches the sumdb.

▲aseipp 9 days ago

> conversely, relying on a git tag of "v1.2.3" from a random Github repo could be anything at any point.

I don't know of a single modern build tool that can do this but doesn't require or record this information specifically? Maybe the earlier versions of Go? (I know they've gone through a few changes in module/import strategies.)

▲agwa 9 days ago

The problem is that if you clone the Git repository, or view it on GitHub, you have no assurance that you're seeing the same code that the go command or the Go module proxy saw. The author of a malicious module could change the Git tag to point to a different, benign, commit after the Go module proxy stores the malicious copy. There are other tricks an attacker can play as well: https://github.com/golang/go/issues/66653

Ultimately, if you're doing a code audit, you have to compute the checksum of the code that you're looking at, and compare it against the entry in go.sum or the checksum database to make sure you're auditing the right copy.

▲jesprenj 9 days ago

Related: Proxying can be disabled by setting the environment variable GOPROXY=direct [0]. I put it in my bashrc.

[0] https://www.practical-go-lessons.com/chap-18-go-module-proxi...

▲Zambyte 9 days ago

The fact that this isn't default is honestly the biggest thing I dislike about Go.

▲timeon 9 days ago

You can do that with Rust as well if you define path of dependency to git repo (or local dir).

▲estebank 9 days ago

Be aware that you cannot publish on crates.io if you do that: either you buy into the system (so that you can ensure that you can rebuild in perpetuity) or not at all (so you end up with a crate that can depenend partially on crates.io, but must always be consumed directly from a repo or directory.

▲mberning 9 days ago

How could you rank them for review priority? Use a combination of repo popularity multiplied by amount of significant differences? Where significant differences are determined by excluding non-code files?

▲pornel 9 days ago

I'd use popularity (how many people are using the crate indirectly) divided by trust level in the publisher of the crate.

However, publishing a list that basically says "these are the least trustworthy Rust users" would cause quite a stir, so I'm not doing that.

▲swiftcoder 9 days ago

Looking through the code, it already ignores the majority of non-code files

▲corytheboyd 9 days ago

How crazy would it be to have a package repository that also builds the artifacts it distributes? You’d need a high barrier to entry to save on costs and time sifting through garbage. Perhaps it’s this high barrier that would prevent such a repository from taking off though. Perhaps this is just a really dumb step on a path leading back to simple checksum validations… though with those, you’re only validating that whatever was uploaded is what you downloaded, it doesn’t ensure that it was built from a known set of source files… hard problems.

▲miki123211 9 days ago

Distro repositories (like the one you have on Debian / Ubuntu / Redhat etc) do this.

They work on a different model, where only packages that are deemed "worthy" are included, and there's a small-ish set of maintainers that are authorized to make changes and/or accept change requests from the community. In contrast, programming language package managers like cargo, pip or npm let anybody upload new packages with little to no prior verification, and place the responsibility of maintaining them solely on their author.

The distribution way of doing things is sometimes necessary, as different distributions have different policies on what they allow in their repositories, might want to change compilation options or installation paths, backport bug and security fixes from newer project versions for compatibility, or even introduce small code changes to make the program work better (or work at all) on that system.

One example of such a repository, for the Alpine Linux distribution, is at https://github.com/alpinelinux/aports

▲c0balt 9 days ago

That's what nixpkgs does for Nix/NixOS. The package set is continuously built by a CI system and made publicly available: https://github.com/NixOS/nixpkgs#continuous-integration-and-...

▲thayne 9 days ago

go kind of solves that by making the git repo the source of truth for a package, and host a cache for it.

The problem with it is you need the full git url in every file you import it. which is a pain if the repo changes locations, or you want to use a fork or a local version. Versioning is also tricky, to the point that go recommends creating a separate branch for a major/breaking version, which requires updating every import statement.

I think a good middle ground would be to have a central repository and/or package configuration file that maps package names to git repos and versions to commits (possibly via tags). And of course use hashes to lock the version to specific contents.

Bazel kind of does this, but it doesn't have any built in version resolution or transitive dependency resolution (although in some cases there are other tools that help). And it can add a lot of complexity that you may not need.

▲dgoldstein0 9 days ago

bazel has modules now: https://bazel.build/external/module

Not tried them but they look like a reasonable dep handling solution on paper - each module can declare it's own dependencies and bazel will figure it out for you like a package manager. Their old workspaces way of doing it was a nightmare, as while patterns emerged where repos would export a function to register their dependencies, the first declaration of any name would win and thus you weren't guaranteed to have a compatible set of workspaces at the end.

▲ 9 days ago

▲kichimi 9 days ago

Isn't this gentoo?

▲GauntletWizard 9 days ago

No, Gentoo does something far from it - it builds everything on the host machine every time, more or less.

▲k8svet 9 days ago

Bruh. I mean this as a genuine ask. Have you heard of Nix and is there a reason it didn't land on your radar or was rejected?

Because what you want exists and has a thriving community, and a package set that outclasses, well, statistically every other package manager in existence.

I swear, it's a daily occurrence for me to see software engineering challenges posited here as damn near impossible that Nix has been solving for over a decade.

What if you could run a single command and have exact insight to the source you're using for every single package on your system with the context of the dependency graph it exists in.

I cannot wait for this wave to crash and for people to realize how much engineering effort is reduced by using Nix. And that all of these things they know they want for years, already exists. But hey, the syntax takes time to get used to and how do you compare that against the countless blog posts and hours and institutional knowledge you need to actually use docker properly. And then later on some Go-based SBOM tool made by a VC-backed startup that fundamentally still does an inferior job to Nix. Sigh.

Well anyway I guess nix will keep being used by hedge funds, algorithmic traders, "advanced defensive capabilities" companies, literal (launched, in space) satellites, wallet manufacturers, etc, while everyone else listens to the syntax decriers.

▲KolmogorovComp 9 days ago

crates.io already builds the artefacts.

But the code-source that is sent to crates.io is not necessarily the same as the one in the public repo linked to the crate.

▲kibwen 9 days ago

It's possible that crates.io might attempt to build a crate when published as a sort of sanity check (I don't know if this is true, but it's certainly feasible), but it doesn't distribute binaries, it distributes source code.

▲pabs3 9 days ago

> it doesn't distribute binaries, it distributes source code.

It definitely does contain generated files, at least one crate has Rust code generated by a Python script that is not in the crate, only in the upstream Git repository.

▲kibwen 8 days ago

Yes, let's clarify: crates.io expects a Rust crate, which itself can contain whatever junk the uploader wants. But crates.io isn't taking your source, building it, and then distributing those executables; at the end of the day it's distributing the source code of a Rust crate as given by whoever published it.

▲progval 9 days ago

Do you have a source for crates.io building artefacts? I have a couple of crates on it and never saw any sign it tried to compile them, even when they were broken.

▲corytheboyd 9 days ago

Ah yeah, I suppose that’s what I really mean, a means of verifying builds link to source that is publicly available. Sounds like the source repository has to be in on it too

▲gigatexal 9 days ago

Has anyone analyzed the data?