SQL nulls are weird(jirevwe.github.io)
46 points by subomi 12 hours ago | 11 comments
hiAndrewQuinn 8 hours ago
SQL NULLs aren't weird, they're just based off of Kleene's TRUE-FALSE-UNKNOWN logic!

If you want you can read NULL as UNKNOWN and suddenly a whole bunch of operations involving them become a lot more intuitive:

1. TRUE OR UNKNOWN = TRUE, because you know you have at least one TRUE already.

2. TRUE AND UNKNOWN = UNKNOWN, because you don't know whether you have two TRUEs or not. It's just out there.

3. UNKNOWN XOR UNKNOWN = UNKNOWN, because it could darn near be anything: TRUE XOR TRUE, TRUE XOR FALSE, FALSE XOR FALSE, FALSE XOR TRUE... Internalizing this is where SQL's use of NULL / UNKNOWN really becomes intuitive.

4. (TRUE AND FALSE) XOR (TRUE OR UNKNOWN) = (FALSE) XOR (TRUE) per #1 = TRUE. See, it's consistent, you just need to keep in mind that if you have a lot of known UNKNOWNs they're quite parasitic and your final outcome is likely to be, itself, an UNKNOWN. Just like in real life!

mrkeen 8 hours ago
It's unfortunate to namesquat on 'boolean' if your elements have three possible values. Just call it 'ternary logic' and let individuals decide which system to use.
foobarchu 7 hours ago
Who's name squatting boolean? The bool column is exactly what it claims to be, you just have the option of introducing unknowability if you define it allow nulls.
bunderbunder 6 hours ago
The distinction is that not all formal logic systems are Boolean. Meaning that it is nonsensical and confusing to use "Boolean" as a generic synonym for "truth value" in the same way that it's nonsensical to use "Pantone" as a generic synonym for "color value", including when the specific kind of color value you're talking about is CMYK or HSV and definitely not Pantone.
chuckadams 7 hours ago
Maybe GP was edited, but it doesn't use the word "boolean" anywhere.
hiAndrewQuinn 7 hours ago
Correct, I edited "boolean" out prior to ^^P's comment. My apologies.
7 hours ago
Recursing 5 minutes ago
Surprised that this doesn't mention "IS DISTINCT FROM" ( https://modern-sql.com/caniuse/is-distinct-from )

(Although in rare cases that is even weirder: https://stackoverflow.com/a/58998043 )

demurgos 9 hours ago
> select null = null; returns NULL, because each NULL is basically a placeholder representing any “unknown value”. Two unknown values are not necessarily the same value; we can’t say that they are equal, because we don’t know the value of either of them.

Agreed with all of this, it would probably have been better if they were named `unknown` instead of reusing the `null` keyword.

Note also that since Postgresql 15, you can use `NULLS NOT DISTINCT` when creating a unique index [0]. I'm less familiar with other databases.

0: https://www.postgresql.org/docs/15/sql-createtable.html

magicalhippo 8 hours ago
The result of comparisons involving NULL values can result[1][2] in UNKNOWN, and in PostgreSQL for example you can test[3] for this using IS UNKNOWN.

That said, as someone self-taught in SQL, I agree NULL was not a good choice.

Replacing NULL with UNKNOWN and the third boolean value as INDETERMINATE for example would have been better.

[1]: https://stackoverflow.com/a/79270181

[2]: https://learn.microsoft.com/en-us/sql/t-sql/language-element...

[3]: https://www.postgresql.org/docs/current/functions-comparison...

otteromkram 13 minutes ago
Also self-taught SQLer and I don't have an issue with NULL.

I also don't use UNIQUE constraints, so maybe that has something to do with it.

layer8 8 hours ago
SQL NULL doesn’t behave like “unknown” in all contexts. That’s one issue of NULL, that it doesn’t really have consistent semantics.
masklinn 7 hours ago
Furthermore if null only means unknown then we need a value for “known absent”, there’s a reason why null is so often used as that.
duncan-donuts 8 hours ago
Introducing “unknown” feels like another kind of hell like undefined in JavaScript.
demurgos 8 hours ago
Just to clarify, I'm not advocating to introduce a new `unknown` keyword. I'm saying that the existing `null` in SQL was not named properly and that the name `unknown` would have been more fitting. SQL's `null` already has the semantics of `unknown` as explained in the part of the article that I quoted.
wvenable 8 hours ago
SQL's use of "null" is probably one of the oldest instances of that concept in computing. It's exactly equivalent to unknown. That is its definition.
demurgos 6 hours ago
Really? I know that SQL is old but I would have expected `null` to refer to pointers at first.

Going by Wikipedia, I see that SQL is from 1974 and C from 1972. Were there earlier uses/drafts where `null` is "unknown" instead of "unset"?

wvenable 5 hours ago
I wouldn't necessarily define `null` as "unknown" -- it's just "no value" -- which is really the same thing and also somewhat equivalent to "unset". But null pointers aren't unset as pointers aren't initialized to null in C and you can explicitly set a pointer to null.

E.F. Codd added nulls to relational model in 1970 so that does pre-date C. The concept is even older than that I imagine.

NoMoreNicksLeft 8 hours ago
>also that since Postgresql 15, you can use `NULLS NOT DISTINCT` when creating a unique index [0]. I'm less familiar with other databases.

Why would anyone want to use another database?

stronglikedan 8 hours ago
Simplicity. PG is often overkill for simple apps, where MySQL/Maria/et al is easier to maintain, or even SQLite for very simple apps where zero DB maintenance is preferable.
homebrewer 7 hours ago
Also the reverse — MySQL et al support much more complex replication topologies out of the box, including multi master for the very rare use case when you need it.

It's also much easier to tune, most database instances require setting innodb_buffer_pool_size, and that's basically it. Newer versions can even set it automatically if you're fine with consuming all memory on that machine, thus requiring no tuning at all.

lcnPylGDnU4H9OF 8 hours ago
Why would you say MySQL/Maria/et al are easier to maintain for simple apps than PG?
demurgos 8 hours ago
The main pain point for smaller apps is that every major Postgres version requires an explicit migration of the underlying data representation. I get why it's there, but for simpler apps I would appreciate a flag to do it transparently.
throwaway10235 8 hours ago
I'm not sure what you mean. I have migrated versions without having to update any applications that connects to it?

Maybe it is a driver specific issue? I have used Python/Java, and haven't updated any of my code or dependencies because of a major Postgre update

Volundr 6 hours ago
It's not client side, it's server side. The binary format between Postgres major versions is generally not compatible so you need to do a pg_dump from the old database and do a pg_restore into the new one. pg_upgrade [1] can let you bypass this by migrating the binary data directly, but still requires having both the new and old postgres versions installed. There's also things you can do with replication, but since we're talking simplicity I don't think that really applies.

Personally I think upgrades are the one thing MySQL has on Postgres at this point.

[1] https://www.postgresql.org/docs/current/pgupgrade.html

homebrewer 7 hours ago
It requires manual interventions because the upgrade process is basically dump + restore. MySQL and MariaDB upgrade between major versions automatically — you simply install the next version (or change the version tag in your container definition) and restart the server.

Usually it takes almost no time, altought might be just as slow as PG when major changes to the data format are introduced. The only example I can remember is 8.0 when oracle completely rewrote the data format (making things like atomic ddl possible).

panzi 7 hours ago
You need to dump the database on the old PostgreSQL version and then load the dump on the new PostgreSQL version. Some other database servers can just use the old table data or migrate the data files on the fly transparently.
phplovesong 8 hours ago
Usually its a skill issue
badlibrarian 8 hours ago
VACUUM
kstrauser 7 hours ago
If your data's large and changing enough that you have to care about vacuuming, any reasonable database is going to require some tuning, tending and management.

I'd posit that only a tiny fraction of PostgreSQL uses have to know or care that vacuuming is a thing because the autovacuum default handle it for them.

mardifoufs 5 hours ago
Sure, it's never going to be plug and play, but it doesn't mean that all the issues will be equivalent. Vacuuming doesn't really have an equivalent in say, MySQL. It's something you don't have to worry about if you use the latter.

For example, HA and clustering will always be challenging to deploy/maintain, but you will still have a harder time doing that with postgres than with MySQL. Postgres also has a lot of benefits obviously, though.

poincaredisk 8 hours ago
* legacy applications or vendor lock-in

* use of a cloud provider that favours another database (like SQL server)

* some people claim mysql/maria is faster for them, simpler, or has a better support for replication

* use of sqlite for lightweight or portable apps

* domain specific databases that still use sql as their query language

* someone may want to use another database for fun or to learn something new

password4321 8 hours ago
Replication
irrational 8 hours ago
I expected the article to mention how in Oracle NULLs are equal to empty strings. Now that is weird.
hyperman1 6 hours ago
Oh man. I had a talk with a DBA about how oracle could not deal with an adress with no street name - literally a tiny village with 10 houses on 1 nameless town square. It was unsearchable in parts of the app because street='' was interpreted as street is null. DBA kept claiming oracle was right and the town should adapt their naming to our software.

This attitude was so prevalent at the time, I sometimes wonder if the rise of noSQL was simply people sick of dealing with Oracle DBAs

svieira 8 hours ago
Domain-embedded nulls are the bane of my existence.
datadrivenangel 8 hours ago
SQL NULLs are not weird once you consider how you want relational logic to work when they is a record with non-existent values.
grahamlee 8 hours ago
Exactly this. SQL is based on the relational algebra and that's well-defined, NULL along with other features of SQL work in an entirely regular and predictable way. The only time it's weird is when a developer decides that it should work the way Javascript (or whatever) NULLs work because that's the last time they saw the same word used in a programming language, in which case it's the assumption that's weird.
tzs 6 hours ago
That's not the only time it is weird. There's even a whole book by one of the pioneers of the relational DB model, Date's "Database Technology: Nulls Considered Harmful" [1], covering many of the ways it is weird.

[1] https://www.amazon.com/Database-Technology-Nulls-Considered-...

setr 8 hours ago
The part that’s weird with nulls is that it’s a trinary logic stuffed into a boolean algebra. The use of x = NULL instead of x IS NULL is pretty much always a mistake.

More importantly, x = value instead of (x = value and x IS NOT NULL) is almost always a mistake, and a stupidly subtle one at that. And for this curse, we get… nothing particularly useful from these semantics.

Also the x != NULL case is completely cursed

whstl 8 hours ago
Agreed.

I will die on the hill that regular C-like nulls are the actual thing that's weird.

The real billion dollar mistake [1] was the damage it made on the minds of developers.

[1] https://en.wikipedia.org/wiki/Tony_Hoare

chuckadams 7 hours ago
Even null in programming languages isn't so bad if it's a distinct type. The problem with null in languages like Java is that null is part of every reference type (C's pointers are another world of broken, null being basically just another unsafe invalid address).

Most languages nowadays do get nulls right, even PHP of all things.

poincaredisk 8 hours ago
We should start adjusting that billion for inflation.
bunderbunder 8 hours ago
> ... and this is even less obvious if you’re used to using ORMs.

Which is why I continue to be such an ORM skeptic. I agree that they're convenient. But I do worry that we've now got an entire generation of engineers who regularly interact with relational databases, but have largely been spared the effort of learning how they actually work.

As another commenter pointed out, if you've learned basic relational algebra then the way SQL nulls behave seems obvious and logically consistent. The logic is the same as the logic behind the comparison rules for NaN in IEEE floats. It's the behavior of C-style nulls that is, always and forever, a billion-dollar mistake.

thrance 7 hours ago
My experience with ORMs is that most of the time you end up needing to write some queries by hand, in raw SQL. Usually these are the most complex, that you can't express in your ORM's DSL. My point being, I don't think using an ORM really shields you from having to learn how it works behind the scenes.
kurtbuilds 7 hours ago
If you want equality testing with nulls, you want to use `is (not) distinct from` instead of `=` and `<>` / `!=`.

`1 is not distinct from NULL` => false

`NULL is not distinct from NULL` => true

`0 is not distinct from 1` => false

blast 20 minutes ago
Having that is much better than not having it, but man is it verbose and confusing.
zokier 8 hours ago
SQL nulls in some ways behave in similar to floating point nans. Of course nans are also weird in their own way, but it is a bit comforting that its not so completely singularly weird.
dunham 8 hours ago
Also similar to the bottom value in haskell and exceptions in other languages.
giraffe_lady 8 hours ago
NaN is cool because it's almost like a type that constrains uncertainty. What do we know about this entity? not much! but it's definitely not a number. Calling it anything else would have been a mistake.

Null is more confusing because it means different things in different languages. Sometimes it's a more constrained uncertainty, eg this definitely doesn't exist. But in sql it's a less constrained uncertainty, like "undefined" in math. The value of this thing couldn't make sense in this context, but we can't make assertions about its existence.

mplanchard 7 hours ago
There's another comment in here that talks about thinking of NULL as UNKNOWN, and I quite like that. It makes a lot more sense, and I think it would've been a better choice to standardize on. UNDEFINED would also be an improvement.
adhamsalama 7 hours ago
Except it is of type number, at least in JavaScript.
al2o3cr 8 hours ago
FWIW, you can explicitly change this behavior in Postgres as of version 15 - include "NULLS NOT DISTINCT" when creating the unique index.
jmyeet 8 hours ago
NULL is the absence of a value. If you try and treat it as a value, you're going to have a bad time. So an attempted UNIQUE(email_address, deleted_at) constraint is fundamentally flawed. If you treated NULL as a value that could be unique, you're going to break foreign keys.

But let's continue the logic of deleted_at being NULL indicating an active account, which seems to the intent here. You end up doing things like:

    SELECT /* ... */
    FROM accounts
    WHERE email_address = '...'
    AND deleted_at IS NOT NULL
Depending on your database, that may or may not index well. More problematic, you may end up with privacy leaks if someone forgets the last conditional.

If anything, you want to reverse this so someone has to go out of their way to explicitly select deleted accounts. There are multiple strategies for this eg using an active_accounts view or table.

Lastly, there are lots of potential reasons for an account to be disabled or otherwise not visible/accessible. Takedowns, court orders, site safety, hacked accounts and so on.

Overloading deleted_at to have a semantic meaning for an active account is just fundamentally bad design.

chuckadams 7 hours ago
> Overloading deleted_at to have a semantic meaning for an active account is just fundamentally bad design.

Then don't do that. It's kind of a leap to say soft deletes are categorically bad because someone might confuse "deleted" with "inactive". My users table does the super-advanced thing of having both columns. The ORM also doesn't forget to add the not-null criterion. There's also zero databases in active use where it poses a problem to indexing.

Soft deletes suck in their own way, but none of the alternatives are perfect either.

indeed30 8 hours ago
That's interesting - I believe this is exactly how Sequelize implements soft-deletion.
giraffe_lady 8 hours ago
You put the "is not null" on the index itself and then simply don't use it for the much rarer queries that are on deleted accounts. Or just use a view for active accounts.

Overloading timestamps to carry a boolean on null is awesome as long as you decide that's what you're doing and use one of the several standard techniques to dodge the easily avoided potential downside.

This isn't a valid security concern, more than any other incorrect sql query would be anyway. A dev can always write a bad Q, you need another way to address that it's not more likely here because of the null.

ungut 8 hours ago
The NULLs in unique constraints quirk actually works differently in ORACLE databases, which is infuriating to say the least. Apparently this comes from some ambiguity in some sql standard, anyone know more about this?