SQL NULLs aren't weird, they're just based off of Kleene's TRUE-FALSE-UNKNOWN logic!
If you want you can read NULL as UNKNOWN and suddenly a whole bunch of operations involving them become a lot more intuitive:
1. TRUE OR UNKNOWN = TRUE, because you know you have at least one TRUE already.
2. TRUE AND UNKNOWN = UNKNOWN, because you don't know whether you have two TRUEs or not. It's just out there.
3. UNKNOWN XOR UNKNOWN = UNKNOWN, because it could darn near be anything: TRUE XOR TRUE, TRUE XOR FALSE, FALSE XOR FALSE, FALSE XOR TRUE... Internalizing this is where SQL's use of NULL / UNKNOWN really becomes intuitive.
4. (TRUE AND FALSE) XOR (TRUE OR UNKNOWN) = (FALSE) XOR (TRUE) per #1 = TRUE. See, it's consistent, you just need to keep in mind that if you have a lot of known UNKNOWNs they're quite parasitic and your final outcome is likely to be, itself, an UNKNOWN. Just like in real life!
It's unfortunate to namesquat on 'boolean' if your elements have three possible values. Just call it 'ternary logic' and let individuals decide which system to use.
Who's name squatting boolean? The bool column is exactly what it claims to be, you just have the option of introducing unknowability if you define it allow nulls.
The distinction is that not all formal logic systems are Boolean. Meaning that it is nonsensical and confusing to use "Boolean" as a generic synonym for "truth value" in the same way that it's nonsensical to use "Pantone" as a generic synonym for "color value", including when the specific kind of color value you're talking about is CMYK or HSV and definitely not Pantone.
> select null = null; returns NULL, because each NULL is basically a placeholder representing any “unknown value”. Two unknown values are not necessarily the same value; we can’t say that they are equal, because we don’t know the value of either of them.
Agreed with all of this, it would probably have been better if they were named `unknown` instead of reusing the `null` keyword.
Note also that since Postgresql 15, you can use `NULLS NOT DISTINCT` when creating a unique index [0]. I'm less familiar with other databases.
Just to clarify, I'm not advocating to introduce a new `unknown` keyword. I'm saying that the existing `null` in SQL was not named properly and that the name `unknown` would have been more fitting. SQL's `null` already has the semantics of `unknown` as explained in the part of the article that I quoted.
I wouldn't necessarily define `null` as "unknown" -- it's just "no value" -- which is really the same thing and also somewhat equivalent to "unset". But null pointers aren't unset as pointers aren't initialized to null in C and you can explicitly set a pointer to null.
E.F. Codd added nulls to relational model in 1970 so that does pre-date C. The concept is even older than that I imagine.
Simplicity. PG is often overkill for simple apps, where MySQL/Maria/et al is easier to maintain, or even SQLite for very simple apps where zero DB maintenance is preferable.
Also the reverse — MySQL et al support much more complex replication topologies out of the box, including multi master for the very rare use case when you need it.
It's also much easier to tune, most database instances require setting innodb_buffer_pool_size, and that's basically it. Newer versions can even set it automatically if you're fine with consuming all memory on that machine, thus requiring no tuning at all.
The main pain point for smaller apps is that every major Postgres version requires an explicit migration of the underlying data representation. I get why it's there, but for simpler apps I would appreciate a flag to do it transparently.
It's not client side, it's server side. The binary format between Postgres major versions is generally not compatible so you need to do a pg_dump from the old database and do a pg_restore into the new one. pg_upgrade [1] can let you bypass this by migrating the binary data directly, but still requires having both the new and old postgres versions installed. There's also things you can do with replication, but since we're talking simplicity I don't think that really applies.
Personally I think upgrades are the one thing MySQL has on Postgres at this point.
It requires manual interventions because the upgrade process is basically dump + restore. MySQL and MariaDB upgrade between major versions automatically — you simply install the next version (or change the version tag in your container definition) and restart the server.
Usually it takes almost no time, altought might be just as slow as PG when major changes to the data format are introduced. The only example I can remember is 8.0 when oracle completely rewrote the data format (making things like atomic ddl possible).
You need to dump the database on the old PostgreSQL version and then load the dump on the new PostgreSQL version. Some other database servers can just use the old table data or migrate the data files on the fly transparently.
If your data's large and changing enough that you have to care about vacuuming, any reasonable database is going to require some tuning, tending and management.
I'd posit that only a tiny fraction of PostgreSQL uses have to know or care that vacuuming is a thing because the autovacuum default handle it for them.
Sure, it's never going to be plug and play, but it doesn't mean that all the issues will be equivalent. Vacuuming doesn't really have an equivalent in say, MySQL. It's something you don't have to worry about if you use the latter.
For example, HA and clustering will always be challenging to deploy/maintain, but you will still have a harder time doing that with postgres than with MySQL. Postgres also has a lot of benefits obviously, though.
Oh man. I had a talk with a DBA about how oracle could not deal with an adress with no street name - literally a tiny village with 10 houses on 1 nameless town square. It was unsearchable in parts of the app because street='' was interpreted as street is null. DBA kept claiming oracle was right and the town should adapt their naming to our software.
This attitude was so prevalent at the time, I sometimes wonder if the rise of noSQL was simply people sick of dealing with Oracle DBAs
Exactly this. SQL is based on the relational algebra and that's well-defined, NULL along with other features of SQL work in an entirely regular and predictable way. The only time it's weird is when a developer decides that it should work the way Javascript (or whatever) NULLs work because that's the last time they saw the same word used in a programming language, in which case it's the assumption that's weird.
That's not the only time it is weird. There's even a whole book by one of the pioneers of the relational DB model, Date's "Database Technology: Nulls Considered Harmful" [1], covering many of the ways it is weird.
The part that’s weird with nulls is that it’s a trinary logic stuffed into a boolean algebra. The use of x = NULL instead of x IS NULL is pretty much always a mistake.
More importantly, x = value instead of (x = value and x IS NOT NULL) is almost always a mistake, and a stupidly subtle one at that. And for this curse, we get… nothing particularly useful from these semantics.
Even null in programming languages isn't so bad if it's a distinct type. The problem with null in languages like Java is that null is part of every reference type (C's pointers are another world of broken, null being basically just another unsafe invalid address).
Most languages nowadays do get nulls right, even PHP of all things.
> ... and this is even less obvious if you’re used to using ORMs.
Which is why I continue to be such an ORM skeptic. I agree that they're convenient. But I do worry that we've now got an entire generation of engineers who regularly interact with relational databases, but have largely been spared the effort of learning how they actually work.
As another commenter pointed out, if you've learned basic relational algebra then the way SQL nulls behave seems obvious and logically consistent. The logic is the same as the logic behind the comparison rules for NaN in IEEE floats. It's the behavior of C-style nulls that is, always and forever, a billion-dollar mistake.
My experience with ORMs is that most of the time you end up needing to write some queries by hand, in raw SQL. Usually these are the most complex, that you can't express in your ORM's DSL. My point being, I don't think using an ORM really shields you from having to learn how it works behind the scenes.
SQL nulls in some ways behave in similar to floating point nans. Of course nans are also weird in their own way, but it is a bit comforting that its not so completely singularly weird.
NaN is cool because it's almost like a type that constrains uncertainty. What do we know about this entity? not much! but it's definitely not a number. Calling it anything else would have been a mistake.
Null is more confusing because it means different things in different languages. Sometimes it's a more constrained uncertainty, eg this definitely doesn't exist. But in sql it's a less constrained uncertainty, like "undefined" in math. The value of this thing couldn't make sense in this context, but we can't make assertions about its existence.
There's another comment in here that talks about thinking of NULL as UNKNOWN, and I quite like that. It makes a lot more sense, and I think it would've been a better choice to standardize on. UNDEFINED would also be an improvement.
NULL is the absence of a value. If you try and treat it as a value, you're going to have a bad time. So an attempted UNIQUE(email_address, deleted_at) constraint is fundamentally flawed. If you treated NULL as a value that could be unique, you're going to break foreign keys.
But let's continue the logic of deleted_at being NULL indicating an active account, which seems to the intent here. You end up doing things like:
SELECT /* ... */
FROM accounts
WHERE email_address = '...'
AND deleted_at IS NOT NULL
Depending on your database, that may or may not index well. More problematic, you may end up with privacy leaks if someone forgets the last conditional.
If anything, you want to reverse this so someone has to go out of their way to explicitly select deleted accounts. There are multiple strategies for this eg using an active_accounts view or table.
Lastly, there are lots of potential reasons for an account to be disabled or otherwise not visible/accessible. Takedowns, court orders, site safety, hacked accounts and so on.
Overloading deleted_at to have a semantic meaning for an active account is just fundamentally bad design.
> Overloading deleted_at to have a semantic meaning for an active account is just fundamentally bad design.
Then don't do that. It's kind of a leap to say soft deletes are categorically bad because someone might confuse "deleted" with "inactive". My users table does the super-advanced thing of having both columns. The ORM also doesn't forget to add the not-null criterion. There's also zero databases in active use where it poses a problem to indexing.
Soft deletes suck in their own way, but none of the alternatives are perfect either.
You put the "is not null" on the index itself and then simply don't use it for the much rarer queries that are on deleted accounts. Or just use a view for active accounts.
Overloading timestamps to carry a boolean on null is awesome as long as you decide that's what you're doing and use one of the several standard techniques to dodge the easily avoided potential downside.
This isn't a valid security concern, more than any other incorrect sql query would be anyway. A dev can always write a bad Q, you need another way to address that it's not more likely here because of the null.
The NULLs in unique constraints quirk actually works differently in ORACLE databases, which is infuriating to say the least. Apparently this comes from some ambiguity in some sql standard, anyone know more about this?