- Publicado en
A Million Dollar Null
Ok, maybe not a million but could've cost a lot of money.
The Story
Once upon a time, I was tasked with getting some historic data for a key project. I was supposed to get the data from a PostgreSQL database and all the data pivoted around a table similar to the following:
create table operation (
operationid bigserial primary key,
creationdate date not null, -- Deprecated, we should only use the timestamp
creationtimestamp timestamptz,
amount bigint, -- The amount in minor units
additionalthings jsonb,
operationtypeid bigint references operation_type,
-- ... more stuff here
)
The task was simple, I had to get all the operations that happened in the past six months, hidrate them with some other data sources and send a csv file so we can get paid. Simple enough, right?
The only things I had to keep in mind where those creationdate
and creationtimestamp
columns. I knew at some time we stopped paying attention to the creationdate
and moved to using creationtimestamp
so I should check if the period I'm interested in overlaps with the period when we were only using the creationdate
field.
> select count(*) from operation
where creationdate != creationtimestamp::date
and creationtimestamp >= now() - interval'6 months'
count
-------
0
(1 row)
Seems like I'm not affected so I can move on to hidrating the data and finishing my Friday early, right? Easy enough, right? RIGHT? That's exactly what I did without noticing that my sanity check was completely meaningless.
See, the creationtimestamp
columns was nullable because it had been introduced way after the creation of the table (enough time for making it infeasible to backfill), so the comparison I made between it and the creationdate
makes no sense.
NULL
compared with anything is always NULL
, thus you cannot say that the columns are different or equal, you just cannot compare them in that way.
Had I done something like the following, I would've noticed that I was missing almost three months of data:
> select min(creationtimestamp) from operation;
min
------------------------
2020-08-15 00:00:00+02
(1 row)
> select count(*) from operation
where creationtimestamp is null
and creationdate > now() - interval '6 months';
count
-------
1800
(1 row)
... Shit.
Conclusion
This is just a watered down version of what actually happened, and of course I'm making the query examples on synthetic data but the essence of the mistake I made is still there. Fortunately a second sanity check someone else did before submitting the data was enough to catch it so I could fix it on time.