📚 Finished reading The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling by Ralph Kimball and Margy Ross.
Most everyone that’s been in charge of designing a large database has recommended this to me as a good summary of how to think about the structure a data warehouse.
It teaches the Kimball method - you might notice that’s one of the authors' surnames. So if you’ve come across terminology such as ‘type 2 slowly changing dimension’ or the facts and dimensions used in dimensional modelling et al. then it’s very possible that your interlocutor has read this tome.
Whilst I’m more an analyst than an engineer, I have dabbled in the latter and this book will be good guidance on how to think things through going forward.
A key aspect, and one I’m grateful for, is to focus on making the data easy to understand and analyse in downstream tools, even if it requires substantially more effort - both technical and diplomatic - up front.
It’s been helpful to me to understand why systems I use are set up like they are - e.g. a distinct absence of using null value where I’d initially thought it’d make sense to use them. It’s also a refresher on basic patterns one can use to enable common analysis requirements like ‘how many times has this value changed over time?’ or ‘what would the results look like if they were remapped to the historical structure of the organisation?’
It’s written rather prescriptively - follow this rule or regret it forever! I have been told by practitioners that sometimes it’s necessary, or at least preferable, to break the occasional rule in reality but that their recommendations are good ones for the majority of the time.
It’s also organised slightly confusingly in that most of it is divided up into chapters seemingly aimed at various specific business applications e.g. e-commerce or insurance. But they’re at pains to say that you shouldn’t just read the one that’s most similar to your organisation or the task you’re trying to accomplish as you’ll not understand it fully without all the context provided in the earlier chapters that from their title don’t seem relevant to you. So I’m not entirely certain why they structured it like that in the first place.
But that aside, it’s invaluable reading for anyone designing databases or using other people’s databases that are designed this way.