The Spaghetti
Posted: June 21, 2023
Serge leans back, a mischievous grin playing on his face, and says, "You know, Max, there was a time when I encountered a metric so absurd, I had an existential crisis. It’s all BS. Why even bother? I questioned the point of it all, ya know?" His words pique my curiosity, and I lean in. His eye twinkles, Serge begins his story, transporting me to a world where data anomalies run rampant and the quest for truth takes an unexpected turn.
Serge begins his story. This is Part II of, I Smell Color. You can start at the begining, Confessions of Analytics Professonals.
What stays with me is the buzz in the air. The whole place is on a caffeine high, buzzing with excitement and innovation. Boy do I feel like the small fish in a big pond. The logo on the wall is unmistakable. At the time it still felt surreal to work as part of an analytics department, let alone there. Perks galore. All-you-can-eat snacks. Gourmet coffee. They've got it all covered. Nothing gets my analytical mind going like topping off a cup of joe.
I sink down into the loving embrace of my ergonomic chair. Desks so long it could be the hood of a Dodge Viper. Natural light streams in through floor to ceiling windows. This is all pre-pandemic mind you, work from home wasn’t a thing.
It's not just about the creature comforts. This company knows how to inject some serious fun into the mix. Picture this—life-sized foosball. Yeah, you heard me. It's like regular foosball, but each player is an actual person attached to other actual people with a long pole, jiving and juking, trying to score a goal. It's like synchronized swimming, but with less water and more high fives. Teamwork makes the dream work, right?
Then there is my boss—a real wizard when it comes to data. Slicing and dicing numbers like a chef with a sharp knife, deep understanding of the company's operations. Expertise and leadership off the charts. He is our North Star, guiding us through the vast abyss of data. And trust me, in this field, you need someone with a map and a compass, 'cause it's easy to get lost in a sea of ones and zeros.
To be honest I feel pretty dumb, but as the saying goes if you’re the smartest person in the room you're in the wrong room. I must be in the right room ‘cause I’m learning a lot. I’m a kid in a candy store, except for nerds: a sage in the stacks; a brain in the books.
And then in an instant everything changes.
One minute, I'm part of a talented analytics team at a marquee tech company, and the next, I'm thrust into the spotlight as my boss and several other leaders abandon ship, irreconcilable differences with management. Talk about being caught in the whirlwind. Suddenly, I find myself wearing the expert hat, not because I earned it, but because everyone who knows better is gone. It's like I'm the captain of a ship with no crew and a map full of uncharted waters.
Now, let me tell you about the state of our data. At the time, we didn’t know. That was the issue. It was a black box. Our data model was opaque with logic scattered all across the data stack. As we pick around the edges a picture starts to form. Imagine a dense, thorny briar patch, each thicket representing a tangled mess of information. That's how I see it—unruly, interlacing, and chaotic. Management has a different take. They call it "spaghetti," a swirling plate of tangled noodles. It’s actually not far from the truth. Each report fed directly from the source, the logic for each was self contained and sometimes borrowed.
You want an example of this spaghetti madness? Here's a good one. We have seventeen different conversion metrics. Each of them has their own specific meaning and purpose, valid in their own little worlds. But when you try to make sense of it across the business, you get seventeen flavors of conversion. Imagine Baskin Robins with 31 flavors all named, Vanilla. What's the point? We are looking for clarity, and finding a confusing conundrum of identical-sounding KPIs.
But wait, it gets better. Throughout our dataset, we stumble upon conflicting definitions for dimensions. It's like having a dictionary that contradicts itself. The same word with two definitions and you're left scratching your head, one definition says "up," the other says "down." Trying to make sense of such discordant definitions is solving a puzzle with mismatched pieces. We found ourselves in a precarious situation.
I’m looking at this spaghetti mess, the road ahead won't be easy. We need to find a way to bring order to the chaos, to streamline our metrics, and align our definitions. But how?
So, here I am, the accidental expert, ready to untangle the spaghetti. Because if there's one thing I've learned, it's that sometimes, even the most twisted paths can lead to surprising discoveries. And who knows, maybe along the way, we'll find a hidden meatball or two.
That singular thought keeps me going. Find the meatball. There is gold in these hills, start digging.
This is Part II of my new book, I smell color.. If you made it to this point, consider subscribing. You can sign up here.
How did we get here?
Alright, buckle up ‘cause now we are getting into the thick of it. Have you heard of CRISP-DM, the Cross-Industry Standard Process for Data Mining? It's like a roadmap to navigating data analysis. Someone had the bright idea to come up with an acronym for wildly intuitive common sense.
- Start with understanding the business.
- Figure out what questions you want to answer.
- Dive into the data.
- Unleash your analytical skills, exploring the data like a curious detective.
- Once you've uncovered some insights, fine-tune your models.
- Finally, you present your findings.
Now, grab a plate.
Imagine your little world of analytics as a massive bowl of spaghetti, with each noodle representing a report. Each noodle is cooked up using its unique recipe, applying CRISP-DM. It's directly modeled from the source data, seasoned based on specific questions and business objectives. You take another noodle, different report, cooked using the same CRISP-DM process but different questions means different seasoning. Each noodle, each report, has its own spin on the process, like adding various toppings or spices. Some noodles might be al dente, perfectly capturing the essence of the data, while others may be a bit overcooked, losing some of that data goodness. Mind you, that’s on purpose! Behold the spaghetti bowl of analytics! Hundreds of unique reports, each offering a different perspective, all tangled together by the common thread of the CRISP-DM approach. So, grab your fork and start twirling, because in this delicious world of data, every noodle tells its own story.
That’s a pretty good illustration of how we got to this point. For onesie-twosie analysis CRISP-DM works. But it doesn’t scale.
The more seasoned (no pun intended) analysts want to do away with data modeling all together. They want to join everything to everything. They envision it would look like a kind of mosaic. Each tile is a data source, and when assembled forms a breathtaking masterpiece . . .
in theory.
In reality it’s more like a patchwork quilt with bald spots and gaps. This monolithic wide sparse table of data is large. It’s slow. And it joins everything to everything creating dependencies and points of failure across reports that for all practical purposes aren’t related.
So where CRISP-DM is too much, this mosaic model is not enough. The problem with CRISP-DM and the monolith mosaic is that each approaches data as random bits and pieces. But we don’t want to model the data, we want to model the lifeblood of business—transactions, customers, products, and everything in between. We have our source data in one hand and the reports we want to create in the other. But what we don’t have is a winning strategy to get from one hand to the other.
The Solution
The Goldilocks model is business dimensional modeling. Business dimensional modeling takes these vital concepts and arranges them in a way that reflects the reality of the business and facilitates analysis and decision-making.
In the beginning we didn’t know what to call it. So we called our efforts lego blocks. You can think of business dimensional modeling as creating the building blocks of reporting. It's about organizing the diverse elements of a business into pieces that can be interlocked and unlocked in novel and unique ways while still maintaining a consistent foundational definition. For example, a kid can use a cardboard box as a boat, a table, or even a fort but we can still agree it’s a cardboard box. In the same way an idea can connect to another idea or concept along common dimensions and metrics, forming a cohesive and comprehensive view of the business. It represents how we think about the business because it models what we think about the business and ties in the relevant data.
(At this point Serge grabs a napkin, scribbling on it before handing it to me.)
Bazinga!
In contrast the CRISP-DM and mosaic models are largely dependent on their sources to define the structure. With all the different sources you lose cohesion. That is how it all gets bogged down.
Nailing down the problem was the first step. Wrapping our heads around business dimensional modeling was the next step. This is when we began taking our first steps away from working as analysts and started seeing ourselves as analytics engineers. At that point we still hadn’t heard the term analytics engineer, even though that is what we were doing. It wasn’t until we adopted DBT into our data stack that everything really became clear.
DBT, or Data Build Tool, is our magic wand for transforming and modeling data. It's a platform that allows us to wrangle, shape, and organize the data to model the business. With the help of DBT, we can implement the principle of separation of concerns to organize and manage our transformations.
Separation of concerns is a principle in software development that advocates for breaking down a system into distinct and independent parts, with each part responsible for a specific aspect of functionality. Separation of concerns means we break down the transformation process into smaller, manageable chunks. Each transformation focuses on a specific aspect or concern, allowing us to maintain clarity and avoid tangled spaghetti code. By modularizing our transformations, we enhance reusability, maintainability, and collaboration within our team. The goal is to ensure that each component focuses on a single concern and is decoupled from other parts of the system.
One of the key tools DBT offers is Directed Acyclic Graphs (DAGs), maps that illustrate the path our data takes from source to the final destination. These maps illustrate the data transformation arc. We start with the source data, which is often messy and unrefined. We use DBT to perform a series of transformations, taking the data on a journey from a multiverse of chaos to a world of understanding. We clean the data, apply business rules, and ensure the data conforms to our business dimensional models. These models or core business logic serve as the foundation for reporting.
As we progress along the transformation arc, our data starts to take shape. We can build data marts for specific business areas or functions. These data marts are built with our business dimensional models, ensuring that the data is structured in a way that supports efficient analysis and reporting.
Now, here's the exciting part—reporting on top of our business dimensional models. With the data now organized and modeled in a meaningful way, we can unlock valuable insights and empower decision-makers with actionable information . . . at scale. We can slice and dice the data, apply filters, and drill down into specific dimensions to understand trends, patterns, and outliers. The reports we develop are consistent because they come from a single source of truth, the business dimensional model.
But wait, there's more! DBT doesn't stop at just transforming and modeling data. It also provides features for data testing, documentation, and version control. We can validate the accuracy and integrity of our transformed data, document the entire process, and ensure that changes are tracked and auditable.
And then there is Jinja. I got started with DBT for the documentation, but I stayed for the Jinja. My interest in DBT began with a desire for automated documentation. Jinja sold it. I discovered Jinja, and my SQL game is forever changed. I really like using Jinja in my SQL.
At it's core, DBT aligns three technologies to deliver knowledge better: SQL, YAML, & Jinja. You can do a lot with just SQL and YAML. Adding in Jinja makes SQL feel a lot more like traditional development. I kinda missed that. It's like seeing an old friend that you really liked but haven't seen for a while.
It’s next level.
A Better place
And that my friend, brings us to today. We have come a long way. We are in a better place—a place where the mystic single source of truth is no longer an elusive dream but a tangible reality within our data stack. A crystal-clear path to coherence.
Not really. But you-know we’re getting there. Much closer for sure.
The backbone that stands it all up is our collection of business dimensional models. They lay the foundation for all the insights we unravel. They are the sources of truth and, and, AND we can trace 'em back to wherever the hell they came from. We turned a tangled mess of spaghetti into a neat, organized stack of . . . i don’t know . . . it’s a lasagna. We've brought order to the chaos, my friend. That’s the real freakin’ deal.
Each report we build is on a solid source of truth. What's beautiful about it? Each one has the freedom to frame those truths in its own way. No cookie-cutter nonsense here. We bring a fresh perspective to each report, the freakin' art of data storytelling, my friend.
“Damn,” is all I can say.
Later That Night
Later that night I’m at home, my macbook open to a blank screen. I jot down the key takeaways from my conversation with Serge. I’m still laughing about “data lasagna”. Serge is a character. I reflect on the importance of good friends, people who know what you are going through and a healthy dose of humor with your troubles. Armed with newfound insights (and caffeine) I jot down some notes:
As an analyst I win when I can:
- provide reliable metrics and KPIs
- answer my bosses (stupid) questions
- demonstrate competence and confidence in the data
The success of my work relies on the underlying data model. When it comes to modeling I have three options:
- CRISP-DM (spaghetti)
- Wide Table (threadbare patchwork quilt)
- Business Dimensional Modeling (the only scalable alternative - lasagna)
Business Dimensional Modeling helps me win as an analyst because:
- core business objects create order for how we think about the stuff of our business
- data lineage shows the transformational arc of the data
- separation of concerns helps everyone see the steps in the transformation
As an analyst I don’t want to be beholden to my data sources to organize my thinking for me. I don’t want to reinvent the wheel from source data every time I need to create a new report. I want subject matter expertise and domain knowledge to frame the structure of how I think. I insert the data into that framework to inform decisions, tactics, opportunities, etc. If and when sh*t hits the fan, I can look to the DAGs to troubleshoot.
I think I’m picking up what Serge was putting down. I'm excited to share what I've learned at work. Tomorrow. Today's been a full day.
Seriously, if you made it to this point,
Do me a solid and subscribe. It's the only way I see if this is adding value. You can sign up here.
I'll shoot you a short email when the next part drops.