Would you ever build a house without a blueprint? Then you shouldn’t build your data architecture without a roadmap. Otherwise, you’ll end up with uneven surfaces, a weak foundation, and a reduced value on your work.
In this post, I’ll share the 8 principles behind any modern and scalable data architecture. Let’s dive in.
Vancouver, where I live, has been dealing with a leaky condo crisis for over 30 years. Leaky condos are when apartment buildings and other housing units allow water to seep in. This leads to rot, decay, and costly repairs.
The seed of this crisis was planted in the late 1980s and early 1990s. There were multiple factors at play here – ineffective building codes, lack of planning for a cooler climate, and lack of accountability for construction companies. All in all, this had led to the “the biggest and most costly reconstruction of housing stock in Canadian history”.
I mention this story because this is exactly what you don’t want to happen to your data architecture. If you skip critical planning steps or try to cut corners, you may find yourself with a leaky data infrastructure that is limiting the growth of your company.
I recommend that you start at the end, with the outcome that you’re hoping to achieve. This could be faster insights for your company or the ability to build complex machine learning models or simply a better way to uncover eureka moments. The more these outcomes tie into the long term strategy of the company, the better.
The fastest path to your outcome is a straight line. We want to minimize the left and right turns here unless necessary. You might still need to take a detour but we can find ways to deal with these changes.
For example, let’s imagine that you want to upgrade your data architecture but you estimate that this will take 12 months. In the meantime, you want to bring an external agency to help you sort through your existing data. This is a detour and instead, you can explore hiring someone internally who can help you now and in the future. You could start training this person on your company day, KPIs and report.
When designing your architecture, think in 3 time zones: now, in 6 months, and in 12 months. That’s it. Challenges farther along that are too blurry to see yet. Besides, if you don’t solve the problems of the present, you might never get to that future.
I have helped 75+ companies build their data architecture. Each company had a different level of complexity depending on their industry, size, and unique makeup of each team. However, they all had several similarities and principles that we will talk about in this section.
Data collection should happen in multiple ways across all customer touchpoints. This may include websites, mobile apps, product databases, offline sources and others. It’s not enough to just see what customers are doing on your website. We need to also see how they got there, what they purchased, and their interactions with the customer support team.
I was talking to a reporter about data warehouses vs data lakes. The former is meant for structured data while the latter is seen as a dumping ground for data you might use in the future. I argued that data lakes are only useful in rare specific scenarios, mostly around machine learning. Data needs to be structured and ready for analysis, otherwise, it doesn’t have much short term value.
Building on the previous point, the schema of your data matters. You can design it ahead of time instead of hoping that you can transform your data in a data lake. Doing it ahead of time means making conscious choices on how you track users, what user ID is used across your company, and proactively defining what events and actions are needed.
Data locked up in a vault with no key is useless. This is like the adage of what happens if a tree falls in the forest but no one is there to hear it. Did the tree actually fall? I have seen teams with limited data but world-class accessibility do better than teams with world-class data but poor accessibility.
Good accessibility means that everyone in the company can quickly query data, they trust it and data comes in multiple formats (dashboards, emails, SMS, notifications, CSV, etc)
You don’t know everything that you will need right now in your data architecture. That’s fine and we will focus on the next 12 months at max. However, you do need to build modular support into your infrastructure. This means that you could easily plug in other ways to access the data in the future.
For example, choose a data warehouse that has extensive support like Redshift or BigQuery or choose an ETL tool with a deep functionality instead of choosing something with a limited toolset.
We now live in a privacy-first world. This means collecting only what you need (which is another reason why data lakes aren’t helpful), protecting your data, and complying with regulations like GDPR and CCPA. Don’t wait to tackle privacy until you see your first fine or data deletion request.
Backups are like insurance. You don’t want to use them, they should be invisible and they need to work if they are called upon. Be especially mindful of how you handle backups as the data volume increases in your data architecture.
Design security into your data from day 1. Like a house, you need different levels of security depending on how sensitive or valuable your data is. Some data can simply be behind door keys while other data should be stored in a vault within your house. Explore options like limiting data to company emails, two-factor authentication, and creating special group permissions. Take advantage of “seamless” security options like fingerprint and face unlock on mobile phones.
Back in 2020, you can still find leaky condos throughout Vancouver. Simply look for properties that are way under market value in a sale. I have seen apartments that are trying to be sold for less than $250,000 than the market value. These are the leaky condos.
These leaks in your data architecture can lead to lawsuits, wasted effort and shackles on your growth. Some of the corners that I see companies cut all the time include skipping over data privacy, choosing technology at random, and poor schema design.
Assume that this architecture will require significant changes every 12 – 18 months so build that in. Make it easy to replace pieces of your infrastructure. Keep your strategy simple and scale them over time.
Good construction is done at the blueprint level. This is where you determine the weakest points, unleash your creativity, and win the game. Construction (or implementation) is merely executing on this plan. Remember that as you design your data architecture.