Skip to content

Instantly share code, notes, and snippets.

@bdkosher
Last active October 27, 2017 15:26
Show Gist options
  • Select an option

  • Save bdkosher/3481404880f75cafd9a0300bf7d0fb12 to your computer and use it in GitHub Desktop.

Select an option

Save bdkosher/3481404880f75cafd9a0300bf7d0fb12 to your computer and use it in GitHub Desktop.

6 Reasons Why Dev Teams Should Own their Data Model

Although it seems like it would be common knowledge that

1. It biases applications architectures toward transactional, relational databases.

Where I've worked, the data modeling teams specialized in relational databases, which contributed to the organizaitonal mindset of handling all data persistence requirements through use of an RDBMS. Unfortunately, the presumption of RDBMS usage limits discussion about application persistence requirements and associated trade-offs:

  • "Is my system transactional?"
  • "Do we need immediate consistency at the expense of availability?"
  • "How do we best support the high write throughput we anticipate?"

It's always worthwhile to have these conversations before diving right into the design of the DDL.

2. It promotes the false belief in a single, canonical representation of data for all use cases.

The data modeling teams I've worked with tend to believe there's an ideal way to represent the data, generally based on one particular view of the world that desires to maximize constraints and reduce data duplication. In reality, data can be represented in a myriad of ways to suit a variety of purposes--there is no such thing as a "canonical" data model. Data models exist for particular purposes.

Consider customers and their addresses. One valid view is a many-to-many relationship; a customer may have multiple addresses and multiple customers may share a single address. Contrarily, consider a application purposefully designed to allow purchases without forcing customers to establish an account first. For every purchase, you want to capture just the minimal information necessary to complete the transaction, which includes a customer name and address. In this case, a better data model would combine the address and customer information into a single "purchasor" entity.

If the data is modeled using a many-to-many approach, however, the development team would insert a new customer record, address record, and customer-address association record for every purchase, which would violate the data model's intentions and result in "bad" data. But the problem isn't that the application is inserting bad data, it's that the data model does not fit the perspective of the application.

Instead of trying to conform every system to adhere to a single perspective, we should allow multiple perspectives and manage their co-existence.

3. It elevates the intrinsic quality of the data model above other concerns.

The belief in a single, canonical data representation an environment where adherence to the ideal canonical representation takes higher priority over other application concerns: ease of implementation, simplicity of legacy data migration, schema change agility, performance optimizations, etc. In other words, it becomes "valuable" to the organization to have a nicely normalized database with minimal data duplication adhereing to all of the various naming and key conventions, even if it's of little value to the application.

"Could we tweak the DDL to simplify Hibernate mappings?" "No, that would go against some of our database standards."

"These soft delete date columns has no meaning or relevance to application--can we remove them?" "No, our data modeling team requires all tables to have them."

"We have a query we run very often. Can we change the model to optimize this query's performance?" "No, that change would introduce too much de-normalization."

"There are concepts in the next general data model that don’t exist in the legacy system; can we defer implementing these concepts until after we've migrated off of the legacy system?" "We've added these concepts based on our requirements discussions with the product owner. We wouldn't want to remove them at this point."

"We would like to add a column to this table to support a quick maintenance release." "Here's the updated data model DDL; we’ve added two new tables and four more constraints. By the way, we noticed some names were not in agreement with our yet-to-be-published standards, so we went ahead and renamed them."

We should prioritize and tackle application development concerns that help us achieve our business objectives. If we want to migrate off of our legacy systems, we need to value things that allow us to migrate quickly (e.g. data structures that agree with legacy semantics). If we want to have aggressive application release cycles, we need to value things that enable us to deliver and adapt quickly (e.g. automated regression testing, database scripts managed alongside the application code).

4. It encourages integration at the database level.

When the organization emphasizes the intrinsic quality and documentation of DDLs above application service interfaces, the database becomes a desirable point of integration. This is especially true when integration service APIs provided by other, parallel projects are inconsistently designed, under development, or at risk of not being delivered on time. "I know how to mitigate that risk, let’s just query their database directly ourselves."

This short term solution causes long term problems. Once a database is used by multiple applications, it becomes very difficult to make database changes. Multiple applications with separate budgets and release cycles need to be modified in tandem. Some applications may not be able to support certain degrees of change. If we want to move towards service-oriented, cloud-native applications, then we need to encourage integration at the service level using cloud-friendly protocols.

5. It encourages big upfront design.

When your database is the integration point of your architecture, you attempt to mitigate the risk of DDL changes by future-proofing the data model through big upfront design:

  • Include extraneous columns and tables that are not needed immediately but will support potential future requirements, and
  • Bring together potentially isolatable data sets (e.g. customer management, workflow management, document metadata, etc.) in order to allow a broad set of questions to be answered in the future, if needed.

As the saying goes, "Prediction is very difficult, especially about the future."

Big Upfront Design conflicts with the Agile/DevOps principle of Minimum Viable Product. We should encourage data models that are minimally viable for present use cases and establish processes that allow us to easily make changes to the model with minimal disruption to the application code.

6. It complicates the development process.

Big Upfront Design is one cause of complications. Application developers are forced to deal with data structures that may contain data outside their application’s scope, are designed for future use at the expense of present needs (like legacy migration), or lack defined semantics.

Consider the standard practice of adding BEGIN_EFFECTIVE_DATE, END_EFFECTIVE_DATE, and DELETE_IN columns on lookup tables. The mere existence of these columns creates the possibility for lots of peculiar scenarios. What happens when the end date is set but the delete indicator is false? Is this the same as when the end date is not set but the delete indicator is true? What if the end effective date is before the begin date? What if I have a record with a foreign key relationship to a deleted record? How does this differ from a foreign key relationship to a non-effective record? A better question to ask is: what benefit do these columns provide that justify the added confusion?

Things are further complicated by the data modelling process. Data modelers often meet independently with the product owners when gathering requirements. Not only does this take up extra time for the POs, but sometimes the development team and the modelers come away with different interpretations of the requirements that need to get hashed out. The extra feedback loop between development team and data modeling team introduces delays. This has been mitigated by having the modelers work one Sprint ahead of the team, which is in and of itself another complication (think of the Agile layered cake analogy with vertical slices—the layers within the slice are staggered across multiple Sprints).

If we empower the development team to create UIs, design APIs, and write application code, shouldn't we also empower them to create DDLs as well? Why is the DDL held in such esteem that we need an independent group to create it, own it, and house it separately from the rest of the application code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment