In 2018 I started work on an analytical framework to help evaluate the quality of data management systems. What I was looking for was a way to say to potential clients “your system has these problems, and it will cost $X to fix it, and the end result will be measurably better.” It’s the measurably better claim that’s the real hard part. I’ve got a lot of experience in data management, but an analytical framework beats your life story.
The first step to a measurement framework was defining failure points. If we can sort out how a system’s components fail, and then count actual failures, which are likely to fail and which won’t, then we can start to assess a fix and assign a dollar value. What I concluded was that when we do data management – the kind of thing we do when we build a data warehouse or even a transaction database behind an application – there’s only four operations we ever do, period. That is, there’s only four operations in data management, and those operations can fail, so those are our failure points. I’m going to explain what those four are below.
That claim should provoke some skepticism. Why only four? Surely in the rich history of data management, there’s more than four ways to screw up. It’s lucky there’s only four, though, because one of the problems we face explaining the cost of poor-quality data management is task scope. It’s better that there are four instead of 38 because it’s easier to sort out what to do with four problems.
If I’m managing the construction of widgets on an assembly line and I want to make widgets faster, cheaper, or slightly different, I need to be able to divide the construction processes into component parts and abstract those processes so I can figure out how to do more with the same amount of effort. This is obvious, no?
So to improve the overall output produced by data management teams we need to identify some discrete and low-finite number of things those data management teams do. “ETL,” “BI” and “database administration” are architectural options and not frameworks. That is, where we choose to transform data into something useful for our client depends on their preference and our skillsets, but we can do pretty much the same things in the ETL, BI or persistence layers of our system. If we know there’s only four operations happening across all of those layers, we can start to apply some intelligence to particular local problems. And we can start to calculate the cost of doing things the right way versus the wrong way, the cost of improving a single operation so it does its work better, and the cost of the failure of that operation to do its expected job.
This is important. I have former coworkers looking at multiple petabyte-size file servers stuffed with test data who believe there’s value in those systems. The only framework they have for evaluating the utility of that vast catalog of data is subject matter: They know roughly what a particular set of directories is about, but the only quality scale most of us have for such environments is “messy-to-pretty clean.” This is not much more useful than using your thumbs to measure out the lumber you need to build a house.
I also have colleagues who’ve inherited systems built by move-fast-and-break-stuff-oriented co-workers. These people have to turn these systems into reliable, useful reporting tools. But there’s no way to convey the difficulty of that conversion to hopeful managers handing off their latest prototype, including where the prototype’s failure points will be, or of providing an operating-cost analysis. Now I’ve built a few of those rube-goldberg systems myself, so let’s not condemn the enthusiastic-but-inexperienced out-of-hand. But if you need to explain why it costs so much to maintain a half-assed data lake with a swampy layer of assumptions where the semantic layer ought to be, in comparison to something that’s had some thought put into it, all you’ve got again is thumbs.
There’s no way, in short, to explain why good data management practices cost less than bad ones.
Data management
Data management consists of two broad kinds of effort. First, there’s data modeling. At one end of the scale data modeling is the often-handmade, idiosyncratic and painstaking construction of a gigantic ER model in a relational database. At the other end is automated feature recognition applied to some datasets from which we hope to squeeze some short-term value. In between those two extremes, there’s a lot of room, but pretty much all of it requires human judgment. So let’s take the data model as immutable, for our purposes, whether its construction is automated or expensive.
There is always a model. This is something of an axiom in data management: All data is structured, even if the model is just documented poorly. So-called “unstructured data” is really just “unstructured for me”: Whether records are created by dropping form results into an amoeba-like MongoDB or by a device spitting out billions of results, someone spent some time figuring out what those results mean.
That this axiom is true might take a moment of thought. We don’t build forms without also specifying a mechanism for retrieving the data created by the form; we don’t build devices that populate files with white noise just to fill up file servers. When we accidentally do generate white noise or drop records we can’t retrieve we consider that software buggy. So all data has, by its very nature, a structure, a data model, a set of fields and filter criteria on records.
The other kind of effort is data movement. The vast majority of a data management team’s effort is spent on initiating and managing data movement. We’ve got data in one place, with an implicit or explicit data model, and we need to move some of it – some of the fields and some of the records – to another location. The target location may be a file name on an FTP site, a stage table, a Hadoop cluster, a report sent to someone’s inbox, a list of items available for purchase displayed alongside a <add-to-cart> button, or a machine waiting on instructions.
All data movement consists of between one and four operations, inclusive. The data modeling piece is still, for the purposes of our discussion here, a black art. I’ve got some opinions on data modeling you may or may not be interested in, and much of what goes wrong with data movement is caused by the data model. But the data movement piece is not a black art, because there’s only four kinds of things we do on the assembly line.
Semantic assets
To understand why there are only four things we do in data movement, let’s do a quick walkthrough of the concept of the semantic asset.
A semantic asset is a subset of a dataset. Let’s call this dataset a source. That subset is consumed by a target. Multiple assets can be generated out of a single source, and this is something we all hope for.
Good semantic assets are reliably constructed subsets: they have a stable set of fields and the filter criteria for the records included in the asset are clear to anyone who consumes the data. In short, a good semantic asset has a stable data model. Unreliable semantic assets will have indeterminate fields or indeterminate filtering criteria.
Let’s abstract a little further. The basic use of a semantic asset is as a simple, defined subset of the source dataset. You could think of the target as really just a set of expectations. This is why all this discussion is semantic and not syntactic: The target expects the asset to be meaningful in very specific ways. (This is true whether the target is an admin in HR looking at a report or a machine.) If the asset is not meaningful then it’s not a very good asset. Its that failure to mean what its supposed to that we’re trying to quantify: How and why does a semantic asset fail to meet its target’s expectations? The data model that makes up the asset is immutable, again, and it may be automated or expensive. But what is the asset expected to do? What transformations or operations, in short, does the asset perform on the source?
Very little of what we do in data management is single-source simple. Most of the time our semantic assets look something like this:
We usually have to combine at least two sources to make a single asset. The two sources might be two tables in a relational database, and we have to join and map them together based on some combination of fields. In more complex and more typical cases the sources themselves are queries or databases or nebulous persistence mechanisms like file formats, and we’re merging two very complex datasets.
In the beginning, before we engage ETL and BI teams, what we’ve got is just sources. Immutable, untouched, lacking any obvious critical faults, sources are just more-or-less stable collections of records and fields divided into various entities, documents, file formats or whatever. It’s not until we try to recombine that data in ways that will be meaningful to people in the organization that we run into trouble. Each recombination is a bet that a particular semantic asset – a particular subset of fields and records – is valuable to an end-user. The data management team is thus in the business of creating obligations to end-users. I prefer to think of the creation of these assets as debts. The output of a data management team is a set of obligations that must be tended, hopefully improved, and acted on if they fail.
Type 2 failures
We’re going to start discussing types of failure with the second type, just for ease of exposition.
When we write queries that join two datasets we need to find a set of fields in each dataset that provide semantically-equivalent definitions of the dataset’s records. We need, in short, keys. In the case of traditional, old-and-busted relational databases the designers have usually encapsulated the dataset’s natural keys into an artificial key. An artificial key allows us to substitute e.g. CUSTOMER_ID for more verbose constructions like {F_NAME, M_NAME, L_NAME, DOB, SSN}, which may or may not be present in all of our datasets. When we join two or more sources we’re looking for common keys. If the key on one side of the join changes then the asset won’t mean the same thing it did before the change occurred. I call this a Type 2 Failure, or “incompatible natural keys.” The failure of the operation occurs because the change in the join keys means the assets before and after the change are no longer semantically equivalent.
Let’s think about this failure as a “debt” for a minute: When you join two datasets on a key you’ve made an implicit bet that the key is reliable. If the key is assigned by a venerable ERP system then the bet is that the ERP system is reliable. If the key is assigned by an intern just getting their feet wet with SQL by merging a bunch of spreadsheets copied into a data lake, the bet is that the intern has picked the right fields. If the bet doesn’t work out then you owe a fix to the user of the asset. You need to find a replacement. This is why I refer to these bets as semantic debts: If the bet goes south you’ve got to pay someone with some time and effort to replace the key.
Type 1 failures
Once we get the join figured out, we often have to rename fields we find in the source so they’re meaningful to the target.
Here’s a real-world example I’ve spent days on: Suppose you’re mapping a customer address record from a jurisdiction with five or more levels of geographical specificity into one with the boring old North American standard of four. So you’ve got a dataset with at least the fields {PREFECTURE, {CITY, VILLAGE, CITY WARD}, SUBAREA, SUBAREA_NUMBER, BLOCK, HOUSE}. Your target address fields are {ADDRESS1, ADDRESS2, CITY, STATE, POSTAL_CODE} to play with. What does that map look like? Is it complex? Are there target field-size considerations? Almost certainly there’s some logic to it. Anyone who’s done a mapping like that can’t help but feel there are corner cases they’ve missed or misunderstood.
I call this a Type 1 Failure, or “incompatible schemas.” The schemas are not just incompatible because the field names are different: Often enough the strings are identical, but the meaning of the fields is different. The thing represented as a CITY in one geographical naming-convention scheme may not be considered a CITY in another; some objects are planets in some models, and mere planetoids in others. We’ve made a bet, again, that the mapping between two sets of fields will hold, specifically that the meaning of the two fields is the same.
Equivalence failures
So we’ve got two obvious types of failure. One type of failure occurs when we need to join two datasets together. Another occurs when we need to map fields between datasets. In some ways this is just the same problem: Aren’t “keys” really just fields by any other name? This is certainly true. In both cases we’ve got assumptions about the equivalence between fields, so both Type 1 and 2 failures are Equivalence failures.
But there’s a practical and important difference between the two kinds of failures. Natural keys are used to define the boundaries of records; a logician would call them identity conditions. Fields are the boundaries of the attributes or properties of a record; a logician would call them variables. When we move data sometimes we try to figure out whether we can insert the records from a source into a target. We don’t want to unknowingly mix records that have different grains, for example, into our fact tables, customer tables or product tables, because this would be a Type 2 failure. But we also have to figure out which fields from a source belong where in a target. We don’t want to get addresses wrong, for example, or mix up QTY and PRICE, because this is a Type 1 failure.
Now I’m not sure what other kinds of problems we might have. Code pages or character encodings in the source that differ from the target? That’s a field-mapping problem, so it’s a Type 1 failure. Left outer joins got you down? Record-boundary problems, so a Type 2 failure. Again, I’m open to the possibility that putting two sources together requires solving a third problem besides mapping record definitions or mapping fields. I just don’t know what that problem would look like.
Type 3 failures
In the first two failure types, we’re dealing with explicit schemas. That is, I’m supposing you’ve got a complete data dictionary, or at least field names and access to the records. What happens if you don’t have human-readable field names you can use as a guide for mapping source to target, or you’ve got an otherwise indeterminate schema? This is a Type 3 failure.
In those cases, which are common, we make our best guess as to what the fields mean. You look inside the file and figure that a set of fields indicate gene locations, for example, using a particular gene repository’s conventions. Or you note that your ERP system has values in the ATTRIBUTE21 field in the ORDER_ITEM table, and because you see a bunch of records starting with “1Z” you figure they’re waybills. In both cases you place a bet on your mapping from an indeterminate field name (or concept) in the source to a determinate target.
This is different from a Type 1 failure, though. In a Type 1 failure we know someone who knows what’s supposed to be in the field – someone has written an access method that moves a string from the PREFECTURE field on a form into the PREFECTURE field in the database, and we can call that person up and say “hey what’s the equivalent of a Japanese PREFECTURE in California-ese?” In a Type 3 failure we don’t know who that person is, and the more reliable method is to guess. By “reliability” I’m not talking about an unwillingness to talk to DBAs or developers. Schemas subject to Type 3 failure are indeterminate because the original documentation was unclear or incomplete.
The dataset may be poly-structured as well, in the sense that Curt Monash uses it. Some poly-structured cases are like our ATTRIBUTE21 example, in that the meaning of the field may be contextual but otherwise unavailable to us. That is, ATTRIBUTE21 is a waybill for certain order types, a SKU assembly for another set of order types, an order status scheme in a third, and a notes field for a fourth. That is, there’s multiple undocumented content schemas at use in the field. (This is not exactly what Monash intends for poly-structured, but I think it’s semantically equivalent.) If you only ever see the first and fourth order types in your extracts and construct your asset accordingly, and then you see some strings some of which start with “1Z” and some strings that look like a short story, your asset has suddenly got a Type 3 failure. You may have guessed at the two schemas for ATTRIBUTE21 that you’ve got access to, but there are two or more content schemas stored in that field as well.
There are vast amounts of Type 3 failure waiting for the unsuspecting on file servers – indeed, it might be said that file servers were built for Type 3 failures.
Type 4 Failures
What’s the opposite of a schema or model that’s indeterminate? A schema that’s precisely, perfectly determined. There’s a place for everything and everything has a place, within the initial assumptions. If you’re outside of the domain proscribed by those assumptions, you’re out of luck.
This shrink-wrapped perfection seems ideal until you’ve got data in a source that needs to be in the target, but there’s no home for it in the target. It’s usually the case that the domain for a given data model is too narrow. Perhaps the modelers assumed there would never be a need to track ad-hoc collections of customers as “households,” for example, or that there were only ever going to be two kinds of marketing channel. If our data model is in 5th normal form then we can usually categorize communication channels that weren’t in the original plan. But if our data model represents “marketing channel” as {ISA_EMAIL, ISA_PRINT} then we’ve got a problem.
The former or “household” case is the same problem, just at a different scale. Many data warehouses are reverse-engineered from their source transaction systems. This is an approach I’ve used with lots of success many times; aside from the abstraction needed to turn a transaction into a legitimate fact, you just rearrange the source data model into a star schema. It’s fun and easy and fast, and you look like a magician. What inevitably happens to that model, unless you’re unlucky enough to be in an organization that never changes, is that some additional transactions come along that don’t fit into the fact structure. For example, you place a key for customer in the fact table, but you discover that some facts really belong to households and not customers alone, and you’ve got nowhere to put the household concept on the fact table. Clearly some remodeling of the fact and customer and household dimensions is in order.
This is a Type 4 failure. I also call this debt an “epistemic bubble.” This failure occurs because our target model has no room for the data in our sources.
Coverage failures
Type 3 and 4 failures are examples of Coverage failures. In Type 3 failures we don’t know what’s included in the source data model, so our construction of a subset is indeterminate. With Type 4 failures we can’t map our desired subset into our target, because the target doesn’t contain concepts we can map from our subset. The target model is not as rich as the asset. I call these Coverage failures because our model doesn’t cover all of our sources or target.
Are there other ways the model may fail to cover the source and target? I’m not sure how that would work. That’s why I think there are only two kinds of coverage failure.
Conclusion
Looked at in a certain way, when Type 1 and 2 failures can’t be resolved, Type 3 and 4 failures happen organically. You get a Type 3 failure when you can’t be sure of the fields in your source but you are sure of your target, and Type 4 failures happen when you’re not sure which field in the target to map a source field into. That’s a useful perspective. But the opposite is also true; Type 1 and 2 failures are the inevitable result of underdetermined schemas at the source or the target. We get incompatible natural keys because people make different assumptions about identity conditions, and we get incompatible schemas because of homophony.
If it helps heuristically to see one type of failure originating in a widespread failure to resolve another type, then you’re welcome to that view. My point is to demonstrate that these four bets are all we make. We bet that two known keys can be joined, or that two known fields mean the same thing. Or we guess at the meaning of an unknown source field, or discover we’ve got no place in the target to put a source field.
But all we’re hoping for here is heuristic observations, the kind of thing that can help a consultant diagnose problems with data movement and assess the cost and risk of a fix. I’m open to a more metaphysical or methodological discussion, but my aim at the top was a means to assign a dollar cost to common problems.
In conclusion then, when a semantic asset has a problem, its because one (or more) of the following things happen:
Type 1: The fields we mapped together don’t mean the same thing.
Type 2: The keys used to join two sources don’t mean the same thing.
Type 3: Our guess as to the meaning of a source was wrong.
Type 4: We can’t find a place to put something expected in a target.
architecture databases pipelines etl semantic debt