Modeling Information: Part 1 - The programmer’s trick
My latest project at work has lead me down an unexpected road of wrapping my head around modeling information from a very high level. It’s made me realize a lot of my own past mistakes of overlooking the obvious. As a self-taught programmer, I’ve picked up some very fundamental bad habits. Since I’m a programmer, I’m going to start there with a simple concrete problem. Through the next parts in this series, we’ll hopefully work our way toward an ultimate generic model for information.
A simple problem
Lets start with something ridiculously simple. It’s probably many developers’ first project: a Web 2.0 contacts database application. Here are some requirements:
- A contact consists of just their name
- Contacts need to be grouped
Here’s what the database probably looks like:

I build it and it’s wonderful and straightforward. But requirements change and I’m pissed. I find out that there are three different ranks of contacts in the group and this needs to be recorded. The different ranks are:
- Junior
- Senior
- Director
Well of course we refactor the database to account for this:

At first this is dandy. But then the creepy fear of requirements changing again comes upon me. What if a new kind of contact comes along? Why should I go through all this extra work? Anyway this database schema isn’t even accurate because technically a contact could be a junior, senior and director at once! This looks wrong anyway. I must protect myself!

ContactKind can now have Junior, Senior and Director. I even give users the ability to add new ContactKinds. Much better. Right?
Wrong: 3 Years Later
So three years later, the company wants to grab some statistics from this wonderful application. They want to find out the total number of directors. They’ll run this report a monthly basis. I say fine, since this is pretty simple:
SELECT COUNT(cga.ContactID) FROM ContactGroupAssociation cga INNER JOIN ContactKind ck ON cga.ContactKindID=ck.ContactKindID WHERE ck.ContactKind='Director'
I run the query and it returns ‘0′. WTF? I take a look in the ContactKind table. The contact kinds are now:
- New Hire
- Level 1 Junior
- Junior: Level 2
- Senior: Level 3
- Level 4 Director
- Director: Level 5
- General Associate Dir.
- CONTACT REMOVED - PLEASE E-MAIL JAN DECK FOR DETAILS
Damnit! What the hell is this?
What went wrong?
At this point, we could take this problem and solve it in numerous ways:
- A lot of us programmers would just refactor the query using a bunch of OR’s and LIKE’s and hope to hell that those damn users don’t come up with some other crazy naming schemes.
- Other programmer’s will form some sort of interface to their report to put the users in charge of figuring out how to get Directors
- Even other programmer’s would change the data to fit our use and maybe “fix” the user interface to avoid these problems in the future (get rid of the ability to add ContactKinds).
- Some managers would say statistics were not in scope for the application, therefore this request is denied. Please go through the enterprise standard change control process so we can refactor the entire program.
- Other managers would say that user’s were abusing the application and were not following the correct procedures. We should create some new business procedures for naming the kinds of contacts consistently.
- Some hardcore database modelers would say that the database model is an inaccurate portrayal of logical model. Wait, where is your logical model?
The point is that in software development this is a huge problem which everyone has a solution for. I want to go down the road of the data modeler.
In this example, ContactKind is used as a generic enumeration attribute for contacts. Each value in the enumeration needs an explanation in order for us to understand it and be able to perform the requisite data mining. Some of those values aren’t even in the same class, like “CONTACT REMOVED - PLEASE E-MAIL JAN DECK FOR DETAILS.” Even in our very original schema of grouped contacts, we don’t know the context of any of those without an explanation. This is where a data dictionary would come in. It may as well had been:

The moral of this story is had a I kept the database schema that I came up with originally, I would not have this problem. That schema was a better representation of the logical model.
As an experienced programmer, many times I will jump the gun and make a change like this without a
second thought. It’s a powerful trick that allows us to develop solutions that can be customized to user’s needs. By expanding the database model to encompass the actual logical model we can push the logical model all the way down to the physical layer.
In the process we lose some granularity on our data. Suddenly there is a difference between the problem our database model tries to address and the current problem we’re addressing. The database schema is useless without data, and the data is useless without some sort of validation and data dictionary to explain it.
In the next part we’ll look further at the relationship of logical models to physical models.

A very well thought out post and all very true. What’s even better is when you have to work with a third party application that pulls data from another third party application and transforms the data into a different schema. After which, they fail to provide anyone with a clear data dictionary to describe this new translation.
It’s kinda like translating German to someone using Russian when you know that they don’t understand Russian.
Thanks!
Haha, when have you had any experience with that? That’s a perfect example!