Complicated Queries
I frequently find myself in the position of having to write fairly complex SQL queries in order to get a list of data that matches a certain criteria. One problem with complicated SQL queries is that it is very hard to know if the result you have is correct, especially when your query involves multi-table joins and sub-selects. Often, a small error in one part of a complex SQL query can go unnoticed because it doesn't greatly change the count of the results.
Unfortunately, sometimes there is just no way to make a simple query to achieve a certain result. Often, a large amount of the information applied to the data in order to create the query requires outside knowledge that is not found in the relationships directly specified in the database. In situations like these a solid technique for verifying the results of the query is important.
SQL checking SQL
I have started using a technique of approaching a query from two different directions in order to check my results. The idea is that if you can think of more than one way to get the same set of data you are looking for, and each way returns the same set or results, the likelihood of that data being correct is greatly increased.
There was a scene in Apollo 13 where Tom Hanks manually calculated some trajectory and 3 guys in Houston all checked his math to make sure it was correct. When they all came up with the same answer, everyone could be pretty sure the calculation was correct.
That is the idea behind SQL checking SQL. If you can write a second query to get the same data, but from a different angle, there is a pretty good chance that the query is correct. The best way to do this would be to have another person write the second query, knowing nothing about how you are doing the first.
A Practical Example
Sounds good, but how can I actually use it? Let's say that you have a database which stores job applications and resumes from applicants. Let's say that you want to create a list of all the applicants last week.
We could approach this problem from two angles.
- Get a list of all the people who had an application where the application date was last week.
- Get a list of all the people who had a resume created.
We have to use some outside information here. We have to know that in our particular program each applicant for a job can only have one resume submitted and they must have one submitted. Then we know that we should see the same list of people who applied for jobs and had resumes created in that time frame.
Of course this is a really simple example and given all the conditions it would be easy to be pretty sure of either of these queries without running both. The problem is when you have a much larger set of tables you are interacting with and shakier sets of relations between the tables and data.
Tips and Tricks
Once you have created both queries to get the same set of resulting data, how can you easily check to see if they contain the same data? The easiest way is to do a difference in SQL. Depending on what flavor of SQL you are using, the syntax is slightly different. You can look up the set operations for your flavor of SQL. Difference is the opposite of union. Difference allows you to see what records exist in the first query and not the second. I like to use difference with the 1st query first and with the 2nd query first, so that I can see what records are in either query, but not in the other. Any records I find, I closely examine to figure out why they are in one and not the other. Each time I do this, I then makes changes to the query that was wrong until all the results match exactly.
This balancing of the two queries helps to highlight problems that could be missed from either viewpoint of the problem. After balancing out both queries to have exactly the same results, there is a really good chance the results are correct. Certainly the chances of your query being correct are increased vs “just winging it.”