I’ve just published a new Pluralsight course, Introduction to Hibernate.
This course was definitely a difficult one to produce. There is so much to cover in Hibernate, and there was no way I could cover it all.
Here is the course description:
In the world of Java one of the most popular and widely used frameworks is Hibernate.
Hibernate is an ORM or Object Relational Mapper that allows developers to map Java objects to relational database tables. It is a valuable tool that all Java developers should know how to use.The problem is, learning Hibernate can be difficult. Hibernate is a very large framework and there are many important concepts to understand to use it effectively.
Are you a developer who…
- has been wanting to learn how to use Hibernate
- struggles to use the application
- has never really understood Hibernate
This course is designed to give you the knowledge you need to feel confident about how Hibernate works and how to use it.
What this course offers:
This course is designed to make getting started with Hibernate as easy as possible and focus on the most important things you need to know. John starts off this course by teaching you a bit about Hibernate and how it works.
Then you will see how to get it setup with a real MySQL database installation. After that you will learn the very basics of mapping, which is one of the most important things to understand about Hibernate.
John will show you how to create a basic mapping for a Java object to a relational table and explain to you how the mapping works. He also covers some of the complex mapping situations like mapping collections and different relational mappings like one-to-many and many-to-one.
Once you’ve learned how to map you objects, you’ll want to know how to query them, so he’ll show you how to do this using Hibernate’s built-in HQL and using a more object-oriented approach with the Criteria API.
Finally, he wraps things up by taking a brief tour of some of the more advanced features like caching and interceptors.
Deploying software doesn’t have to be that complicated!
I’ve seen and built many software building and deployment solutions over my career, and I have come to find that most software deployment can be boiled down to a simple process.
I’m not trying to give you a solution for your software deployment automation, nor am I trying to perfectly model your exact process.
What I am trying to do in this post, is to help you to simplify your process.
If you can identify the parts of your deployment process that fit into the simple steps I am going to outline below, it should be much easier for you to automate your deployment process.
Even though software build processes, infrastructure and components are unique, I have found that most software deployment processes can be simplified into the following steps.
- Build software without configuration
- Create environment specific configuration.
- Create a set of database changes.
- Bundle software, configuration and database changes.
- Apply new software
- Apply new configuration
- Apply new database changes
- Start it back up
You might read through these steps and think “well duh.”
You might be tempted to say “my process is more complicated than that.”
I’m not going to argue with you. You are right, your process is probably more complicated than that. But, does it need to be?
Can you simplify your process to fit into these steps?
Sure, the implementation of these steps is likely to be fairly complex and vary for each type of software, but if you can distill the process into these steps, you can much more easily automate that process.
Where people go wrong
The big key to my simple version of deployment is
Build software without configuration
You MUST do this! Departing from this step causes all kinds of pain and complexity. Please don’t try to build your software and the configuration for an environment at the same time. These things must be pulled out from the get g or you will have the pain of trying to tease them apart later – or you will have to create separate builds for each environment.
It is also critical that the same bits that were built by your build server are what is deployed to each environment!
I will say that this isn’t the easiest problem to solve. You may need to have a separate build process that builds up the configuration for an environment.
Separating the two will also force you down the path of building a process to apply that configuration to an environment.
But, if you are willing to accept that this is just a must and bite through this pain, you’ll come out on the other side clean (even though you had to crawl through tunnels of crap.)
The whole story
Now that I’ve hopefully convinced you to separate your configuration from the building of your software, let’s go over the big picture of a deployment using the simple process outlined above.
It all starts out when you build your software. Perhaps you have a continuous integration build server setup that is automatically building software on each check-in; perhaps you are manually kicking off a script.
Once you have built your software, you have some bits that you should be able to apply to any environment. Nothing that you built here should be machine or environment specific in any way.
Now, you kick off another process, or perhaps one was kicked off simultaneously by your continuous integration server. This builds up the configuration for the environment you are going to deploy to.
A similar process is kicked off—also could be simultaneous, for generating a list of database changes that need to be applied to the target environment.
Now that you have your bits, configuration and database changes, you are ready to deploy.
If you are smart, you’ve even built these ahead of time and they are just waiting for when you need them.
Next, gather up the artifacts and move them to the deployment target where you actually apply them.
First, unpack your bits and put the new bits into place. (You may or may not need to take your application fully offline to do this.)
Then apply the new configuration on top of your newly installed bits for that environment.
Finally, apply database changes for that environment.
Now you should be completely deployed and can start up your application.
But how do I do it?
Perhaps you agree with me that the actual process should be what I have outlined and described, but now you are at the point of implementing a solution.
How do you actually automate this stuff?
Good question. If you figure out a simple answer, let me know.
This is the point where you might be writing custom tools and scripts to get all this working. The key is to take it one step at a time.
There are at least two tools out there that I know of that help you do this. I can’t speak for either of these tools, since I haven’t used them myself, but I have heard good things about them.
One other thing to consider is how you are going to get the right stuff to the right server. You will want to think about things like:
- Promoting build products
- Preloading promoted products to servers to make deployment faster
- Getting through firewalls by having the software or some other process PULL the upgrade to your target, rather than you PUSHING it there.
- Rollback, or some kind of mitigation strategy if things go wrong. (My recommendation here is not to get fancy. I have NEVER seen a successful rollback, only a database restore followed by a manual code restore. If you mess up bad, just count on restoring the machine and the database.)
Every once in a while I stumble across some SQL keyword that I didn’t really know about, but is extremely useful.
The other day I came across APPLY, or rather CROSS APPLY.
After reading through documentation on how it works and articles about it, I had a bit of trouble understanding it because I couldn’t really find a simple explanation.
I am going to try to explain it as simply as possible so you can start using it right away.
How CROSS APPLY works
The basic idea behind CROSS APPLY is to allow you to join two sets of data together.
If you understand how INNER JOIN works, you already understand CROSS APPLY.
The only difference is CROSS APPLY also allows you to join in a set of data in which that set of data is created or dependent on each row in the first set.
So basically what that means is that for a normal join you would, for example, join two tables that shared a common key.
I could join my customer table to my orders table to see all the orders for a particular customer like so:
SELECT * FROM orders o JOIN customers c ON o.customerid = c.customerid WHERE c.companyname = 'Around the Horn'
Notice how the join is operating on a key that exists independently in each table.
We could rewrite this to be exactly the same using the CROSS APPLY syntax instead like so:
SELECT * FROM orders o CROSS APPLY ( SELECT * FROM customers c WHERE o.customerid = c.customerid) AS c WHERE c.companyname = 'Around the Horn'
We can prove these results sets are exactly the same by using EXCEPT to make sure there are no rows in one set that aren’t in the other and then flipping it, like so:
SELECT * FROM orders o CROSS APPLY (SELECT * FROM customers c WHERE o.customerid = c.customerid) AS c WHERE c.companyname = 'Around the Horn' EXCEPT SELECT * FROM orders o JOIN customers c ON o.customerid = c.customerid WHERE c.companyname = 'Around the Horn'
Just run this exact query again swapping the SQL above the EXCEPT with the SQL below and make sure it has no results as well. If both of those queries have no results, then you know the results from each query are the same since EXCEPT will show any results that are in the top query but not in the bottom one.
So when is CROSS APPLY useful?
Remember how I said it can do more than a simple join? Joins are restricted to only joining two sets of data that could be queries independently of each other.
What if I said give me the three most recent orders for each customer?
Take a minute and think about how you would write that query. Go ahead and try to do it. I’ll wait.
There are a few ways to do it without using CROSS APPLY, but none of them are really very easy or perform very well.
Using CROSS APPLY it is simple though:
SELECT * FROM customers c CROSS APPLY ( SELECT TOP 3 o.orderdate , o.shipcity FROM orders o WHERE o.customerid = c.customerId order by o.orderdate desc ) as top3;
So CROSS APPLY is useful whenever you have some data that you would want to be able to join against, but are forced to do some kind of sub-query instead because the data you are trying to join is not going to map well against a single key.
The other instance in which CROSS APPLY will be useful is when you are doing a sub-select that has more than one value you would like to use in your final query.
For example if you were sub-selecting from an Order Details table to match up order ids that had a Quantity greater than 5, that sub-select would need to return exactly one column in order for you to use it in your where clause. If you wanted to use other columns from the sub-select, you would have to do another sub-select for each of these columns.
If you first try to rewrite the sub-select as a JOIN and find that you can’t, you may be able to write it as a CROSS APPLY.
How to know when to use CROSS APPLY
There isn’t a good solid rule you can use to identify when you should use a CROSS APPLY but having the knowledge of CROSS APPLY and how it works can help you when you are trying to tune queries and you are having a difficult time constructing one. It is another option you can try.
Here are some general guidelines of times when you might want to use CROSS APPLY:
- A query where the result set you want to JOIN against is in some way related to the data in the first set. (Example: one column in the first table tells you how many rows in the 2nd table to get)
- A query where you are doing a sub-query, but need more than one value from the sub-query
- Anywhere you are using a Common Table Expression (CTE) could possibly be rewritten as a CROSS APPLY
- A query that has a large set of data it is joining against and then filtering out. You can change it to a CROSS APPLY that does the filter in the CROSS APPLY statement.
- Any time you are trying to join against a table function (this is actually what CROSS APPLY was created for.)
As always, you can subscribe to this RSS feed to follow my posts on Making the Complex Simple. Feel free to check out ElegantCode.com where I post about the topic of writing elegant code about once a week. Also, you can follow me on twitter here.
Last time I wrote some code to allow paging of stored procedure results in SQL Server it sucked.
That was about 3-4 years ago.
I just wrote some code to do it again, and well… It still sucks.
Perhaps I’m doing it wrong, but if you are looking for, “how to implement paging in sql server”, or “how to page sql server stored procedure results”, or “sql server paging”, I’ll give you my best answer for it at this point.
How I roll with CTEs
CTEs or Common Table Expressions, are a pretty nice feature of SQL Server. They are a bit complex, but basically they let you create a query and treat it as a table to select from. They also allow recursive calls to themselves, which let you do cool things like query hierarchies of data.
This link offers a pretty good explanation of CTEs.
Anyway, CTEs come in handy combined with the ROW_NUMBER() function to add the ability to page data from an existing stored procedure.
The basic idea here is to modify the original query by wrapping it inside of a CTE that adds a RowNumber column.
Then we can select from that query the rows that are in the range we want.
Vegetables are good for you
Let’s look at an example:
SELECT name, goodnessfactor, color FROM vegetables WHERE color = 'purple' OR color = 'green'
Now let’s modify this query to be pageable given we have the first and last record we want. (first record = page number * rows per page, last record = page number + 1 * rows per page)
WITH pagedVegetables AS ( SELECT name, goodnessfactor, color, ROW_NUMBER() OVER (ORDER BY name) AS rownumber FROM vegetables WHERE color = 'purple' OR color='green' ) SELECT * FROM pagedVegetables WHERE rownumber > @firstRecord AND rownumber <= @lastRecord
Okay, so it is actually not that bad.
A couple of notes to help understand what happened here.
- ROW_NUMBER() requires an OVER clause which basically will determine how you calculate what the row number is. In this case, we are going to sort by name. So, the first alphabetical name will be row number 1, and so on.
- We’ve wrapped the query in a CTE and added an additional column so that we have a row number in our new virtual table. (Think of the CTE as creating a temporary view for us.)
- We are selecting everything out of the original query that is in the range of rows we want. (Depending on how you define your range you may use different equality operators. If you range is inclusive, you would do <= and >= , and if it is exclusive it would be < and >. In this case we are both.)
A word on total row count
If you need the total number of rows so that you can pass that back to the pager in your code, then you will probably have to select the results from the CTE into a temporary table and return two result sets from it.
I couldn’t really find a way to select the total rows and the data from a single CTE expression. If you know a way, please let me know.
I have been working on getting set up this week to develop for my new, awesome employer, TrackAbout.
In doing so, I have once again felt the pain of getting a development environment configured. I forgot how painful it can be. This is in no way a reflection of TrackAbout, the truth is most development environments are a pain to get setup. Unless you’re actively trying to build a painless development environment, it is going to probably be the opposite.
I’ve seen a large number of development environments and I’ve built my share of them. From all this, I have a pretty good idea of what I consider ideal, what we should strive for.
The basic outline
- Install non-scriptable tools or start with a fresh image. (Basically getting IDE and SQL Server installed locally)
- Get branch from version control.
- Build database
- Build code
- Local deploy
The idea here is that I should be able to either get an image that has my base tools installed, or install them myself, then pull down one source control location and everything else that happens from there is the result of build scripts or some other automated process.
I know, it is easier said than done. Let’s break it down step by step and look at some of the possible solutions.
If you are in an organization where everyone will have the same hardware, it is much easier to create an image of a developer machine with, say, Visual Studio and SQL Server installed.
Another possible solution is to create a dev VM that is maintained and updated regularly, so that it has all the required tools and you have a uniform structure. I have tried this approach, and I find that the biggest problem is that many times you want to run native to get the performance improvements. As hardware capabilities increase though, I am seeing this as a more viable route.
Finally, if you can’t get either of those situations, it is ideal to put all the tools that must be installed on a network share or some other easily accessible place.
Ideally, you want to keep the number of required tools down to an absolute minimum. In most .NET environments this should be Visual Studio and SQL Server. The other kinds of tools can be handled via dlls (usually).
Get branch from version control
Ideally, you should be able to point a person to one source control location, and that should get everything necessary for them to build and deploy the entire system locally.
If different applications your organization is developing have different branches, then you might need to check out one location per project, but even that can be automated to some degree with a “get latest” script or symbolic links.
Build the database
This one is kind of hard. It requires quite a bit of forethought on how to get this working. The idea here is that I should be able to build the entire database from a set of scripts.
The challenge is getting together a process which allows for the construction of the database from scratch and to populate tables that are required for the application, and be able to apply patches to existing databases. I won’t go into how to do that here.
Build the code
There is quite a bit lumped into here. From a developer perspective I should just be able to run one build command that is the same build that will be run on the continuous integration server and everything that I need should get built for me.
From behind the scenes, this is a difficult step.
- You have to make sure everything works from relative paths or environment variables.
- You have to have your scripts check to see if things are installed and install them if not (registry keys, etc).
- You have to have all the libraries in a place that the build can find on the client machine.
The key to success here is to eliminate as much as possible and locate in one place, as much as possible, all configuration differences.
It should be very easy to do a local deployment of the application. For .NET developers this usually isn’t a challenge, but in the Java world it can take some thinking on how to do this properly.
At anytime someone should be able to deploy locally to their machine. Ideally, anyone should be able to take a build from the build server and deploy it with a single command.
It is all about the mindset
Basically, you have to think about zero configuration development environment from the beginning if you really want to be successful at it. It is much harder to add it on later.
You do have to weigh the effort involved carefully though. Most developers only set up their configuration once or twice. If you are going to have a growing team where you are constantly adding new developers, you should probably put considerable effort into getting as close to zero configuration as possible. On the other hand, if you have a small team and don’t have new developers very often, it might not be worth the extra effort. You have to find the balance.
In all honesty, my experience at my new job has been pretty good in contrast to some of the development environment setups I have seen. There is a huge amount of consistency in configuration locations, which is good.
I’m looking forward to figuring out how to make it easier for the next guy though, once I understand everything better myself.
In my Kanbanand guide I have a rule for infrastructure that says, “Production data never goes to any other environment.”
Someone recently asked what exactly this means.
It means that the database data that is in production never goes into dev, integration, qual, or whatever other environments you may have set up at your workplace. I think more important than the “what” is the “why”, and while I’m at it, I’ll tell you why I won’t give you, as a developer, access to production data at all.
Why production data stay in production
Let’s start with a simple reason: security. Production data in most cases is going to contain at least some amount of sensitive information. If your company leaks sensitive information about customers or clients you can almost guarantee that you will have a lawsuit on your hands.
It is much better to keep production data as safe as possible. The best way to do that is to limit access to it as much as possible. If you move production data to different environments you will be greatly increasing the amount of places that data lives and correspondingly the number of people who have access to it.
Aside from just the security concerns, is the problem of reliance on exact data to recreate a problem. Debugging an issue which is seen in production should not require the actual production data. If it does, there is a problem with the tooling you have created to support your application. You might consider creating some development tools that will allow you to simulate any production issue in a different environment without having to use real production data.
I know it is much easer to just use production data, but it is a bad habit that masks other problems with your tooling and support. It also has a large cost to the infrastructure and support of an application if you require frequent data migration between environments.
Why developers stay out of production
The reasoning is very similar. We could talk again about security. We could talk again about the problems with relying on production data to debug a problem in your application.
This really comes down to a discipline and constraint issue. By enforcing this constraint and practicing this discipline, you gain quite a few things out of necessity:
- Logging must be improved to understand what happened in production
- You are forced to consider creating a user click tracking mechanism
- Test data generation is forced to be considered
- Testing becomes more complete because your team is forced to recreate issues independently rather than using production data
It might not be immediately apparent to you how these things are connected, but if you enforce this kind of a policy, you will quickly find that you will need better ways of understanding what the user did.
It may seem counter-productive to make a developer’s job harder by preventing them from accessing production, but in the long run being forced to create tooling and better logging to understand how the user is using the system, and having more complete testing will save time.
Some tips on making this reality
Yes, I know, easier said than done. Let me help you get it done with these tips:
- Build enough logging into the application to make it easy to understand what the system is doing and the flow through the system.
- Build a detailed level of logging that can be turned on for different parts of a live running production system.
- Build or use tools that take data from your web server logs and application logs and translate that into a digest of how the user clicked through the system. (Doing this well will allow you to recreate most scenarios that may have happened in production.)
- Create migration tools that allow you to migrate and cleanse a particular piece of data. The goal here is not to take the whole database and dump it to another environment, but to take specific sets of data, cleanse them (get rid of all personally identifiable information), and put them somewhere else to examine and debug.
- Have ways to generate volumes of clean test data to put into non-production environments. This may look like a test script that runs though the application and clicks through many scenarios to generate test data, or a bunch of SQL scripts that populate test data tables.
DRY = Don’t Repeat Yourself.
For the most part this is a good principle to follow. But, there is one particular instance where I feel repeating myself makes things more clear.
It is very important for us to remember that principles are important, but ultimately they are very strong guidelines, not absolutes. (It is hard to imagine that I just said that, since I am usually arguing on the merit of principles, but I digress.)
In particular, when creating database tables and naming database columns, most people will create a table structure that looks like:
Looks perfectly normal and good. But, I don’t like it. Imagine you have to join these tables together to create a query. (A pretty common occurence.)
SELECT * FROM Book JOIN Author ON Book.author_id = Author.id JOIN Publisher ON Book.id = Publisher.publisher_id
Did you catch my mistake above? Probably not. Most people would overlook it. With the ER diagram above, and not very many columns on the tables, it is a little bit easier to see.
It is very easy to mix up which table is the parent and which is the child. It is very easy to put the .id on the wrong side of the join clause. The correct SQL should have been:
SELECT * FROM Book JOIN Author ON Book.author_id = Author.id JOIN Publisher ON Book.publisher_id = Publisher.id
Bucking the trend
Now take a look at this ER diagram:
The corresponding SQL for selecting from this database is:
SELECT * FROM Book JOIN Author ON Book.author_id = Author.author_id JOIN Publisher ON Book.publisher_id = Publisher.publisher_id
It is a small change, but it makes a huge difference. Writing SQL to join the tables is much easier and much more difficult to get wrong. Yes, I am repeating myself here, but I am doing it for a very good reason. I am making the SQL much more clear and I am making it much easier to talk about these tables.
Think about the last time you had a conversation about database tables. Did you say id? Did you then have to qualify which id you were talking about? If you never have “id”, you eliminate a large amount of ambiguity.
The major argument against what I am suggesting is that you don’t need to name the id for the book, book_id because it is a column on book, so you know that id refers to the id of the book. I do not disagree with this. For the most part that principle is sound. The question though is “which one is easier to read, write and talk about?”
I frequently find myself in the position of having to write fairly complex SQL queries in order to get a list of data that matches a certain criteria. One problem with complicated SQL queries is that it is very hard to know if the result you have is correct, especially when your query involves multi-table joins and sub-selects. Often, a small error in one part of a complex SQL query can go unnoticed because it doesn’t greatly change the count of the results.
Unfortunately, sometimes there is just no way to make a simple query to achieve a certain result. Often, a large amount of the information applied to the data in order to create the query requires outside knowledge that is not found in the relationships directly specified in the database. In situations like these a solid technique for verifying the results of the query is important.
SQL checking SQL
I have started using a technique of approaching a query from two different directions in order to check my results. The idea is that if you can think of more than one way to get the same set of data you are looking for, and each way returns the same set or results, the likelihood of that data being correct is greatly increased.
There was a scene in Apollo 13 where Tom Hanks manually calculated some trajectory and 3 guys in Houston all checked his math to make sure it was correct. When they all came up with the same answer, everyone could be pretty sure the calculation was correct.
That is the idea behind SQL checking SQL. If you can write a second query to get the same data, but from a different angle, there is a pretty good chance that the query is correct. The best way to do this would be to have another person write the second query, knowing nothing about how you are doing the first.
A Practical Example
Sounds good, but how can I actually use it? Let’s say that you have a database which stores job applications and resumes from applicants. Let’s say that you want to create a list of all the applicants last week.
We could approach this problem from two angles.
- Get a list of all the people who had an application where the application date was last week.
- Get a list of all the people who had a resume created.
We have to use some outside information here. We have to know that in our particular program each applicant for a job can only have one resume submitted and they must have one submitted. Then we know that we should see the same list of people who applied for jobs and had resumes created in that time frame.
Of course this is a really simple example and given all the conditions it would be easy to be pretty sure of either of these queries without running both. The problem is when you have a much larger set of tables you are interacting with and shakier sets of relations between the tables and data.
Tips and Tricks
Once you have created both queries to get the same set of resulting data, how can you easily check to see if they contain the same data? The easiest way is to do a difference in SQL. Depending on what flavor of SQL you are using, the syntax is slightly different. You can look up the set operations for your flavor of SQL. Difference is the opposite of union. Difference allows you to see what records exist in the first query and not the second. I like to use difference with the 1st query first and with the 2nd query first, so that I can see what records are in either query, but not in the other. Any records I find, I closely examine to figure out why they are in one and not the other. Each time I do this, I then makes changes to the query that was wrong until all the results match exactly.
This balancing of the two queries helps to highlight problems that could be missed from either viewpoint of the problem. After balancing out both queries to have exactly the same results, there is a really good chance the results are correct. Certainly the chances of your query being correct are increased vs “just winging it.”
I recently ran across a very interesting problem involving database design. What is the best way to store database metadata, or, metadata about data in a database?
I have changed the table names to simplify the problem, so let me describe it as such:
I have 2 tables, fruit and vegetable. Each stores whether or not a fruit or vegetable is tasty. We can ignore the other columns, so it would be like:
Now let’s say that someone keeps changing the IsTasty setting through the UI of my application, and management is demanding the ability to see when someone changed it last and who is responsible for the change. The tricky part here is although we are ignoring the other data on the table, there is other data, and we don’t want to track when any data on the table was changed, just this one column.
Solution 1: Track the data in the table by adding columns to track updated on and updated by.
The drawback here is we are adding 2 extra columns per table and anywhere else we wanted to do this would require the same. Imagine if we had more columns in this same table we wanted to track changes on. Arghh…
Solution 2: Track the data in a separate table for the fruit and vegetable tables for just that column.
This seems to be a little more clean and extensible in the future. Now we have created two extra tables and we haven’t added a bunch of extra columns to our existing tables, but if we wanted to track more columns we would have to add more tables that are copy and pastes of these tables. These tables are also exactly the same as each other except for the foreign key.
Solution 3: Create a new table that has 2 optional foreign keys
Here we are removing the duplication of the tables that are almost exactly alike. It will scale, but we have optional foreign keys and it is very hard to do joins. We also lose knowing what data this table belongs to and automatic referential integrity provided by the database. (Which is the ability for the database to know what tables are linked together and make sure the referential data is correct.) We could create a custom constraint, but each time we need a new column we would have to add a new key, and for a different type of column we would need another table.
Solution 4: Create a meta table to store information about columns in other tables
To me this seems like the best approach. When a column was changed really has nothing to do with a fruit or vegetable. It is data about data or meta-data. By creating a table to store meta-data about any table it is very scalable. We don’t have a foreign key, but we can either create triggers, or use our DAO framework to automatically make inserts to the audit table.
Other options? Pro’s / con’s? Thoughts?