If you’re a programmer, chances are you’ve used Stack Overflow at some point in your academic or professional career to help you look for answers to all of your programming questions. But did you know that Stack Overflow also likes to ask their own questions of their users?
Stack Overflow conducts a yearly survey that asks silly questions about if you use tab versus spaces and such, but it also goes further to explore gender demographics in the software world and programming language popularity.
Stack Overflow makes the results of these yearly surveys publicly available; anyone can use the data for their own purposes, be it to back up a hypothesis or to make a visualization.
I decided to do something with the data for a project and the first issue I noticed was the file format. The data are presented in CSV format. Although CSV is one of the most common formats, together with Microsoft Excel spreadsheets, they provide almost zero flexibility to programmers. It’s tough to work with CSV files if you want to transform them into something readable.
That’s why I decided to take the files and transfer them into a database. Using SQL (a query language for T-SQL-based database systems), the data from the survey can be queried in a simple manner, allowing us to apply filters on the data set, make transformations on the fly, and remove the need of sharing files.
And that is why after around four hours of finding the correct tools, setting up the database, filtering, formatting, and lastly uploading the data, I was like this at about 4 a.m.:
Follow the White Rabbit.
I’m going to explain how I finally made it work, so you don’t have to go through the same frustrating process. I’ll go through the steps that I took to move the files from my desktop to the database using MySQL database as the target, since it’s one of the most common database systems.
Importing CSV into MySQL
I made a quick sketch in my mind:
- Download files
- Set up the database
- Find a way to turn CSV into SQL
The first two steps were trivial: The files were easily searchable on Google and free for download. I set up the database so that I had access to the files and the database, which they would soon call home.
I had been using Sequel Pro to access the database, but since it’s available only outside of the Mac ecosystem, I had to use an alternative to import CSV into MySQL.
The internet suggested Workbench has such abilities. I have never been very fond of Workbench (it crashes too often), but I gave it a go. Fifteen minutes later, 56 rows were inserted. The processing speed is terrible and with this pace, the uploading would have taken ages.
After that, I tried HeidiSQL and additional online tools that promise to import CSV into SQL, but they all failed one after the other due to the size of the files.
The above-mentioned programs suggest they can take a CSV file and put it in the database; however, they are horribly slow. Workbench is known for its slow speed while performing those actions.
My savior answers to the name of csvkit. It can generate a script for the database to create the schema of the database. I built a CREATE_TABLE.sql file using the following command:
Csvkit expects the files to be under UTF-8 format, which avoids the errors that can happen with some other formats. I translated the files to this format quickly by executing the following command on PowerShell:
Now it was time for some formatting.
I went through the CREATE_TABLE.sql files to do some curation. This process involved reading through the generated code to check if there were any obvious errors, and also to make the name of the table suit my programming style.
In addition, I had to merge a few columns to make them more readable while in the database.
Some examples of issues I needed to fix in the curation process were:
- The 2015 file had a two-row header
- Because the table columns were taken from the questions, I needed to shorten them, since MySQL supports a maximum column name of 64 characters
- The 2012 file referenced 2011 columns
- Column names were underscore separated
All fixes were manual and focused around reordering the contents of the original CSV files so that csvkit could process them properly.
After I finished editing, I executed the scripts toward the database and then I had a schema I could work with!
During the edit, I had to choose between writing a script to take care of the edits or incorporating them manually. I was pretty sure that it would have taken me longer to automate the process, because I was tired and not focused. Sure, the manual approach normally takes longer, but it’s less risky than writing a script because it’s a repetitive process. Besides, let’s face it—I would never need that script file again.
I went with the dummy manual approach, with fewer risks—yes, sometimes manual work is faster.
The survey result files before 2015 are quite messy (don’t expect much until further curation is done). They have order errors, the questions go over columns, and so on, so the files are difficult to understand unless manual work is done to fix them.
However, if you have been paying attention to Stack Overflow and then you spend around an hour going through the surveys, it is clear how the results affected their course of action. I would like to say “kudos” to Stack Overflow for listening to the community over the years.
Upload Time
Now that I had the schemas, also known as tables, it was time to insert the CSV. Luckily, MySQL has the ability to load data from files, so I used the following command to do the trick:
During the insertion, I ran into a lot of column size issues. When defining a schema, you must provide the maximum string size that can be inserted. But during the schema generation, the upload outputted 1037277 warnings. My assumption was that there was a calculation error from Workbench, because that number exceeded the total number of rows inserted.
I had neither the time nor the will to spend on those warnings, so I moved on. I was sure that all rows were inserted by checking the sums, and if some were truncated, I just had to accept it for the time being.
Below are some details about the warnings I got for each upload:
2011 (921KB) Records:2814 Deleted:0 Skipped:0 Warnings:39298 6.031 sec
2012 (2,469KB) Records:6244 Deleted:0 Skipped:0 Warnings:38787 17.234 sec
2013 (7,746KB) Records:9743 Deleted:0 Skipped:0 Warnings:58596 57.250 sec
2014 (5,986KB) Records:7644 Deleted:0 Skipped:0 Warnings:67869 42.734 sec
2015 (~27MB) Records:26086 Deleted:0 Skipped:0 Warnings:275917 204.156 sec
2016 (~68MB) Records:56030 Deleted:0 Skipped:0 Warnings:541168 498.578 sec
2017 (~91MB) Records:51392 Deleted:0 Skipped:0 Warnings:1037277 666.641 sec
As you can see, the number of warnings was incredibly high for every set. I manually checked a few database entry rows per set, just to make sure they were not all nonsense and could not see any truncation or other issues with the text.
There is no guaranteed way to check if all the data are correct, especially in this magnitude, unless you start a whole operation and a new article just to validate that the origin and the destination have the same content.
Recap of the Process
The whole process took around four hours. No regrets here; it was actually quite enjoyable.
For developers, you gain a lot of knowledge by doing tasks that fall outside your normal everyday work. There is no software architecture involved or design patterns. You have to search for the best tool for the job and not give up when you encounter obstacles that arise.
These might be tasks that you would hate your boss for assigning to you, but they are fun to take on during your own time. And now, if anyone asks, you know how to put CSV into MySQL.
If you’re looking for all of the data assets I mentioned, you can find them on DataCircle or check out Stack Overflow Survey Data Raw for the raw data itself.