There’s something I felt very curious about for some time now : parallel programming. The name sounded great, conveyed the same feeling as in “horsepower”, the feeling that you can do impressive things with it.
Unfortunately, occasions are pretty rare to use that kind of technology if you:
- are not in a “number crunching” industry
- have plenty of time to run your calculations
- don’t have some spare hardware
Recently, on a project, we had to process huge (not insanely huge, dozens of GB…) quantity of data in a short time frame (around one working day). Previous process took around a week, and by tuning the file formats and the algorithms, we reduced the time to two or three days. But we needed more. So I remembered that parallel computing idea, and searched about it.
First conclusion: parallel computing is for UNIX/LINUX. That was not to please my customer who only uses MS Windows. Then the miracle happened: Condor, a grid computing framework with native builds for UNIX and Windows. Ok, we had the software … but how do you use it ?
Second conclusion: if your process is not sliceable into independent pieces that can run on their own, you won’t benefit much from parallelism. That sounds obvious, that was not, and I spent some time trying to twist all my process so it could fit the parallel paradigm.
Third conclusion: even if you can’t split your whole process, maybe there are sections of it that can be. If that’s the case, then you can adapt your process so it integrates the parallel part, which means splitting the data and the process before the calculation, then merging the results once it’s done.
Fourth conclusion: parallel computing is cool. One of Condor’s greatest strengths is that it can harvest cycles on idle machines (lunch break/night for example) and run it’s jobs at those times, and instantly leave the computer if the user returns, so it does not even notice his computer was scavenged moments. Of course, it can also be run on dedicated server clusters, providing more stable income of CPU power.
Final conclusion: it really helps. By using parallelism, I was able to reduce my two days into six hours, I can still use my PC while it’s doing crazy number crunching (actually managing a remote quad core server doing it) that require 100% CPU for hours, and it became safer because every action is monitored, so when a job crashes for any reason, it is restarted somewhere else, but a track is kept in the logs so I know that job went wrong once, twice, … and I can take actions accordingly. The best part is that if the job ninth’ job on ten crashes, I only have to restart one job and no longer the full batch, saving me hours of frustration…