Eric Gazoni's Blog

Daily thoughts for computer scientists

Month: March, 2010

Transparency

Note: this post is part of a planned series on what values I wish to promote through my daily work

What do I mean by transparency

Based on my quite short but wide experience in the IT business, I must say that sometimes (not to say often) people do business like they play poker. The purpose of poker, like most games, is winning against your opponents. The provider tries to win against the customer, the customer tries to win against his own employees, etc…

To make successful business, we should open our eyes and realize that we don’t play against each other, we play in the same team.

Obviously, it only works when everyone keeps playing fair with others, and follow the game’s rules. Let me states some of them:

  • accept that a bad work shouldn’t be awarded
  • accept that getting more implies giving more
  • accept that others can also be right
  • accept that money cannot buy everything, nor that you are allowed to sell anything for money

That’s mainly humanism applied to business: remain fair with the others and no one will try to fool you. If you send the signal that you don’t respect people, don’t be shocked when people lack respect for you.

Some solutions for a better IT world

Here are a few ideas I try to spread around me on what we could to at any level to improve the current IT ecosystem:

Open formats

Most of my customers don’t see the issue when they build their whole business around proprietary formats.

Actually, they don’t see the issue while they do it, but after a few years, when the format has become extremely deeply rooted in the company’s flows, then they start to see that they have shot themselves in the foot.

Proprietary formats don’t play well with others, so it’s sometimes difficult, not to say mind-boggling to read them with an application that was not patented to do so. Business processes cannot be then automated (or only partly), what leads to more manual operations, more points of failure, more sources of errors, insufficient testing, and finally chaos (to stay polite).

And that’s not yet-another-open-format-geek’s rant, you don’t have to blindly believe in what I say, but just ask around you how many times a closed format was one source of major development delay, or was preventing/hindering automation, you might find the results interesting.

No vendor lock

Although it’s tempting to secure your customer base by preventing them (sometimes contractually) to evaluate other offers, made by potential concurrents. We’ve all seen all the great “benefits” that came with monopolistic situations:

  • low customer support: you don’t need to sweat hard for you customers, it’s not like they could fly away
  • loss of competitive advantage: we don’t need new features, the old ones are still good enough for them
  • overpriced updates: “We know we have a bug on version 1. It won’t be fixed in this version, but we have version 2 that doesn’t have the bug. Of course it will cost you . Shall I send you an upgrade form now ?”

I know I am doing my job really well, my customers know it too, so they are willing to pay for my services.

If one day they find a guy who is better than me, then I think it’s fair he gets the opportunity to show his skills, but I also have the opportunity to improve myself to remain competitive.

The choice is on the customer side, not in mine.

Open source

This one should be obvious nowadays. We are all using open source code at some point in every project. Some people admit it, others are afraid to.

Come on it’s not something a smart developer should be ashamed of. Not working extra hard to reinvent the wheel will not get you fired. You don’t expect your surgeon or physician reinventing medicine for each patient, you just expect he understands enough of the key principles, has enough experience, and knows how to insert what he learned in your specific case.

The same goes for us: we are not the smartest guys on Earth, we cannot invent a new way to build an e-commerce website at each customer that asks one. So many other people already wrote one, failed, learned from that, failed again, etc. I prefer relying on those guys who devoted substantial amounts of time building the most secure e-commerce website known to man, and give them proper credit, while earning my money on what I am the best at: advising the customer, writing the tiny part of the application that is completely customer specific, help him link his website with his existing applications, etc.

And if I can help those guys a little bit by releasing, for example, a bug fix in their application, I see no good reason not to do so, they deserve it. Everyone is enjoying it because no one gets screwed.

“Open schedule”

This is an especially sensitive topic. It’s not specific to IT though, but to any subcontractor activity. I often wondered why some customers required me to work at their premises while the job could have been done elsewhere, like at my office. That’s because some customers fears that you don’t play well and bill them more than necessary. By keeping you under their physical, visual, constant control, they have the illusion that you are not stealing them.

I think this idea that you could be over billing comes from some bad players in the field, either disguised amateurs with no ethics, either crooks with no other intent than making money on customer’s back. Either way, they left a bitter taste to the customer that is now punishing all new contractors, and indirectly themselves, for those guys. The same goes for plumbers, locksmith, painters, … because of some bad guys, one can completely lose confidence in the profession, while 90% of them are honest and hard workers.

Here comes the RERO principle: “release early, release often”, that is fundamental in the Scrum method. If you are able to deliver working software on a regular basis, then you are not stealing the customer’s money. The opposite is not true however, keeping the contractors “in-house” does not ensure that the product will be faultless, only that you will be able to watch the developer’s back during the whole duration of the project (you wish he has a sexy back then).

Dis-cu-tion

Perhaps the most overlooked idea while the easiest to practice. There are situations where you can choose playing it open, or prefer hiding the issue and cross your fingers for the best outcome. This includes:

  • as a customer, not having enough cash to pay for all the requirements that were made
  • as a developer, not being comfortable with a new technology, or not having heard of it at all
  • as an employee, you find that working six days per week, ten hours per day is not a sustainable pace
  • as a sales rep, knowing your coders won’t deliver the product in time

That’s why people talk.

If you cannot pay for the full product, maybe we can remove some features and fall back in your budget.

If you don’t know a technology, maybe you can get a training with someone who does, and either share the training fees between you and the customer, either take it for you and assume lifting less profit this month.

If you are working in at death march pace, at one point, something will fail, either in your body or in your life. Reducing your workload will allow you to be more focused, less tired, thus more productive.

If you know you won’t deliver on time, again maybe we can remove some features, but you let the customer select which ones are important to him.

Failing to communicate is digging the project’s grave.

Don’t let your pride, ego, fear or whatever talk for you. Just make one step in the other’s direction and you might be surprised by the outcome. Even if it does not turn the way you expected, at least you remain professional, because ignoring possible risks is not something you should allow yourself to do.

Thank you for reading so far, I hope you found some points of interest, or already shared my views on the subject. As for everything, I don’t pretend holding the truth, that’s just my own observations, mixed with many things I read during the past months.

To get a bit further, I recommend:

Again, if you have comments about this post, ideas you would like to defend, opposite experience, feel free to express yourself in the comments below.

My Python environment

The early days

When I first heard about Python, it was just after the 2.5 release. I heard that one of my customer was using it but I had never seen a line of Python yet. At some point in a project, I needed a Bash script equivalent on Windows, and decided to give Python a try, instead of using Windows BAT files.

I installed it on my workstation, and started reading the (excellent) documentation about the tasks I needed to do.

I’ve used IDLE at first, because, well, it’s shipped with the Windows Python distribution. It has been a very unpleasant experience I must say (although I have learned to appreciate some features IDLE has that are missing in other editors). TKinter is a summary toolkit, the look and feel makes it look like it was written in the 80’s (in fact, it probably was). The main concept of editor/runner mix-in felt also a bit weird at first. I finally returned to my all-time-favorite editor, Notepad++, and ran my scripts from the command line.

Eclipse and Pydev

Later, I (luckily) landed on a new project, and it included a lot of Python. The team in place was using an editor they were not yet familiar with, but hopefully, I already knew quite well: Eclipse. I must say the Pydev extension for Eclipse is one of the greatest blessing you can get when working with Python. It features a lot of interesting features from IDLE:

  • syntax highlighting
  • code completion
  • real-time code inspection (very useful when using a dynamic language)
  • PyLint integration (once you’ve tasted it, you can never work without anymore)
  • smart indentations (you almost don’t have to worry about your indents)
  • unittest integration (although I’ve stopped using it)

Eclipse is an excellent product on its own, I already used it for PHP development for several years and I was happy it was also my customer choice for Python.

However, there was still some occasions Eclipse was not the right tool to use:

  • the integrated console implements the basic shell only
  • executing scripts outside the project path is hard, as well as changing the current directory
  • for 10 line scripts, creating a project is a bit overkill

The revelation: IPython

After a couple of months struggling with Eclipse and Notepad++ & python.exe, I discovered IPython and at last found a way to work on small scripts without the overhead of Eclipse, but with all its interesting features:

  • serves as well as a shell replacement as Python interpreter
  • excellent autocompletion (both for paths and Python code)
  • “magic” functions such as bookmarks, list of currently defined variables (“whos” command)
  • PDB (Python Debugger) integration with IPDB, providing code completion and history to PDB
  • post-mortem debugger (“debug” command after a traceback)
  • quick access to docstrings and source code of almost every library

IPython is packaged inside the Python(x,y) distribution with the Console application, which is a kind of command line emulator for Windows. Once configured with a readable setup, it’s probably the best development environment you can find of on Windows.

Read the rest of this entry »

Using fake data to test your software

A little story

A couple of years ago, I was sitting at my desk at a customer office. The guy next to me was demonstrating an application built by another company to his colleague. The tool was built with an obscure WYSIWYG IDE that packaged its own framework and language. The point of using a GUI only framework for an important accounting application might sound a dangerous idea, especially when this language turns really inefficient.

During the demo, the guy opened a couple of customer accounts, scrolled down list of operations, and displayed a few reports. While his colleague seemed impressed by the tool, the only positive point I could find was the cuteness of the interface. It was really shiny, the colors were elegant, window decorations were well drawn. But there was something that made me sweat: the (relative) slowness of the product.

It might be hard to spot when you are not a professional developer, but I can’t imagine how scrolling down on the 20 customers present in the demo database could be so irregular. Every 5 items, the application froze for a split second, before resuming. Not something the final user will notice, as it’s not part of his job to have a responsive interface. His job is to work with the data shown by the interface.

A few month later, I came back to the customer, and by this time, the roll out phase of the accounting application had begun. The other company came to migrate the existing accounts (a few thousands) into the new application. When I arrived, I found a few people gathered around one computer, in the IT department. They were all yelling at the screen, so I joined them to see what was going on.

It appears that the problem I mentally noticed before had turned to a complete disaster when loaded with the production data. The customer selection screen took several minutes to just load, when you tried to open a customer record, either the application crashed, either it took half an hour and half of the main memory. I was happy I was not the one responsible for this mess…

Apparently, it seems that every screen was always loading all the records in the database into memory, then sorted the ones it needed, and displayed them. Sounds just goofy, but the worst was it did this every time the screen was modified ! Every time you selected an item in a list, refresh. Every time you modified a field, refresh. Well, you get the picture.

What to learn

The point of this story is: if the developer used a larger set of test data, and unless his workstation looked like a crazy NASA grade supercomputer, he would never have missed something like this. You cannot put something into production that has not been tested with a reasonable amount of data. “Reasonable” varies according to the business you’re into of course, when doing physics simulation, one million observation will seem low, but if you write a real estate portfolio management application, one thousand items will sound overkill.

But “how do I get that many data ?” you may ask. There are plenty of tools available to generate big datasets:

  • a Perl module to generate random data: Data::Faker (sorry, couldn’t find the equivalent in Python)
  • for human related data (name, email, phone, credit card): http://www.fakenamegenerator.com
  • for everything else, any scripting language will do

An example

For a project I am currently working on, I needed to populate my database will human related data, plus a few “business specific” fields.

I downloaded a small sample (5000 names) from fakenamegenerator.com, then I made some Python post-processing in order to have 5000 dummy people with all the fields I needed. I build lists of possible values, then pick randomly them for each dummy people.

Sorry for the colors, there is no pygments plugin on wordpress.com… Update: there is one, but I didn’t know where to find it

from itertools import permutations

COMPANIES = [' '.join(t) for t in list(permutations(['buzz','works','sim','corp','data','micro'], 3))]

ACRONYMS = list(set([cn[0] + cn[len(cn)/2] +  cn[-1] for cn in COMPANIES]))

DEPARTMENTS = [' '.join(t) for t in list(permutations(['artificial','financial','science','training','marketing','sales'], 3))]

LANGUAGES = ['de', 'en', 'fr', 'nl']

TITLES = ['Ph.D.', 'MSC', 'BSC', '']

I could refine it further, so there is a correlation between the name of the company, and the company acronym for example, but it’s not relevant for my application.

Other benefits

Except database performance benchmark, you can have several other benefits from using massive fake data.

Unicode support

As Tarek Ziadé pointed out recently, his name still breaks web applications, because of the “é” in it. Most of us will use “cultural” dummies when inputting them by hand. You don’t want to spend time investigating realistic examples of chinese or russian names for your application, but what if someone with a non-ascii symbol in his name wants to join you website ? With fake data, you can have one dataset for each culture, so it’s easy to test and include new cultures as well.

Field length

When I was a student, I was doing a lot of assumptions when writing code. One of my dumbest was that a family name (surname) would never exceed 20 characters. Of course, my tests included the famous french dummy family:  “toto toto” and his wife “tata toto”, as well as his son, “tutu toto”. You see ? Nothing above 20 characters. When I submitted the program to my teacher, I simply tried to input his name, and you guessed it, it was more than 20 characters long. The program crashed into flames (don’t mess with the length of C strings), and I learned the lesson.

First step to fuzzy testing

One technique I would like to use more often is fuzzing my inputs so they are “monkey ass proof”. If your application breaks just because someone typed a character he normally wouldn’t have (like a letter in a number field), then you should worry. Using random data, you can generate a lot of “possible” combinations (not probable, but possible) of inputs and then spot your errors more efficiently.

Open source packaging goofs

I am a true Open Source enthusiast, and always advise my customers to use open software whenever alternative to proprietary software exists. It’s link to the way many (Microsoft for instance) software company try to lock customers down by using closed formats, “forgetting” to put an “export all data” option and providing unstable (thus unusable) interoperability.

But I must admit it’s not possible to advise people to use software that I (computer geek type) can’t even install or try. If I’m not able to get the software working, my customer (usually not computer savvy) probably won’t be either. Recently, I gave some Python packages a try: virtualenv, pylons, and trac.

Test bench

I installed virtualenv on a FreeBSD 8.0 box, it installed cleanly, I created my environment, but was unable to use it because the command in the manual (“$ source activate.sh”) is broken and does not give a clean error message, so I was unable to debug the issue by myself. I assume the package is tried and tested on Linux boxes and the source command there works differently than mine.

I also installed Pylons three times, on my Windows dev box, because it looks like an excellent package. Out of three attempts, I managed to make the tutorial work only once. And as soon I modified the tutorial “Hello, World” into something else, I got strange and non-related errors every time I reloaded the page. Too bad, I will use Bottle for now.

Finally, yesterday I tried to install the dev branch of the Trac bug tracker. The instructions say you have to checkout and install the latest version of Genshi before checking out the latest of Trac itself from their SVN server. That’s what I did, and after ten minutes, I got an error message saying my version of Genshi was not high enough to use Trac. Probably someone set version requirements on his computer, tested it could deploy, but forgot to push the modification in the Genshi tree and I was … stuck. Finally I was able to download and install the last stable version, but not able to test the new 0.12 features.

Conclusion

Of course, I know those goofs are just exceptions in the long life of those three excellent packages, I probably came at the wrong time, and those bugs have probably been solved by now, but it shows the real need of automated integration testing inherent to every software development. And that could be a strategic advantage of the Open Source projects compared to the closed source ones, as their reactivity can make them implement those tools faster than it could be in a big corporation, with procedures, forms to fill and so on. Developers should pay extra attention making sure their software installs smoothly on every supported platform (virtualization is wide spread nowadays), so users are not scared away and then completely miss all the shiny features of their package.

Remember that 10 years ago, linux was labeled as “computer guru stuff”, I welcomed you by asking you the size of your hard drive and the number of cylinders. Now look at what Canonical did with Ubuntu: clean, mouse-driven, colorful installer. And then people were suddenly able to install linux, and realized that “damn, it’s even better than my old Windows box, and I don’t have to pay for it anymore !”. So please Open Source developers, fix your installers 😀

Update: I agree that instead of complaining, I could have filed a bug report in each project to report the issue. Although the statement is valid, it misses the point of “ready to use” software.

Parallel computing

There’s something I felt very curious about for some time now : parallel programming. The name sounded great, conveyed the same feeling as in “horsepower”, the feeling that you can do impressive things with it.

Unfortunately, occasions are pretty rare to use that kind of technology if you:

  1. are not in a “number crunching” industry
  2. have plenty of time to run your calculations
  3. don’t have some spare hardware

Recently, on a project, we had to process huge (not insanely huge, dozens of GB…) quantity of data in a short time frame (around one working day). Previous process took around a week, and by tuning the file formats and the algorithms, we reduced the time to two or three days. But we needed more. So I remembered that parallel computing idea, and searched about it.

First conclusion: parallel computing is for UNIX/LINUX. That was not to please my customer who only uses MS Windows. Then the miracle happened: Condor, a grid computing framework with native builds for UNIX and Windows. Ok, we had the software … but how do you use it ?

Second conclusion: if your process is not sliceable into independent pieces that can run on their own, you won’t benefit much from parallelism. That sounds obvious, that was not, and I spent some time trying to twist all my process so it could fit the parallel paradigm.

Third conclusion: even if you can’t split your whole process, maybe there are sections of it that can be. If that’s the case, then you can adapt your process so it integrates the parallel part, which means splitting the data and the process before the calculation, then merging the results once it’s done.

Fourth conclusion: parallel computing is cool. One of Condor’s greatest strengths is that it can harvest cycles on idle machines (lunch break/night for example) and run it’s jobs at those times, and instantly leave the computer if the user returns, so it does not even notice his computer was scavenged moments. Of course, it can also be run on dedicated server clusters, providing more stable income of CPU power.

Final conclusion: it really helps. By using parallelism, I was able to reduce my two days into six hours, I can still use my PC while it’s doing crazy number crunching (actually managing a remote quad core server doing it) that require 100% CPU for hours, and it became safer because every action is monitored, so when a job crashes for any reason, it is restarted somewhere else, but a track is kept in the logs so I know that job went wrong once, twice, … and I can take actions accordingly. The best part is that if the job ninth’ job on ten crashes, I only have to restart one job and no longer the full batch, saving me hours of frustration…

Test driven development in .NET

Several months after discovering TDD in Python thanks to Eric Jones, I don’t think I could develop anymore without it. Currently working on a different project in .NET, the first thing I did was re-installing NUnit (I already gave it a try several years ago, but without understanding how to use it, it was soon abandoned).

I have to admit that having a GUI for the test runner (compared to Nosetest) is better on the psychological point of view: having all those green and red lights is way funnier than a dark listing with a big “OK” at the end.

But NUnit runner also has its drawbacks, for instance there is no way to tell it to stay on the tray/taskbar even when pressing the close button. I don’t know if it would be a useful option but I have the (bad) reflex of closing windows instead of minimizing them, so I have to start NUnit like ten times a day. I also miss the “–pdb-failure” option a lot 🙂

I’ve recently read a book that was reviewed on Slashdot about unit testing and this, and combined with my previous experiments in Python, it completely changed the way I write code today. I think unit testing should really be taught in CS classes, as it improves software internal quality by so many orders of magnitude that it’s definitely worth the extra minutes spent writing tests.