Using fake data to test your software

A little story

A couple of years ago, I was sitting at my desk at a customer office. The guy next to me was demonstrating an application built by another company to his colleague. The tool was built with an obscure WYSIWYG IDE that packaged its own framework and language. The point of using a GUI only framework for an important accounting application might sound a dangerous idea, especially when this language turns really inefficient.

During the demo, the guy opened a couple of customer accounts, scrolled down list of operations, and displayed a few reports. While his colleague seemed impressed by the tool, the only positive point I could find was the cuteness of the interface. It was really shiny, the colors were elegant, window decorations were well drawn. But there was something that made me sweat: the (relative) slowness of the product.

It might be hard to spot when you are not a professional developer, but I can’t imagine how scrolling down on the 20 customers present in the demo database could be so irregular. Every 5 items, the application froze for a split second, before resuming. Not something the final user will notice, as it’s not part of his job to have a responsive interface. His job is to work with the data shown by the interface.

A few month later, I came back to the customer, and by this time, the roll out phase of the accounting application had begun. The other company came to migrate the existing accounts (a few thousands) into the new application. When I arrived, I found a few people gathered around one computer, in the IT department. They were all yelling at the screen, so I joined them to see what was going on.

It appears that the problem I mentally noticed before had turned to a complete disaster when loaded with the production data. The customer selection screen took several minutes to just load, when you tried to open a customer record, either the application crashed, either it took half an hour and half of the main memory. I was happy I was not the one responsible for this mess…

Apparently, it seems that every screen was always loading all the records in the database into memory, then sorted the ones it needed, and displayed them. Sounds just goofy, but the worst was it did this every time the screen was modified ! Every time you selected an item in a list, refresh. Every time you modified a field, refresh. Well, you get the picture.

What to learn

The point of this story is: if the developer used a larger set of test data, and unless his workstation looked like a crazy NASA grade supercomputer, he would never have missed something like this. You cannot put something into production that has not been tested with a reasonable amount of data. “Reasonable” varies according to the business you’re into of course, when doing physics simulation, one million observation will seem low, but if you write a real estate portfolio management application, one thousand items will sound overkill.

But “how do I get that many data ?” you may ask. There are plenty of tools available to generate big datasets:

  • a Perl module to generate random data: Data::Faker (sorry, couldn’t find the equivalent in Python)
  • for human related data (name, email, phone, credit card): http://www.fakenamegenerator.com
  • for everything else, any scripting language will do

An example

For a project I am currently working on, I needed to populate my database will human related data, plus a few “business specific” fields.

I downloaded a small sample (5000 names) from fakenamegenerator.com, then I made some Python post-processing in order to have 5000 dummy people with all the fields I needed. I build lists of possible values, then pick randomly them for each dummy people.

Sorry for the colors, there is no pygments plugin on wordpress.com… Update: there is one, but I didn’t know where to find it

from itertools import permutations

COMPANIES = [' '.join(t) for t in list(permutations(['buzz','works','sim','corp','data','micro'], 3))]

ACRONYMS = list(set([cn[0] + cn[len(cn)/2] +  cn[-1] for cn in COMPANIES]))

DEPARTMENTS = [' '.join(t) for t in list(permutations(['artificial','financial','science','training','marketing','sales'], 3))]

LANGUAGES = ['de', 'en', 'fr', 'nl']

TITLES = ['Ph.D.', 'MSC', 'BSC', '']

I could refine it further, so there is a correlation between the name of the company, and the company acronym for example, but it’s not relevant for my application.

Other benefits

Except database performance benchmark, you can have several other benefits from using massive fake data.

Unicode support

As Tarek Ziadé pointed out recently, his name still breaks web applications, because of the “é” in it. Most of us will use “cultural” dummies when inputting them by hand. You don’t want to spend time investigating realistic examples of chinese or russian names for your application, but what if someone with a non-ascii symbol in his name wants to join you website ? With fake data, you can have one dataset for each culture, so it’s easy to test and include new cultures as well.

Field length

When I was a student, I was doing a lot of assumptions when writing code. One of my dumbest was that a family name (surname) would never exceed 20 characters. Of course, my tests included the famous french dummy family:  “toto toto” and his wife “tata toto”, as well as his son, “tutu toto”. You see ? Nothing above 20 characters. When I submitted the program to my teacher, I simply tried to input his name, and you guessed it, it was more than 20 characters long. The program crashed into flames (don’t mess with the length of C strings), and I learned the lesson.

First step to fuzzy testing

One technique I would like to use more often is fuzzing my inputs so they are “monkey ass proof”. If your application breaks just because someone typed a character he normally wouldn’t have (like a letter in a number field), then you should worry. Using random data, you can generate a lot of “possible” combinations (not probable, but possible) of inputs and then spot your errors more efficiently.

2 responses to “Using fake data to test your software”

  1. Nice article. I have worked in many companies where applications are put into production without adequate testing, making it freeze every now and again on heavy data.

    I will check out http://www.fakenamegenerator.com myself later.
    Thanks for the link.

    1. Thanks for your comment 🙂 Actually I’ve noticed that some people are reluctant enough to use new software even if it’s reactive and well tested. Releasing half-tested products is only providing material for those guys, and staining your image. The best way to convince people of using your program, is giving them a working product first, and if you can make it shiny too, it’s even better 😉