Eric Gazoni – Wise Old Geek

Technical Advisor | Software Architect | Open Source Creator

Building useful things for 20 years. Still debugging life.

Tag: Python

  • openpyxl starts being used

    When I started working on openpyxl a few months ago, I didn’t know it would catch that much activity around it. I’m very happy to see that it can apparently help  so many people 🙂

    I’ll try to follow-up on the bug fixes and new features as far as my time permits, and will usually answer emails within the day. Thank you everyone for using the library, even though it is still far from being perfect 😉 Keep posting bugs on the tracker or ideas and requests on the mailing list !

  • openpyxl is on PyPi

    Yes, I know, yet another post about openpyxl 😉

    This time I’m announcing the release of the 1.1.0 version on PyPi.

    As mentioned in the Wiki, you can now just type easy_install openpyxl to get the latest released version, if you are not much into getting the snapshop from bitbucket.

    Also, thanks to Yaroslav’s great job, openpyxl is also packaged under Debian testing !

    Now there is no valid reason for not giving it a try… maybe except that darn memory footprint…

  • openpyxl turns 1.1

    After two weeks of intense activity around openpyxl, I’m releasing version 1.1 today. This new version brings support for dates and number formats.

    Several bugs have been fixed, thanks to the careful testing of two new contributors, Jonathan Peirce and Yaroslav Halchenko, both working on the PsychoPy project.

    Thanks guys for boosting my morale, providing valuable advises and patches !

    Many thanks goes to Marko Loparic for his support and enthusiasm 😉

    You can get the sources for the latest version here http://bitbucket.org/ericgazoni/openpyxl. I expect a lot of bug reports with this new version, as it is stable but not extensively tested yet, and that the number of possibilities have seriously increased with the introduction of number formatting.

    Keep in mind that the memory footprint is still high, but that it is the target for milestone 1.2. It should perform reasonably well if your needs are moderate (<100.000 cells), but if you want to add more data, then it might start consuming RAM pretty quickly. This holds for writing and reading.

    Memory consumption is almost linear, and a 15MB workbook results in 450MB in RAM.

    There is also a new mailing list for the project: http://groups.google.com/group/openpyxl-users. It’s pretty empty for now, but feel free to ask questions there, I’ll be reading it regularly.

    Bug reports will be better handled if they are filed on the project bug tracker: http://bitbucket.org/ericgazoni/openpyxl/issues/new.

    Happy coding !

  • openpyxl reaches 1.0 mark

    After a few more efforts, I am pleased to announce the release of the first version of openpyxl.

    The reader and the writer are working and tested for strings and numbers.

    I have been able to read and write simple Excel 2007 xlsx files from Python and open them with Excel.

    You can clone the repository using Mercurial:

    hg clone https://ericgazoni@bitbucket.org/ericgazoni/openpyxl

    or download the release in zip format.

    Edit: 1.0 release is really outdated, you might want to get a more recent version here.

    The (sparse for now) documentation can be found on the wiki.

    Reader usage (using the “empty_book.xlsx” file from the previous example)

    from openpyxl.reader.excel import load_workbook
    
    wb = load_workbook(filename = r'empty_book.xlsx')
    
    sheet_ranges = wb.get_sheet_by_name(name = 'range names')
    
    print sheet_ranges.cell('D18').value # should display D18
    

    Code is published under the MIT licence, so you can use it for whatever use you need, and I’d be very happy if  you drop me an email if  you use it 🙂

    If you don’t find it useful, spot a bug, or want to suggest an enhancement, you can do so by filling a ticket on the tracker.

    Features that will be added in the next version are listed here, so if you need something in this list, please be patient or send me a message to tell me to hurry 😉

  • openpyxl: simple writer done

    I’ve been very busy on openpyxl the last few days, and I managed to get a working writer for basic data types (strings, numerics).

    For the impatient, you can clone my bitbucket repository:

    hg clone https://ericgazoni@bitbucket.org/ericgazoni/openpyxl
    

    It’s still a work in progress, so expect some quirks here and there, and if that happens, please file a new issue here.

    If you like it, you can also drop a comment below or send me an email (see Contact page).

    Usage is pretty simple as you can see:

    from openpyxl.workbook import Workbook
    from openpyxl.writer.excel import ExcelWriter
    
    from openpyxl.cell import get_column_letter
    
    wb = Workbook()
    
    ew = ExcelWriter(workbook = wb)
    
    dest_filename = r'empty_book.xlsx'
    
    ws = wb.worksheets[0]
    
    ws.title = &quot;range names&quot;
    
    for col_idx in xrange(1, 40):
        col = get_column_letter(col_idx)
        for row in xrange(1, 600):
            ws.cell('%s%s'%(col, row)).value = '%s%s' % (col, row)
    
    ws = wb.create_sheet()
    
    ws.title = 'Pi'
    
    ws.cell('F5').value = 3.14
    
    ew.save(filename = dest_filename)
    
    

    Next features are:

    1. a working reader (so that I can read back files generated by the writer)
    2. dates support
    3. calculations
    4. formatting
  • openpyxl: my python xlsx library

    Update: openpyxl 1.0 is now out !

    At a customer, we read a lot of Excel files. We’ve tried the conventional approaches, that are xlrd and xlwt, pyinex, and COM automation.

    That’s COM that we mainly use, because it’s able to deal with every Excel file format, from the ancient Excel 5 to most recent Excel 2007 Office Open XML format.
    However, we experience from time to time stability issues (Excel is a complex beast, sometimes you don’t fully understand why it is angry).

    We then looked for a native reader for .xlsx format, to get rid of the Excel part of the equation, but unfortunately, there are only two small read-only libraries for now:

    Finally, I thought that I was the only guy who needed a native .xslx writer, and decided to stick with COM for now.
    I wouldn’t be doing this project now without a tweet from Tarek Ziadé, who was also looking for such a library. That meant that we were at least two in need for the same thing, so I simply decided to write it.

    Trust me, the Office Open XML format is open, but it’s also a bit twisted, so I spent a few days gathering documentation, and I finally landed on the PHPExcel library, that was already doing what I needed, but in PHP.

    So now, I’m busy porting the PHPExcel library under Python, which is really easy, because of the similarities between both languages, but I can also benefit from all the nice things that come with Python, so the code is much simpler.

    You can follow my progress on bitbucket: http://bitbucket.org/ericgazoni/openpyxl/

  • Using fake data to test your software

    A little story

    A couple of years ago, I was sitting at my desk at a customer office. The guy next to me was demonstrating an application built by another company to his colleague. The tool was built with an obscure WYSIWYG IDE that packaged its own framework and language. The point of using a GUI only framework for an important accounting application might sound a dangerous idea, especially when this language turns really inefficient.

    During the demo, the guy opened a couple of customer accounts, scrolled down list of operations, and displayed a few reports. While his colleague seemed impressed by the tool, the only positive point I could find was the cuteness of the interface. It was really shiny, the colors were elegant, window decorations were well drawn. But there was something that made me sweat: the (relative) slowness of the product.

    It might be hard to spot when you are not a professional developer, but I can’t imagine how scrolling down on the 20 customers present in the demo database could be so irregular. Every 5 items, the application froze for a split second, before resuming. Not something the final user will notice, as it’s not part of his job to have a responsive interface. His job is to work with the data shown by the interface.

    A few month later, I came back to the customer, and by this time, the roll out phase of the accounting application had begun. The other company came to migrate the existing accounts (a few thousands) into the new application. When I arrived, I found a few people gathered around one computer, in the IT department. They were all yelling at the screen, so I joined them to see what was going on.

    It appears that the problem I mentally noticed before had turned to a complete disaster when loaded with the production data. The customer selection screen took several minutes to just load, when you tried to open a customer record, either the application crashed, either it took half an hour and half of the main memory. I was happy I was not the one responsible for this mess…

    Apparently, it seems that every screen was always loading all the records in the database into memory, then sorted the ones it needed, and displayed them. Sounds just goofy, but the worst was it did this every time the screen was modified ! Every time you selected an item in a list, refresh. Every time you modified a field, refresh. Well, you get the picture.

    What to learn

    The point of this story is: if the developer used a larger set of test data, and unless his workstation looked like a crazy NASA grade supercomputer, he would never have missed something like this. You cannot put something into production that has not been tested with a reasonable amount of data. “Reasonable” varies according to the business you’re into of course, when doing physics simulation, one million observation will seem low, but if you write a real estate portfolio management application, one thousand items will sound overkill.

    But “how do I get that many data ?” you may ask. There are plenty of tools available to generate big datasets:

    • a Perl module to generate random data: Data::Faker (sorry, couldn’t find the equivalent in Python)
    • for human related data (name, email, phone, credit card): http://www.fakenamegenerator.com
    • for everything else, any scripting language will do

    An example

    For a project I am currently working on, I needed to populate my database will human related data, plus a few “business specific” fields.

    I downloaded a small sample (5000 names) from fakenamegenerator.com, then I made some Python post-processing in order to have 5000 dummy people with all the fields I needed. I build lists of possible values, then pick randomly them for each dummy people.

    Sorry for the colors, there is no pygments plugin on wordpress.com… Update: there is one, but I didn’t know where to find it

    from itertools import permutations
    
    COMPANIES = [' '.join(t) for t in list(permutations(['buzz','works','sim','corp','data','micro'], 3))]
    
    ACRONYMS = list(set([cn[0] + cn[len(cn)/2] +  cn[-1] for cn in COMPANIES]))
    
    DEPARTMENTS = [' '.join(t) for t in list(permutations(['artificial','financial','science','training','marketing','sales'], 3))]
    
    LANGUAGES = ['de', 'en', 'fr', 'nl']
    
    TITLES = ['Ph.D.', 'MSC', 'BSC', '']
    

    I could refine it further, so there is a correlation between the name of the company, and the company acronym for example, but it’s not relevant for my application.

    Other benefits

    Except database performance benchmark, you can have several other benefits from using massive fake data.

    Unicode support

    As Tarek Ziadé pointed out recently, his name still breaks web applications, because of the “é” in it. Most of us will use “cultural” dummies when inputting them by hand. You don’t want to spend time investigating realistic examples of chinese or russian names for your application, but what if someone with a non-ascii symbol in his name wants to join you website ? With fake data, you can have one dataset for each culture, so it’s easy to test and include new cultures as well.

    Field length

    When I was a student, I was doing a lot of assumptions when writing code. One of my dumbest was that a family name (surname) would never exceed 20 characters. Of course, my tests included the famous french dummy family:  “toto toto” and his wife “tata toto”, as well as his son, “tutu toto”. You see ? Nothing above 20 characters. When I submitted the program to my teacher, I simply tried to input his name, and you guessed it, it was more than 20 characters long. The program crashed into flames (don’t mess with the length of C strings), and I learned the lesson.

    First step to fuzzy testing

    One technique I would like to use more often is fuzzing my inputs so they are “monkey ass proof”. If your application breaks just because someone typed a character he normally wouldn’t have (like a letter in a number field), then you should worry. Using random data, you can generate a lot of “possible” combinations (not probable, but possible) of inputs and then spot your errors more efficiently.

  • Test driven development in .NET

    Several months after discovering TDD in Python thanks to Eric Jones, I don’t think I could develop anymore without it. Currently working on a different project in .NET, the first thing I did was re-installing NUnit (I already gave it a try several years ago, but without understanding how to use it, it was soon abandoned).

    I have to admit that having a GUI for the test runner (compared to Nosetest) is better on the psychological point of view: having all those green and red lights is way funnier than a dark listing with a big “OK” at the end.

    But NUnit runner also has its drawbacks, for instance there is no way to tell it to stay on the tray/taskbar even when pressing the close button. I don’t know if it would be a useful option but I have the (bad) reflex of closing windows instead of minimizing them, so I have to start NUnit like ten times a day. I also miss the “–pdb-failure” option a lot 🙂

    I’ve recently read a book that was reviewed on Slashdot about unit testing and this, and combined with my previous experiments in Python, it completely changed the way I write code today. I think unit testing should really be taught in CS classes, as it improves software internal quality by so many orders of magnitude that it’s definitely worth the extra minutes spent writing tests.