Hi, I'm Gregor, welcome to my blog where I mostly write about data visualization, cartography, colors, data journalism and some of my open source software projects.

Start using databases, today!


This post is written to welcome dataset, a new library to simplify working with databases in Python.

Let’s face it. Relational databases, such as MySQL, SQLite and PostgreSQL, are pretty cool – but nobody actually uses them. At least not in the day-to-day work with small to medium scale datasets. But why is that? Why do we see an awful lot of data stored in static files in CSV or JSON format, even though

  • they are hard to query (you need to write a custom script every time)
  • they are messy, as they cannot store meta data such as data types
  • it is a pain to update them incrementally, say if some record has changed

Programmers are lazy people

The answer is that programmers are lazy, and thus they tend to prefer the easiest solution they find. And in most programming languages, a database isn’t the simplest solution for storing a bunch of structured data. At least in Python, things really shouldn’t be this way. So, say hello to dataset, a new Python library to simplify your every-days work with databases. In a nutshell, dataset makes managing databases as simple as reading and writing plain JSON files. Here’s a brief list of the key features:

  • Automatic schema: You never need to worry about the database schema again. If a table or column is written that does not exist in the database, it will be created automatically.
  • Upserts: Very handy when running a scraper the second time: Records are either created or updated, depending on whether an existing version can be found.
  • There are some nicequery helpers for simple queries such as [all]( rows in a table or all [distinct]( values across a set of columns.
  • Compatibility: Being built on top of SQLAlchemy, dataset seamlessly works with all major databases, such as SQLite, PostgreSQL and MySQL.
  • Scripted exports: Data can be exported based on a scripted configuration, making the process easy and replicable.

Hope this comes handy to some of you, since I cannot live without the library anymore. If you want to read more, check out the full documentation at

Happy databasing!


Anyong (Jun 04, 2013)

Interesting! What’s the difference between ‘dataset’ and ‘MongoDB’?