Data Persistence and Exchange

There are two aspects to preserving data for long-term use: converting the data back and forth between the object in-memory and the storage format, and working with the storage of the converted data. The standard library includes a variety of modules that handle both aspects in different situations.

Two modules convert objects into a format that can be transmitted or stored (a process known as serializing). It is most common to use pickle for persistence, since it is integrated with some of the other standard library modules that actually store the serialized data, such as shelve. json is more frequently used for web-based applications, however, since it integrates better with existing web service storage tools.

Once the in-memory object is converted to a format that can be saved, the next step is to decide how to store the data. A simple flat-file with serialized objects written one after the other works for data that does not need to be indexed in any way. Python includes a collection of modules for storing key-value pairs in a simple database using one of the DBM format variants when an indexed lookup is needed.

The most straightforward way to take advantage of the DBM format is shelve. Open the shelve file, and access it through a dictionary-like API. Objects saved to the database are automatically pickled and saved without any extra work by the caller.

One drawback of shelve, though, is that when using the default interface there is no way to predict which DBM format will be used, since it selects one based on the libraries available on the system where the database is created. The format does not matter if an application will not need to share the database files between hosts with different libraries, but if portability is a requirement, use one of the classes in the module to ensure a specific format is selected.

For web applications that work with data in JSON already, using json and dbm provides another persistence mechanism. Using dbm directly is a little more work than shelve because the DBM database keys and values must be strings, and the objects will not be re-created automatically when the value is accessed in the database.

The sqlite3 in-process relational database is available with most Python distributions for storing data in more complex arrangements than key/value pairs. It stores its database in memory or in a local file, and all access is from within the same process so there is no network communication lag. The compact nature of sqlite3 makes it especially well suited for embedding in desktop applications or development versions of web apps.

There are also modules for parsing more formally defined formats, useful for exchanging data between Python programs and applications written in other languages. xml.etree.ElementTree can parse XML documents, and provides several operating modes for different applications. Besides the parsing tools, ElementTree includes an interface for creating well-formed XML documents from objects in memory. The csv module can read and write tabular data in formats produced by spreadsheets or database applications, making it useful for bulk loading data, or converting the data from one format to another.