The Downside of Overusing Python Dicts

This piece introduces a new Python library: cavro. There is a lot of technical detail and narrative explaining why I felt it was needed (related to dicts!), but if you just need to parse or create Avro messages faster using Python, then visit here: https://cavro.io.

As a long-time Python developer, I’ve heard the phrase “In Python everything is a dict” on many occasions. Typically in a slightly condescending tone.

I’ve always quite liked this label. While it’s often used pejoratively to suggest that Python programs just stuff data into amorphous dicts rather than structuring things ‘properly’ into classes or specialist data structures. The real beauty comes to fore in understanding that even if you use a class, it’s fundamentally powered by dicts.

Granted, things have evolved in Python recently for performance reasons, but the essence remains: Accessing a dict key and a class attribute can often be effectively the same thing. This feature is subtly central to Python’s dynamic nature and power, and something I missed while writing Ruby.

So, why claim that dicts can be bad?

Python’s dict implementation is great. Dicts are fast and efficient. But there are times when, especially in performance-critical code, doing a hash-based lookup can be slower than required. This is an exploration into a situation where the use of dicts can cause problems.

Content Overview

Enter Avro
Enter fastavro
Enter <internal library name>
The Problem
A compounding factor
The Solution
Current State of Affairs
Enter cavro
Try it out!

Enter Avro

This all started about 7 years ago when my team was developing a new app that needed to decode and encode a lot of Apache Avro messages. This was part of a processing pipeline, and our app was reading events and outputting new ones. The peak target was something like 100 messages per second per thread.

We used the official Python avro library (I’m referring to this as apache-avro), and started testing. End to end, we were getting… 0.5 messages per second. Each message was taking, I think, 0.1s to decode, and 2 seconds to encode! This translates to about 4,000,000,000 clock cycles to encode approximately 1k of data.

We needed a 200x improvement to meet our target.

Now, some people might argue ‘Python is slow, what do you expect?’ but that opinion doesn’t lead to success. Instead, we set out to understand more, and find solutions.

Enter fastavro

Searching the internet quickly led to an alternate library: fastavro. Someone else had encountered similar problems with the Apache library and written a fast version. This was great! We loaded up the library and tried it, getting (I forget exactly) about 2x speed improvements.

Our hopes for a quick win with fastavro were dashed. fastavro was faster, not nearly faster enough for us. This was surprising. Fastavro uses cython to AOT compile a lot of logic, so mindlessly blaming Python speed was starting to seem less credible.

Fastavro has some good performance numbers, is a solid library, and for almost every use-case, is perfectly fine. However, in our situation, we had a typically-enterprise problem that hit a very ugly corner case in how both fastavro and apache-avro libraries handled complex messages. Institutionally, we were doing something that few other people were doing with avro and so were having to deal with the side-effects.

It was at this point, in a fit of disbelief and hubris, I declared I would write my own version that would be the ‘best ever™️’ and solve all our problems, and went off to a coffee shop to start coding.

Thankfully, this gambit paid off. A few hours later we had a skeleton avro library that could meet our performance targets (I want to say about 50ms per message), and fleshing out the library into something deployable wasn’t too much more work. (ironically the bureaucracy took far longer than the coding)

Enter <internal library name>

The library blew fastavro and apache-avro out of the water performance-wise, and beat both on every performance metric.

How is this possible? Did I use some magical proprietary wizardry to hand-unroll parse trees and output inscrutible hyper-optimised assembly? No.

Instead, we took a clean look at the published Avro specification, ignored the existing implementations, and wrote a codec in a pythonic way using standard OOP techniques, some type hinting, and threw the code through Cython to compile.

The Problem

This section is quite technical and dense, feel free to skip down to ‘The Solution’ below.

If you’re just interested in the summary: Apache-avro and fastavro both use dicts in different ways that hurt performance, and the key to being faster is to avoid doing that.

Avro is a message format that uses a defined schema (like protobuffers), and messages that are non-self-describing. In order to encode or decode an avro message, you need to have a schema that tells you what values go where, and what type they are.

A typical process for encoding a message using a defined schema looks a bit like this:

You load the schema into memory, parse it into a usable form, and then use that schema to encode values. Simple really. Note, the yellow boxes (Load/Parse Schema) only have to be done once.

Codecs are in a class of libraries that are a bit special for having code that should be default assumed to be called in a hot-loop. If you’re writing a codec (and you can’t review every possible use-case) it should be assumed that the actual encoding/decoding is going to be a performance bottlekneck for someone. It’s therefore important to make these parts efficient by default.

In Avro, schemas are defined as JSON documents, and here’s where the first trap lies: Reading json in Python is trivial, you use json.loads (or equivalent) and get back a first-class plain data object (for convenience, let’s call it a dict). Dicts are so convenient to use in Python that it’s tempting to stop there and say our parsed schema object is the value returned by the json.loads call, rather than convert it to a more optimal representation.

This superficially makes sense because the job of the encoder is to take the schema and work out how to encode the data. If the schema is:

{"type": "int"}

Then the encoder can say:

if schema['type'] == 'int':
    return encode_int(value)

And that’s simple. This approach seems to scale quite nicely, there are only about 15 types to handle, and so the code is simple.

The apache library at the time appeared to be a bit different, they half-unpack the json into classes, but still implement the encoding as a set of if statement:

    if writers_schema.type == 'null':
      encoder.write_null(datum)
    elif writers_schema.type == 'boolean':
      encoder.write_boolean(datum)
    elif writers_schema.type == 'string':
      encoder.write_utf8(datum)
    elif writers_schema.type == 'int':
      encoder.write_int(datum)
    elif writers_schema.type == 'long':
      encoder.write_long(datum)
    elif writers_schema.type == 'float':
      encoder.write_float(datum)
...

The problem with this is that ‘in Python, everything is a dict’, especially in Python 2 (both our enforced Python version, and apache-avro’s only supported Python at the time), where instance attributes are looked up in a dictionary.

This code, from a performance point of view, is equivalent to:

if writers_schema['type'] == 'null':
   ...
elif writers_schema['type'] == 'boolean':
   ...
...

So, for each decode/encode operation, you’re doing up to 14 dictionary lookups (each time looking up the same value, ‘type’) just to work out which method to dispatch to.

Unlike ahead-of-time compiled languages/platforms, such as C or the JVM, Python can't optimise the repeated attribute lookup. (although in recent Python versions, the attribute lookup has been optimised significantly) This is becauseanother thread could alter the value of writers_schema.type at any time without warning, so the runtime has to re-look up the value for each if-branch.

fastavro at the time was smarter, folding the if/else dispatch into a dict of types, but ended up still doing several dict lookups (edited for clarity). Ironically by using dict based dispatch with cython, fastavro were reducing the ability for cython to optimise the call patterns, because the dict was stopping cython knowing the types of the writer functions:

schema_type = extract_record_type(schema)  # Performs a dict lookup: schema['type']
return WRITERS[schema_type](fo, datum, schema)  # Writers is a dict lookup too.

From the outside, this doesn’t look awful, but this small cost starts to build up...

Encoding simple scalar values is fine, but if you have structured data to encode, array/map/record etc... then the generic write methods have to be called for the structure type, but also separately for each value in the structure.

So, if you have a container with 100 values, avro is performing up to 1,400 dict lookups, and fastavro 200 dict lookups.

While 1,400 dict lookups may not seem excessive given Python's efficient dict implementations, the profilers kept showing this to be the problem.

There is more to the story…

A compounding factor

The schema we were using was quite complex, effectively it was a union of hundreds of different record types. This meant that any value being decoded could represent any one of hundreds of different message types. The JSON schema representation came in at about 3MB at the time.

On decoding, the avro spec handles this efficiently, the rule is to output the index into the list of possible sub-types, so a decoder can jump straight to the right type and start decoding.

On encoding, the library has to choose the best member of the union to use to encode.

Both fastavro and apache-avro accept dicts as values to encode, matching keys in the dictionary to the field names in the schema. The trouble comes when you have records in a union, because the libraries have to use rules to work out which union entry matches the data, and this can only be done by matching up fields to dictionary keys. (fastavro has gained several custom ways to specify the type directly since then)

So now, to encode a value, we have to loop over every type in the union. For each type, we loop over its fields and every key in the dictionary, ensuring they all match up. This involves numerous dictionary lookups in each loop. Suddenly 1,400 dict lookups becomes 1,400,000 or more.

Apache-avro had one further issue, as shown in this example checking if a value matches an array:

False not in [validate(expected_schema.items, d) for d in datum]

In order to check if any of the items in the list don’t match the schema, it recursively checks every item in the list, and then searches for False. It’s far better here to stop the first time a False is encountered, rather than calculate everything. This pattern was used in multiple places, and added significant overhead.

The Solution

How did our library manage to avoid all this? With several simple things:

On parsing the schema, the library builds a class-based representation of the schema (like apache-avro does), and the classes contain the logic for encoding/decoding values. This way we can take advantage of traditional class-based dispatch patterns, which are far better optimized than if/else or dictionary-based dispatch.
We map record types to actual Python classes, with slots for field values. This means that values can be passed in that have the explicit type associated with them (and we don’t have to do dict-based attribute lookups). Now, when encoding a union schema, if the value is typed, the library can directly jump to the correct sub-schema type for that value without having to recursively check each option.

With Cython being given structures that match simple c/c++ style structs and classes, it’s able to produce optimized code that boils down to a vtable lookup-based dispatch, rather than lots of Python if/else statements. And Compilers are really great at optimising code that uses vtables.

There’s nothing novel or unique about this approach, except by writing pythonic Python and using tools carefully, we were able to outperform our Java peers on these metrics.

Current State of Affairs

In the intervening years, Python has been getting faster. The sorts of simple repetitive attribute lookups and other patterns have been optimized well. The apache-avro library has been slowly moving logic onto their schema class hierarchy, and the performance numbers have been improving a lot!

Unfortunately at the moment there are several show-stopping bugs in the latest releases(AVRO-3843, AVRO-3842, AVRO-3834, and a few more), so things are a little unstable with the official library. And the overall performance is still significantly higher than it could be.

Fastavro has been fairly consistently flat performance-wise, there are some edge-cases where it suffers from nasty performance cliffs that are hard to mitigate given its approach.

Unfortunately, the way fastavro is designed would make it tricky to adapt to any of these performance improvements. It would be ideal to submit some PRs to make things faster, but the change would end up rewriting most of the cavro internals, based on how schemas are handled. I wouldn’t want to impose that on anyone.

Anyone working on fastavro that wants to collaborate or incorporate any of the differences, please feel free, and I’m always happy to discuss!

For people trying to read non-trivial avro data using Python, there still isn’t (until now!) a great solution.

So I decided to write a new, separate library that uses this alternate approach:

Enter cavro

Cavro is a fresh implementation of an avro parsing library, using Cython, and a class based schema approach, as described above. It matches or out-performs both apache-avro and fastavro across-the-board. (If you find a situation where it’s slower, please let me know!)

It’s been developed to keep the hot-path of encoding/decoding fast and efficient in all situations (including complex schema promotion cases).

It’s designed to support wide compatibility through extensive runtime options, and can actually pass 99.5% of both apache-avro and fastavro’s own internal unit-test suites (through use of an unsupported compatibility shim: avro-compat which does horrible things to make the cavro API look like the fastavro and apache-avro APIs). The remaining failing tests are either bugs in the original test cases, or situations where the shim is unable to emulate the library well enough.

Cavro has a crude benchmark test suite, designed to show approximate relative performance (not scientific measurements!): https://cavro.io/docs/benchmarks.

An example of the relative speeds of the libraries can be seen here:

The most recent run has cavro at 12x faster than apache-avro, and 25x faster than fastavro. Different workloads have different relative performances, but cavro is consistently the fastest.

Cavro v1.0.0 has just been released, and should be pip-installable in any Python >= 3.8.

Try it out!

If this is at all relevant to you (I know Avro users are not that common!) please try out cavro, report any issues, make suggestions, and share it.

Github: https://github.com/cavro-py/cavro

Pip install: pip install cavro

Documentation: https://cavro.io/