Modifying the Python object model

This article brought to you by LWN subscribers

Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

By Jake Edge
May 16, 2018

Python Language Summit

At the 2018 Python Language Summit, Carl Shapiro described some of the experiments that he and others at Instagram did to look at ways to improve the performance of the CPython interpreter. The talk was somewhat academic in tone and built on what has been learned in other dynamic languages over the years. By modifying the Python object model fairly substantially, they were able to roughly double the performance of the "classic" Richards benchmark.

Shapiro said that Instagram is a big user of Python and has been looking for ways to improve the performance of the CPython interpreter for its workloads. So the company started looking at the representation of data in the interpreter to see if there were gains to be made there. It wanted to stick with CPython in order to preserve the existing API, ecosystem, and developer experience

He is just a "casual user" of Python, but has a background in language implementations. He has worked on the Java virtual machine (JVM) and on garbage collection for Go among other language work. He said he is "not afraid of dynamic languages". He started by looking at the Python runtime. His initial impression was that Python is relatively inefficient compared to other high-level languages. But there is nothing that is so different in Python from those other languages; it shares a lot of common operations with other language runtimes. However Python has never evolved into a faster implementation as has happened with other languages (e.g. V8 for JavaScript, HotSpot for Java)

He did some data collection with a heavily instrumented CPython interpreter. He compared different kinds of workloads and other languages on those workloads. He also ran the modified interpreter against real traffic data gathered from Instagram. The tests gathered counts of bytecodes used as well as CPU instructions for handling the bytecodes.

To a first approximation, the breakdown of which bytecodes were used the most track closely with other languages, which is what would be expected. Roughly 20% of the bytecodes executed are either CALL_FUNCTION or LOAD_ATTR, which are used to call functions and retrieve object attributes, as the names would imply. But, handling those bytecodes required nearly 50% of the CPU instructions that were executed. Those two opcodes turn into hundreds of CPU instructions each (the average was 498 instructions for CALL_FUNCTION and 240 for LOAD_ATTR). Those are much higher than the instruction count for the highly optimized versions in other languages.

In addition, when Python is dispatching to a method call, 85% of the time there is only one type involved at a given call site. The interpreter is set up to handle generic method dispatch, but a cache of the most recently used method will work in the vast majority of cases. That percentage rises to 97% if you cache four type/method pairs for the call site. It is not uncommon for high-level languages to have different strategies for a single type versus either four or eight types; call sites that fall outside of those constraints are handled a third way, he said.

Beyond that, the comparison and binary operation implementations were more general than needed by the vast majority of the calls in the Instagram workload. Comparisons and binary operations are normally done on the built-in types (int, dict, str), rather than on user-defined classes. But the code to handle those operations does a lot of extra work to accommodate the dynamic types that are rarely used.

Some of what Shapiro presented did not sit well with Guido van Rossum, who loudly objected to Shapiro's tone, which was condescending, he said. Van Rossum thought that Shapiro did not really mean to be condescending, but that was how he came across and it was not appreciated. The presentation made it sound like Shapiro and his colleagues were the first to think about these issues and to recognize the inefficiencies, but that is not the case. Shapiro was momentarily flustered by the outburst and its vehemence, but got back on track fairly quickly.

Shapiro's overall point was that he felt Python sacrificed its performance for flexibility and generality, but the dynamic features are typically not used heavily in performance-sensitive production workloads. So he believes it makes sense to optimize for the common case at the expense of the less-common cases. But Shapiro may not be aware that the Python core developers have often preferred simpler, more understandable code that is easier to read and follow, over more complex algorithms and data structures in the interpreter. Some performance may well have been sacrificed for readability.

Continuing, Shapiro said that the most interesting statistic in the data gathered for him was on object construction and "monkey patching" (e.g. adding new attributes to an object after its creation). The instrumented interpreter found that 70% of objects have all of their attributes set in the object's __init__() method. Another percent or so were added from the function where the object is initialized. Much of the rest were all instances of a single class that is frequently used at Instagram. The implication is that adding object attributes post-initialization is actually not frequent.

It is in keeping with what other high-level languages have found that Python code does not use the dynamic features of the language all that much. But the data structures used in the interpreter overemphasize these dynamic features that are not used that widely, he said. Better native code generation, which is a common place to look for better performance, does not directly address the performance of data structures that are not optimized for the 90% case. What would the object model look like if it were optimized for the most frequent operations and how much performance would be gained?

A method lookup and call takes two orders of magnitude more instructions than he would like to see. So instead of optimizing the call frames for environment introspection (for stack traces and the like), as it is today, he proposed optimizing for call speed. Also, the garbage collector is reference-count based, which has poor locality. It is part of the C API, though, so reference counting must be maintained for C extensions; in the experiment, the Instagram developers moved to a tracing garbage collector everywhere else.

The representation of an instance in CPython has a hash table for attributes that is optimized for adding and removing attributes. Since that is not done that often in practice, flattening the representation to an array (with an overflow hash table for additional attributes) was tried. The experiments also featured aggressive caching for attributes.

The changes made to the interpreter were all at the data structure level; they were conservative changes, overall, Shapiro said. The developers did not go out of their way to optimize any of the changes they made, so there is still a "lot of headroom" in the implementation. They raced the code against CPython 3.6. The caching for attributes and for methods were the most powerful optimizations; each cut the run time roughly in half. The experimental interpreter started at 3.62s versus CPython at 1.32s; that dropped to 1.8s with the attribute caching and 0.72s by adding method caching as well. Other optimizations had a smaller effect, but did get the run time down to 0.65s.

He concluded that the current object data model seems to benefit the uncommon cases more than the common cases—at least for the Instagram workload. He wondered if others had seen that also. There has been a lot of effort on compilation to native code over the years, but just changing the object model provided some big wins; perhaps that is where the efforts should be focused.

Eric Snow asked whether the dictionary versioning was being used to invalidate caches. Shapiro said that the experimental interpreter relies less on dictionaries than CPython does. The instances are now arrays instead of dictionaries, but there is an equivalent signal so that the cache entries can be invalidated. Thomas Wouters asked if he had looked at PyPy. Shapiro said the company had, but there was only a modest bump in performance for its workload. He was not the one who did the work, however. Wouters noted that PyPy is more than "Python with a JIT" because it has its own data model as well.

Mark Shannon said that Python 3.7 has added a feature that should provide a similar boost as the method-lookup caching used in the experiment. Shapiro said he had looked at those changes but still believed his proposed mechanism would provide more benefit. Attribute lookup still requires five lookups in CPython, while it is only one lookup in the experimental version. Shannon did not sound entirely convinced of that, however.

These changes are not "all or nothing", Shapiro said. The experiment showed that there is a lot of headroom in the data structures themselves. Some parts of the changes could be adopted if they proved compelling. One attendee asked how serious Facebook/Instagram is about getting some or all of the changes into CPython. Shapiro said that the company has not released the code for the experiment (yet, though it was a bit unclear when that might change) but that it did try to feed its changes back upstream. It does not want to have a fork of CPython that it runs in production.

Index entries for this article
Conference	Python Language Summit/2018
Python	Object model

(Log in to post comments)

Modifying the Python object model

Posted May 16, 2018 23:17 UTC (Wed) by roc (subscriber, #30627) [Link]

> the garbage collector is reference-count based, which has poor locality

This is counter-intuitive. Generally speaking, reference counting has better data-cache locality than tracing GC, because you normally read or write an object's reference count around the same time as you touch the object for other reasons, whereas tracing GC must periodically scan all live objects whether they have recently been used or not.

Compacting or generational GC can improve locality by improving the compactness of allocations, but I would be surprised if Instagram was able to fit that into CPython.

So I wonder what's behind this statement.

Modifying the Python object model

Posted May 16, 2018 23:39 UTC (Wed) by malefic (guest, #37306) [Link]

They probably meant increased memory fragmentation which results from keeping refcounts in every object structure.

Modifying the Python object model

Posted May 17, 2018 1:04 UTC (Thu) by roc (subscriber, #30627) [Link]

Refcounts may increase memory usage (but not by much usually), but won't increase fragmentation per se. Unless Python does something really strange.

Modifying the Python object model

Posted May 17, 2018 1:15 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

You need interlocked operations for refcounts if you want multithreading. This causes merry hell with cache lines bouncing around.

Modifying the Python object model

Posted May 17, 2018 10:06 UTC (Thu) by roc (subscriber, #30627) [Link]

Sure, but the Python interpreter can't share object between threads and the article doesn't say anything about changing that.

Modifying the Python object model

Posted May 18, 2018 14:59 UTC (Fri) by mb (subscriber, #50428) [Link]

Of course objects can be shared. That is the whole point of threads.

Modifying the Python object model

Posted May 18, 2018 22:00 UTC (Fri) by roc (subscriber, #30627) [Link]

OK that's true, but all access to Python objects is protected by the GIL so there is no need to use atomic operations to modify the refcount. (I should have said that objects can't be shared by concurrently executing threads.)

Modifying the Python object model

Posted May 20, 2018 5:20 UTC (Sun) by peterfeiner (subscriber, #97524) [Link]

>> the garbage collector is reference-count based, which has poor locality
> This is counter-intuitive. Generally speaking, reference counting has better data-cache locality than tracing GC, because you normally read or write an object's reference count around the same time as you touch the object for other reasons, whereas tracing GC must periodically scan all live objects whether they have recently been used or not.

I think the problem is when the last reference goes away. An object isn't being used anymore, yet its (first) cacheline is made hot. This is bad because the problem cascades when a whole object graph is freed. Lots of disparate cachelines getting fetched just to do bookkeeping.

Modifying the Python object model

Posted May 20, 2018 6:33 UTC (Sun) by njs (subscriber, #40338) [Link]

It may have been a garbled reference to the way refcounting causes memory write traffic all over the place. Workloads like Instagram's are often memory-bound, and they rely heavily on starting up a parent process, loading in their main code and data, and then forking it to get lots of workers that all share most of their data through CoW. But writes from refcounts and marking can break the CoW and create unnecessarily high memory usage. (They used to disable the GC entirely, and then later contributed a feature to Python to mark their shared objects as immortal so the GC would skip marking them.)

[1] https://instagram-engineering.com/dismissing-python-garba...
[2] https://bugs.python.org/issue31558

Modifying the Python object model

Posted May 20, 2018 18:21 UTC (Sun) by rweikusat2 (subscriber, #117920) [Link]

The extremely amusing thing here is obviously that they disabled the additional, cycle-breaking tracing collector because it was causing their problems :->.

Modifying the Python object model

Posted May 24, 2018 8:36 UTC (Thu) by njs (subscriber, #40338) [Link]

Indeed! That's part of why I'm uncertain and just guessing :-). But it is perhaps easier to fix the scattered memory writes problem with a tracing collector than with refcounting.

Or heck, maybe he meant he wanted a compacting collector... that would certainly improve memory locality in some cases.

Modifying the Python object model

Posted May 17, 2018 15:55 UTC (Thu) by excors (subscriber, #95769) [Link]

> Shapiro may not be aware that the Python core developers have often preferred simpler, more understandable code that is easier to read and follow, over more complex algorithms and data structures in the interpreter. Some performance may well have been sacrificed for readability.

Surely the point of writing readable code is to make it easy to come back later and modify it, not to just put it in a glass case and admire its static perfection. But if you adhere so strongly to readability that you reject any attempts to improve the code out of fear they will increase its complexity, then you've destroyed the benefits of that readability.

It's kind of the opposite of technical debt - you've spent a lot of effort building up technical credit by keeping the code readable, but that effort is wasted unless you eventually spend the credit on valuable features or performance. Obviously there's a danger of spending too much and going back into debt, but you shouldn't ignore the danger of spending too little.

Modifying the Python object model

Posted May 19, 2018 12:17 UTC (Sat) by quotemstr (subscriber, #45331) [Link]

Right. One can come up with an infinity of good-sounding excuses for what's really a "culture of no" and a pervasive sense of stasis and paralysis. I applaud this work on the Python interpreter and hope the cpython core adopts at least some of it. Lookup caching is a huge win.

Modifying the Python object model

Posted May 19, 2018 20:42 UTC (Sat) by robert_s (subscriber, #42402) [Link]

I really don't understand why he didn't perform his analysis on PyPy, where any performance "quick wins" he found would have been lapped up. The two projects have different goals.

Modifying the Python object model

Posted May 21, 2018 3:38 UTC (Mon) by gps (subscriber, #45638) [Link]

PyPy is not always feasible for large existing CPython applications because they rely on the CPython API for tons of C/C++ extensions. The pypy C API emulation layer, even when up to the task, kills performance and increases pypy's already high memory overhead.

YouTube investigated using pypy. The experiments concluded that it just wasn't going to be a win compute resource wise even before counting the additional "different runtime" long term maintenance burden it would require.

PyPy has a lot of good ideas. It is a great place to experiment with fun internals changes but isn't practical for many things not already written to use it from the start.