python logo

Evaluating Python Implementations

This section describes the composition of the different measured program suites. An in depth description of the contents of each suite can be found here. The test battery is composed of four levels of programs, depending on its complexity. The majority of the programs that compose each of these levels have been taken from already existing benchmarking suites and programs. Programs that use third party non-standard Python libraries or GUI toolkits were not considered, to maximize the amount of python implementations that are able to execute the programs. For the same reason, only standard Python code is measured, without considering any implementation-specific language extension.

A very small fraction of the measured programs have been created to measure certain language characteristics. This was done to test those programming language characteristics that were not covered with the existing ones. These custom tests were created following the structure of existing benchmark suites. All programs have been stripped from its screen output code to obtain more accurate performance results. When user input is needed to operate, the program have been modified to simulate it, so the user have not to input any information to continue its processing. The four levels of programs are:

  1. Microbenchmarks
  2. Benchmarks
  3. Programs
  4. Large Scale Applications


Small programs that deal with only one language characteristic (integer arithmetic, double arithmetic …). They are used to compare the performance of Python implementations considering exclusively that particular characteristic. Already existing benchmarks were used whenever possible, although as we said some of them have to be developed in order to test some characteristics, especially metaprogramming ones. Several microbenchmarks have been ported from other languages with the help of the Java2Python. In total, we measured 309 different microbenchmarks.

Microbenchmark suite Description

A Python translation of the Tommti benchmarking suite: . This suite is composed by 13 microbenchmarks that measure typical language characteristics such as arithmetic operations and data structure usage among others.


Pybench ( ) is a collection of 15 files that comprises 57 microbenchmarks, providing a standardized way to measure the performance of Python implementations. It measures different aspects of the Python programming language, allowing us to obtain a different performance number for each characteristic instead of providing a single overall performance estimation.

Java Grande

A Python translation of the Java Grande benchmarks ( ). This benchmark suite has the purpose of measuring and comparing alternative Java execution environments in ways which are important to Grande applications. A Grande application is one which uses large amounts of processing, I/O, network bandwidth, or memory. They include applications in science and engineering among others. For our work, the three-layered structure of increasingly complex programs that compose this suite adapts well to our testing layout. Therefore we opted to translate this suite to Python and use its different layers to the corresponding parts of our testing suite. For our measurement work, we used the sequential version of these benchmarks. In this part, only the section 1 of the benchmarks (Low-level operations such as arithmetic and math library operations, method calls, casting…) were used, comprising 9 different code files that executes 83 different microbenchmarks.

Pypy translator microbenchmarks

The PyPy python implementation sources includes 10 code files in its source code distribution that holds up to 82 microbenchmarks. These test individual programming language characteristics in the same way as the other microbenchmarks we used:

Custom Characteristics

We developed some microbenchmarks to measure those programming language characteristics that we considered that were not appropriately covered by the other microbenchmarks we used. In total we developed 24 microbenchmarks:

  • Process execution: 4 microbenchmarks that test typical process handling primitives.
  • Thread execution: 3 microbenchmarks that test typical thread handling primitives.
  • Bit operations: 5 microbenchmarks that cover bit arithmetic operations in the same way that integer or floating-point arithmetic is covered by other benchmarks.
  • Object comparison: 2 benchmarks that test object comparison by state.
  • Complex data type arithmetic: 3 microbenchmarks that cover arithmetic operations with the Complex data type in the same way that integer or floating point arithmetic is covered by other benchmarks.
  • Rational data type arithmetic: 3 microbenchmarks that cover arithmetic operations with the Rational data type in the same way that integer or floating point arithmetic is covered by other benchmarks.
  • Complementary data access scenario: 1 microbenchmark that cover a specific variable read and write operations that are not tested by other benchmarks. Its structure is similar to the Tommti benchmarks.
  • Set management: 1 microbenchmark that covers set management.
  • Complementary String management use cases: 2 microbenchmarks that test two string handling use cases not covered by other testing suites.
Functional Programming

These benchmarks cover typical functional programming usage scenarios. We developed 7 microbenchmarks following the same structure of existing benchmarking suites. These benchmarks cover:

  • Lambda function creation.
  • Higher order functions usage.
  • Lambda function invocation.
  • Currying.
  • map function.
  • filter function.
  • reduce function.

Based on the work developed in: How (and why) developers use the dynamic features of programming languages: the case of Smalltalk (Oscar Callaú, Romain Robbes, Éric Tanter, David Röthlisberger. Empirical Software Engineering DOI 10.1007/s10664-012-9203-2, 2012), and in our past research ( A hybrid class- and prototype-based object model to support language-neutral structural intercession; Francisco Ortin, Miguel A. Labrador, Jose M. Redondo; Information and Software Technology, Volume 56, Issue 2, February 2014, Pages 199-219), we developed a benchmarking suite composed by 46 microbenchmarks that covers the metaprogramming characteristics in the Python language, divided in four layers:

  • Introspection features (8 microbenchmarks)
    • Dynamic Class lookup (1 microbenchmarks)
    • Dynamic method invocation (3 microbenchmarks)
    • Dynamic access to attributes and variables (4 microbenchmarks)
  • Intercession features(24 microbenchmarks)
    • Dynamic class creation (4 microbenchmarks)
    • Add attributes to instances and classes (8 microbenchmarks)
    • Delete attributes to instances and classes (4 microbenchmarks)
    • Add methods to instances and classes (4 microbenchmarks)
    • Delete methods to instances and classes (2 microbenchmarks)
    • Dynamic inheritance usage scenarios (2 microbenchmarks)
  • Computational reflection features (7 microbenchmarks)
    • __getattr__ usage (2 microbenchmarks)
    • __setattr__ usage (2 microbenchmarks)
    • __hasattr__ usage (2 microbenchmarks)
    • __call__ usage (1 microbenchmarks)
  • Dynamic code evaluation features (eval, exec,…)(4 microbenchmarks)


This group of programs contains a compendium of popular open source benchmarks that are either coded in Python or translated from other programming languages. Each benchmark deals with several language characteristics at once to achieve its final result. So, while not being "real" programs per se, it allows us to test several characteristics in a more realistic environment. We used 61 benchmarks in our measurements. These are:

Benchmark suite Description
Unladen swallow benchmarks

Unladen Swallow was an optimization branch of CPython, intended to be fully compatible and significantly faster. The project was abandoned due to failing to reach its performance milestones, but its source code is still available and contains a suite of performance programs ( ). This suite is composed by benchmarking programs and tests that use commercial products. We included 24 of these benchmarks in this group.

The Computer Language Benchmarks Game (CLSB)

This is a compendium of benchmarking programs of different nature that have a python version. These are freely available on . For our tests we used 10 of these benchmarks.


Parrot a virtual machine designed to efficiently compile and execute bytecode for dynamic languages. Parrot currently hosts a variety of language implementations in various stages of completion, including Tcl, Javascript, Ruby, Lua, Scheme, PHP, Python, Perl 6, APL, and a .NET bytecode translator. Its distribution comes with 7 benchmarks that cover several dynamic language features, including metaprogramming. We adapted these benchmarks to be used in our measurements.

Jolden - TSP

The popular Jolden benchmark suite has a Java version that can be found in We translated the TSP benchmark from this suite to be used in our measurements.

Java Grande

As we described previously, the Java Grande benchmarks are composed by 3 sections of increasingly complex benchmarks. We translated 6 benchmarks from section 2 and 3 benchmarks of Java Grande section 3, to be included in this group.

Crypto - AES

As cryptography algorithms are commonly used as benchmarking tools, we used a Python implementation of the AES (Advanced Encryption Standard) by Josh Davis ( ) as a benchmark. The code was obtained from

Crypto - MD5

For the previous reason, a Python implementation of the MD5 hashing algorithm by Dinu C. Gherman ( ) was also used as a benchmark

Intercessive Pybench

These are 3 benchmarks that use intercession, developed as part of our previous research.

Dynamic inheritance

These are 4 benchmarks that use dynamic inheritance, developed as part of our previous research.


This benchmark is the Python version of the Dhrystone benchmark and is commonly used to compare different implementations of the Python programming language. Pystone is included in the standard CPython distribution.


This group is composed by applications that are created for a specific and realistic purpose: perform scientific calculations, popular problem solving, data manipulation and parsing, etc. These programs serve much different purposes and have different sizes, due to the different complexity required by each solved problem. The programs in this group are not specifically designed as benchmarks, but as Python applications that, independently of its size, were designed with a realistic purpose in mind.

To obtain programs that fit in this category we used the program collection provided by the Shed Skin Python compiler distribution. Shed Skin ( is an experimental compiler that can translate pure, but implicitly statically typed Python programs into optimized C++. Although this distribution is promising, it lacks support for certain Python programming language features at this moment, so we decided not to include it yet in our measurements. However, the example library it provides ( ) is excellent, and suits to our purposes.

The Shed Skin program collection is composed by a wide variety of programs of different nature and authors. We adapted the majority of them to perform our measures. The following table describes each one of the 54 programs we used:

Program Description
Adatron Support Vector Machine

An Adatron Support Vector Machine with polynomial kernel placed in the public domain by Stavros Korokithakis.

Ambient Occlusion Renderer

This program is an ambient occlusion renderer written by Syoyo Fujita

Ant colony optimization

This program generates a random array of distances between cities, then uses Ant Colony Optimization to find a short path traversing all the cities the Travelling Salesman Problem. By Eric Rollins

Arithmetic compressor encoder and decoder

Arithmetic coding compressor and uncompressor for binary data by David MacKay


A Python implementation of the BH* Olden benchmark ( ). It implements the Barnes-Hut benchmark that is described in: A hierarchical o(N log N) force-calculation algorithm, ; Barnes, J., Hut, P., Nature, 324:446-449, Dec. 1986

Block compression

A Huffman block compression algorithm by David MacKay

Brainfuck interpreter

A Brainfuck programming language ( ) interpreter by Philippe Biondi

Chaosgame fractals

This program creates chaosgame-like fractals. By Carl Friedrich Bolz


Simple chess like speed test program written in Python by Jyrki Alakuijala

Color Patterns Shells

This program Models for the simulations of the color pattern on the shells of mollusks

Connect four game

This program implements the connect four (also known as four-in-a-row) game

Conway game of life

This program is an implementation of the Game of Life, a cellular automaton devised by the British mathematician John Horton Conway. Its author is Francesco Frassinelli


Program that uses the Dijkstra shortest-distance algorithm by Gustavo J.A.M. Carneiro

Dijkstra Bidirectional

Bidirectional Dijkstra/search algorithm extracted from the NetworkX program (

Genetic algorithm

A genetic algorithm implementation

Genetic algorithm 2

Another genetic algorithm implementation by Stavros Korokithakis


An implementation of the popular Go game by Mark Dufour

Hq2x filter

Python adaptation of the Hq2x filter demo program by Maxim Stepin

Jpeg decoder

A JPEG decoding program by Tong Lin

Kanoodle puzzle

A program that finds solutions to the popular kanoodle puzzle game by David Austin

Lempel-Ziv compression

A program that performs compression of data using the Lempel-Ziv code by David Mackay

Linear algebra

A program that uses various linear algebra operations by Mladen Bestvina

Loop nodes

A program that performs loop recognition in a node graph by Leonardo Maffi


Program that generates and serializes Mandelbrot fractals by Tony Veijalainen


An implementation of the Mastermind game, by Sean McCarthy


Another implementation of the Mastermind game by Leonardo Maffi

Maze solver

A random maze generator/solver by ActiveState:

Minimal global illumination renderer

This program is a minimal global illumination renderer (minilight) written by Harrison Ainsworth HXA7241 and Juraj Sukop

Neural network

This program implements a Back-Propagation Neural Networks. Its author is Neil Schemenauer

Othello game

This program is an implementation of the Othello (also named Reversi or Yang) game by Mark Dufour

Path tracer

This program is an implementation of Path tracing, a way of solving a rendering equation using Montecarlo integration. It consist on a Python port of the works of Jonas Wagner ( )

Primes by sieve of Atkin

This program computes a finite list of prime numbers that are smaller than a predefined limit by using the sieve of Atkin. Its author is Steve Krenzel

Probabilistic linear context-free rewriting systems parser for natural language

This program is a natural language parser for PLCFRS (probabilistic linear context-free rewriting systems), an extension of context-free grammar which rewrites tuples of strings instead of strings. Its author is Andreas van Cranenburgh

RGB Converter

Conversion functions between RGB and other color systems

Richards task management

A Python translation by Mario Wolczko that implements a task management algorithm originally written by Dr. Martin Richards


This is a Python implementation of the Rsync algorithm: The rsync algorithm, Tridgell A., Mackerras, P.; Technical Report TR-CS-96-05, Canberra 0200 ACT, Australia, 1996.

Rubik solver

This program implements a Rubik cube solver algorithm . It was adapted by Mark Dufour

Rubik solver 2

Another Rubik's cube solver using Thistlethwaite's algorithm, originally implemented on C++ by Stefan Pochmann but translated to Python by Mark Dufour

Satisfiability solver

This program implements a satisfiability solver. Its author is Mark Dufour


This program is a Python implementation of the SHA-1 algorithm by J. Hall´en and L. Creighton

Simple mandelbrot

This is an implementation of the Mandelbrot fractal by Daniel Rosengren


An implementation of the Sokoban game

Solitaire encryption

Python implementation of Bruce Schneier's Solitaire Encryption algorithm by John Dell'Aquila.

Sudoku solvers

We included 5 different Sudoku solvers by various authors:

Tic Tac Toe

An implementation of the Tic Tac Toe game by Peter Goodspeed


This program implements a Voronoi diagram, which is a way of dividing space into a number of regions. A set of points is specified beforehand and for each seed there will be a corresponding region consisting of all points closer to that seed than to any other. The regions are called Voronoi cells.

Yopyra ray tracer

This program is a complete Python raytracer by Carlos Gonzalez Morcillo It was adapted from the examples included in the Shed Skin experimental Python implementation ( )

Another block of testing programs is composed by the BioPython suite ( ). BioPython is a set of freely available tools for biological computation written in Python. Its aim is to develop Python libraries and applications which address the needs of works in bioinformatics. This program suite has a high number of test programs included, that execute different functionalities of the suite using unit testing. Therefore, it makes use of metaprogramming features during its execution. We adapted 85 of these programs and individualized it as part of our measurements.

Large Scale Applications

The final group comprises realistic commercial-grade Python applications. It also includes programs that use the services of a commercial framework, library or API. These programs use a large amount of the program language characteristics we measure, and are a good opportunity to check the performance of the Python implementations we measured with real workloads.

Looking for programs to include in this category is difficult because we couldn’t include those that use third party non-standard libraries or GUI toolkits, in order to try to execute them in the majority of Python implementations we measured. In the end, we included 14 applications in this category:

Program Description
2to3 python code tanslation

The 2to3 application is included in the Python 3.X distribution to help developers to adapt Python 2 code to the Python 3 specification. This test uses the 2to3 tool to translate itself. It was adapted from the unladen swallow examples.

Bazaar version control

Bazaar ( ) is a version control system that helps you track project history over time and to collaborate easily with others. This application uses this application at command line issuing some commands. It was adapted from the unladen swallow examples.

BlindElephant web application fingerprinter

The BlindElephant Web Application Fingerprinter attempts to discover the version of a web application by comparing static files at known locations against precomputed hashes for versions of those files in all available releases. . Testing was done with a local sample web page.

Cog code generation tool

Runs Cog, a simple code generation tool written in Python.

Django template system

Django template system ( ) to build a 150x150-cell HTML table. It was adapted from the unladen swallow examples.

Html5 library

This application parses the HTML 5 spec ( ) using html5lib. It was adapted from the unladen swallow examples.

Mako template system

Mako ( is a web template library written in Python. It is used by popular web sites such as This application parses a sample web page with this template system. It was adapted from the unladen swallow examples.

Pyramid web framework

Runs the Pyramid web framework, a small and fast Python web application development framework

Rietveld code review

This application uses the Rietveld code review app along with the Django template system ( ). It was adapted from the unladen swallow examples.

Spam Bayesian classifier

This application runs a “canned” mailbox through a SpamBayes ( ) ham/spam classifier. It was adapted from the unladen swallow examples.

Spitfire template system

This application uses the Spitfire template system ( ) to build a 1000x1000-cell HTML table. It was adapted from the unladen swallow examples.

SQLMap Injection testing tool

Runs SQL Map, an open source penetration testing tool that automates the process of detecting and exploiting SQL injection flaws and taking over of database servers.

Volatility forensics framework

The Volatility Framework is a completely open collection of tools, implemented in Python under the GNU General Public License, for the extraction of digital artifacts from volatile memory (RAM) samples. A sample RAM dump was used for its execution.

Web2Py web framework

Runs the Web2Py web framework, a free open source full-stack framework for rapid development of fast, scalable, secure and portable database-driven web-based applications.