Python implements its dictionary as a hash table that uses open addressing to resolve collisions. The hash table occupies a contiguous block of RAM broken up into slots, each of which stores one and only one entry. Essentially a python dictionary is just an array in memory where a hash value of each newly inserted key is used to find a slot in the array to place the new entry.
Python lets you use seemingly arbitrary objects as keys to dictionaries by
utilizing the object’s __hash__
method:
Any object that defines a __hash__
method is considered hashable and thus
can be used as the key in a dictionary. __hash__
should return an integer,
which python then uses to find the key a slot in the hash table.
If __hash__
does not return an int python attempts to coerce the value to
an int, which can produce less than desirable effects.
object
has a default __hash__
function that returns a value
related to the identity of the object. In python every object has an identity
which can be observed using the builtin id()
function. Using the identity
makes sense as it is required to be a unique, constant value for the object during
its lifetime. The natural choice for identity then is the object’s
memory address:
Now, if we take the hash of this object we get,
which is…some random number. But, it turns out this seemingly random number is just the identity of the object divided by 16.
(I don’t know why python divides id by 16 for the hash.)
In CPython, the hash table starts with PyDict_MINSIZE
slots which
as of this writing is 8. To find slots for newly added items, python uses the
expression hash(key) & mask
where mask
is always equal to the current capacity
of the dictionary (which must be a power of 2) minus 1. The bitwise representation
of 2^n - 1 is always a sequence of 1’s so the mask will zero out all bits except
the lower order n-1 bits.
For example, say we want to insert the key ‘abc’ with value 5. We first need to find an empty slot in the hash table,
In the hash table this might look something like:
Idx | Hash | Key | Value |
---|---|---|---|
000 | |||
001 | |||
010 | |||
011 | ...100011 | "abc" | 5 |
100 | |||
101 | |||
110 | |||
111 |
So, what happens when we get a collision? Python probes to find the next empty slot. One way to probe would be to search the slots in order from the occupied slot until an empty one is found. Python uses a different form of probing to find the next available slot. The exact details are not important, but python’s probing uses the higher order bits of the hash code to prevent long chains of collisions. (See dictobject.c:35-125).
The hash table gets resized when the dictionary becomes 2/3rds full. The size of the table will be doubled (e.g. to 16) and the lower order 4 bits will then be used to redistribute keys in the block of memory.