Understanding PHP's internal array implementation (PHP's Source Code for PHP Developers - Part 4)
Welcome back to the fourth part of the “PHP’s Source Code for PHP Developers” series, in which we’ll cover how PHP arrays are internally represented and used throughout the code base.
In case you missed them, here are the previous parts of this series:
- Part 1: Structure of the source code and introduction to C
- Part 2: Finding and understanding PHP’s internal function definitions
- Part 3: PHP’s internal value representation: The zval
Everything is a hash table!
Basically, everything in PHP is a hash table. Not only are hash tables used in the underlying implementation of PHP arrays, they are also used to store object properties and methods, functions, variables and pretty much everything else.
And because the hash table is so fundamental to PHP, it is worth having a deeper look into how it works.
So, what is a hash table?
Remember that in C arrays are basically chunks of memory, which you can access by index. Thus arrays in C only have
integer keys and have to be continuous (i.e. you can’t have a key 0
and the next key is 1332423442
). There is no
such thing as an associative array.
And this is where hash tables come in: They convert string keys into normal integer keys using a hash function. The
result can then be used as an index into a normal C array (aka chunk of memory). The problem here obviously is that the
hash function can have collisions, i.e. multiple string keys can yield the same hash. For example in a PHP array with up
to 64 elements the strings "foo"
and "oof"
would have the same hash.
This problem is solved by not storing the value directly at the generated index, but storing a linked list of possible values instead.
HashTable and Bucket
So, after the basic concept of hash tables is clear, let’s have a look at the structures actually used in PHP’s hash table implementation:
The first one is the HashTable
:
typedef struct _hashtable {
uint nTableSize;
uint nTableMask;
uint nNumOfElements;
ulong nNextFreeElement;
Bucket *pInternalPointer;
Bucket *pListHead;
Bucket *pListTail;
Bucket **arBuckets;
dtor_func_t pDestructor;
zend_bool persistent;
unsigned char nApplyCount;
zend_bool bApplyProtection;
#if ZEND_DEBUG
int inconsistent;
#endif
} HashTable;
Let’s quickly go through it:
-
nNumOfElements
specifies how many values are currently stored in the array. This is also the number thatcount($array)
returns. -
nTableSize
specifies the size of the internal C array. It is always the next power of 2 greater or equal tonNumOfElements
. E.g. if an array stores 32 elements, the internal C array also has a size of 32. But if one more element is added, i.e. the array then contains 33 elements, the internal C array is resized to 64 elements.This is done to always keep the hash table efficient in space and time. It is clear that if the internal array is too small there will be many collisions and the performance will degrade. If the internal array is too big on the other hand, we’d be wasting memory. The power-of-2 size is a good compromise.
-
nTableMask
is the table size minus one. This mask is used to adjust the generated hashes for the current table size. For example the actual hash for"foo"
(through the DJBX33A hashing function) is 193491849. If we currently have a table size of 64, we obviously can’t use that as an index into the array. Instead we only take the lower bits of the hash by applying the table mask:hash | 193491849 | 0b1011100010000111001110001001 & mask | & 63 | & 0b0000000000000000000000111111 --------------------------------------------------------- = index | = 9 | = 0b0000000000000000000000001001
-
nNextFreeElement
is the next free integer key, which is used when you append to an array using$array[] = xyz
. -
pInternalPointer
stores the current position in the array. This is used forforeach
iteration and can be accessed using thereset()
,current()
,key()
,next()
,prev()
andend()
functions. -
pListHead
andpListTail
specify the first and last element of the array. Remember: PHP arrays have an order. E.g.['foo' => 'bar', 'bar' => 'foo']
and['bar' => 'foo', 'foo' => 'bar']
contain the same elements, but have a different order. -
arBuckets
is the “internal C array” we were always talking about. It’s defined as aBucket **
, so it can be seen as an array of bucket pointers (we’ll get to what exactly a Bucket is in a minute). -
pDestructor
is the destructor for the values. If a value is removed from the HT this function will be called on it. For a normal array the destructor function iszval_ptr_dtor
.zval_ptr_dtor
will reduce the reference count of thezval
and, if it reaches 0, destroy and free it.
The last four properties aren’t really of interest to us. So let’s just say that persistent
specifies that the hash
table can live between multiple requests, nApplyCount
and bApplyProtection
are used to prevent infinite recursion in
some places and inconsistent
is used to catch incorrect uses of hash tables in debug mode.
Let’s move on to the second important structure: Bucket
:
typedef struct bucket {
ulong h;
uint nKeyLength;
void *pData;
void *pDataPtr;
struct bucket *pListNext;
struct bucket *pListLast;
struct bucket *pNext;
struct bucket *pLast;
const char *arKey;
} Bucket;
-
h
is the hash (without the table mask applied). -
arKey
is used to save string keys.nKeyLength
is the corresponding length. For integer keys those two aren’t used. -
pData
orpDataPtr
is used to store the actual value. For PHP arrays that value is azval
(but it’s also used for other things internally.) Don’t bother with the fact that there are two properties for this. The difference between them is who is responsible for freeing the value. -
pListNext
andpListLast
specify the order of the array elements. If PHP wants to traverse the array it starts off at thepListHead
bucket (specified in theHashTable
struct) and then always takes thepListNext
bucket. The same works in reverse, by starting atpListTail
and always followingpListLast
. (You can do this in userland by callingend()
and then always callingprev()
.) -
pNext
andpLast
form the “linked list of possible values” I mentioned above. ThearBuckets
array stores a pointer to the first possible bucket. If that bucket hasn’t the right key, PHP will look at the bucket whichpNext
points to. This is done until the right bucket is found.pLast
can be used to do the same in reverse.
As you can see PHP’s hash table implementation is fairly complex. This is the price one has to pay for its ultra-flexible array type.
How are hash tables used?
The Zend Engine defines a large number of API functions for the work with hash tables. An overview of low-level hash
table functions can be found in zend_hash.h
. Additionally the ZE defines a set of slightly higher level APIs in
zend_API.h
.
We don’t have time to go through all of these, but we can look at a sample function to see at least some of them in
action. We’ll use array_fill_keys
as that sample function.
Using the technique outlined in the second part you should be able to find the function definition in
ext/standard/array.c
easily. Let’s quickly walk through it now.
As always there’s a set of variable declarations and a zend_parse_parameters
call at the top:
zval *keys, *val, **entry;
HashPosition pos;
if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "az", &keys, &val) == FAILURE) {
return;
}
The az
obviously means that the first parameter is an array (fetched into the keys
variable) and the second one
is an arbitrary zval (fetched into the val
variable).
After the parameters are parsed the array to be returned is initialized:
/* Initialize return array */
array_init_size(return_value, zend_hash_num_elements(Z_ARRVAL_P(keys)));
This line contains already three important parts of the array API:
- The
Z_ARRVAL_P
macro fetches the hash table from azval
. zend_hash_num_elements
fetches the number of elements in a hash table (thenNumOfElements
property).array_init_size
initializes an array with a size hint.
So this line initializes an array into return_value
with the same size as the keys
array.
The size hint here is just an optimization. The function could have also called just array_init(return_value)
, in
which case PHP would have to do multiple resizes as more and more elements are added to the array. By specifying an
explicit size, PHP allocates the right amount of memory from the start.
After the return array initialization the function loops through the keys
array using a while
loop with roughly this
structure:
zend_hash_internal_pointer_reset_ex(Z_ARRVAL_P(keys), &pos);
while (zend_hash_get_current_data_ex(Z_ARRVAL_P(keys), (void **)&entry, &pos) == SUCCESS) {
// some code
zend_hash_move_forward_ex(Z_ARRVAL_P(keys), &pos);
}
This can be translated to PHP code easily:
reset($keys);
while (null !== $entry = current($keys)) {
// some code
next($keys);
}
Which is the same as:
foreach ($keys as $entry) {
// some code
}
The only real difference is that the C iteration doesn’t use the internal array pointer, but uses it’s own pos
variable to store the current position.
The code within the loop has two branches: One for integer keys and one for other keys. The integer key branch contains only two lines:
zval_add_ref(&val);
zend_hash_index_update(Z_ARRVAL_P(return_value), Z_LVAL_PP(entry), &val, sizeof(zval *), NULL);
This is pretty straightforward: First the refcount of the value is increased (adding the value to the hash table means
adding another reference to it) and then the value is actually inserted into the hash table. The arguments of the
zend_hash_index_update
macro are the hash table to update Z_ARRVAL_P(return_value)
, the integer index
Z_LVAL_PP(entry)
, the value &val
, the size of the value sizeof(zval *)
and the destination pointer (which we don’t
care about, thus NULL
).
The non-integer-key branch is slightly more complicated:
zval key, *key_ptr = *entry;
if (Z_TYPE_PP(entry) != IS_STRING) {
key = **entry;
zval_copy_ctor(&key);
convert_to_string(&key);
key_ptr = &key;
}
zval_add_ref(&val);
zend_symtable_update(Z_ARRVAL_P(return_value), Z_STRVAL_P(key_ptr), Z_STRLEN_P(key_ptr) + 1, &val, sizeof(zval *), NULL);
if (key_ptr != *entry) {
zval_dtor(&key);
}
First the key is converted to a string (unless it already is one) using convert_to_string
. But before this can be done
the entry
has to be copied into a new key
variable. The key = **entry
line takes care of that. Additionally
zval_copy_ctor
has to be called, otherwise complex structures (like strings or arrays) won’t be copied correctly.
The copy is necessary to ensure that the cast doesn’t change the original array. Without the copy the cast would
modify not only our local variable, but also the element in the keys
array (which obviously would be quite unexpected
to the user).
Obviously the copy has to be removed again after the loop, which is what the zval_dtor(&key)
line does. The difference
between zval_ptr_dtor
and zval_dtor
is that zval_ptr_dtor
only destroys the zval if the refcount reaches 0,
whereas zval_dtor
destroys it always, regardless of the refcount. That’s why you’ll find zval_ptr_dtor
used with
“normal” values and zval_dtor
with temporary variables, which aren’t used anywhere else anyways. Also zval_ptr_dtor
frees the zval after destroying it, while zval_dtor
does not. As we never malloc()
d anything, we also don’t have
to free()
, so zval_dtor
is the right choice in this respect too.
Now, let’s look at the last two lines left (the important ones ^^):
zval_add_ref(&val);
zend_symtable_update(Z_ARRVAL_P(return_value), Z_STRVAL_P(key_ptr), Z_STRLEN_P(key_ptr) + 1, &val, sizeof(zval *), NULL);
Those are very similar to what was done in the integer-key branch. The difference is that now zend_symtable_update
is
called instead of zend_hash_index_update
and the string key and it’s length are passed in.
The Symtable
The “normal” function for inserting string keys into a hash table is zend_hash_update
, but here zend_symtable_update
is used instead. What’s the difference?
A symtable basically is a special type of hash table, which is used for arrays. The difference to the ordinary hash
table is how it handles numeric string keys: In a symtable the keys "123"
and 123
are considered identical. So if
you store a value in $array["123"]
, you can then retrieve it using $array[123]
.
The underlying implementation could use two ways: Either save both 123
and "123"
using the "123"
key or save both
using the 123
key. PHP obviously chooses the latter option (as integers are smaller and faster than strings).
You can have a little bit of fun with symtables, if you manage to somehow insert the "123"
key without it being cast
to 123
. One way is to exploit the array to object cast:
$obj = new stdClass;
$obj->{123} = "foo";
$arr = (array) $obj;
var_dump($arr[123]); // Undefined offset: 123
var_dump($arr["123"]); // Undefined offset: 123
Object properties are always saved under string keys, even if they are numbers. So the $obj->{123} = 'foo'
line
actually saves "foo"
under the "123"
index, not the 123
index. When doing the array cast this is not changed. But
as both $arr[123]
and $arr["123"]
try to access the 123
index (not the "123"
index that actually exists), both
throw an error. So, congratulations, you’ve created a hidden array element!
In the next part
The next part will again be published on ircmaxell’s blog. It will analyze how objects and classes work internally.