To Dict Or Not To Dict: Comparing Data Structure Sizes

To Dict Or Not To Dict: Comparing Data Structure Sizes

Featured on Hashnode

Searching for the best data structure for your problem can be a tricky business. The chosen data structure should be easy to understand for other developers, run fast in the algorithms where it's used and be memory efficient. You can't always optimize at the start, but you can learn from mistakes. I made such a mistake recently regarding the sizes in memory and on disk, so I'd like to walk you through a simple example.

Basically I was storing keys and values, so obviously a dictionary is the first choice of data structure. But it turned out that was not the most memory efficient solution to choose.

So let's look at 4 options that I tried, a Dict an OrderedDict a Vector{Pair} and an AxisArray. We'll start with a simple example with a 1000 strings and integers:

using OrderedCollections, AxisArrays

# our types
dict = Dict{String,Int}()
dict_ordered = OrderedDict{String,Int}()
vector_pair = Vector{Pair{String,Int}}()

# create some key-value objects
for value in 1:1000
    dict["key$value"] = value
    dict_ordered["key$value"] = value
    push!(vector_pair, "key$value" => value)
end

# and an axis array
axis_array = AxisArray(collect(values(dict)), key = collect(keys(dict)))

If we look at the data structure implementations, we see below that a Dict contains three arrays, slots, keys, vals . I assume the slots are used as a kind of hashes to quickly index the keys. But the Dict pre-allocates 4096 keys and values apparently.

julia> dump(dict)
Dict{String, Int64}
  slots: Array{UInt8}((4096,)) UInt8[0xd9, 0xef, 0xfb, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xba0x00, 0x00, 0x00, 0x00, 0x00, 0x87, 0x00, 0x94, 0x00, 0x00]      
  keys: Array{String}((4096,))
    1: String "key239"
    2: String "key390"
    ...
    4095: #undef
    4096: #undef
  vals: Array{Int64}((4096,)) [239, 390, 722, 0, 0, 0, 0, 0, 0, 4800, 0, 0, 0, 0, 532, 0, 568, 0, 0]
  ndel: Int64 0
  count: Int64 1000
  age: UInt64 0x00000000000003f0
  idxfloor: Int64 1
  maxprobe: Int64 4

The OrderedDict is similar, but doesn't pre-allocate the keys and values.

julia> dump(dict_ordered)
OrderedDict{String, Int64}
  slots: Array{Int32}((4096,)) Int32[239, 390, 722, 0, 0, 0, 0, 0, 0, 4800, 0, 0, 0, 0, 532, 0, 568, 0, 0]
  keys: Array{String}((1000,))
    1: String "key1"
    2: String "key2"
    ...
    999: String "key999"
    1000: String "key1000"
  vals: Array{Int64}((1000,)) [1, 2, 3, 4, 5, 6, 7, 8, 9, 10991, 992, 993, 994, 995, 996, 997, 998, 999, 1000]
  ndel: Int64 0
  maxprobe: Int64 4
  dirty: Bool true

A Pair is just a key (first) and a value (second) in a struct. And then we can store an array of those.

julia> dump(vector_pair[1])
Pair{String, Int64}
  first: String "key1"
  second: Int64 1

An AxisArray is just an array with named axes, where each value in an axis contains our key.

julia> dump(axis_array)
AxisVector{Int64, Vector{Int64}, Tuple{Axis{:key, Vector{String}}}}
  data: Array{Int64}((1000,)) [239, 390, 722, 480, 679, 798, 841, 877, 21, 636413, 667, 717, 697, 398, 256, 334, 823, 532, 568]
  axes: Tuple{Axis{:key, Vector{String}}}
    1: Axis{:key, Vector{String}}
      val: Array{String}((1000,))
        1: String "key239"
        2: String "key390"
        ...
        999: String "key532"
        1000: String "key568"

You can also index an AxisArray similar to a dictionary:

julia> axis_array["key5"]
5

1D sizes

Let's look at the memory size of these 4 beasts:

julia> varinfo(r"dict|vector|axis")
  name               size summary     
  –––––––––––– –––––––––– ––––––––––––––––––––––––––––––––––––––––––––
  axis_array   29.302 KiB 1-dimensional AxisArray{Int64,1,...} 
  dict         81.747 KiB Dict{String, Int64} with 1000 entries             
  dict_ordered 45.356 KiB OrderedDict{String, Int64} with 1000 entries    
  vector_pair  44.856 KiB 1000-element Vector{Pair{String, Int64}}

The AxisArray wins in terms of memory size. I'm also curious about size on disk with JLD2

using JLD2
for type in (:dict, :dict_ordered, :vector_pair, :axis_array)
    jldsave("$type.jld2", x=eval(type))
    println("$type.jld2 size: $(filesize("$type.jld2"))")
end

This gives us the following jld2 sizes:

dict.jld2 size: 105398
dict_ordered.jld2 size: 66303
vector_pair.jld2 size: 105162
axis_array.jld2 size: 50949

Again the AxisArray wins, though the difference shrunk a bit. Although a 2x difference doesn't feel very significant.

2D sizes

In my case I was actually storing a lot of dictionaries, which were actually sharing the same keys. This was my biggest mistake, because you can store such data as a matrix and share the keys of the rows and columns. Let's look at a small example again:

# let's very naively store a vector of dictionaries, to get a feel for the size
dict_vector = [deepcopy(dict) for _ in 1:100]

# let's also store an AxisMatrix with shared rows and columns
axis_matrix = AxisArray(
    rand(Int, 1000, 100),
    row = ["row$r" for r in 1:1000],
    col = ["col$c" for c in 1:100]
)

Now the sizes are significantly different:

julia> varinfo(r"dict_vector|axis_matrix")
  name         size           summary     
  –––––––––––– ––––––––––---- ––––––––––––––––––––––––––––––––––––––––––––
  axis_matrix  804.845 KiB    2-dimensional AxisArray{Int64,2,...} 
  dict_vector  7.984 MiB      100-element Vector{Dict{String, Int64}}

You see an almost 10x difference in memory size. The difference is more than 10x in the JLD2 files:

julia> for type in (:dict_vector, :axis_matrix)
    jldsave("$type.jld2", x=eval(type))
    println("$type.jld2 size: $(filesize("$type.jld2"))")
end

dict_vector.jld2 size: 10352637
axis_matrix.jld2 size:   849101

Differences are so big because most of the data is integers, which are small, and the "keys" of the matrix are strings which can occupy quite some memory.

In the end I opted to switch to AxisArrays for my 2D problem. Though I could only do this by assuming all columns had the same keys.

Conclusion

There's a difference in memory usage between data structures that store keys and value. Dictionaries do not seem to be optimal, however for the simple case I looked at here the difference is not as large as I imagined. A bigger difference only appears when you have extra information, such as repeating keys, and you can use that information to choose a more optimal data structure, such as a matrix with the keys on the axes.