To Dict Or Not To Dict: Comparing Data Structure Sizes

PublishedMay 30, 2024

A scientific software developer with over a decade of experience in academia, startups and industry. My mission is to turn you into an elite numerical computing professional.

Searching for the best data structure for your problem can be a tricky business. The chosen data structure should be easy to understand for other developers, run fast in the algorithms where it's used and be memory efficient. You can't always optimize at the start, but you can learn from mistakes. I made such a mistake recently regarding the sizes in memory and on disk, so I'd like to walk you through a simple example.

Basically I was storing keys and values, so obviously a dictionary is the first choice of data structure. But it turned out that was not the most memory efficient solution to choose.

So let's look at 4 options that I tried, a Dict an OrderedDict a Vector{Pair} and an AxisArray. We'll start with a simple example with a 1000 strings and integers:

using OrderedCollections, AxisArrays

# our types
dict = Dict{String,Int}()
dict_ordered = OrderedDict{String,Int}()
vector_pair = Vector{Pair{String,Int}}()

# create some key-value objects
for value in 1:1000
    dict["key$value"] = value
    dict_ordered["key$value"] = value
    push!(vector_pair, "key$value" => value)
end

# and an axis array
axis_array = AxisArray(collect(values(dict)), key = collect(keys(dict)))

If we look at the data structure implementations, we see below that a Dict contains three arrays, slots, keys, vals . I assume the slots are used as a kind of hashes to quickly index the keys. But the Dict pre-allocates 4096 keys and values apparently.

julia> dump(dict)
Dict{String, Int64}
  slots: Array{UInt8}((4096,)) UInt8[0xd9, 0xef, 0xfb, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xba  …  0x00, 0x00, 0x00, 0x00, 0x00, 0x87, 0x00, 0x94, 0x00, 0x00]      
  keys: Array{String}((4096,))
    1: String "key239"
    2: String "key390"
    ...
    4095: #undef
    4096: #undef
  vals: Array{Int64}((4096,)) [239, 390, 722, 0, 0, 0, 0, 0, 0, 480  …  0, 0, 0, 0, 0, 532, 0, 568, 0, 0]
  ndel: Int64 0
  count: Int64 1000
  age: UInt64 0x00000000000003f0
  idxfloor: Int64 1
  maxprobe: Int64 4

The OrderedDict is similar, but doesn't pre-allocate the keys and values.

julia> dump(dict_ordered)
OrderedDict{String, Int64}
  slots: Array{Int32}((4096,)) Int32[239, 390, 722, 0, 0, 0, 0, 0, 0, 480  …  0, 0, 0, 0, 0, 532, 0, 568, 0, 0]
  keys: Array{String}((1000,))
    1: String "key1"
    2: String "key2"
    ...
    999: String "key999"
    1000: String "key1000"
  vals: Array{Int64}((1000,)) [1, 2, 3, 4, 5, 6, 7, 8, 9, 10  …  991, 992, 993, 994, 995, 996, 997, 998, 999, 1000]
  ndel: Int64 0
  maxprobe: Int64 4
  dirty: Bool true

A Pair is just a key (first) and a value (second) in a struct. And then we can store an array of those.

julia> dump(vector_pair[1])
Pair{String, Int64}
  first: String "key1"
  second: Int64 1

An AxisArray is just an array with named axes, where each value in an axis contains our key.

julia> dump(axis_array)
AxisVector{Int64, Vector{Int64}, Tuple{Axis{:key, Vector{String}}}}
  data: Array{Int64}((1000,)) [239, 390, 722, 480, 679, 798, 841, 877, 21, 636  …  413, 667, 717, 697, 398, 256, 334, 823, 532, 568]
  axes: Tuple{Axis{:key, Vector{String}}}
    1: Axis{:key, Vector{String}}
      val: Array{String}((1000,))
        1: String "key239"
        2: String "key390"
        ...
        999: String "key532"
        1000: String "key568"

You can also index an AxisArray similar to a dictionary:

julia> axis_array["key5"]
5

1D sizes

Let's look at the memory size of these 4 beasts:

julia> varinfo(r"dict|vector|axis")
  name               size summary     
  –––––––––––– –––––––––– ––––––––––––––––––––––––––––––––––––––––––––
  axis_array   29.302 KiB 1-dimensional AxisArray{Int64,1,...} 
  dict         81.747 KiB Dict{String, Int64} with 1000 entries             
  dict_ordered 45.356 KiB OrderedDict{String, Int64} with 1000 entries    
  vector_pair  44.856 KiB 1000-element Vector{Pair{String, Int64}}

The AxisArray wins in terms of memory size. I'm also curious about size on disk with JLD2

using JLD2
for type in (:dict, :dict_ordered, :vector_pair, :axis_array)
    jldsave("$type.jld2", x=eval(type))
    println("$type.jld2 size: $(filesize("$type.jld2"))")
end

This gives us the following jld2 sizes:

dict.jld2 size: 105398
dict_ordered.jld2 size: 66303
vector_pair.jld2 size: 105162
axis_array.jld2 size: 50949

Again the AxisArray wins, though the difference shrunk a bit. Although a 2x difference doesn't feel very significant.

2D sizes

In my case I was actually storing a lot of dictionaries, which were actually sharing the same keys. This was my biggest mistake, because you can store such data as a matrix and share the keys of the rows and columns. Let's look at a small example again:

# let's very naively store a vector of dictionaries, to get a feel for the size
dict_vector = [deepcopy(dict) for _ in 1:100]

# let's also store an AxisMatrix with shared rows and columns
axis_matrix = AxisArray(
    rand(Int, 1000, 100),
    row = ["row$r" for r in 1:1000],
    col = ["col$c" for c in 1:100]
)

Now the sizes are significantly different:

julia> varinfo(r"dict_vector|axis_matrix")
  name         size           summary     
  –––––––––––– ––––––––––---- ––––––––––––––––––––––––––––––––––––––––––––
  axis_matrix  804.845 KiB    2-dimensional AxisArray{Int64,2,...} 
  dict_vector  7.984 MiB      100-element Vector{Dict{String, Int64}}

You see an almost 10x difference in memory size. The difference is more than 10x in the JLD2 files:

julia> for type in (:dict_vector, :axis_matrix)
    jldsave("$type.jld2", x=eval(type))
    println("$type.jld2 size: $(filesize("$type.jld2"))")
end

dict_vector.jld2 size: 10352637
axis_matrix.jld2 size:   849101

Differences are so big because most of the data is integers, which are small, and the "keys" of the matrix are strings which can occupy quite some memory.

In the end I opted to switch to AxisArrays for my 2D problem. Though I could only do this by assuming all columns had the same keys.

Conclusion

There's a difference in memory usage between data structures that store keys and value. Dictionaries do not seem to be optimal, however for the simple case I looked at here the difference is not as large as I imagined. A bigger difference only appears when you have extra information, such as repeating keys, and you can use that information to choose a more optimal data structure, such as a matrix with the keys on the axes.

#julia #data-structures

Comments (4)

Join the discussion

Abhay Pansuriya2y ago

Hi there

Anthony Lucas2y ago

Coin Recovery Specialists IS THE BEST BITCOIN AND CRYPTO RECOVERY COMPANY.

It was such a huge relief when I was able to recover our cryptocurrency funds with the help of Coin Recovery Specialists , my husband and I had invested in a fake crypto investment platform that cost us $137,000 a few months back. I came to know about Coin Recovery Specialists through the help of my colleague at work who had benefited from the service of Coin Recovery Specialists . I’m truly grateful there is such a wonderful team who is out there helping cryptocurrency scam victims to recover their money. If you are ever in need of their services, you can easily reach them via Email: E-mail: ( CoinRecoverySpecialists @ Gmail . Com ) Or WhatsApp ( +447442036436 )

Truly amazing service!

RP7772y ago

RP777 offers various products: Slots, Fishing, Sabong, Live Casino, Sports, Lottery. Fast deposits and withdrawals via Gcash, Paymaya, USDT, Grabpay, Payloro, and Online Banking. We make your dream come true! Phone : 63 9081232888 Add : 721 Sto Rosario, Mandaluyong, 1552 Kalakhang Maynila

Richard Careaga2y ago

Great walkthrough on custom data structures. I have a different take on criteria.

Efficiency should always be subservient to effectiveness. Effectiveness is doing the right thing, and in the context of coding an analytic solution that means optimization for the least efficient component, which is always wetware.

Brian Kernighan nails the two key points

If optimization has become a concern, suspect a poor choice of algorithm. (See also Donald Knuth’s cautions against premature optimization.)
Everyone knows that it is twice as hard to debug as it is to code, so if you code as cleverly as you are capable, you will never be able to debug your code.

Matthijs Cox2y ago

Would you say there's a difference in the effectiveness of the structures I compare? They all seem rather easy to use and understand to me.

Richard Careaga2y ago

Matthijs Cox Sorry. I learned a lot from your code and my comments are directly to the general question of the tradeoffs that can arise from trying to program to throughput goals. No criticisms at all of your approach.

Sandy Mildred2y ago

CONSULT A LICENSE PATECH RECOVERY HACKER FOR CRYPTO RECOVERY I would like to express my gratitude to PATECH RECOVERY HACKER for helping me through a difficult period. I was duped into making an online investment in which I was promised a 25% weekly profit, but it turned out to be a scam. I was very angry until I read an article about PATECH Recovery about how they had assisted others in recovering funds. But because Patech Recovery helped me and made things simple for me, I was able to get my cryptocurrency back. If you've been a victim of something similar, get in touch with them Email: patechrecovery @ proton dot me whatsapp---913////730///0531

Show less

More from this blog

When should you turn your ugly research script into a reproducible package?

Blog status: I haven’t been spending much time on the blog. I admit I don’t have much to say at the moment, we’ll see how it goes in the future. This current article is mostly a journal-like question for myself, to reflect a bit, but maybe it helps o...

Jan 13, 20255 min read

When should you turn your ugly research script into a reproducible package?

Straightforward Functional Programming Examples in Julia

Functional programming has gained quite some popularity in recent years. Yet if you code with the Julia language you probably already used a lot of functional programming concepts without really thinking about it. In it's essence functional programmi...

Sep 18, 20245 min read

Straightforward Functional Programming Examples in Julia

Julia Type Annotations

Overview of all different types of type annotation in the Julia language

Jul 16, 20247 min read

Comparing Package Management in Python, R, Julia, and Rust

A look at how different programming languages approach package management

May 23, 202411 min read

Comparing Package Management in Python, R, Julia, and Rust

The Scientific Coder

22 posts