Learning how best to use arrays, hashes, and sets

Ruby's collection classes are one of the reasons why it is such a joy to program in Ruby. In most cases, the choice of collection class to use is fairly straightforward. If you need a simple list of values that you are iterating over, or using the collection as a queue or a stack, you generally use an array. If you need a mapping of one or more objects to one or more objects, then you generally use a hash. If you have a large list of objects and want to see whether a given object is contained in it, you generally use a set.

In some cases, it's fine to use either an array or a hash. Often, when iterating over a small list, you could use the array approach:

[[:foo, 1], [:bar, 3], [:baz, 7]].each do |sym, i|
  # ...
end

Or, you could use the hash approach:

{foo: 1, bar: 3, baz: 7}.each do |sym, i|
  # ...
end

Since you are not indexing into the collection, the simpler approach from a design perspective is to use an array. However, because the hash approach is syntactically simpler, the idiomatic way to handle this in Ruby is to use a hash.

For more complex mapping cases, you often want to use a hash, but you may need to decide how to structure the hash. This is especially true when you are using complex keys. Let's take a deeper look at the differences between arrays, hashes, and sets by working through an example that implements an in-memory database.

Implementing an in-memory database

While many programmers often use a SQL database for data storage, there are many cases when you need to build a small, in-memory database using arrays, hashes, and sets. Often, even when you have the main data stored in a SQL database, it is faster to query the SQL database to retrieve the information, and use that to build an in-memory database for the specific class or method you are designing. This allows you to query the in-memory database with similar speed as a hash or array lookup, orders of magnitude faster than a SQL database query.

Let's say you have a list of album names, track numbers, and artist names, where you can have multiple artists for the same album and track. You want to design a simple lookup system so that given an album name, you can find all artists who worked on any track of the album, and given an album name and track number, you can find the artists who worked on that particular track.

In the following examples, you should assume that album_infos is an arbitrary object that has each method that yields the album name, track number, and artist. However, if you would like to have some sample data to work with:

album_infos = 100.times.flat_map do |i|
  10.times.map do |j|
    ["Album #{i}", j, "Artist #{j}"]
  end
end

One approach for handling this is to populate two hashes, one keyed by album name, and one keyed by an array of the album name and track number. Populating these two hashes is straightforward, by setting the value for the key to an empty array if the key doesn't exist, and then appending the artist name. Then you need to make sure the artist values are unique for the hash keyed just by album name:

album_artists = {}
album_track_artists = {}
album_infos.each do |album, track, artist|
  (album_artists[album] ||= []) << artist
  (album_track_artists[[album, track]] ||= []) << artist
end
album_artists.each_value(&:uniq!)

With this approach, looking up values is fairly straightforward, and just involves looking in the appropriate hash with the appropriate key:

lookup = ->(album, track=nil) do
  if track
    album_track_artists[[album, track]]
  else
    album_artists[album]
  end
end

An alternative approach would be to use a nested hash approach, with each album having a hash of tracks:

albums = {}
album_infos.each do |album, track, artist|
  ((albums[album] ||= {})[track] ||= []) << artist
end

With this approach, looking up values is more complex, especially in the case where a track number is not provided, and you have to dynamically create the list:

lookup = ->(album, track=nil) do
  if track
    albums.dig(album, track)
  else
    a = albums[album].each_value.to_a
    a.flatten!
    a.uniq!
    a
  end
end

In general, the first approach using multiple hashes is going to take significantly more memory than the second approach if there is a large number of albums, but it will have a much better lookup performance for albums. The first approach will also take much more time to populate the data structure. The second approach is much lighter on memory and has better lookup performance for albums with tracks as it avoids an array allocation, but will exhibit a far more inferior performance for albums.

Each of these approaches does not depend on the types of objects that album_infos.each yields. You probably made the reasonable assumption that album and artist would be strings, and track would be a number. Let's say you knew in advance that the track number was an integer between 1 and 99. You could use that information to design a different approach. You could still have a single of hash keyed by album name, with a value being an array containing arrays of artist names for each track. Since tracks only go from 1 to 99, you could use the 0 index in the array to store all artist names for the album. Populating this combination of hash and array of arrays isn't too difficult:

albums = {}
album_infos.each do |album, track, artist|
   album_array = albums[album] ||= [[]]
   album_array[0] << artist
   (album_array[track] ||= []) << artist  
end
albums.each_value do |array|
  array[0].uniq!
end

This approach is more memory-efficient than either of the previous approaches, and looking up values is very simple and never allocates an object:

lookup = ->(album, track=0) do
  albums.dig(album, track)
end

Compared to the previous two approaches, this approach uses about the same amount of memory as the nested hash approach. It takes slightly more time to populate compared to the nested hash approach. It is almost as fast as the two hash approach in terms of lookup performance for albums, and is the fastest approach for lookup performance by albums with tracks.

Maybe the needs of your application change, and now you need a feature that allows users to enter a list of artist names, and will return an array with only the artist names that the application knows are on one of the albums. One way to handle this is to store the artists in an array:

album_artists = album_infos.flat_map(&:last)
album_artists.uniq!

The lookup can use an array intersection to determine the values:

lookup = ->(artists) do
  album_artists & artists 
end

The problem with this approach is that Array#& uses a linear search of the array, so this approach is very slow for a large number of artists.

A better performing approach would use a hash, keyed by the artist name:

album_artists = {}
album_infos.each do |_, _, artist|
  album_artists[artist] ||= true
end

The lookup can use the hash to filter the values in the submitted array:

lookup = ->(artists) do
  artists.select do |artist|
    album_artists[artist]
  end
end

This approach performs much better. The code isn't as simple, though it isn't too bad. However, it would be nicer to have simpler code that performed as well. Thankfully, the Ruby Set class can meet this need. Like BigDecimal, Set is not currently a core Ruby class. Set is in the standard library, and you can load it via require 'set'. However, Set may be moved from the standard library to a core class in a future version of Ruby. Using a set is pretty much as simple as using an array in terms of populating the data structure:

album_artists = Set.new(album_infos.flat_map(&:last))

You don't need to manually make the array unique, because the set automatically ignores duplicate values. The lookup code can stay exactly the same as the array case:

lookup = ->(artists) do
  album_artists & artists 
end

Of the three approaches, the hash approach is the fastest to populate and the fastest to look up. The Set approach is much faster to look up than the array approach, but still significantly slower than hash. Set is actually implemented using a hash internally, so in general, it will perform worse than using a hash directly. As a general rule, you should only use a set for code that isn't performance-sensitive and you would like to use a nicer API. For any performance-sensitive code, you should prefer using a hash directly.

In this section, you learned about Ruby's core collection of classes, arrays, hashes, and sets. In the next section, you'll learn about Struct, one of Ruby's underappreciated core classes.

Polished Ruby Programming

By : Evans

Polished Ruby Programming

By: Evans

Overview of this book

Learning how best to use arrays, hashes, and sets

Implementing an in-memory database

Polished Ruby Programming

By : Evans

Polished Ruby Programming

By: Evans

Overview of this book

Learning how best to use arrays, hashes, and sets

Implementing an in-memory database

Delete Bookmark

Confirmation

Buy this book with your credits?