The hash package: hashes come to R

Legacy content from the Open Data Group website.

Perl has hashes. Python has dictionaries. Why doesn’t R have an equivalent? Hash tables and associative arrays are indispensable tools for the programmer. One of the most common and basic tasks of a programmer is to “look up” or “map” a key to a value. In fact, there are projects whose sole raison d’être is making the hash as fast and as efficient as possible.

R actually has two equivalents, both lacking. The first is R’s named vectors and lists. Elements of vectors and lists can be accessed by name, through the standard R methods:


obj$name
obj['name']
obj[['name']]

Vectors are not stored using internal hash tables and as they grow large, performance can suffer. The performance impact is tangible even on small lists. For programs doing many look-ups or look-ups on many objects, this can create a bottleneck.

R’s environments are much closer to Perl hashes and Python’s dictionary. The structure of the environment is a hash table internally and look-ups do not appreciably degrade with object size. To use a R environment, you need to create it and assign key-value pairs to it.


hash = new.env(hash=TRUE, parent=emptyenv(), size=100L)
assign(key, value, hash)
get(key, hash)

We can even get the keys from the hash with the ls function:


ls( env=hash )

This works well and perfomance is good. So what’s the problem?

Usability. In designing, the S language, John Chambers put much thought into how the analyst and statistician interact with data. All variables are designed to be vectors and a standard set of accessors( $, [, [[ ) were defined to retrieve and set slices, subsets or elements of the data. The problem is that R environments don't follow this pattern. And this is where the hash package comes in.

The hash package is designed to provide an R-syntax to R's environments and give programmers a hash. The package provides one constructor function, hash that will take a variety of arguments, always doing the right thing. All of the following work:


hash()
hash( keys=c('foo','bar','baz'), values=1:3 )
hash( foo=1, bar=2, baz=3 )
hash( c( foo=1, bar=2, baz=3 ) )
hash( list( foo=1, bar=2, baz=3 ) )
hash( c('foo','bar','baz'), 1:3 )

It pretty much does what you mean.

The standard accessors: [, [[, $ are also available.


h <- hash( c('foo','bar','baz'), 1:3 )
h[ c('foo','bar') ]
h[[ 'foo' ]]
h$foo

As does their corresponding replacement methods.


h <- hash( c('foo','bar','baz'), 1:3 )
h[ c('foo','bar') ] <- c( 'fred', 'wilma' )
h[[ 'foo' ]] <- 'dino'
h$foo <- 'bam bam'

There you have it, hashes for R.

Comments

John Gant has pointed out that, at present, hash cannot contain hashes.

John writes:

I found a super simple example showing how its not working for me. That eliminates all of my code, which should make finding the issue easier.

> library( hash )
> h1 = hash()
> h2 = hash()
> h2[[ 101 ]] = 102
> h1[[ 100 ]] = h2
Error in FUN(”1″[[1L]], …) : object ‘1′ not found

* * * * *

My Response:

The problem is that hash does not support hashes as values. The problem is:

h1[[ 100 ]] <- h2

In this case the assignment tries to interpret h2, rather than say creating a reference. There are some other problems with
embedding hashes.

h1 h1[["100"]]
[[1]]
An object of type ‘hash’ containing 1 key-value pairs.
101 : 102
## LOOKS GOOD ##
> h1[["100"]][["101"]]
NULL
## UH-OH ##

Much of the problem steps from how R handles environments. It is going to take a bit of thought and some effort this weekend. Until then I would consider hash in hash as unsupported. But I understand that is a very important use case for both me and you.

* * * * *

I am presently testing a fix. If all works well the package will be posted to CRAN soon.


#2 by Christopher Brown on September 29, 2009 - 11:47 pm

The fix is in. Hash 1.0 is hitting CRAN now. This is considered the first production-ready version. Several changes were made to make it more R-ish and speedier.

Feedback is appreciated.


#3 by Christopher on October 5, 2009 - 3:46 pm

This is now fixed as of hash-1.0.

Chris


#4 by Christopher Brown on October 5, 2009 - 4:41 pm

From 2.9.0:

o A mechanism now allows S4 classes to inherit from object types “environment”, “externalptr” and symbol (”name”). See ?setClass.

This will be changed in the Hash package concomitant with R-2.10.0


#5 by Christopher Brown on October 12, 2009 - 6:10 am

Version 1.0.2 was uploaded to CRAN this evening. See the ChangeLog for details. Small-ish bug fixes release only.


#6 by Christopher Brown on October 12, 2009 - 6:11 am

This version does however use the .Data slot to contain the underlying environment. The API did not change at all.

C-


#7 by Denise Mauldin on October 30, 2009 - 7:19 pm

Hi there,

It appears that the fix does not allow for any layer deep of hashes? I’m trying to do a hash of a hash of a hash and getting an error:

Error in get(make.keys(i), x@.Data) : object ‘1′ not found

ie:
key <- 'one'
ikey <- 'two'
val <- 'three'
info <- hash()
info[key] <- hash( keys=c(ikey), values=c(val) )


#8 by maximilian haussler on November 12, 2009 - 11:25 am

I want to mention that often you don’t need hashes in R if you use dataframes and join them with the merge() function.


#9 by Christopher Brown on November 30, 2009 - 12:39 am

Maximilian,

Yes. You can emulate hashes by using data.frames and the merge() function. You can also use named lists. And if all items of the hash are of the same base class, you can even use named vectors.

The problem with each of these emulations is that they do not scale well, O(n). There are fine when n is small but are disastrous when it grows. The hash packages solves this by using R’s environments, i.e. real hash tables. In fact, the hash packages only provides a more intuitive interface.

Try it. Compare using data frames and merge tables against hash on a million records.

Chris


#10 by Christopher Brown on November 30, 2009 - 2:30 am

Denise,

Sorry for the late reply and thanks for the bug report. Indeed this was not behaving as expected. Moments ago, I released hash version 1.10.0 to address this and other minor annoyances. It should be trickling its way through the CRAN mirrors.

Thanks Again,

Chris


#11 by Carlo on February 17, 2010 - 5:44 am

is there a way to do one key to many value lookups?


#12 by ivo welch on February 21, 2010 - 3:34 pm

is the internal implementation really a hash? I created a hash with 160,000 values and then tried to see if a certain value is in the hash. alas, each scalar access seems to take seconds.

in the docs, you should probably mention that the way to test whether an element is in the hash is

is.in.hash= function(value) !is.null( myhash[[ value ]] )


#13 by christopher on April 25, 2010 - 5:58 am

Carlo,

There preferred way to do many value look-ups is use the ‘values’ function that takes an optional ‘keys’ argument.

Chris


#14 by christopher on April 25, 2010 - 6:11 am

Ivo,

Yes. There was a problem with an earlier implementation of the package in which performance was poor on large hashes. This has been fixed.

The ‘has.key’ method is an S4 method for testing whether the key is assigned to the hash.


#15 by Gabor on July 10, 2010 - 8:19 pm

I had an R function that worked until I updated R and some packages to latest versions. Now I am getting the following problem when I assign to some hashes.

Error in assign(i, value, x@.xData) : invalid first argument

The following code will produce this.

RIs <- hash(1,1:100)
RI <- rnorm(100,mean=.5,sd=.05)
i=1
RIs[[i]] <- RI

Please help.


#16 by Gabor on July 12, 2010 - 5:33 pm

Explicitly forcing i to a string using toString() solves this problem.


#17 by esther on August 12, 2010 - 2:48 pm

Brilliant package! Took me a few hours to get my head round – but here’s the line of code I’ve come up with – I needed to add a ‘gender’ – column to my data. It works and it’s VERY fast!

relevant_records["gender"] <- ifelse( has.key(as.character(relevant_records$participant_id),female_hash), "female", "male" )

Thanks!

Esther :-)