Friday, December 27, 2013

VBA Collections: Using as a Hash Table for Fast String Lookup in Excel and Access

VBA collections are pretty simple beasts. They only have three methods: Add, Remove and Item, and a Count property. When adding values to the collection, we can include a string key for lookup.
Some sample code:
Dim TestColl As New Collection
TestColl.Add 153, "Greece"
TestColl.Add 23, "Japan"
now we can use an integer index to retrieve the n-th 1-based item (returns 23)
TestColl.Item(2)
or the string to retrieve the item by key (returns 153)
TestColl.Item("Greece")
We can also iterate the values:
Dim v As Variant

For Each v In TestColl 
  Debug.Print v
Next
There is some debate as to whether the collection uses a proper hashed lookup to retrieve the value when given a string key. Many sites/blogs recommend using the Dictionary object in the System.Scripting namespace.
While this may mostly work, over the years I have learned to avoid external dependencies in MS Access wherever possible. Hence I determined to investigate the lookup performance of the native VBA collection for both integer and string based retreival.

The following code builds a collection of 100,000 integer values and then looks up a particular value at the start, middle or end of the collection 10,000 times, so as to return a measurable time.
Public Sub TestCollectionItemLookup()
Dim TestColl As New Collection
Dim i As Long
Dim k As Long
Dim StartTime As Single
Dim Iterations As Long
    Iterations = 10000
    Debug.Print "Iterations=" & Iterations

    For i = 0 To 100000
        TestColl.Add i, CStr("k" & i)
    Next
    
    ' By Int Index, i=2
    StartTime = Timer
    
    For i = 0 To Iterations
        k = TestColl.Item(2)
    Next
    
    Debug.Print "By Int Index, i=2: t=" & Format(Timer - StartTime, "0.000000")
    
    ' By Int Index, i=500
    StartTime = Timer
    
    For i = 0 To Iterations
        k = TestColl.Item(500)
    Next
    
    Debug.Print "By Int Index, i=500: t=" & Format(Timer - StartTime, "0.000000")
    
    ' By Int Index, i=50000
    StartTime = Timer
    
    For i = 0 To Iterations
        k = TestColl.Item(50000)
    Next
    
    Debug.Print "By Int Index, i=50000: t=" & Format(Timer - StartTime, "0.000000")
    
    ' By Int Index, i=99999
    StartTime = Timer
    
    For i = 0 To Iterations
        k = TestColl.Item(99999)
    Next
    
    Debug.Print "By Int Index, i=99999: t=" & Format(Timer - StartTime, "0.000000")
    
    ' By String Index, i=500
    StartTime = Timer
    
    For i = 0 To Iterations
        k = TestColl.Item(CStr("k" & 500))
    Next
    
    Debug.Print "By String Index, i=500: t=" & Format(Timer - StartTime, "0.000000")
    
    ' By String Index, i=99999
    StartTime = Timer
    
    For i = 0 To Iterations
        k = TestColl.Item(CStr("k" & 99999))
    Next
    
    Debug.Print "By String Index, i=99999: t=" & Format(Timer - StartTime, "0.000000")
    
    '*****************************************************************
    
    ' By Int Index, random lookup
    StartTime = Timer
    
    For i = 0 To Iterations
        k = TestColl.Item(CLng(Rnd * 100000))
    Next
    
    Debug.Print "By Int Index, random lookup: t=" & Format(Timer - StartTime, "0.000000")
    
    ' By String Index, random lookup
    StartTime = Timer
    
    For i = 0 To Iterations
        k = TestColl.Item(CStr("k" & CLng(Rnd * 100000)))
    Next
    
    Debug.Print "By String Index, random lookup: t=" & Format(Timer - StartTime, "0.000000")
End Sub
Executing this sub printed the following results for me:
   
Iterations = 10000
By Int Index, i=2: t=0.007813
By Int Index, i=500: t=0.031250
By Int Index, i=50000: t=5.406250
By Int Index, i=99999: t=16.195310
By String Index, i=500: t=0.031250
By String Index, i=99999: t=0.027344
By Int Index, random lookup: t=6.164063
By String Index, random lookup: t=0.027344
We can draw the following conclusions:
- lookup in a collection by integer index just uses a linear for/next loop with a value comparison. Values at the start of the collection will be retrieved far more quickly than ones at the end
- lookup by string key IS hashed, and it makes no difference where in the array the values lies, although key lookup is at least an order of magnitude slower than direct array lookup would be (we would expect integer lookup for i=2 to be 2-3 times slower than an array lookup)
- the code also checks 10,000 tries of random retrieval, in case of some kind of caching was speeding up the key lookup after the first retrieval, but the results indicate this not to be the case, with the result very similar to the repeated lookup of a single value

Thus, VBA collections can be recommended for hashed string lookup for large collections, however should never be used for integer indexed lookup for large collections.
If integer lookup was required, it would be the best strategy to create an array and copy the values into it, then use the array for positional retrieval.

No comments:

Post a Comment