The syntax of the Excel MATCH function is MATCH(lookup_value, lookup_array, [match_type]). The 3rd argument of the MATCH function, match_type, is a number that can be 1, 0, or -1. When it is 1 or -1 the range being searched (lookup_array) should be sorted and Excel carries out a binary search. For a value of 1 the data should be sorted in ascending order and the returned result is the index into lookup_array corresponding to the largest value that is less than or equal to the lookup_value. For a value of -1 the lookup_array should be in descending order and the returned index refers to the smallest value that is greater than or equal to the lookup_value. When the 3rd argument is 0, the data do not need to be sorted and Excel executes a linear search, returning the index of an exact match. In all cases, if no valid index satisfies the criterion MATCH returns the error #N/A.
Of interest are the performance aspects of the linear and the binary search. Is the performance of a binary search sufficiently superior to a linear search that even when we want an exact match, it justifies sorting a large data set and using an additional formula to simulate an exact match? Essentially something along the lines of the following: Suppose a 100,000 element unsorted lookup_array is in B with the lookup values in C starting with row 1. Then, after sorting lookup_array in ascending order, enter in D1 =MATCH(C1,$B$1:$B$100000,1) and in E1 =IF(INDEX($B$1:$B$100000,D1)=C1,D1,NA()). Cell E1 will now have the same result as if we had used =MATCH(C1,$B$1:$B$100000,0).
Expectation and Result Summary
To get started, I tested the MATCH function in three configurations:
1) a binary search, i.e., match_type=1,
2) a linear search, i.e., match_type=0, with the lookup_array in sorted order, and
3) a linear search with the lookup_array containing unsorted data.
For each of those configurations, I tested 3 sets of lookup values, where each set had 10,000 values generated randomly. The reason for using three data sets was to work around the possibility that a particular data set was somehow biased. For each of these sets, I timed how long it took to enter the MATCH formula in all the relevant cells in one step and have Excel calculate the results. I repeated this last step 3 times to identify any wrinkles in the process of entering the formula.
What did I expect? Well, the performance of the binary search is of the order of log (N). So, given that the smallest data set tested had 100,000 elements, I expected it to perform the best. Also, it should scale very well as the number of data elements increased. The performance of the linear search is of order N. So, it should take twice as long if the data set being searched doubled in size. I also expected that the performance of the linear search to be independent of whether the lookup_array was sorted or not.
As expected the binary search outperformed the linear search. When searching 100,000 elements the binary search was about 80 times faster and when searching 1 million records it was about 1,650 times faster! Also, the binary search took less than 10% longer searching a million records vs. searching 100,000 records (0.0486 seconds vs. 0.0534 seconds). What was unexpected was that the linear search was faster while searching a sorted lookup_array vs. an unsorted array. I don’t know why. One possibility is that the random numbers were biased towards the lower values. Of course, it was to eliminate such a bias that I generated 3 sets of lookup numbers. So, I cannot account for the performance difference of the linear search when searching a sorted array vs. an unsorted array.
Time (seconds) |
|||
Lookup |
Binary |
Linear |
Linear |
100,000 |
0.0486 |
3.67 |
4.44 |
500,000 |
0.0512 |
41.58 |
44.59 |
1,000,000 |
0.0534 |
86.00 |
92.36 |
Given the performance of the binary search algorithm, it was almost a given that sorting an unsorted array and using 2 formulas to find an exact match would be faster than the linear search. Implementing the algorithm resulted in speeds 10 to 50 times faster than for a linear search, the former for 100,000 records and the latter for 1,000,000 records.
Simulated Exact Match |
|
Lookup Range |
Time (seconds) |
100,000 |
0.38 |
500,000 |
0.92 |
1,000,000 |
1.68 |
As mentioned above, I tested 3 configurations: Binary Search, Linear Search with the lookup_array sorted, and Linear Search with the lookup_array unsorted. These algorithms were applied to 3 ranges of lookup_array with 100,000, 500,000, and 1,000,000 million records respectively. For each, I generated 3 lookup_value data sets, each 10,000 elements large.
In the results below, HowMany indicates the size of the lookup_array, SearchCount the number of elements being looked up, UseBinary indicates if I was using a binary search and OrigSorted indicates whether the lookup_array was sorted or not. For a binary search, the array had to be sorted and, consequently, the state was predetermined.
The Search Set column indicates the 3 data sets used for the lookup values (10,000 elements each). The Iteration column is the 3 iterations of entering the MATCH function for the 10,000 elements. The Time column is the time it took to get the formulas entered. Excel was in Automatic Calculation mode. The hardware configuration was a Core 2 Duo processor with 4GB of memory. The software was Windows 7 64bit and Excel 2010 RC 64bit.
HowMany SearchCount UseBinary OrigSorted
100000 10000 True N/A
Search Set Iteration Time
1 1 6.640625E-02
1 2 0.0546875
1 3 0.046875
2 1 4.296875E-02
2 2 4.296875E-02
2 3 0.046875
3 1 0.046875
3 2 4.296875E-02
3 3 0.046875
>>> Done
HowMany SearchCount UseBinary OrigSorted
100000 10000 False True
Search Set Iteration Time
1 1 3.632813
1 2 3.605469
1 3 3.601563
2 1 3.609375
2 2 4.035156
2 3 3.757813
3 1 3.582031
3 2 3.597656
3 3 3.59375
>>> Done
HowMany SearchCount UseBinary OrigSorted
100000 10000 False False
Search Set Iteration Time
1 1 4.359375
1 2 4.347656
1 3 4.339844
2 1 4.34375
2 2 4.410156
2 3 4.472656
3 1 4.757813
3 2 4.570313
3 3 4.339844
>>> Done
HowMany SearchCount UseBinary OrigSorted
500000 10000 True N/A
Search Set Iteration Time
1 1 5.859375E-02
1 2 5.078125E-02
1 3 0.046875
2 1 5.078125E-02
2 2 5.078125E-02
2 3 5.078125E-02
3 1 5.078125E-02
3 2 0.0546875
3 3 0.046875
>>> Done
HowMany SearchCount UseBinary OrigSorted
500000 10000 False True
Search Set Iteration Time
1 1 41.57813
1 2 41.96484
1 3 42.10156
2 1 41.58984
2 2 41.92969
2 3 41.67578
3 1 41.17969
3 2 41.14844
3 3 41.01953
>>> Done
HowMany SearchCount UseBinary OrigSorted
500000 10000 False False
Search Set Iteration Time
1 1 44.25391
1 2 43.82422
1 3 44.24609
2 1 44.89844
2 2 44.41406
2 3 44.94922
3 1 44.875
3 2 44.40625
3 3 45.42578
>>> Done
HowMany SearchCount UseBinary OrigSorted
1000000 10000 True N/A
Search Set Iteration Time
1 1 0.0546875
1 2 0.0546875
1 3 5.078125E-02
2 1 0.0546875
2 2 5.078125E-02
2 3 5.078125E-02
3 1 5.859375E-02
3 2 0.0546875
3 3 5.078125E-02
>>> Done
HowMany SearchCount UseBinary OrigSorted
1000000 10000 False True
Search Set Iteration Time
1 1 86.00781
1 2 85.61328
1 3 85.59375
2 1 85.5625
2 2 85.76563
2 3 85.49219
3 1 86.49219
3 2 86.57422
3 3 86.88672
>>> Done
HowMany SearchCount UseBinary OrigSorted
1000000 10000 False False
Search Set Iteration Time
1 1 92.28516
1 2 92.25391
1 3 92.60547
2 1 91.91016
2 2 92.23828
2 3 92.67578
3 1 92.57031
3 2 92.53906
3 3 92.14063
>>> Done
Exact Match HowMany SearchCount
1000000 10000
Search Set Iteration Time
1 1 1.601563
1 2 1.722656
1 3 1.734375
2 1 1.597656
2 2 1.726563
2 3 1.71875
3 1 1.609375
3 2 1.726563
3 3 1.714844
>>> Done
Exact Match HowMany SearchCount
500000 10000
Search Set Iteration Time
1 1 0.859375
1 2 0.953125
1 3 0.953125
2 1 0.8359375
2 2 0.9570313
2 3 0.9648438
3 1 0.8398438
3 2 0.9648438
3 3 0.9609375
>>> Done
Exact Match HowMany SearchCount
100000 10000
Search Set Iteration Time
1 1 0.3085938
1 2 0.4140625
1 3 0.4101563
2 1 0.3203125
2 2 0.40625
2 3 0.4257813
3 1 0.3046875
3 2 0.4101563
3 3 0.3984375
>>> Done
The code is modularized. If someone wants to use it for more tests, it should be fairly easy to reuse components or change it.
Option Explicit
Option Base 0
Sub generateData(ByRef Where As Range, _
ByVal HowMany As Long, ByVal Sorted As Boolean, _
Optional ByVal SimData As Boolean = False)
If SimData Then Sorted = False
Dim Arr: ReDim Arr(HowMany - 1, 0)
Dim I As Long
For I = LBound(Arr, 1) To UBound(Arr, 1)
Arr(I, 0) = IIf(SimData, I * 2, I)
Next I
Dim Dest As Range: Set Dest = Where.Resize(HowMany, 1)
Dest.Value = Arr
If Not Sorted Then
Dest.Offset(0, 1).FormulaR1C1 = "=RAND()"
With Where.Parent
.Sort.SortFields.Clear
.Sort.SortFields.Add Key:=Dest.Offset(0, 1), SortOn:=xlSortOnValues, _
Order:=xlAscending, DataOption:=xlSortNormal
With .Sort
.SetRange Dest.Resize(, 2)
.Header = xlGuess
.MatchCase = False
.Orientation = xlTopToBottom
.SortMethod = xlPinYin
.Apply
End With
End With
Dest.Offset(0, 1).Delete Shift:=xlToLeft
End If
End Sub
Sub generateTargets(ByRef Where As Range, _
ByVal HowMany As Long, ByVal Rng As Long)
Dim Rslt() As Long: ReDim Rslt(HowMany - 1, 0)
Dim I As Long
For I = LBound(Rslt) To UBound(Rslt)
Rslt(I, 0) = Int(Rnd() * Rng)
Next I
Where.Resize(HowMany, 1).Value = Rslt
End Sub
Sub Test()
Cells.Delete
Const HowMany As Long = 1000000, SearchEleCount As Long = 10000, _
OrigSorted As Boolean = False, UseBinary As Boolean = False
Debug.Print "HowMany", "SearchCount", "UseBinary", "OrigSorted"
Debug.Print HowMany, SearchEleCount, UseBinary, IIf(UseBinary, "N/A", OrigSorted)
'MsgBox HowMany & "," & SearchEleCount & "," & UseBinary
generateData Cells(1, 2), HowMany, IIf(UseBinary, True, OrigSorted)
Dim I As Integer
Debug.Print "Search Set", "Iteration", "Time"
For I = 1 To 3
Dim Targets() As Long
Columns(3).Resize(, 2).Delete
generateTargets Cells(1, 3), SearchEleCount, HowMany
Dim J As Integer, AccumTime As Single
For J = 1 To 3
Dim StartTime As Single
StartTime = Timer
Cells(1, 4).Resize(SearchEleCount, 1).FormulaR1C1 = _
IIf(UseBinary, "=MATCH(RC[-1]," _
& Cells(1, 2).Resize(HowMany, 1).Address(True, True, xlR1C1) _
& ",1)", _
"=MATCH(RC[-1]," _
& Cells(1, 2).Resize(HowMany, 1).Address(True, True, xlR1C1) _
& ",0)")
Debug.Print I, J, Timer - StartTime
Next J
Next I
Debug.Print ">>> Done"
End Sub
The code to simulate the exact match via the sort and 2 formulas is below.
Sub sortAscending(ColRng As Range)
With ActiveSheet
.Sort.SortFields.Clear
.Sort.SortFields.Add Key:=ColRng, SortOn:=xlSortOnValues, Order:=xlAscending, _
DataOption:=xlSortNormal
With .Sort
.SetRange ColRng
.Header = xlGuess
.MatchCase = False
.Orientation = xlTopToBottom
.SortMethod = xlPinYin
.Apply
End With
End With
End Sub
Sub simulateExactSearch()
Const HowMany As Long = 100000, SearchCount As Long = 10000, _
OrigSorted As Boolean = False, UseBinary As Boolean = False
Debug.Print "Exact Match", "HowMany", "SearchCount"
Debug.Print , HowMany, SearchCount
Cells.Delete
generateData Cells(1, 2), HowMany, False, True
Dim SaveArr: SaveArr = Cells(1, 2).Resize(HowMany).Value
Debug.Print "Search Set", "Iteration", "Cumulative Time"
Dim I As Integer
For I = 1 To 3
Columns(3).Resize(, 3).Delete
generateTargets Cells(1, 3), SearchCount, HowMany * 2
Dim J As Integer
For J = 1 To 3
Dim StartTime As Single: StartTime = Timer
sortAscending Cells(1, 2).Resize(HowMany, 1)
Cells(1, 4).Resize(SearchCount, 1).FormulaR1C1 = _
"=MATCH(RC[-1]," _
& Cells(1, 2).Resize(HowMany, 1).Address(True, True, xlR1C1) _
& ",1)"
Cells(1, 5).Resize(SearchCount, 1).FormulaR1C1 = _
"=IF(INDEX(" _
& Cells(1, 2).Resize(HowMany, 1).Address(True, True, xlR1C1) _
& ",RC[-1])=RC[-2],RC[-1],NA())"
Debug.Print I, J, Timer - StartTime
Cells(1, 2).Resize(HowMany).Value = SaveArr
Next J
Next I
Debug.Print ">>> Done"
End Sub