Let’s start with a practical example:
We have a list of IDs that need to be processed. This list can hold from just a dozen of IDs to up to a few 100,000 IDs. Each ID is processed only once in each iteration and the order of IDs is not important.
This whole thing is presenting a bottleneck in our application, so we decided to take a look at other iterable data structures that Python offers.
To demonstrate this, we will create a simple example:
- Function that generates a shuffled list of positive and negative numbers:
1
2
3
4
5
6
7
8
9
10
11
12
13
| from random import shuffle
def get_numbers(power=2):
low_end = -10 ** power
high_end = 10 ** power
# generate list
arr = [i for i in range(low_end, high_end)]
# shuffle list
shuffle(arr)
# remove 25% of numbers, less numbers will have it's pair
remove_n_elements = int(len(arr) * .25)
arr = arr[remove_n_elements:]
return arr
|
This will return:
1
2
| get_numbers(2) # list of 150 unique numbers, between -100 and 100
get_numbers(3) # list of 1500 unique numbers, between -1000 and 1000
|
- Then we will define a simple function that will iterate through list, tuple or set and find pairs, where the sum of both numbers equals sum:
1
2
3
4
5
6
| def find_pairs(arr, sum):
pairs = {}
for i in arr:
if (sum - i) in arr:
pairs.add((i, sum - i))
return pairs
|
- And let’s also create a function that will measure the execution time of find_pairs() function:
1
2
3
4
5
6
7
| from time import time
def measure_time(nums, sum):
start = time()
pairs = find_pairs(nums, sum)
end = time()
return end - start # seconds
|
Okay, now we can measure time, based on number of elements.
150 numbers:
1
2
3
4
| numbers = get_numbers(2) # list of 150 numbers
measure_time(numbers, 17) # ~ 0.000271 seconds
measure_time(tuple(numbers), 17) # ~ 0.000224 seconds
measure_time(set(numbers), 17) # ~ 0.000046 seconds
|
With 150 numbers, using set is ~5.89 times faster than using list.
1,500 numbers:
1
2
3
4
| numbers = get_numbers(3) # list of 1,500 numbers
measure_time(numbers, 17) # ~ 0.025494 seconds
measure_time(tuple(numbers), 17) # ~ 0.026308 seconds
measure_time(set(numbers), 17) # ~ 0.000418 seconds
|
With 1,500 numbers, using set is ~60.99 times faster than using list.
15,000 numbers:
1
2
3
4
| numbers = get_numbers(4) # list of 15,000 numbers
measure_time(numbers, 17) # ~ 3.825252 seconds
measure_time(tuple(numbers), 17) # ~ 3.786150 seconds
measure_time(set(numbers), 17) # ~ 0.006523 seconds
|
With 15,000 numbers, using set is ~586.42 times faster than using list.
150,000 numbers:
1
2
3
4
| numbers = get_numbers(5) # list of 150,000 numbers
measure_time(numbers, 17) # ~ 834.580523 s = 13 min 54.58 sec
measure_time(tuple(numbers), 17) # ~ 854.708630 s = 14 min 14.70 sec
measure_time(set(numbers), 17) # ~ 0.07081294 seconds
|
With 150,000 numbers, using set is ~11,785.70 times faster than using list.
Conclusion:
list and tuple work pretty much with same speed, but set made a big difference, here.
By using the set instead of list we significantly sped up the code.