Created
April 16, 2023 17:00
-
-
Save p1-dta/2c49d5c3eb6bafdd9675bf0bb9001bf1 to your computer and use it in GitHub Desktop.
Count how many people are alive per year, based on their birthyear and deathyear, aims for simplicity.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import collections | |
import random | |
import timeit | |
pop = [ | |
(b := random.randint(1900, 2000), random.randint(b + 1, b + 100)) | |
for _ in range(10000) | |
] | |
# 5.049 s (with 10000 people [1,100] years old) | |
def count_living_per_year(population: list[tuple[int, int]]) -> dict[int, int]: | |
living_per_year = {} | |
for birth, death in population: | |
for year in range(birth, death): | |
if year in living_per_year: | |
living_per_year[year] += 1 | |
else: | |
living_per_year[year] = 1 | |
return living_per_year | |
print(sorted(tuple(count_living_per_year(pop).items()))) | |
print( | |
timeit.timeit( | |
"count_living_per_year(pop)", | |
globals=globals(), | |
number=100, | |
) | |
) | |
# 3.697 s (with 10000 people [1,100] years old) | |
def count_living_per_year(population: list[tuple[int, int]]) -> dict[int, int]: | |
living_per_year = {} | |
for birth, death in population: | |
for year in range(birth, death): | |
living_per_year[year] = living_per_year.get(year, 0) + 1 | |
return living_per_year | |
print(sorted(tuple(count_living_per_year(pop).items()))) | |
print( | |
timeit.timeit( | |
"count_living_per_year(pop)", | |
globals=globals(), | |
number=100, | |
) | |
) | |
# 3.929 s (with 10000 people [1,100] years old) | |
def count_living_per_year(population: list[tuple[int, int]]) -> dict[int, int]: | |
living_per_year = collections.defaultdict(int) | |
for birth, death in population: | |
for year in range(birth, death): | |
living_per_year[year] += 1 | |
return living_per_year | |
print(sorted(tuple(count_living_per_year(pop).items()))) | |
print( | |
timeit.timeit( | |
"count_living_per_year(pop)", | |
globals=globals(), | |
number=100, | |
) | |
) | |
# 4.084 s (with 10000 people [1,100] years old) | |
def count_living_per_year(population: list[tuple[int, int]]) -> dict[int, int]: | |
return collections.Counter( | |
year for birth, death in population for year in range(birth, death) | |
) | |
print(sorted(tuple(count_living_per_year(pop).items()))) | |
print( | |
timeit.timeit( | |
"count_living_per_year(pop)", | |
globals=globals(), | |
number=100, | |
) | |
) |
Here's a faster numpy
implementation:
def count_living_per_year(population: list[tuple[int, int]]) -> dict[int, int]:
return dict(np.stack(np.unique(np.concatenate([np.arange(*item) for item in population]), return_counts=True), axis=1))
Having to return a dict
slows it down a bit.
Here's an updated numpy
method, but it's not as quick as the optimized collections.Counter
method:
def count_living_per_year(population: list[tuple[int, int]]) -> dict[int, int]:
births, deaths = np.array(population).T
def counter(arr, min, max):
temp = np.zeros(((max - min) + 1))
counter = np.unique(np.array(arr), return_counts=True)
temp[counter[0] - min] = counter[1]
return temp
min, max = births.min(), deaths.max()
births_count = counter(births, min, max)
deaths_count = counter(deaths, min, max)
delta_living = births_count - deaths_count
living_per_year = np.cumsum(delta_living)
return dict(np.stack((np.arange(min, max + 1), living_per_year), axis=1))
Looks like the numpy.unique
method is not quite as efficient.
ChatGPT helped speed it up a little more using bincount
instead of unique
.:
def count_living_per_year(population: list[tuple[int, int]]) -> dict[int, int]:
births, deaths = np.array(population).T
min_year = births.min()
max_year = deaths.max()
years = np.arange(min_year, max_year + 1)
birth_counts = np.bincount(births - min_year, minlength=max_year - min_year + 1)
death_counts = np.bincount(deaths - min_year, minlength=max_year - min_year + 1)
living_per_year = np.cumsum(birth_counts - death_counts)
return dict(zip(years, living_per_year))
I looked at bincount
but wasn't sure how to set the minlength
- looks obvious now!
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
A conceptually simpler approach that achieves the same, and gets me down to ~~20 milliseconds on my machine (between 17 and 22 across several runs) or about 0.02s : https://gist.github.com/JSzitas/921726d1a91320991b19331465a1f06e
As an advantage, the years are, obviously, sorted. If you insist that we do not know the range ahead of time, you can just use e.g.
Where you would adjust the min_year and the max_year as new entries were added (and accordingly add to the front or back of the deque). This should still be fairly fast, but I did not go to the trouble of that, because I am not paid to make this fast :) Same reason for why there is no multi-threading or SIMD.
P.S.: If you insist on using using maps you might want to use unordered_map instead of map, your code will be faster.