I’m querying an issue tracking system for a set of issues
The way the data comes back, I know when the issue was created and when it was changed, so I can iterate over all lifetime events for all issues.
I want to do some calculations like “how many open issues were there in each of the weeks from the start of the project until today”
To do this, I’m creating a DataFrame, initially populated with NaNs, with a multi-index on all the days the project has been live and all the issues found (some of which won’t have existed for each day)
lifetime_days = pd.date_range(start_date, datetime.date.today())
issue_keys = sorted(map(lambda s: s.key, issues))
index = pd.MultiIndex.from_product([lifetime_days, issue_keys], names=['date', 'key'])
df = pd.DataFrame(np.NaN, index=index, columns=['status', 'resolved'])
The plan then is to iterate over all changes and insert the events at the relevant index, recording status and resolved, then use ffill() to forward fill the missing values, resample to weeks, and then use groupbys to calculate things (number of issues by state each week; number of issues total by week)
Question: does this sound like a sensible way to use Pandas? Sub-question: does ffill() work in this case? I need it to pad values based on the previous day’s value for the same issue key, but not for the previous issue on the same day, if that makes sense