To train our neural sequencer model properly, we need to get the ordered sequence of user interaction with articles in a certain period of time. To formulate the requirements, the input dataset for the model training can be described as:
The last X articles the user has interacted with in the last Y days in chronological order.
X: variable with integer value between [0, 20] inclusive
Y: variable with integer value between [20, 50] inclusive
To generate the above mentioned feature, we use Apache Beam pipeline with session window strategy. According to the investigation done by Axel, BTCACE-638: PDP Usecase - identify optimal session window durationCLOSED , we can’t identify an optimal session window gap to fulfil our requirement. Therefore we need to find an alternative to address this issue.
One alternative is to use sliding window strategy. This strategy can guarantee that we get correct features in the Y days duration, but since each event is duplicated for each window, we will have Y times of data in memory, which is a working but not optimal solution.
Another alternative, also the optimal solution is: to create a beam custom window strategy so we get
only 1 window for each user for the last Y days.
However, we are lacking of experience in this area. In this ticket, we want to identify the feasibility of the custom window strategy.
- The feasibility of the custom window strategy is determined.
- If it is feasible to create custom window strategy,
- the solution is proposed/discussed/agreed with the team
- custom window strategy is documented
- custom window strategy is implemented and verified in production environment (can be splitted into a separate ticket)