Referee Report 3: Gijs Dekkers (Federal Planning Bureau, Belgium)

Overall Assessment

This paper presents an impressive technical achievement in combining survey and administrative data through machine learning methods. The scale of the calibration exercise (over 7,000 targets) and the sophisticated use of quantile regression forests represent significant advances in microsimulation methodology. From a European perspective, where we have extensive experience with data fusion and reweighting in models like EUROMOD and MIDAS, I offer the following observations.

Major Comments

1. Methodological Transparency and Reproducibility

While the open-source nature of the implementation is commendable, several methodological details need clarification:

The QRF hyperparameters are not fully specified. What values were used for forest size, tree depth, minimum samples per leaf?
The dropout regularization approach is interesting but needs theoretical justification. Why 5%? Was this optimized?
The L0 regularization path mentioned in the code should be discussed in the paper
Consider providing a more detailed algorithmic description, perhaps in pseudo-code

2. Validation Framework

The validation against 7,000+ targets is impressive but raises concerns:

With so many targets, some will match by chance. Consider adjusting for multiple testing
The paper should discuss the relative importance of different targets. Are all targets weighted equally?
Cross-validation or out-of-sample testing would strengthen confidence in the approach
How stable are the results across different random seeds?

3. Dynamic Considerations

Coming from dynamic microsimulation, I'm concerned about the static nature of the enhancement:

How will relationships between variables evolve over time?
The 2015-2024 gap is problematic for forward-looking policy analysis
Consider discussing how the enhanced dataset could feed into dynamic models
What are the implications for modeling behavioral responses to policy changes?

4. International Comparability

For the international microsimulation community, it would be valuable to discuss:

How transferable is this methodology to other countries?
What are the minimum data requirements (survey and administrative) for applying this approach?
Could this method help harmonize microsimulation datasets across countries?
How does this compare to European experiences with combining EU-SILC with administrative data?

Technical Comments

Convergence diagnostics: The gradient descent optimization should include formal convergence criteria beyond just iteration count.
Weight distribution: The extreme variation in weights (some zero, high standard deviation) is concerning. Have you examined the effective sample size? Consider trimming extreme weights.
Sparse solutions: The brief mention of L0 regularization suggests you've explored sparse weighting schemes. This deserves fuller treatment as it could address the weight variance issue.
Missing data: How are missing values in the original datasets handled before imputation?

Minor Comments

The notation could be more consistent (e.g., bold for vectors/matrices).
Some figures are hard to read in black and white printing.
The computational requirements (time, memory) should be documented.
Consider adding a flowchart of the overall methodology.

Recommendation

This paper makes significant methodological contributions that will be of great interest to the microsimulation community. The technical innovation is substantial, and the open-source implementation is particularly valuable. However, the paper needs revision to address the validation concerns and provide more methodological detail. With these improvements, it would make an excellent addition to IJM.

I particularly encourage the authors to position their work more explicitly within the international microsimulation literature and to discuss the broader applicability of their methods beyond the US context. EOF < /dev/null

MaxGhenis/referee_report_dekkers.md