Skip to content

Instantly share code, notes, and snippets.

@inkrement
Created November 21, 2024 15:55
Show Gist options
  • Save inkrement/6fa3a9ccfc7990ea5ba535ac9a8450c3 to your computer and use it in GitHub Desktop.
Save inkrement/6fa3a9ccfc7990ea5ba535ac9a8450c3 to your computer and use it in GitHub Desktop.
some thoughts about reproducibility

Reproducibility in R and Marketing Research

Ensuring reproducibility in R-based marketing research projects involves adopting practices that enable others (or your future self) to replicate and extend your analysis.

Language-Agnostic Tips

  • Version Control (e.g., Git):

    • Use Git to track changes in your project files. This keeps the project history intact and makes collaboration easier.
    • Always create clear commit messages to describe changes effectively.
    • IMHO: branching/forking and other git-features are overkill for small projects.
  • Reproducibility:

    • Document your system environment (e.g., operating system, R version, compiler version).
    • Set seeds for random processes (e.g., set.seed(123) in R) to ensure consistent outputs.
    • Save intermediary outputs, such as pre-processed data or model results, to avoid recalculating results unnecessarily. Makes it easy to compare intermediary outputs.
  • Backups:

    • Leverage platforms like GitHub or GitLab for source code backups.
    • For larger files (e.g., datasets >100MB), use dedicated object storage solutions (e.g., AWS S3 or Google Drive).
    • consider committing also your R workspace as serialized object.
    • Store data in compressed formats (e.g., .zst) for faster loading and saving.
  • Trackability:

    • Use clear labels and file organization for all components of your project. For example:
      • /data for raw and processed datasets.
      • /scripts for analysis scripts.
      • /results for output files, such as visualizations or tables.
    • Clearly document each step in the analysis pipeline.

R-Specific Tips

  • Avoid Side-Effects:

    • Instead of importing entire libraries, use namespace-qualified calls. For example:
      readr::read_csv("file.csv")
      This prevents conflicts between similarly named functions across packages.
  • File Formats and Editors:

    • Use Quarto over R Markdown (Rmd) for writing reports and reproducible documents. Quarto supports multiple languages, has better integration with modern tools, and is easier to version control.
    • Avoid Jupyter notebooks (.ipynb) for version control-heavy workflows, as they are harder to merge and track in Git.
  • Environment Versioning:

    • Use renv (R environment) to snapshot and lock package versions. This ensures that others can replicate your analysis in the same R environment.
      renv::init() # Initializes a project-specific library
      renv::snapshot() # Records the state of your library
    • Include the renv.lock file in your version control to share the exact package dependencies.
  • Reproducible Workflows:

    • Organize your code into modular scripts, such as:
      • data_preparation.R
      • analysis.R
      • visualization.R
  • Scalable Collaboration:

    • Write functions for repetitive tasks instead of duplicating code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment