isteves / pkg_db_connection.md

Last active January 10, 2021 09:20

Managing a DB connection in an R package

In our department, there's almost always just a single database that we want to connect to. Thus, managing the connection throughout our code quickly becomes annoying and redundant:

conn <- odbc::dbConnect(odbc::odbc(), ...)

dbGetQuery(conn, statement1)
dbGetQuery(conn, statement2)
dbGetQuery(conn, statement3)

isteves / glue_in_function.md

Created January 26, 2021 19:31

Using glue::glue() inside of another function

Using glue inside of another function

The key is defining an environment!

test_glue <- function(cmd, e = parent.frame()) {
  crayon::red(glue::glue(cmd, .envir = e))
}

test_fxn &lt;- function (name) {

isteves / neo4j.md

Last active October 31, 2021 17:42

neo4j learnings

Undirected: (a)-[r]-(b) Directed: (a)-[r]->(b) where a and b are nodes and r is the relationship (link) between them

In the following call, the curly brackets are for extra parameters (json form). CALL apoc.import.graphml("file://graph.graphml", {}) CALL apoc.import.graphml("file://graph.graphml", {readLabels: true})

There are properties and labels. Labels are what you can see as different colors in neo4j, and is defined in a graphml file as shown below (see ":Person"). Properties are other attributes that you can query by, such as age ("> 30 years old").

isteves / resources.md

Last active January 25, 2022 09:39

Resource collection

Resources

Docker

TOREAD https://towardsdatascience.com/how-docker-can-help-you-become-a-more-effective-data-scientist-7fc048ef91d5

Airflow

Macros & templating:

isteves / pyspark_tricks.md

Last active May 25, 2022 11:40

PySpark tricks

"Exploding" aggregations

If you want to do the same aggregation to many columns you can write it this way to be more succinct:

cols_min = ["size", "age"]

df \

isteves / tidyverse2pyspark.md

Last active February 28, 2023 13:34

tidyverse2pyspark_translation

Tidyverse to pyspark translations

Adding count of a column as a new column

df %>% add_count(some_col)

df.withColumn("n", count("*").over(Window.partitionBy("some_col")))

Irene Steves isteves