Skip to content

Instantly share code, notes, and snippets.

@dvu4
dvu4 / subtract_vs_left_join_in_pyspark.md
Last active April 23, 2025 19:30
Troubleshooting for dividing data into 2 separate groups with subtract vs left join

This approach leads to the duplications in df_holdout and df_target

  • subtract() is row-based and requires exact row match

  • If df has duplicate rows, subtract() doesn't guarantee it removes just one instance.

# Filter 20% of the data for the holdout group
df_holdout = df.sample(fraction=0.2, seed=42)
    
@dvu4
dvu4 / write_count_record_to_file.md
Last active April 23, 2025 18:51
Troubleshooting for writing total rows to .ok file in Pyspark

Write count record to .ok file

This code is functionally correct, but it's inefficient and overly complex for what it does: writing a single integer (record count) to a file.

def write_table_to_file(df, container, storage_account, out_name = None, delimiter = "|", audit_prefix = ".ok"):

  ok_output_path  = f"abfss://{container}@{storage_account}.dfs.core.windows.net/tmp_ok_output/"

Check version of GPU Mac

1. Check the GPU info

system_profiler SPDisplaysDataType

This will display something like:

Check version of Mac Chip

1. Check the Chip Type

sysctl -n machdep.cpu.brand_string

This will display something like:

@dvu4
dvu4 / insert_in_sublime.md
Last active April 8, 2025 23:56
Insert text every beginning or ending of all lines in sublime

Insert text every beginning or ending of all lines in sublime

Context :

I have 1000 line following like structure texts.

f54g
f5g546
2122v
@dvu4
dvu4 / create_pem_file.md
Last active December 23, 2024 22:11
Create certificate .pem file from .pk12

Generate .pem File from .p12 Certificate

This guide explains how to extract a .pem file from a .p12 file using OpenSSL and troubleshoot common errors encountered during the process.

1. Generate the .pem File

Run the following command to extract the .pem file:

openssl pkcs12 -in /Users/dvuiw/Desktop/customer.p12 -nokeys -out /Users/dvuiw/Desktop/certicate.pem -nodes -password pass:123456789
@dvu4
dvu4 / Divide the column containing JSON string into separate columns in PySpark.md
Created August 15, 2024 21:46
Divide the column containing JSON string into separate columns in PySpark

Divide the column containing JSON string into separate columns in PySpark

  • create Dataframe
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType
 window_spec = Window.partitionBy("rxa_claim_id", "ITEM_PRODUCT_CODE").orderBy(df_idi["EVENT_TIMESTAMP"].desc())
 df_valid_claim = df_valid_claim.withColumn("row_number", F.row_number().over(window_spec))
 df_valid_claim.display()

window_spec_2 = Window.partitionBy("rxa_claim_id", "ITEM_PRODUCT_CODE")
valid_claim =  df_valid_claim.withColumn("row_number_count", F.count("row_number").over(window_spec_2))
df_valid_claim.display()
@dvu4
dvu4 / demo-mermaid.md
Last active October 17, 2024 18:30
Draw Diagrams in Markdown with Mermaid

Convert date in MM/dd/yyyy to yyyy-MM-dd

from pyspark.sql import SparkSession
from pyspark.sql.functions import from_unixtime, unix_timestamp

# Initialize Spark session
spark = SparkSession.builder.appName("date_format_conversion").getOrCreate()

# Example data