dvu4 / subtract_vs_left_join_in_pyspark.md

Last active April 23, 2025 19:30

Troubleshooting for dividing data into 2 separate groups with subtract vs left join

This approach leads to the duplications in `df_holdout` and `df_target`

subtract() is row-based and requires exact row match
If df has duplicate rows, subtract() doesn't guarantee it removes just one instance.

# Filter 20% of the data for the holdout group
df_holdout = df.sample(fraction=0.2, seed=42)

dvu4 / write_count_record_to_file.md

Last active April 23, 2025 18:51

Troubleshooting for writing total rows to .ok file in Pyspark

Write count record to .ok file

This code is functionally correct, but it's inefficient and overly complex for what it does: writing a single integer (record count) to a file.

def write_table_to_file(df, container, storage_account, out_name = None, delimiter = "|", audit_prefix = ".ok"):

  ok_output_path  = f"abfss://{container}@{storage_account}.dfs.core.windows.net/tmp_ok_output/"

dvu4 / check_mac_gpu_with_terminal_command.md

Last active April 10, 2025 18:55

Check version of GPU Mac

1. Check the GPU info

system_profiler SPDisplaysDataType

This will display something like:

dvu4 / check_mac_chip_with_terminal_command.md

Last active April 10, 2025 18:56

Check version of Mac Chip

1. Check the Chip Type

sysctl -n machdep.cpu.brand_string

This will display something like:

dvu4 / insert_in_sublime.md

Last active April 8, 2025 23:56

Insert text every beginning or ending of all lines in sublime

Context :

I have 1000 line following like structure texts.

f54g
f5g546
2122v

dvu4 / create_pem_file.md

Last active December 23, 2024 22:11

Create certificate .pem file from .pk12

Generate `.pem` File from `.p12` Certificate

This guide explains how to extract a .pem file from a .p12 file using OpenSSL and troubleshoot common errors encountered during the process.

1. Generate the `.pem` File

Run the following command to extract the .pem file:

openssl pkcs12 -in /Users/dvuiw/Desktop/customer.p12 -nokeys -out /Users/dvuiw/Desktop/certicate.pem -nodes -password pass:123456789

dvu4 / Divide the column containing JSON string into separate columns in PySpark.md

Created August 15, 2024 21:46

Divide the column containing JSON string into separate columns in PySpark

create Dataframe

import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType

dvu4 / Get latest record based on timestamp using Pyspark Window.md

Last active August 5, 2024 23:12

 window_spec = Window.partitionBy("rxa_claim_id", "ITEM_PRODUCT_CODE").orderBy(df_idi["EVENT_TIMESTAMP"].desc())
 df_valid_claim = df_valid_claim.withColumn("row_number", F.row_number().over(window_spec))
 df_valid_claim.display()

window_spec_2 = Window.partitionBy("rxa_claim_id", "ITEM_PRODUCT_CODE")
valid_claim =  df_valid_claim.withColumn("row_number_count", F.count("row_number").over(window_spec_2))
df_valid_claim.display()

dvu4 / demo-mermaid.md

Last active October 17, 2024 18:30

Draw Diagrams in Markdown with Mermaid

Draw Diagrams With Markdown

https://mermaid.live
Excalidraw : An open source virtual hand-drawn style whiteboard. Collaborative and end-to-end encrypted.
Technology Radar

Sequence Diagrams

dvu4 / convert_date_MM_dd_yyyy_to_yyyy-MM-ddin_pyspark.md

Created February 26, 2024 21:31

Convert date in MM/dd/yyyy to yyyy-MM-dd

from pyspark.sql import SparkSession
from pyspark.sql.functions import from_unixtime, unix_timestamp

# Initialize Spark session
spark = SparkSession.builder.appName("date_format_conversion").getOrCreate()

# Example data

Duc Vu dvu4

This approach leads to the duplications in df_holdout and df_target