Skip to content

Instantly share code, notes, and snippets.

@datavudeja
datavudeja / getsizes.ps1
Created May 21, 2026 14:29 — forked from cpeddle/getsizes.ps1
Folder size markdown generator
param(
[string]$RootPath = "C:\",
[string]$OutputPath = "C:\projects\laptop-cleanup\folder_analysis\",
[string[]]$ExcludeFolders = @("laptop-cleanup"),
[ValidateRange(0, [int64]::MaxValue)]
[int64]$MinimumDirectorySizeToRecurseBytes = 50MB,
[switch]$ShowAccessDeniedWarnings
)
# Create output directory if it doesn't exist
#requires -version 2
<#
.SYNOPSIS
Gets folder sizes using COM and by default with a fallback to robocopy.exe, with the
logging only option, which makes it not actually copy or move files, but just list them, and
the end summary result is parsed to extract the relevant data.
There is a -ComOnly parameter for using only COM, and a -RoboOnly parameter for using only
@datavudeja
datavudeja / book-archiver.ps1
Created April 17, 2026 19:10 — forked from BurstX/book-archiver.ps1
This script archives documents in a folder (and its subfolders) that may benefit from archiving (i.e. jpg, epub are skipped; pdf's and other files are left archived if they can be compressed to a certain percentage, etc.)
<#
.SYNOPSIS
.PARAMETER Ratio
A real number within (0, 1). Ratio of the compressed file that is accepted as the archive.
Files that cannot be compressed better or equal to this ratio compared to the original,
are not archived (left as is).
#>
param(
import numpy as np
import pandas as pd
#load dataset
df = pd.read_csv("data.csv")
# axis 0 -> row -> i
# axis 1 -> col -> j
# get cols
@datavudeja
datavudeja / pandas_cheetsheet.py
Created March 4, 2026 12:11 — forked from Ezhvsalate/pandas_cheetsheet.py
Pandas cheetsheet: some useful commands for data preprocessing
# Read data from csv
data = pd.read_csv('data.csv', sep=',', index_col='Number')
# Write data to csv
data.to_csv("data_wo_sensitive_lemmatized.csv", index=False, encoding='utf-8', sep=';')
# Read and concat several files in one dataframe
files = glob.glob('*.csv')
small_dfs = [pd.read_csv(fp, names=columns) for fp in files]
df = pd.concat(small_dfs)
@datavudeja
datavudeja / pandas.py
Created March 4, 2026 12:09 — forked from stiles/pandas.py
Pandas cheat sheet
# List unique values in a DataFrame column
df['Column Name'].unique()
# To extract a specific column (subset the dataframe), you can use [ ] (brackets) or attribute notation.
df.height
df['height']
# are same thing!!! (from http://www.stephaniehicks.com/learnPython/pages/pandas.html
# -or-
# http://www.datacarpentry.org/python-ecology-lesson/02-index-slice-subset/)
@datavudeja
datavudeja / data_quality_checks.py
Created February 18, 2026 17:32 — forked from LeGi0N09/data_quality_checks.py
Python: Automated data quality validation framework
import pandas as pd
from typing import Dict, List
class DataQualityValidator:
def __init__(self, df: pd.DataFrame):
self.df = df
self.issues = []
def check_nulls(self, columns: List[str], threshold: float = 0.05):
"""Check if null percentage exceeds threshold"""
@datavudeja
datavudeja / PIPE.py
Created February 9, 2026 15:27 — forked from emherrer/PIPE.py
[Functions] Algunas funciones utiles #python #fun #funciones #def #pipe #words #keywords
from functools import wraps
import datetime as dt
import pandas as pd
def log_start(func):
@wraps(func)
def wrapper(*args, **kwargs):
tic = dt.datetime.now()
result = func(*args, **kwargs)
@datavudeja
datavudeja / nonprint-char_remover.py
Created February 6, 2026 14:41 — forked from GDBSD/nonprint-char_remover.py
Remove non-printing characters from a Pandas dataframe
def remove_non_printing_chars(df):
"""Clean a dataframe column to remove any non-printing characters.
We've encountered values like tabs in some of the data.
:param df: Pandas dataframe
:return: Pandas dataframe
"""
clean_df = df.copy(deep=True)
clean_df = clean_df.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
for col in list(clean_df.columns):
#Apply Lambda function to pandas
# if we require other column as a logic for the new column
df = df.assign(Product=lambda x: (x['Field_1'] * x['Field_2'] * x['Field_3']))
# if we need to modify all the element of selected entity based only on that entity
# this will in-place update all the element
df = df.apply(lambda x: np.square(x) if x.name in ['a', 'e', 'g'] else x, axis=1)
# compare from the previous element of the colums use shift