Skip to content

Instantly share code, notes, and snippets.

@dixitm20
dixitm20 / readme.md
Last active December 11, 2023 10:13
Script for easy identification, restoration & validation of deleted objects in Amazon S3 buckets.

restore-s3-deletes

Overview

A bash script to list or delete delete-markers in a versioning-enabled Amazon S3 bucket. Deleting delete-markers restores the deleted object.

This script is useful for identifying, restoring & validating accidental deletes on Amazon S3.

Requirements

@dixitm20
dixitm20 / ZeppelinSetup.md
Created September 21, 2022 13:02
Zeppelin Setup For Use With Local Spark MetaStore

Zeppelin setup For Use With Local Spark MetaStore

Download and Setup Zeppelin

Step 1) Define below env variables and aliases in the ~/.bash_profile OR ~/.bashrc:

export SPARK_CONF_DIR='/Users/dixitm/Workspace/conf/spark-conf-dir'

# DATA_PLATFORM_ROOT: Local root dir where spark catalog & metastore is setup
export DATA_PLATFORM_ROOT="/Users/dixitm/Workspace/data/local-data-platform"
@dixitm20
dixitm20 / MultiplePythonVersionsWithPyenv.md
Created June 19, 2022 19:10
Managing Multiple Python Versions With Pyenv
@dixitm20
dixitm20 / Spark-Data-Engineer-Assignment.md
Last active September 4, 2024 08:16
Assignment For Data Engineer

Spark: Scala / PySpark Exercise

Create a spark application written in Scala or PySpark that reads in the provided signals dataset, processes the data, and stores the entire output as specified below.

For each entity_id in the signals dataset, find the item_id with the oldest and newest month_id.In some cases it may be the same item. If there are 2 different items with the same month_id then take the item with the lower item_id. Finally sum the count of signals for each entity and output as the total_signals. The correct output should contain 1 row per unique entity_id.

Requirements:

  1. Create a Scala SBT project Or Pyspark Project (If you know scala then please use the same as we give higher preference to that).
  2. Use the Spark Scala/Pyspark API and Dataframes/Datasets
  • Please do not use Spark SQL with a sql string!
@dixitm20
dixitm20 / lambda_function.py
Last active August 17, 2021 15:43
Truncate Dynamodb Tables
import json
import boto3 as boto3
import os
dynamodb = boto3.resource('dynamodb')
def truncate_table(table_name):
table = dynamodb.Table(table_name)
scan_kwargs = {