Manish Dixit dixitm20

restore-s3-deletes

Overview

A bash script to list or delete delete-markers in a versioning-enabled Amazon S3 bucket. Deleting delete-markers restores the deleted object.

This script is useful for identifying, restoring & validating accidental deletes on Amazon S3.

Requirements

Zeppelin setup For Use With Local Spark MetaStore

Download and Setup Zeppelin

Step 1) Define below env variables and aliases in the ~/.bash_profile OR ~/.bashrc:

export SPARK_CONF_DIR='/Users/dixitm/Workspace/conf/spark-conf-dir'

# DATA_PLATFORM_ROOT: Local root dir where spark catalog & metastore is setup
export DATA_PLATFORM_ROOT="/Users/dixitm/Workspace/data/local-data-platform"

Managing Multiple Python Versions With Pyenv

References:

Use below commands to install pyenv

$ sudo apt-get install -y make build-essential libssl-dev zlib1g-dev \

Opensearch local docker for Spark Apps | Setup Instructions

Ref Url

Docker Compose

Use the below steps to run the elastic search container

Spark: Scala / PySpark Exercise

Create a spark application written in Scala or PySpark that reads in the provided signals dataset, processes the data, and stores the entire output as specified below.

For each entity_id in the signals dataset, find the item_id with the oldest and newest month_id.In some cases it may be the same item. If there are 2 different items with the same month_id then take the item with the lower item_id. Finally sum the count of signals for each entity and output as the total_signals. The correct output should contain 1 row per unique entity_id.

Requirements:

Create a Scala SBT project Or Pyspark Project (If you know scala then please use the same as we give higher preference to that).
Use the Spark Scala/Pyspark API and Dataframes/Datasets

Please do not use Spark SQL with a sql string!

	import json
	import boto3 as boto3
	import os

	dynamodb = boto3.resource('dynamodb')


	def truncate_table(table_name):
	table = dynamodb.Table(table_name)
	scan_kwargs = {