Skip to content

Instantly share code, notes, and snippets.

View amalgjose's full-sized avatar
🎯
Focusing

Amal G Jose amalgjose

🎯
Focusing
View GitHub Profile
@amalgjose
amalgjose / python_git_to_adls_sync.py
Last active May 23, 2024 05:56
Python program to clone or copy a git repository to Azure Data Lake Storage ( ADLS Gen 2). This program is helpful for people who uses spark and hive script in Azure Data Factory. Azure Data Factory needs the hive and spark scripts on ADLS. The developers can commit the code in the git. The git repository can be synced to ADLS using this program…
import os
# Modify this path as per the execution environment
os.environ['GIT_PYTHON_GIT_EXECUTABLE'] = r"C:\Program Files\Git\bin\git.exe"
import uuid
import git
import shutil
from git import RemoteProgress
from azure.storage.blob import BlobServiceClient
@amalgjose
amalgjose / download_adls_directory.py
Last active December 4, 2023 16:15
Python program to download a complete directory or file from Microsoft Azure ADLS. This program is capable of recursively download a complete directory from Azure Data Lake Storage. This uses Azure Blob Storage API to iterate over the directories, files and download the data. This is tested as of 05-October-2020. For more details, refer to https…
# coding: utf-8
import os
from azure.storage.blob import BlobServiceClient
class DownloadADLS:
def __init__(self, connection_string, container_name):
service_client = BlobServiceClient.from_connection_string(connection_string)
self.client = service_client.get_container_client(container_name)
@amalgjose
amalgjose / upload_directory_to_adls.py
Last active August 19, 2024 11:05
Python program to upload a directory or a folder to Azure Data Lake Storage Gen 2 ( ADLS Gen 2 ) . For more details read the detailed article in my blog https://amalgjose.com
import os
from azure.storage.blob import BlobServiceClient
# Install the following package before running this program
# pip install azure-storage-blob
def upload_data_to_adls():
"""
Function to upload local directory to ADLS
:return:
"""
@amalgjose
amalgjose / adls_file_write.py
Created September 29, 2020 13:36
Python program to write a file into Microsoft Azure Data Lake Storage (ADLS Gen2) file system. For more details, refer to https://amalgjose.com
from azure.storage.filedatalake import DataLakeServiceClient
# install the following package
# pip install azure-storage-file-datalake
# Get the below details from your storage account
storage_account_name = ""
storage_account_key = ""
container_name = ""
directory_name = ""
@amalgjose
amalgjose / list_all_sagemaker_instances.py
Created September 29, 2020 10:50
Python program to list all the details of AWS SageMaker Notebook instances running in an AWS account and exporting the data as a csv file.
import csv
import boto3
client = boto3.client('sagemaker', region_name='us-east-1')
response = client.list_notebook_instances(MaxResults=100)
notebooks = response['NotebookInstances']
print("Total Number of Notebook Instances ----->", len(notebooks))
notebook_list = []
for notebook in notebooks:
notebook_dict = dict()
@amalgjose
amalgjose / spark_adls_filesystem_operations.py
Created September 23, 2020 06:34
Pyspark program that interacts with Azure Data Lake Storage Gen 2 using HDFS API. Delete and check operations are demonstrated in this program. You can modify the same to perform all the other file system operations. For more details, refer to https://amalgjose.com
from pyspark.sql import SparkSession
# Author: Amal G Jose
# Reference: https://amalgjose.com
# prepare spark session
spark = SparkSession.builder.appName('filesystemoperations').getOrCreate()
# spark context
sc = spark.sparkContext
# set ADLS file system URI
@amalgjose
amalgjose / stream_to_s3.py
Last active July 22, 2022 10:24
Python program to Stream data from a URL and upload it directly to S3 without saving it to local disk. This program is very suitable to use in AWS Lambda functions. This works well with large files. This program acts like a relay between the source and the S3. For more details refer to https://amalgjose.com/2020/08/13/python-program-to-stream-da…
import boto3
import requests
authentication = {"USER": "", "PASSWORD": ""}
payload = {"query": "some query"}
session = requests.Session()
response = session.post("URL",
data=payload,
auth=(authentication["USER"],
@amalgjose
amalgjose / stream_data.py
Last active September 10, 2020 03:47
Sample program to stream large data from a REST or HTTP endpoint using Python. For more details refer to https://amalgjose.com/2020/08/10/program-to-stream-large-data-from-a-rest-endpoint-using-python/
import requests
session = requests.Session()
authentication = {"USER":"", "PASSWORD":""}
payload = {"query":"some query"}
local_file = "data.json"
# This is a dummy URL. You can replace this with the actual URL
URL = "https://sampledatadowload.com/somedata"
@amalgjose
amalgjose / convert_datatype.py
Last active February 5, 2023 02:44
How to convert or change the datatypes of columns in a pandas dataframe ?. For more details, please refer to the blog https://amalgjose.com/2020/05/22/how-to-convert-or-change-the-data-type-of-columns-in-pandas-dataframe
import pandas as pd
# create a sample dataframe
df = pd.DataFrame({'emp_id': ['111', '112', '113'], 'salary': ['40000', '50000', '60000'], 'name':['amal', 'sabitha', 'edward']})
# print the dataframe
print(df)
# print the datatypes in the dataframe
print(df.dtypes)
# now let us convert the data type of salary to integer
df = df.astype({'salary':'int'})
@amalgjose
amalgjose / check_internet_speed.py
Last active February 5, 2023 02:44
Python program to check the internet speed. For more details, visit http://amalgjose.com/2020/05/17/python-program-to-check-the-internet-speed-or-bandwidth
import time
import sqlite3
import speedtest
from datetime import datetime, timedelta
class internet_speed_calculator():
def __init__(self):