Skip to content

Instantly share code, notes, and snippets.

View j-thepac's full-sized avatar

Deepak j-thepac

View GitHub Profile
"""
to Convert json data to Rows and Columns
[{"data":[["2","2xg","2Q"],["1","3xg","3Q"]],"schema":[{"columnName":"CASE_UID","ordinal":0,"dataTypeName":"varchar"},{"columnName":"QUOTE_ID","ordinal":1,"dataTypeName":"varchar"},{"columnName":"OPP_NO","ordinal":2,"dataTypeName":"varchar"}]}]
"""
from pyspark.sql.functions import *
from pyspark.sql.types import *

NGINX with DOCKER

Summary

  • Serving Static Content (HTML, CSS, JavaScript)
  • Reverse Proxy (load balancing ): forward requests from clients to backend servers which handle the requests.
  • API Gateway(distributing traffic): Acts as a single entry point for all API requests and distributing traffic
  • SSL/TLS Termination: it decrypts incoming traffic, inspects it, and then re-encrypts it before sending it to the backend servers.
  • Caching: NGINX can also be used as an HTTP cache, which can improve website performance by caching frequently requested content and serving it directly from memory
  • load balancing

Mongo

Note : For Mongo Cloud

  • Set Password: Select Cluster > Security >Select User > Edit
  • Add IP : Select Cluster > Security > Network Access > Add IP Address

ref:

SQL VS MongoDB

  • Database = Database
@j-thepac
j-thepac / pyspark.md
Last active March 16, 2023 11:16
Install Pyspark in Mac

Pre-Req:

  • Install Python 3.9
  • Find the location of python ($which python) and Keep it handy
  • pip3 install ipython #optional
  • pip3 install pyspark
  • Download apache spark zip > Unzip to a Path
@j-thepac
j-thepac / scikit-learn.md
Last active April 4, 2023 07:38
ML / AI

ML

dataset

  • input data (features or predictors)
    • Example - student's age, gender, previous grades, etc.
    • numpy array or pandas DataFrame
    • denoted by the variable X.
  • target data (response or labels)
    • Eg = student's final grade or pass/fail status.
    • numpy array or pandas DataFrame
  • denoted by the variable Y.

Versioning

  1. Calender Versioning
  2. Semantic Versioning

Calender Versioning

https://calver.org/

  • ubuntu 16.04 = Ubuntu October,2016
  • Pycharm 2022.3.2
@j-thepac
j-thepac / ETL.md
Last active January 13, 2025 13:27

ETL

BigData:

  • OLTP (online Transactional processing)
  • OLAP(online analytical processing)

Data processing systems

  • OLTP (OnLine Transaction Processing) : is used for managing current day to day data information.
    • ACID
  • Atomicity(entire transaction happens at once or nothing happens)

System

  • Sop (std operation procedure)
  • Rca (root cause analysis)
  • Modular
  • Scalable
  • Scheme agnostic (No SqL) : No Schema
  • Model
  • Decouple
from pyspark.sql.window import Window
"""
Aggregate: min, max, avg, count, and sum.
Ranking: rank, dense_rank, percent_rank, row_num, and ntile
Analytical: cume_dist, lag, and lead
Custom boundary: rangeBetween and rowsBetween
"""

Delta Lake

quick link

Issues in Spark :

  • Cannot update /change date
  • No schema enforcement
  • No delta load
  • Data can be messed in overwrite

Adv of Delta Lake