Skip to content

Instantly share code, notes, and snippets.

View ace-racer's full-sized avatar
🎯
Focusing

Anurag Chatterjee ace-racer

🎯
Focusing
  • London
  • 18:40 (UTC)
View GitHub Profile
@ace-racer
ace-racer / validate.py
Created February 24, 2024 03:13
Validate incoming data using generated expectations
import great_expectations as ge
import sys
import json
import os
def validate_data(file_path: str, expectation_suite_path: str):
# read the dataset into ge DataFrame
ge_df = ge.read_csv(file_path)
result_format: dict = {
@ace-racer
ace-racer / expectations.json
Created February 24, 2024 03:07
Generated expectations
{
"data_asset_type": "Dataset",
"expectation_suite_name": "default",
"expectations": [
{
"expectation_type": "expect_table_row_count_to_be_between",
"kwargs": {
"max_value": 50,
"min_value": 1
},
@ace-racer
ace-racer / create_expectations.ipynb
Created February 24, 2024 03:01
Create expectations using GE
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@ace-racer
ace-racer / 02_load_tweets_es.py
Created October 29, 2022 17:52
Load Tweets to Elasticsearch using Pandas and Python Elasticsearch client
import tqdm
from elasticsearch import Elasticsearch
from elasticsearch.helpers import streaming_bulk
import pandas as pd
FILE_LOC = 'staging/TweetsElonMusk.csv'
INDEX_NAME = 'elonmusktweets'
df = pd.read_csv(FILE_LOC)
@ace-racer
ace-racer / 01_create_index.py
Last active October 29, 2022 14:29
Create index in elasticsearch running locally using Python
from elasticsearch import Elasticsearch
client = Elasticsearch(hosts='localhost')
client.indices.create(index='elonmusktweets')
@ace-racer
ace-racer / docker-compose.yml
Created October 29, 2022 13:50
Elasticsearch with Kibana docker compose
version: '3.7'
services:
# Elasticsearch Docker Images: https://www.docker.elastic.co/
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.4.3
container_name: elasticsearch
environment:
- discovery.type=single-node
@ace-racer
ace-racer / mlflow_operations.py
Last active January 23, 2021 10:44
IRIS classification with MLFlow
import pandas as pd
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
import mlflow
import mlflow.sklearn
from mlflow.tracking import MlflowClient
from sklearn.metrics import roc_auc_score, accuracy_score
from sklearn.model_selection import train_test_split
@ace-racer
ace-racer / build.sbt
Created January 10, 2021 14:06
Build sbt for Word Counter Scala application
name := "WordCounter"
version := "0.1"
scalaVersion := "2.12.6"
// https://mvnrepository.com/artifact/org.apache.spark/spark-core
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.6"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.6"
@ace-racer
ace-racer / WikiContentWordCounter.scala
Created January 10, 2021 14:03
Word counter Spark job using scala
package org.spark.learning
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions.{lower, regexp_replace, col, explode, count, desc}
import org.apache.spark.ml.feature.{Tokenizer, StopWordsRemover}
object WikiContentWordCounter {
def main(args: Array[String]): Unit = {
val spark = SparkSession
@ace-racer
ace-racer / word_count_extended.py
Last active December 31, 2020 04:06
Spark job to count the occurances of words after removing stop words
# Base code taken from the Spark samples with the spark installation
import sys
import os
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import lower, regexp_replace