Skip to content

Instantly share code, notes, and snippets.

View ncoop57's full-sized avatar
🤓
I'm a nerd.

Nathan Cooper ncoop57

🤓
I'm a nerd.
View GitHub Profile
use arrow::{
file::{writer::FileWriter, write_all, Writer},
record_batch::RecordBatch,
util::hash::XXHash64,
};
use std::fs::File;
fn hash_text_column(input_path: &str, output_path: &str) {
let mut input_reader = FileReader::try_new(input_path).unwrap();
let input_schema = input_reader.schema().clone();
@ncoop57
ncoop57 / minhash_stackexchange.py
Last active January 26, 2023 07:57
Pyspark Minhash
import time
import os
from pyspark.ml import Pipeline
from pyspark.ml.feature import RegexTokenizer, NGram, HashingTF, MinHashLSH
from pyspark.sql.functions import col
from spark_session_builder import build_spark_session
spark = build_spark_session("spark://cpu64-dy-c6i-16xlarge-1:7077", 32, 128)
db = spark.read.parquet("/fsx/shared/pilev2_parquet/StackExchange_ver4_non_local_dedupped/dataset.parquet").limit(1_000_000) # Stage 0 & 1
import boto3
s3 = boto3.resource("s3")
my_bucket = s3.Bucket("s-eai-neox")
file_paths = []
for my_bucket_object in my_bucket.objects.filter(Prefix="data/codepile/group1/"):
# print(my_bucket_object.key)
file_paths.append(f"s3a://s-eai-neox/{my_bucket_object.key}")
print(len(file_paths))
from spark_session_builder import build_spark_session
file_paths = file_paths[100:200]
base_model: NousResearch/Meta-Llama-3-8B
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer
load_in_8bit: true
load_in_4bit: false
strict: false
datasets:
- path: answerdotai/tiny_programs_haiku3_critiques
@ncoop57
ncoop57 / test.ipynb
Created November 8, 2024 04:49
My Dialog
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@ncoop57
ncoop57 / example.py
Created November 8, 2024 04:50
My example gist
print("Hello")
@ncoop57
ncoop57 / example.py
Created November 8, 2024 04:50
My example gist
print("Hello")
@ncoop57
ncoop57 / test.ipynb
Created November 8, 2024 04:52
My Dialog
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@ncoop57
ncoop57 / test.ipynb
Created November 8, 2024 05:19
My Dialog
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Studying the usage of text-to-text transfer transformer to support code-related tasksA Mastropaolo, S Scalabrino, N Cooper, DN Palacio, D Poshyvanyk, ...2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE …, 2021| 249| 2021
A systematic literature review on the use of deep learning in software engineering researchC Watson, N Cooper, DN Palacio, K Moran, D PoshyvanykACM Transactions on Software Engineering and Methodology (TOSEM) 31 (2), 1-58, 2022| 117| 2022
An empirical study on the usage of bert models for code completionM Ciniselli, N Cooper, L Pascarella, D Poshyvanyk, M Di Penta, G Bavota2021 IEEE/ACM 18th International Conference on Mining Software Repositories …, 2021| 84| 2021
An empirical study on the usage of transformer models for code completionM Ciniselli, N Cooper, L Pascarella, A Mastropaolo, E Aghajani, ...IEEE Transactions on Software Engineering 48 (12), 4818-4837, 2021| 83| 2021
Translating video recordings of mobile app usages into replayable scenariosC Bernal