This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import kuzu | |
import kuzu.connection | |
import kuzu.database | |
def create_tables(conn: kuzu.connection.Connection) -> None: | |
try: | |
# Create a Person node table | |
conn.execute( |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Dockerfile for a Spark environment with Python 3.10. The image is based on the miniconda3 image | |
# and installs OpenJDK 17, Spark 3.5.1 with Hadoop 3 and Scala 2.13 and Poetry. The image then | |
# installs the OpenJDK 17 and the Python packages specified in the pyproject.toml file. | |
FROM continuumio/miniconda3 | |
RUN apt update && \ | |
apt-get install -y curl apt-transport-https openjdk-17-jdk-headless wget build-essential git \ | |
autoconf automake libtool pkg-config libpq5 libpq-dev && \ | |
apt-get clean && \ | |
rm -rf /var/lib/apt/lists/* |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"DATA_SOURCE": "TEST", | |
"RECORD_ID": "1", | |
"RECORD_TYPE": "PERSON", | |
"NAME_LIST": [ | |
{ | |
"NAME_TYPE": "PRIMARY", | |
"NAME_FULL": "KIM SOO IN" | |
} | |
], |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"DATA_SOURCE": "TEST", | |
"RECORD_ID": "6", | |
"RECORD_TYPE": "ORGANIZATION", | |
"NAME_LIST": [ | |
{ | |
"NAME_TYPE": "PRIMARY", | |
"NAME_ORG": "Random Company, LTD." | |
} | |
], |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env bash | |
: ' | |
@echo off | |
powershell -ExecutionPolicy Bypass -Command "$ErrorActionPreference='Stop'; $ProgressPreference='SilentlyContinue'; | |
$output_file = 'data/full-oldb.LATEST.zip' | |
$extract_dir = 'data' | |
Write-Host "`nDownloading the ICIJ Offshore Leaks Database to $output_file`n" | |
Invoke-WebRequest -Uri 'https://offshoreleaks-data.icij.org/offshoreleaks/csv/full-oldb.LATEST.zip' -OutFile $output_file |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import torch | |
import torch.nn as nn | |
import torch.nn.functional as F | |
from transformers import AutoModel, AutoTokenizer | |
class CosineSentenceBERT(nn.Module): | |
def __init__(self, model_name=SBERT_MODEL, dim=384): | |
super().__init__() | |
self.model_name = model_name | |
self.tokenizer = AutoTokenizer.from_pretrained(model_name) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
# | |
# Quickly extract all unique address, person and company name records from pairs.json: https://www.opensanctions.org/docs/pairs/ | |
# Note: non-commercial use only, affordable licenses available at https://www.opensanctions.org/licensing/ | |
# | |
# Get the data | |
wget https://data.opensanctions.org/contrib/training/pairs.json -O data/pairs.json |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
class SentenceBERT(torch.nn.Module): | |
def __init__(self, model_name=SBERT_MODEL, dim=384): | |
super().__init__() | |
self.model_name = model_name | |
self.tokenizer = AutoTokenizer.from_pretrained("data/fine-tuned-sbert-paraphrase-multilingual-MiniLM-L12-v2-original/checkpoint-2400/") | |
self.model = AutoModel.from_pretrained("data/fine-tuned-sbert-paraphrase-multilingual-MiniLM-L12-v2-original/checkpoint-2400/") | |
self.ffnn = torch.nn.Linear(dim*3, 1) | |
# Freeze the weights of the pre-trained model | |
for param in self.model.parameters(): |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
System: I need your help with a data science, data augmentation task. I am fine-tuning a sentence transformer paraphrase model to match pairs of addresses. I tried several embedding models and none of them perform well. They need fine-tuning for this task. I have created 27 example pairs of addresses to serve as training data for fine-tuning a SentenceTransformer model. Each record has the fields Address1, Address2, a Description of the semantic they express (ex. 'different street number') and a Label (1.0 for positive match, 0.0 for negative). | |
The training data covers two categories of corner cases. The first is when similar addresses in string distance aren't the same. The second is the opposite: when dissimilar addresses in string distance are the same. Your task is to read a pair of Addresses, their Description and their Label and generate 100 different examples that express a similar semantic. Your job is to create variations of these records. For some of the records, implement the logic in the Descript |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import numpy as np | |
import pk | |
import seaborn as sns | |
drug = pk.Drug(hl=8, t_max=1) | |
# 5 day simulation | |
conc = drug.concentration( | |
60, | |
1, |
NewerOlder