Skip to content

Instantly share code, notes, and snippets.

View kzinmr's full-sized avatar

Kazuki Inamura kzinmr

  • Tokyo, Japan
  • 02:03 (UTC +09:00)
View GitHub Profile
# Fetch some text content in two different categories
from wikipediaapi import Wikipedia
wiki = Wikipedia('RAGBot/0.0', 'en')
docs = [{"text": x,
"category": "person"}
for x in wiki.page('Hayao_Miyazaki').text.split('\n\n')]
docs += [{"text": x,
"category": "film"}
for x in wiki.page('Spirited_Away').text.split('\n\n')]
@hamelsmu
hamelsmu / webhook-circleback.py
Created April 25, 2024 04:59
Generate a project proposal automatically from a meeting transcript
from fastapi import Request, HTTPException
from pydantic import BaseModel, BaseModel, HttpUrl
from modal import Secret, App, web_endpoint, Image
from typing import Optional, List
from example import proposal
import os
app = App(name="circleback", image=Image.debian_slim().pip_install("openai", "pydantic", "fastapi"))
class Attendee(BaseModel):
@aksh-at
aksh-at / modal_fast_whisper.py
Created February 23, 2024 18:29
Insanely fast whisper on Modal
import base64
import tempfile
from typing import Optional
from pydantic import BaseModel
from modal import Image, Secret, Stub, build, enter, gpu, web_endpoint
whisper_image = (
Image.micromamba()
@veekaybee
veekaybee / normcore-llm.md
Last active April 24, 2025 04:56
Normcore LLM Reads

Anti-hype LLM reading list

Goals: Add links that are reasonable and good explanations of how stuff works. No hype and no vendor content if possible. Practical first-hand accounts of models in prod eagerly sought.

Foundational Concepts

Screenshot 2023-12-18 at 10 40 27 PM

Pre-Transformer Models

Reinforcement Learning for Language Models

Yoav Goldberg, April 2023.

Why RL?

With the release of the ChatGPT model and followup large language models (LLMs), there was a lot of discussion of the importance of "RLHF training", that is, "reinforcement learning from human feedback". I was puzzled for a while as to why RL (Reinforcement Learning) is better than learning from demonstrations (a.k.a supervised learning) for training language models. Shouldn't learning from demonstrations (or, in language model terminology "instruction fine tuning", learning to immitate human written answers) be sufficient? I came up with a theoretical argument that was somewhat convincing. But I came to realize there is an additional argumment which not only supports the case of RL training, but also requires it, in particular for models like ChatGPT. This additional argument is spelled out in (the first half of) a talk by John Schulman from OpenAI. This post pretty much

@fscm
fscm / install_cmake.md
Last active March 25, 2025 16:39
[macOS] Install CMake

[macOS] Install CMake

Instructions on how to install the CMake tool on macOS.

Uninstall

First step should be to unsinstall any previous CMake installation. This step can be skipped if no CMake version was previously installed.

To uninstall any previous CMake installations use the following commands:

@lovit
lovit / huggingface_konlpy.md
Last active November 20, 2024 18:00
huggingface + KoNLPy

Huggingface

  • NLP 관련 다양한 패키지를 제공하고 있으며, 특히 언어 모델 (language models) 을 학습하기 위하여 세 가지 패키지가 유용
package note
transformers Transformer 기반 (masked) language models 알고리즘, 기학습된 모델을 제공
tokenizers transformers 에서 사용할 수 있는 토크나이저들을 학습/사용할 수 있는 기능 제공. transformers 와 분리된 패키지로 제공
nlp 데이터셋 및 평가 척도 (evaluation metrics) 을 제공
@kzinmr
kzinmr / cluster_df.py
Last active September 10, 2024 10:05
from collections import defaultdict, Counter
from operator import add
from functools import reduce
import numpy as np
from sklearn.cluster import KMeans
def dict_of_list(keys, values):
assert(len(keys) == len(values))
from collections import defaultdict
from functools import reduce, partial
import numpy as np
from itertools import chain
def flatten(l):
return list(chain.from_iterable(l))
@kzinmr
kzinmr / pmi.py
Last active September 10, 2024 10:07
PMI calculation
"calculate PMI(A,B)=P(A,B)/P(A)P(B) for every token A and B in a window"
from itertools import tee, combinations
from collections import Counter
def count_bigram(sentence, window=5):
# ['A','B','C','D', 'E', 'F', 'G'], 4 ->
# [['A', 'B', 'C', 'D'],
# ['B', 'C', 'D', 'E'],
# ['C', 'D', 'E', 'F'],