Mehmet Öner Yalçın oneryalcin

Exploring Tokenizers from Hugging Face

Hugging Face (HF) has made NLP (Natural Language Processing) a breeze. In this post, we are going to take a look at tokenization using a hands on approach with the help of the Tokenizers library. We are going to load a real world dataset containing 10-K filings of public firms and see how to train a tokenizer from scratch based on the BERT tokenization scheme. In the process we will understand tokenization in detail and some gotchas to keep an eye out for.

Background on NLP (Optional)

If you already have an understanding of the NLP pipeline, you can safely skip this section.

For any NLP task, one of the first steps is pre-processing the data so that it can be fed into our NLP models. For those new to NLP, the general pipeline for any NLP task (text classification, question answering, etc.) is as follows:

Things I believe

This is a collection of the things I believe about software development. I have worked for years building backend and data processing systems, so read the below within that context.

Agree? Disagree? Feel free to let me know at @JanStette.

Fundamentals

Keep it simple, stupid. You ain't gonna need it.

Backend Architectures

Twitter

ror, scala, jetty, erlang, thrift, mongrel, comet server, my-sql, memchached, varnish, kestrel(mq), starling, gizzard, cassandra, hadoop, vertica, munin, nagios, awstats

Default Django Logging Tree

`app.py`

#!/usr/bin/env python
import os

import django
import logging_tree

Requirements

Minikube requires that VT-x/AMD-v virtualization is enabled in BIOS. To check that this is enabled on OSX / macOS run:

sysctl -a | grep machdep.cpu.features | grep VMX

If there's output, you're good!

	import anthropic
	import os
	import sys
	from termcolor import colored
	from dotenv import load_dotenv


	class ClaudeAgent:
	def __init__(self, api_key=None, model="claude-3-7-sonnet-20250219", max_tokens=4000):
	"""Initialize the Claude agent with API key and model."""

	# Windsurf Memory Bank

	I am Windsurf, an expert software engineer with a unique characteristic: my memory resets completely between sessions. This isn't a limitation - it's what drives me to maintain perfect documentation. After each reset, I rely ENTIRELY on my Memory Bank to understand the project and continue work effectively. I MUST read ALL memory bank files at the start of EVERY task - this is not optional.

	## Memory Bank Structure

	The Memory Bank consists of required core files and optional context files, all in Markdown format. Files build upon each other in a clear hierarchy:

	```mermaid
	flowchart TD

	import openai
	import pinecone
	from sentence_transformers import SentenceTransformer

	class GPTConversationManager:
	def __init__(self, api_key, pinecone_api_key, index_name):
	self.api_key = api_key
	openai.api_key = self.api_key
	self.conversation_history = []
	self.pinecone_api_key = pinecone_api_key

	"""
	Upsert gist

	Requires at least postgres 9.5 and sqlalchemy 1.1

	Initial state:

	[]
	Initial upsert:

	#/usr/bin/python3
	""" Demonstration of logging feature for a Flask App. """

	from logging.handlers import RotatingFileHandler
	from flask import Flask, request, jsonify
	from time import strftime

	__author__ = "@ivanleoncz"

	import logging