Skip to content

Instantly share code, notes, and snippets.

View cgpeter96's full-sized avatar
๐Ÿ™
Everything is OK!

Peter Chan cgpeter96

๐Ÿ™
Everything is OK!
  • China
View GitHub Profile
@cgpeter96
cgpeter96 / tokenization.cpp
Created February 3, 2023 09:15 — forked from luistung/tokenization.cpp
c++ version of bert tokenize
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <unordered_map>
#include <boost/algorithm/string.hpp>
#include <utf8proc.h>
//https://unicode.org/reports/tr15/#Norm_Forms
//https://ssl.icu-project.org/apiref/icu4c/uchar_8h.html