Skip to content

Instantly share code, notes, and snippets.

@MeMartijn
MeMartijn / jina_text_segmenter.py
Created October 29, 2024 15:01
Jina AI's Segmenter ported to Python
import regex
from typing import List
# Define constants
MAX_HEADING_LENGTH = 7
MAX_HEADING_CONTENT_LENGTH = 200
MAX_HEADING_UNDERLINE_LENGTH = 200
MAX_HTML_HEADING_ATTRIBUTES_LENGTH = 100
MAX_LIST_ITEM_LENGTH = 200
MAX_NESTED_LIST_ITEMS = 6
@noisychannel
noisychannel / moses-built-ttable.sh
Created April 23, 2015 21:58
MOSES : Build phrase table
#!/usr/bin/env bash
# Change these variables
ROOT_DIR=/export/a04/gkumar/experiments/scale-2015/1
EXTERNAL_BIN_DIR=/export/a04/gkumar/code/mosesdecoder/tools
F_EXT=pa
E_EXT=en
MAX_PHRASE_LENGTH=10
CORPUS=/export/a04/gkumar/experiments/scale-2015/data/trans