From https://github.com/laurieburchell/open-lid-dataset
Paper: https://arxiv.org/pdf/2305.13820.pdf
Mode: https://data.statmt.org/lid/lid201-model.bin.gz Licensed under GPLv3
"ന്": 26465, | |
"ക്": 28298, | |
"ത്": 31691, | |
"ക്ക": 41627, | |
"ന്ന": 45828, | |
"▁പ": 46110, | |
"▁ക": 49867, | |
"തി": 50292, | |
"്ട": 52078, | |
"ും: 55511, |
# Print a Tab seperated file with all languages supported by MT providers | |
import requests | |
from typing import List | |
mtlabels = { | |
"Apertium": "Ⓐ", | |
"Elia": "Ⓔ", | |
"Google": "Ⓖ", | |
"MinT": "Ⓜ", | |
"Yandex": "Ⓨ", | |
"LingoCloud": "Ⓛ", |
! U+00021 BC=ON BLK=Basic_Latin SC=Common EXCLAMATION MARK | |
? U+0003F BC=ON BLK=Basic_Latin SC=Common QUESTION MARK | |
։ U+00589 BC=L BLK=Armenian SC=Armenian ARMENIAN FULL STOP | |
؝ U+0061D BC=AL BLK=Arabic SC=Arabic ARABIC END OF TEXT MARK | |
؞ U+0061E BC=AL BLK=Arabic SC=Arabic ARABIC TRIPLE DOT PUNCTUATION MARK | |
؟ U+0061F BC=AL BLK=Arabic SC=Common ARABIC QUESTION MARK | |
۔ U+006D4 BC=AL BLK=Arabic SC=Arabic ARABIC FULL STOP | |
܀ U+00700 BC=AL BLK=Syriac SC=Syriac SYRIAC END OF PARAGRAPH | |
܁ U+00701 BC=AL BLK=Syriac SC=Syriac SYRIAC SUPRALINEAR FULL STOP | |
܂ U+00702 BC=AL BLK=Syriac SC=Syriac SYRIAC SUBLINEAR FULL STOP |
From https://github.com/laurieburchell/open-lid-dataset
Paper: https://arxiv.org/pdf/2305.13820.pdf
Mode: https://data.statmt.org/lid/lid201-model.bin.gz Licensed under GPLv3
Manjari | Manjula |
---|---|
Maintained by the designer | Maintainer unknown |
Updates are available | Since there is no maintainer, updates are not expected |
Source code is available | Only ttf binary is available. Script to convert original Manjari-Regular variant to this ttf is also available |
OTF, TTF, Webfont version of fonts are provided. OTF is close to the design. TTF is quadratic curve approximation | Only TTF version is provided |
Regular, Bold, Thin variants are available | Only Regular is provided |
Public issue tracker is available | No issue tracker |
Contains large set of glyphs. With Opentype rules, 1971 style or 2022 style can be used | A subset of glyphs to support Government Script Reformation 2022 is available. Note that the font has all glyphs, but the code to form them is removed. So font file size is unnecessarily bigger than required |
# Misc clean up on corpus | |
# sed -i -f corpora-cleanup.sed corpus/*.txt | |
# Chillu normalization | |
s/ന്/ൻ/g | |
s/ള്/ൾ/g | |
s/ല്/ൽ/g | |
s/ര്/ർ/g | |
s/ന്/ൻ/g | |
s/ണ്/ൺ/g | |
# Remove ZWNJ at end of words |
const glyphs = require('./glyphs.json').glyphs | |
const ligatures = require('./ligatures.json').ligatures | |
const getGlyphValue = (glyphname) => { | |
const glyph = glyphs.find(g => g.glyph === glyphname); | |
return glyph && glyph.value; | |
} | |
const process = () => { | |
const ligaturesLength = ligatures.length; |
import scrapy | |
from scrapy.http import Request | |
class HeadlineCatcher(scrapy.Spider): | |
name = "headlinecatcher" | |
start_urls = ["http://www.prd.kerala.gov.in/pressrelease"] | |
custom_settings = { | |
'FEED_EXPORT_ENCODING': 'utf-8', | |
} |
I hereby claim:
To claim this, I am signing this object:
youtube-dl -f bestaudio --extract-audio --audio-format mp3 --audio-quality 0 -o "%(title)s.%(ext)s" https://www.youtube.com/playlist?list=abdlshfjskdhfuwhrklk |