Luis Mondragón luismond

🎯

Focusing

Localization Engineer specializing in NLP & MT | Building tools for multilingual systems | I wrangle multilingual data and occasionally argue with regex

10 followers · 5 following

TM2TB
Mexico
21:07 (UTC -06:00)
in/luismondragon

View GitHub Profile

Recently created

Least recently created

Recently updated

Least recently updated

luismond / gist:c4eef23a152185ed8e20

Created January 19, 2016 16:53

	from apiclient.discovery import build
	service = build('translate', 'v2', developerKey='')

luismond / save_author_names.py

Created March 10, 2019 00:49

save reddit author names

	def save_author_names():
	authors = [cmnt.author.name for cmnt in reddit.subreddit(sub).comments(limit=None)]
	with open('authors'+'_mexico_'+str(time.time())+'.txt', 'w', encoding='utf8') as f:
	for a in authors:
	f.write(a+'\n')

luismond / download_vtt.py

Created March 10, 2019 00:59

Downloads .vtt subtitle file from youtube video

	import youtube_dl

	def download_vtt(url,lang):
	ydl_opts = {
	'quiet': True,
	'subtitleslangs': [lang],
	'writeautomaticsub': 'yes',
	'skip_download': 'yes'
	}
	with youtube_dl.YoutubeDL(ydl_opts) as ydl:

luismond / faker_es.py

Created March 10, 2019 01:56

Faker test for Spanish locales

	from faker import Faker
	fake = Faker('es_MX')

	for n in range(10):
	print(fake.name())

	'''
	Humberto Menchaca Berríos
	Lic. Irma Menchaca
	Elisa Barrera

luismond / faker_es_jobs.py

Created March 10, 2019 02:05

	from faker import Faker
	fake = Faker('es_MX')

	for n in range(10):
	print(fake.job())

	'''
	Geologist, wellsite
	Sports development officer
	Telecommunications researcher

luismond / faker_es_jobs_translated.py

Created March 10, 2019 02:09

	from faker import Faker
	from translate import Translator

	fake = Faker('es_MX')
	translator= Translator(to_lang="es")

	for n in range(10):
	print(translator.translate(fake.job()))

	'''

luismond / strip_punctuation.py

Created March 13, 2019 19:51

Strip punctuation function

	def strip_punct(line):
	line = str(line)
	charset = set()
	for ch in line:
	charset.update(ch)
	punct = [ch for ch in charset if not ch.isalpha()]
	if ' ' in punct:
	punct.remove(' ')
	for ch in punct:
	line = line.replace(ch, ' ').lower()

luismond / stanfordnlp_get_lemmas_spanish.py

Last active March 15, 2019 00:33

	import stanfordnlp
	MODELS_DIR = 'C:\\Users\\user\\stanfordnlp_resources\\'
	nlp = stanfordnlp.Pipeline(processors='tokenize,pos,lemma', models_dir=MODELS_DIR, lang='es')

	def get_lemmas(line):
	line = nlp(line)
	tagged = [[w.lemma for w in sent.words if w.pos == 'ADV' or w.pos == 'ADJ' or w.pos == 'VERB']
	for sent in line.sentences]
	return ' '.join([w for sent in tagged for w in sent])

luismond / get_bilingual_data_from_tmx.py

Last active March 12, 2021 13:59

	#Get bilingual data from the European Comission translation memories
	#https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory#More%20details%20/%20Reference%20publication

	#I needed to extract just EN-ES bilingual data from the tmx files for my machine translation experiment.
	#Their Java TM exporter was not working on my side.
	#I wrote this script to get the data

	import xmltodict
	import pandas as pd
	import os

luismond / Getting_bilingual_data_from_tmx.ipynb

Last active March 19, 2019 19:47

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

OlderNewer