Skip to content

Instantly share code, notes, and snippets.

@luismond
Last active March 12, 2021 13:59
Show Gist options
  • Select an option

  • Save luismond/75b8fd7e01831299517898c28ef8b22b to your computer and use it in GitHub Desktop.

Select an option

Save luismond/75b8fd7e01831299517898c28ef8b22b to your computer and use it in GitHub Desktop.
#Get bilingual data from the European Comission translation memories
#https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory#More%20details%20/%20Reference%20publication
#I needed to extract just EN-ES bilingual data from the tmx files for my machine translation experiment.
#Their Java TM exporter was not working on my side.
#I wrote this script to get the data
import xmltodict
import pandas as pd
import os
files = os.listdir()
tmxs = [f for f in files if f.endswith('tmx')]
def get_segments(file,source, target):
with open(file,'r', encoding='utf16') as fd:
doc = xmltodict.parse(fd.read())
tus = doc['tmx']['body']['tu']
tuvs = [el['tuv'] for el in tus]
def get_pairs(tuv):
seg_source = ' '.join([t['seg'] for t in tuv if t['@lang']==source])
seg_target = ' '.join([t['seg'] for t in tuv if t['@lang']==target])
return (seg_source, seg_target)
segments = [get_pairs(tuv) for tuv in tuvs]
return segments
segments = [get_segments(file, 'EN-GB','ES-ES') for file in tmxs]
segments = pd.DataFrame([x for y in segments for x in y])
segments.head()
'''
0 Commission Regulation (EU) 2016/266 Reglamento (UE) 2016/266 de la Comisión
1 of 7 December 2015 de 7 de diciembre de 2015
2 amending, for the purpose of its adaptation to... que modifica, con vistas a su adaptación al pr...
3 (Text with EEA relevance) (Texto pertinente a efectos del EEE)
4 THE EUROPEAN COMMISSION, LA COMISIÓN EUROPEA,
'''
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment