Tuan Nguyen tuan3w

Datasets

I had such a bad time trying to create english-vietnamese parallel corpus from bilingual stories, but it sucks. It just wastes a lot of time. So I try to find out as much corpora as possible throughout the internet. My final dataset consists of about 2.5M pair of sentences. You can find all corpora here: link

Model

I use OpenNMT to train my nmt model. Thanks Systran and HavardNLP for open source this project. It will help me and many others to understand how a industral translation system might work. The parameters of my model are as follow:

Preprocesssing: Using aggressive tokenizer provided by OpenNMT

	#!/bin/bash

	sudo apt-get update
	sudo apt-get install -y \
	linux-image-extra-$(uname -r) \
	linux-image-extra-virtual
	sudo apt-get install -y \
	apt-transport-https \
	ca-certificates \
	curl \

	import tensorflow as tf
	from tensorflow.contrib.framework import arg_scope
	from tensorflow.contrib.layers.python.layers.utils import smart_cond
	from tensorflow.python.ops.gen_array_ops import _concat_v2 as concat_v2

	from layers import *

	class Model(object):
	def __init__(self, config,
	inputs, labels, enc_seq_length, dec_seq_length, mask,

	/*
	* Licensed to the Apache Software Foundation (ASF) under one or more
	* contributor license agreements. See the NOTICE file distributed with
	* this work for additional information regarding copyright ownership.
	* The ASF licenses this file to You under the Apache License, Version 2.0
	* (the "License"); you may not use this file except in compliance with
	* the License. You may obtain a copy of the License at
	*
	* http://www.apache.org/licenses/LICENSE-2.0
	*

	import tensorflow as tf
	from tensorflow.examples.tutorials.mnist import input_data
	import math

	IMAGE_PIXELS = 28

	# Flags for defining the tf.train.ClusterSpec
	tf.app.flags.DEFINE_string("ps_hosts", "",
	"Comma-separated list of hostname:port pairs")
	tf.app.flags.DEFINE_string("worker_hosts", "",

	name: tensorflow
	dependencies:
	- backports=1.0=py27_0
	- decorator=4.0.10=py27_0
	- get_terminal_size=1.0.0=py27_0
	- ipython=5.0.0=py27_0
	- ipython_genutils=0.1.0=py27_0
	- libgfortran=3.0.0=1
	- mkl=11.3.3=0
	- numpy=1.11.1=py27_0

	from functools import partial
	import codecs

	def fast_read(name, bytes):
	with codecs.open(name, 'r', 'utf-8') as f:
	prev = ''
	f_read = partial(f.read, bytes)
	for text in iter(f_read, ''):
	if text == '':
	return

	package com.vcc.bigdata.monitoring.graphite

	import java.net.Socket
	import java.io.PrintWriter
	import java.util.Collection
	import scala.collection.JavaConversions._
	import java.io.DataOutputStream
	import java.io.OutputStreamWriter
	import java.io.BufferedWriter
	import java.nio.charset.Charset

	#!/usr/bin/env bash

	. ./common.sh

	NR_HUGEPAGES=128
	NR_CPUS=$(n_cpus)
	NIC=${SERVER_NIC:-eth0}

	# First IRQ of given NIC
	function first_irq() {