Skip to content

Instantly share code, notes, and snippets.

View tuan3w's full-sized avatar
👋

Tuan Nguyen tuan3w

👋
View GitHub Profile
@tuan3w
tuan3w / docker_install.sh
Last active May 3, 2017 02:27
docker_install.sh
#!/bin/bash
sudo apt-get update
sudo apt-get install -y \
linux-image-extra-$(uname -r) \
linux-image-extra-virtual
sudo apt-get install -y \
apt-transport-https \
ca-certificates \
curl \
import tensorflow as tf
from tensorflow.contrib.framework import arg_scope
from tensorflow.contrib.layers.python.layers.utils import smart_cond
from tensorflow.python.ops.gen_array_ops import _concat_v2 as concat_v2
from layers import *
class Model(object):
def __init__(self, config,
inputs, labels, enc_seq_length, dec_seq_length, mask,
@tuan3w
tuan3w / SignRandomProjectionLSH.scala
Last active February 9, 2017 07:56
SignRandomProjectionLSH
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
@tuan3w
tuan3w / Intro.md
Last active May 7, 2021 14:02
Pre-trained model for English -> Vietnamese NMT

Datasets

I had such a bad time trying to create english-vietnamese parallel corpus from bilingual stories, but it sucks. It just wastes a lot of time. So I try to find out as much corpora as possible throughout the internet. My final dataset consists of about 2.5M pair of sentences. You can find all corpora here: link

Model

I use OpenNMT to train my nmt model. Thanks Systran and HavardNLP for open source this project. It will help me and many others to understand how a industral translation system might work. The parameters of my model are as follow:

  • Preprocesssing: Using aggressive tokenizer provided by OpenNMT
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
import math
IMAGE_PIXELS = 28
# Flags for defining the tf.train.ClusterSpec
tf.app.flags.DEFINE_string("ps_hosts", "",
"Comma-separated list of hostname:port pairs")
tf.app.flags.DEFINE_string("worker_hosts", "",
name: tensorflow
dependencies:
- backports=1.0=py27_0
- decorator=4.0.10=py27_0
- get_terminal_size=1.0.0=py27_0
- ipython=5.0.0=py27_0
- ipython_genutils=0.1.0=py27_0
- libgfortran=3.0.0=1
- mkl=11.3.3=0
- numpy=1.11.1=py27_0
@tuan3w
tuan3w / fast_io.py
Created June 21, 2016 15:14
fast way to read big file line by line
from functools import partial
import codecs
def fast_read(name, bytes):
with codecs.open(name, 'r', 'utf-8') as f:
prev = ''
f_read = partial(f.read, bytes)
for text in iter(f_read, ''):
if text == '':
return
@tuan3w
tuan3w / ALS2.scala
Last active June 16, 2020 20:23
Implementation of Biased Matrix Factorization on Spark
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
package com.vcc.bigdata.monitoring.graphite
import java.net.Socket
import java.io.PrintWriter
import java.util.Collection
import scala.collection.JavaConversions._
import java.io.DataOutputStream
import java.io.OutputStreamWriter
import java.io.BufferedWriter
import java.nio.charset.Charset
#!/usr/bin/env bash
. ./common.sh
NR_HUGEPAGES=128
NR_CPUS=$(n_cpus)
NIC=${SERVER_NIC:-eth0}
# First IRQ of given NIC
function first_irq() {