Skip to content

Instantly share code, notes, and snippets.

View hailiang-wang's full-sized avatar
🌴
On vacation

Hai Liang W. hailiang-wang

🌴
On vacation
View GitHub Profile
@drorata
drorata / gist:146ce50807d16fd4a6aa
Last active June 3, 2024 06:00
Minimal Working example of Elasticsearch scrolling using Python client
# Initialize the scroll
page = es.search(
index = 'yourIndex',
doc_type = 'yourType',
scroll = '2m',
search_type = 'scan',
size = 1000,
body = {
# Your query's body
})
@CptMauli
CptMauli / empty-eclipse.target
Created May 2, 2014 07:24
a target platform for use with Eclipse SCADA
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?pde version="3.8"?><target name="simple" sequenceNumber="12">
<locations>
<location path="${env_var:ECLIPSE_432_HOME}" type="Profile"/>
<location path="${project_loc:builder_external}/builder/lib" type="Directory"/>
<location includeAllPlatforms="false" includeConfigurePhase="true" includeMode="planner" includeSource="true" type="InstallableUnit">
<unit id="org.apache.commons.beanutils" version="1.8.0.v201205091237"/>
<unit id="org.apache.commons.collections" version="3.2.0.v2013030210310"/>
<unit id="com.google.guava" version="12.0.0.v201212092141"/>
<unit id="com.google.gson" version="2.1.0.v201303041604"/>
@syllog1sm
syllog1sm / gist:10343947
Last active September 19, 2024 23:54
A simple Python dependency parser
"""A simple implementation of a greedy transition-based parser. Released under BSD license."""
from os import path
import os
import sys
from collections import defaultdict
import random
import time
import pickle
SHIFT = 0; RIGHT = 1; LEFT = 2;
@zviri
zviri / clusterdump.sh
Created December 3, 2013 10:11
Mahout cheat-sheet
mahout clusterdump \
-dt sequencefile \ # format: {Integer => String}
-d reuters-vectors/dictionary.file-* \ # dictionary: {id => word}
-i reuters-kmeans-clusters/clusters-3-final \ # input
-o clusters.txt \ # output (local filesystem)
-b 10 \ # format length
-n 10 # number of top terms to print
--distanceMeasure org.apache.mahout.common.distance.CosineDistanceMeasure # default is euclidean distance
@luw2007
luw2007 / 词性标记.md
Last active December 30, 2024 12:48
词性标记: 包含 ICTPOS3.0词性标记集、ICTCLAS 汉语词性标注集、jieba 字典中出现的词性、simhash 中可以忽略的部分词性

词的分类

  • 实词:名词、动词、形容词、状态词、区别词、数词、量词、代词
  • 虚词:副词、介词、连词、助词、拟声词、叹词。

ICTPOS3.0词性标记集

n 名词

nr 人名

@ttezel
ttezel / gist:4138642
Last active July 27, 2024 14:46
Natural Language Processing Notes

#A Collection of NLP notes

##N-grams

###Calculating unigram probabilities:

P( wi ) = count ( wi ) ) / count ( total number of words )

In english..

@glennblock
glennblock / fork forced sync
Created March 4, 2012 19:27
Force your forked repo to be the same as upstream.
git fetch upstream
git reset --hard upstream/master
@erning
erning / hz2py.md
Created November 4, 2011 05:51
汉字转拼音

汉字转拼音 (hz2py)

我们希望安居客的搜索引擎能够更好的做到同音字的容错,采用拼音容错是一个不错的方法。因此,需要一个将汉字转换为拼音的组件。同时,汉字转拼音组件还可以有多个用途,例如以拼音的首字母来检索小区名、人名等。

这样我们需要一个通用的将汉字转换为拼音的服务。

功能

基本功能就是中文拉丁化,输入一段中文文本,输出转变为汉语拼音的文本。

要求原文中的全角标点符号、空格等应该转为对应的半脚符号。原汉字与英文间如果没有空格分隔,转换为拼音后应该加入空格分隔。