我们希望安居客的搜索引擎能够更好的做到同音字的容错,采用拼音容错是一个不错的方法。因此,需要一个将汉字转换为拼音的组件。同时,汉字转拼音组件还可以有多个用途,例如以拼音的首字母来检索小区名、人名等。
这样我们需要一个通用的将汉字转换为拼音的服务。
基本功能就是中文拉丁化,输入一段中文文本,输出转变为汉语拼音的文本。
要求原文中的全角标点符号、空格等应该转为对应的半脚符号。原汉字与英文间如果没有空格分隔,转换为拼音后应该加入空格分隔。
git fetch upstream | |
git reset --hard upstream/master |
#A Collection of NLP notes
##N-grams
###Calculating unigram probabilities:
P( wi ) = count ( wi ) ) / count ( total number of words )
In english..
mahout clusterdump \ | |
-dt sequencefile \ # format: {Integer => String} | |
-d reuters-vectors/dictionary.file-* \ # dictionary: {id => word} | |
-i reuters-kmeans-clusters/clusters-3-final \ # input | |
-o clusters.txt \ # output (local filesystem) | |
-b 10 \ # format length | |
-n 10 # number of top terms to print | |
--distanceMeasure org.apache.mahout.common.distance.CosineDistanceMeasure # default is euclidean distance |
"""A simple implementation of a greedy transition-based parser. Released under BSD license.""" | |
from os import path | |
import os | |
import sys | |
from collections import defaultdict | |
import random | |
import time | |
import pickle | |
SHIFT = 0; RIGHT = 1; LEFT = 2; |
<?xml version="1.0" encoding="UTF-8" standalone="no"?> | |
<?pde version="3.8"?><target name="simple" sequenceNumber="12"> | |
<locations> | |
<location path="${env_var:ECLIPSE_432_HOME}" type="Profile"/> | |
<location path="${project_loc:builder_external}/builder/lib" type="Directory"/> | |
<location includeAllPlatforms="false" includeConfigurePhase="true" includeMode="planner" includeSource="true" type="InstallableUnit"> | |
<unit id="org.apache.commons.beanutils" version="1.8.0.v201205091237"/> | |
<unit id="org.apache.commons.collections" version="3.2.0.v2013030210310"/> | |
<unit id="com.google.guava" version="12.0.0.v201212092141"/> | |
<unit id="com.google.gson" version="2.1.0.v201303041604"/> |
# Initialize the scroll | |
page = es.search( | |
index = 'yourIndex', | |
doc_type = 'yourType', | |
scroll = '2m', | |
search_type = 'scan', | |
size = 1000, | |
body = { | |
# Your query's body | |
}) |
<img src="http://musa-hw-cafe.qiniudn.com/Screen%20Shot%202014-10-16%20at%2011.12.12.png" width="320px" height="100px"/> | |
宝莲灯是位置感知服务的客户端。 | |
* 位置感知服务是什么? | |
基于近场通信技术,渲染实时室内地图,在消费场所,为消费者与消费者、消费者与商户提供基于位置服务的社交网络。 | |
* 典型应用场景 |
/* ~/Library/KeyBindings/DefaultKeyBinding.Dict | |
This file remaps the key bindings of a single user on Mac OS X 10.5 to more | |
closely match default behavior on Windows systems. This makes the Command key | |
behave like Windows Control key. To use Control instead of Command, either swap | |
Control and Command in Apple->System Preferences->Keyboard->Modifier Keys... | |
or replace @ with ^ in this file. | |
Here is a rough cheatsheet for syntax. | |
Key Modifiers |