Skip to content

Instantly share code, notes, and snippets.

View treper's full-sized avatar

Maybe treper

  • Shanghai
View GitHub Profile
@treper
treper / CooccurrenceMatrix.scala
Last active December 20, 2015 09:59
calculate a word co-occurrence matrix,scala json parser is not thread safe,use json4s
import scala.util.parsing.json._
import org.json4s._
import org.json4s.native.JsonMethods._
import scala.collection.mutable.ArrayBuffer
def parseLine(line:String):ArrayBuffer[String]={
val jsonstr = line.split("\t")(1)
val a=JSON.parseFull(jsonstr)
val result=ArrayBuffer[String]()
if(a!=None){
val itemIdfArray = ArrayBuffer[Pair[Int,Int]]()
@treper
treper / LabelPropagation.scala
Created July 31, 2013 08:20
label propagation using graphx
val g = GraphLoader.textFile(sc, fname, a => 1.0F).withPartitioner(numVPart, numEPart).cache()
@treper
treper / LabelPropagation.scala
Created July 31, 2013 08:20
label propagation using graphx
val g = GraphLoader.textFile(sc, fname, a => 1.0F).withPartitioner(numVPart, numEPart).cache()
@treper
treper / new.scala
Last active December 20, 2015 12:19
import scala.collection.mutable.ArrayBuffer
val name = """\[\d+,\d+,\d+\]""".r
val name2 = """\[(\d+),(\d+),\d+]""".r
def parseLine(line: String): ArrayBuffer[String] = {
val jsonstr = line.split("\t")(1)
val result = ArrayBuffer[String]()
val m = name.findAllIn(jsonstr)
val itemIdfArray = ArrayBuffer[Pair[Int, Int]]()
m.foreach(a => { val name2(item, idfnum) = a; val p: Pair[Int, Int] = Pair(item.toInt, idfnum.toInt); itemIdfArray += p; })
if (itemIdfArray.length > 1) {
@treper
treper / autoencoder.cpp
Created August 4, 2013 04:35
draft code
#include <iostream>
#include <Eigen/Dense>
#include <Eigen/Array>
using namespace Eigen;
using namespace std;
void sigmoid(MatrixXf& input, MatrixXf& output)
{
output = (1+ (input.array().exp()).array()).array().inverse();
val tags=file.filter(line => label_name_map.contains(line.split("\t")(0).toLong)).map(line => (line.split("\t")(1).toLong ->label_name_map(line.split("\t")(0).toLong))).sortByKey(false)
tags.saveAsTextFile("hdfs://finger-test2:54310/home/TagHierarchy/tag_count_sorted")
/**
DBSCAN(D, eps, MinPts)
C = 0
for each unvisited point P in dataset D
mark P as visited
NeighborPts = regionQuery(P, eps)
if sizeof(NeighborPts) < MinPts
mark P as NOISE
else
C = next cluster
@treper
treper / tag_cluster_kmeans.scala
Last active December 22, 2015 19:39
cluster tudou tags using kmeans,tag vectors is generated using word2vec and filtered by tudou tag database
import spark.util.Vector
val word_vec_size=150
def parseVector(line: String): Vector = {
return new Vector(line.split(' ').slice(1,word_vec_size+1).map(_.toDouble))
}
def closestPoint(p: Vector, centers: Array[Vector]): Int = {
var index = 0
var bestIndex = 0
import spark.util.Vector
import scala.math.sqrt
def cosineDist(a:Vector,b:Vector):Double = {
if(a.length==b.length){
(a dot b)/(sqrt(a.squaredDist(Vector.zeros(a.length))*b.squaredDist(Vector.zeros(b.length))))
}
#convert mahout data set format to scikit-learn
#mahout: -d N 3 C 2 N C 4 N C 8 N 2 C 19 N L
#scikit-learn
import sys
import argparse
import numpy
from sklearn.cross_validation import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier