Skip to content

Instantly share code, notes, and snippets.

View dapangmao's full-sized avatar
🏠
Working from home

Dapangmao dapangmao

🏠
Working from home
View GitHub Profile
@dapangmao
dapangmao / s.bash
Last active September 1, 2020 16:28
How to set up a spark cluster on digitalocean
sudo openvpn --config *.opvn
apt-get update
apt-get install vim
wget http://d3kbcqa49mib13.cloudfront.net/spark-1.3.0-bin-hadoop2.4.tgz | tar zxf
hadoop fs -mkdir /spark
hadoop fs -put spark-1.3.0-bin-hadoop2.4.tgz /spark
hadoop fs -du -h /spark
cp spark-env.sh.template spark-env.sh
@dapangmao
dapangmao / blog.md
Last active August 29, 2015 14:17
Deploy a minimal Spark cluster

###Why a minimal cluster

  1. Testing:

  2. Prototyping

###Requirements

I need a cluster that lives short time and handles ad-hoc requests of data analysis, or more specificly, running Spark. I want it to be quickly created to load data to memory. And I don't want to keep the cluster perpetually. Therefore, a public cloud may be the best fit for my demand.

  1. Intranet speed
@dapangmao
dapangmao / blog.md
Last active April 5, 2016 15:57
Spark example

###Transform RDD to DataFrame in Spark

from pyspark.sql import Row
import os

rdd = sc.textFile('C:/Users/chao.huang.ctr/spark-playground//class.txt')
def transform(x):
    args = x.split()
 funcs = [str, str, int, float, float]
@dapangmao
dapangmao / gist:84618a65ac5f921db76a
Created March 18, 2015 15:00
Two ways to transform RDD to DataFrame in Spark
1. Add schema after becoming DataFrame
sqlCtx.inferSchema(rdd1)
1. Add schema after becoming DataFrame
from pyspark.sql import Row
import os
current_path = os.getcwd()
@dapangmao
dapangmao / solution.md
Last active August 29, 2015 14:17
maintain a median

###On a single machine

import heapq
class find_median(object):
    def __init__(self):
        self.first_half = [] # will be a max heap
        self.second_half = [] # will be a min heap, 1/2 chance has one more element
        self.N = 0
@dapangmao
dapangmao / blog.md
Last active January 14, 2016 20:07
回国翻墙小结

####神仙打架,凡人遭殃

回国了一段时间,发现网站封锁的厉害。常用的Google,Facebook,Wiki等等都无从寻觅,叫人好生惆怅。连技术类的GitHub也慢的登录不上去. 这回G FW可能从惯用的DNS污染找到了灵感, 劫持了Baidu Analytics的javascript, 从而绑架了海外访问国内网站的流量. 访问使用Baidu Analytics的网站的浏览器, 会导致每两秒钟会去访问两个GitHub Pages (1 和 2)。话说GitHub也是牛, 这么大规模的distributed denial-of-service攻击只是慢, 也没有down掉.

然后Google和Mozollia禁掉了CNNIC的SSL证书。这个也很麻烦,比如人在海外,上12306买个火车票。用chrome或者firefox就会看到。

####具体的翻墙方案 这才意识到翻墙是必需的. 正好手头有一个DigitalOcean的最便宜VPS。

//http://code2flow.com
A Q is raised;
Name of the function;
Input type and output type;
Test case;
Contrains / time / space requirements;
if (Q in [Leetcode, CareerCup] or similiar)
{
Recall the answer;
class Solution:
def combinationSum(self, candidates, target):
self.res = []
self.dfs(sorted(candidates), [], target)
return self.result
def dfs(self, candidates, current, target):
if target == 0:
self.res.append(current)
return
@dapangmao
dapangmao / graph.md
Last active August 29, 2015 14:20
Graph
  • Undirected graph
"""
Given a 2d grid map of '1's (land) and '0's (water), count the number of islands. An island is surrounded by water and is formed by connecting adjacent lands horizontally or vertically. You may assume all four edges of the grid are all surrounded by water.

Example 1:

11110
11010
11000
@dapangmao
dapangmao / pattern.md
Last active August 29, 2015 14:21
Design pattern

From here

  1. Decorateor
from functools import wraps

def makebold(fn):
    @wraps(fn)
    def wrapped():
        return "<b>" + fn() + "</b>"