Yun YAN Puriney

OS X Screencast to animated GIF

This gist shows how to create a GIF screencast using only free OS X tools: QuickTime, ffmpeg, and gifsicle.

Instructions

To capture the video (filesize: 19MB), using the free "QuickTime Player" application:

Sporadic Breast Cancer为案例，提出了整合大数据多层次去解析生物数据以获取对自然更加准确的认知。

文章提出了meta-dimentional analysis 以及multi-staged analysis (或systems genomics approaches)这样比较“新潮”的概念。虽然是生物领域里的问题，但既然是数据分析，一旦从生物背景剥离出来，仍旧是一个个经典的机器学习教科书式案例。

处理单个数据

不积跬步无以至千里，在做大数据整合之前必须先好好审视每一个单个数据。关乎单个数据，文中提到了至少有以下该考量的方面：

数据质量控制 (Data Quality Control) 。所谓龙生龙凤生凤老鼠儿子会打洞，垃圾数据出来的肯定是垃圾结果(Garbage in, Garbage out) 。
数据降维 (Data Reduction)。大数据一来，想搞5百万个SNP之间的两两相互关系，反正你们搞计算机的不是很厉害么？哪个经费足的生物大佬一拍桌子，买台服务器大不了就是穷举嘛，大不了就是五百万选二的排列组合。羡慕又可惜，高帅富刷硬件；可怜又幸运，屌丝刷算法。回到分析问题的根源：自变的变量你用了太多太多，计算的维度你升了太多太多，瞳孔放大不代表就能看的更多，难怪大人们说生物博士永远都不要念。文中举出了一些经典算法去实现数据降维，如ReliefF, chi-square statistics, PCA, factor analysis, genetic algorithm 和 linkage disequilibrium。顺带一提，找到对应的每一篇引用文献，这又是一篇篇计算生物学的入门读物。不难发现其实文中是把降维（Dimension Reduction）和特征选择（Feature Selection）一并揉在一起，而这两个概念我想有必要一提。不同于互联网世界，生物科研世界更注重模拟模型和预测结果的可解读性。比如PCA这种降维方法最后汇报的主成分（一般两个），你很难让一个生物学家具体的去解释，因为杂糅了诸多个自变量的主成分没有办法直接获得生物意义的诠释。相比之下，如果可以通过一系列尽管媲美黑魔法但的确有逻辑解释的计算方法，即特征选择，撇去某一些无关紧要的自变量，那么恭喜你，你又迈出了刷算法当屌丝的一步。关于特征选择，这里有几个我读过的相关资料：

See https://yihui.shinyapps.io/voice for the live demo. Make sure you have turned on your microphone and allow the web browser to have access to it. Credits go to annyang and also to @wch for showing me a minimal Shiny example. You can do four things on the scatterplot using your voice input:

say "title something" to change the plot title, e.g. title good morning
say "color a color name" to change the color, e.g. color blue
say "bigger" or "smaller" to change the size of points
say "regression" to add a linear regression line

	# Created by http://www.gitignore.io
	.xcodeproj/
	### Xcode ###
	DerivedData/*
	build/
	*.pbxuser
	!default.pbxuser
	*.mode1v3
	!default.mode1v3
	*.mode2v3

	{
	"bold_folder_labels": true,
	"color_scheme": "Packages/User/SublimeLinter/Solarized (Light) (SL).tmTheme",
	"fade_fold_buttons": false,
	"font_size": 16,
	"highlight_line": true,
	"ignored_packages":
	[
	"Markdown",
	"Vintage"

	;; -- mode: emacs-lisp --
	;; This file is loaded by Spacemacs at startup.
	;; It must be stored in your home directory.

	(defun dotspacemacs/layers ()
	"Configuration Layers declaration.
	You should not put any user code in this function besides modifying the variable
	values."
	(setq-default
	;; Base distribution to use. This is a layer contained in the directory

	# create rootless, set up sudo user
	# https://www.digitalocean.com/community/tutorials/initial-server-setup-with-ubuntu-14-04

	sudo apt-get update
	sudo apt-get install git
	sudo apt-get install ruby
	# install linux homebrew
	ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Linuxbrew/linuxbrew/go/install)"
	# vim ~/.bashrc
	export PATH="$HOME/.linuxbrew/bin:$PATH"

	# Using regerssion as example to test my xavier initilizer

	# Data
	data(BostonHousing, package="mlbench")
	train.ind = seq(1, 506, 3)
	train.x = data.matrix(BostonHousing[train.ind, -14])
	train.y = BostonHousing[train.ind, 14]
	test.x = data.matrix(BostonHousing[-train.ind, -14])
	test.y = BostonHousing[-train.ind, 14]

	//
	// Yun Yan
	//
	// [[Rcpp::plugins(cpp11)]]
	#include <bitset>
	#include <unordered_set>
	#include <RcppArmadillo.h>
	// [[Rcpp::depends(RcppArmadillo)]]

	// #include <RcppEigen.h>

	#!/usr/bin/env python
	import numpy as np
	from graphviz import Graph

	#==== Genreate data ====
	# nodes = map(str, [2, 4, 5, 6, 7, 8])
	n = 20
	nodes = np.random.randint(1, 100, n)
	nodes = map(str, sorted(set(nodes)))
	nInf = np.inf