Paulo Haddad paulochf

If you were to give recommendations to your "little brother/sister" on things that they need to do to become a data scientist, what would those things be?

I think the "Data Science Venn Diagram" (http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram) is a great place to start. You need three things to be a good data scientist:

Statistical knowledge
Programming/hacking skills
Domain expertise

	from datetime import datetime
	from sqlalchemy import Column, Integer, DateTime, ForeignKey
	from sqlalchemy.orm import relationship
	from sqlalchemy.ext.declarative import declared_attr
	from flask_security import current_user


	class AuditMixin(object):
	created_at = Column(DateTime, default=datetime.now)
	updated_at = Column(DateTime, default=datetime.now, onupdate=datetime.now)

	#!/usr/bin/env python26
	import logging
	import pika

	logging.basicConfig()


	class Consumer(object):
	"""
	A RabbitMQ topic exchange consumer that will call the specified function

	# coding=UTF8
	#########################################################################
	# This class is to help sklearn to handle statistical process #
	# Author: Joon Lim from Master of Science in Analytics at Northwestern #
	# Date: 04.23.2013 #
	#########################################################################

	''' this Module is built on top of numpy and sklearn. '''

	brew install qt # should already be done
	brew install qwt # should already be done
	brew install portaudio # should already be done

	brew install wget # makes some downloading easier

	# set up your virtualenv (`workon friture` if you've already created it)
	mkvirtualenv friture

	cd /tmp

	import logging
	import multiprocessing
	import time

	import mplog

	FORMAT = '%(asctime)s - %(processName)s - %(levelname)s - %(message)s'
	logging.basicConfig(level=logging.DEBUG, format=FORMAT)
	existing_logger = logging.getLogger('x')

	import multiprocessing
	import pandas as pd
	import numpy as np

	def _apply_df(args):
	df, func, kwargs = args
	return df.apply(func, **kwargs)

	def apply_by_multiprocessing(df, func, **kwargs):
	workers = kwargs.pop('workers')

	curl --header 'Authorization: token INSERTACCESSTOKENHERE' \
	--header 'Accept: application/vnd.github.v3.raw' \
	--remote-name \
	--location https://api.github.com/repos/owner/repo/contents/path

	# Example...

	TOKEN="INSERTACCESSTOKENHERE"
	OWNER="BBC-News"
	REPO="responsive-news"

	package org.apache.spark.graphx
	import org.apache.spark.SparkContext
	import org.apache.spark.SparkContext._
	import org.apache.spark.SparkConf
	import org.apache.spark.rdd.RDD
	import org.apache.spark._

	object repl {

	val sc = new SparkContext("local", "test") //> sc : org.apache.spark.SparkContext = org.apache.spark.SparkContext@3724af13

Paulo Haddad paulochf

Statistical knowledge