Skip to content

Instantly share code, notes, and snippets.

View BeatriceMoissinac's full-sized avatar

BeatriceMoissinac

  • Oregon State Univeristy
View GitHub Profile
@BeatriceMoissinac
BeatriceMoissinac / hadoop.sh
Created June 21, 2018 01:38
[HDFS] #AWS #EMR #S3 #Hadoop
# List hdfs file system
hadoop fs -ls
@BeatriceMoissinac
BeatriceMoissinac / bin.py
Created May 25, 2018 21:31
[Bin data] How to put continuous data into bins #Python
import pandas as pd
df['bin'] = pd.cut(df['1'], [0, 50, 100,200], labels=['0-50', '50-100', '100-200'])
0 1 file bin
0 person1 24 age.csv 0-50
1 person2 17 age.csv 0-50
2 person3 98 age.csv 50-100
3 person4 6 age.csv 0-50
4 person2 166 Height.csv 100-200
@BeatriceMoissinac
BeatriceMoissinac / Export Query
Last active May 3, 2018 02:14
[Mongo Export] How to deal with queries for MongoDB #Mongo #Json #Shell
mongoexport --db car --collection groundtruth --query "{groundtruth:'movies'}" --out movies.json
@BeatriceMoissinac
BeatriceMoissinac / unzip command.sh
Created May 1, 2018 23:35
[Unzip Files on S3] #AWS #Shell #S3 #EMR
# UNZIP SESSION FILES
# Reference: https://aws.amazon.com/blogs/big-data/seven-tips-for-using-s3distcp-on-amazon-emr-to-move-data-efficiently-between-hdfs-and-amazon-s3/
# http://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html
# Hadoop has to be installed on the cluster (add Name=Hadoop)
# As a step
aws emr add-steps --profile $KEY --cluster-id $CLUSTER --steps Type=CUSTOM_JAR,Name="S3DistCp",ActionOnFailure=CONTINUE,Jar="command-runner.jar",Args=[s3-dist-cp,--src,s3://osu-is/aruba/raw/Client_Session_1479414885.zip,--dest,hdfs:///output,--outputCodec,none]
# or as direct command in ssh consol
# s3-dist-cp --src s3://osu-is/aruba/raw --dest hdfs:///output --srcPattern .*\.zip --outputCodec=none
# aws s3 cp --profile $KEY src/main/shell/unzipper.sh s3://osu-is/jar
@BeatriceMoissinac
BeatriceMoissinac / Notes
Created April 13, 2018 16:07
[How to do options in command line] #Shell
@BeatriceMoissinac
BeatriceMoissinac / time.py
Created January 31, 2018 00:59
[Convert Unix Time] How to convert timestamp to unix time in Python #Python
from datetime import datetime
import time
#-------------------------------------------------
# conversions to strings
#-------------------------------------------------
# datetime object to string
dt_obj = datetime(2008, 11, 10, 17, 53, 59)
date_str = dt_obj.strftime("%Y-%m-%d %H:%M:%S")
print date_str
@BeatriceMoissinac
BeatriceMoissinac / Create a cluster
Last active February 8, 2023 18:09
[AWS EMR] How to create and manage clusters on AWS EMR #AWS
// vim: syntax=shell
$JAR=/usr/lib/spark/lib/spark-examples.jar
$KEY=MoissinB
# Create cluster with 1st step
aws emr create-cluster --profile $KEY \
--name "Moissinb Cluster" \
--release-label emr-5.10.0 \
--applications Name=Spark \
@BeatriceMoissinac
BeatriceMoissinac / Install commands
Last active January 22, 2018 19:35
[Installation for IS pipeline] What to install to setup the IS pipeline? #Shell
// vim: syntax=shell
# Install Homebrew
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
# Install Java
brew install java
# Install Scala
brew install scala