This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#import all data add column headers and run checks | |
dist <- read.delim("~/Documents/my blog/million song database/7plus songs/output1.txt", header=FALSE) | |
colnames(dist)<-c('length', 'freq') | |
dist | |
dist_time <- read.csv("~/Documents/my blog/million song database/7plus songs/output2.txt", header=FALSE) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import pandas as pd | |
#open and split file then convert to df | |
lines = [line.strip().split("\t") for line in open("P:\\A.tsv.a.txt", "r")] | |
df=pd.DataFrame(lines) | |
#pull out columns for further split | |
cols=range(18,22)+range(33,42) | |
arrays=df.loc[1:5,cols].values |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import numpy as np | |
from sklearn.datasets import load_iris | |
iris = load_iris() | |
X = iris.data | |
print X | |
#scale the data | |
from sklearn.preprocessing import StandardScaler | |
SS=StandardScaler() | |
XS=SS.fit_transform(X) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#import data | |
import pandas as pd | |
plays = pd.read_table("usersha1-artmbid-artname-plays-sample.tsv", usecols=[0, 2, 3], names=['user', 'artist', 'plays']) | |
users = pd.read_table("usersha1-profile-sample.tsv", usecols=[0, 1], names=['user', 'gender']) | |
#print plays.head() | |
#print users.head() | |
#clear people who don't know gender for | |
users=users.dropna() | |
#dummy code up gender |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
%matplotlib inline | |
import matplotlib | |
import matplotlib.pyplot as plt | |
import pandas as pd | |
import numpy as np | |
plays = pd.read_table("usersha1-artmbid-artname-plays-sample.tsv", usecols=[0, 2, 3], names=['user', 'artist', 'plays']) | |
users = pd.read_table("usersha1-profile-sample.tsv", usecols=[0, 1], names=['user', 'gender']) | |
users=users.dropna() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#in terminal connect ot the master node | |
ssh [email protected] -i ~/aws_key_pair.pem | |
#then fire up spark | |
MASTER=yarn-client /home/hadoop/spark/bin/pyspark | |
lines = sc.textFile('s3n://jthomson/lastfm_listens/listens/usersha1-artmbid-artname-plays.tsv') | |
data = lines.map(lambda l: l.split('\t')) | |
ratings = data.map(lambda d: (d[0], d[2], 1)) | |
users_lkp = ratings.map(lambda s: s[0]).distinct().zipWithUniqueId() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#start a terminal at the folder where spark is installed | |
#in the command line run this to fire up a pyspark instance | |
./bin/pyspark | |
########################### | |
### LOADING IN THE DATA ### | |
########################### | |
#load in the file and examine | |
lines = sc.textFile('usersha1-artmbid-artname-plays.tsv') |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import nltk | |
#with open('sample.txt', 'r') as f: | |
# sample = f.read() | |
#article taken from the bbc | |
sample="""Renewed fighting has broken out in South Sudan between forces loyal to the president and vice-president. A reporter in the capital, Juba, told the BBC gunfire and large explosions could be heard all over the city; he said heavy artillery was being used. More than 200 people are reported to have died in clashes since Friday. The latest violence came hours after the UN Security Council called on the warring factions to immediately stop the fighting. In a unanimous statement, the council condemned the violence "in the strongest terms" and expressed "particular shock and outrage" at attacks on UN sites. It also called for additional peacekeepers to be sent to South Sudan. | |
Chinese media say two Chinese UN peacekeepers have now died in Juba. Several other peacekeepers have been injured, as well as a number of civilians who have been caught in crossfire. The latest round of violence erupted when troops loy |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import nltk | |
import gensim | |
sample="""Renewed fighting has broken out in South Sudan between forces loyal to the president and vice-president. A reporter in the capital, Juba, told the BBC gunfire and large explosions could be heard all over the city; he said heavy artillery was being used. More than 200 people are reported to have died in clashes since Friday. The latest violence came hours after the UN Security Council called on the warring factions to immediately stop the fighting. In a unanimous statement, the council condemned the violence "in the strongest terms" and expressed "particular shock and outrage" at attacks on UN sites. It also called for additional peacekeepers to be sent to South Sudan. | |
Chinese media say two Chinese UN peacekeepers have now died in Juba. Several other peacekeepers have been injured, as well as a number of civilians who have been caught in crossfire. The latest round of violence erupted when troops loyal to President Salva Kiir and first Vice-President Riek Machar began sho |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import pandas as pd | |
import re | |
import numpy as np | |
import nltk | |
import gensim | |
#import data. contains identifier and tweet | |
tweets=pd.DataFrame.from_csv('tweets.txt', sep='\t', index_col=False) |