- Log in to AWS
- Go to a sensible region
- Start a new instance with Ubuntu Trusty (14.04) - compute-optimised instances have a high vCPU:memory ratio, and the lowest-cost CPU time.
c4.2xlarge
is a decent choice. - Set security group (firewall) to have ports 22, 80, and 443 open (SSH, HTTP, HTTPS)
- If you want a static IP address (for long-running instances) then select Elastic IP for this VM
- If you want to use HTTPS, you'll probably need a paid certificate, or to use Amazon's Route 53 to get a non-Amazon domain (to avoid region blocking).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def search(text,n): | |
'''Searches for text, and retrieves n words either side of the text, which are retuned seperatly''' | |
word = r"\W*([\w]+)" | |
groups = re.search(r'{}\W*{}{}'.format(word*n,'place',word*n), text).groups() | |
return groups[:n],groups[n:] | |
t = "The world is a small place, we should try to take care of it." | |
search(t,3) | |
#(('is', 'a', 'small'), ('we', 'should', 'try')) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def search(text,n): | |
'''Searches for text, and retrieves n words either side of the text, which are retuned seperatly''' | |
word = r"\W*([\w]+)" | |
groups = re.search(r'{}\W*{}{}'.format(word*n,'place',word*n), text).groups() | |
return groups[:n],groups[n:] | |
t = "The world is a small place, we should try to take care of it." | |
search(t,3) | |
#(('is', 'a', 'small'), ('we', 'should', 'try')) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#create fake data example taken from stackoverflow | |
df_example = pd.DataFrame({'CG':np.random.randint(0, 5, 100), 'Morph':np.random.choice(['S', 'E'], 100), 'R':np.random.rand(100) * -100}) | |
def my_agg(x): | |
x = x.sort_values('R') | |
morph = x.head(1)['Morph'].values[0] | |
diff = x.iloc[0]['R'] - x.iloc[1]['R'] | |
diff2 = -2.5*np.log10(sum(10**(-0.4*x['R']))) | |
prop = (x['Morph'].iloc[1:] == 'S').mean() | |
return pd.Series([morph, diff, diff2, prop], index=['morph', 'diff', 'diff2', 'prop']) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import multiprocessing | |
import pandas as pd | |
import numpy as np | |
def _apply_df(args): | |
df, func, kwargs = args | |
return df.apply(func, **kwargs) | |
def apply_by_multiprocessing(df, func, **kwargs): | |
workers = kwargs.pop('workers') |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env python | |
import os | |
path = 'data' | |
os.chdir(path) | |
files = sorted(os.listdir(os.getcwd()), key=os.path.getmtime) | |
oldest = files[0] | |
newest = files[-1] |
This gist is a detailed walkthrough on how to deploy python Dataflow pipelines in GCP to run without external IPs. Full code samples are available below.
This walkthrough assumes you have a already authenticated with gcloud login commands and have the appropriate IAM privileges to execute these operations.
Since we are planning to use no external IPs on our dataflow worker nodes, we must package up all our application dependencies for an offline deployment. I highly recommend using a virtual environment as your global dependencies will be much more than your single application will require.
Dump your application dependencies into a single file.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import pandas as pd | |
from google.cloud import firestore | |
db = firestore.Client() | |
users = list(db.collection(u'users').stream()) | |
users_dict = list(map(lambda x: x.to_dict(), users)) | |
df = pd.DataFrame(users_dict) |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from pyspark.sql.functions import pandas_udf, PandasUDFType | |
df = spark.createDataFrame( | |
[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], | |
("id", "v")) | |
def my_function(df, by="id", column="v", value=1.0): | |
schema = "{} long, {} double".format(by, column) |
OlderNewer