Skip to content

Instantly share code, notes, and snippets.

View sbatururimi's full-sized avatar

Stas Batururimi sbatururimi

View GitHub Profile
@sbatururimi
sbatururimi / spark_tips_and_tricks.md
Created January 18, 2019 13:52 — forked from dusenberrymw/spark_tips_and_tricks.md
Tips and tricks for Apache Spark.

Spark Tips & Tricks

Misc. Tips & Tricks

  • If values are integers in [0, 255], Parquet will automatically compress to use 1 byte unsigned integers, thus decreasing the size of saved DataFrame by a factor of 8.
  • Partition DataFrames to have evenly-distributed, ~128MB partition sizes (empirical finding). Always err on the higher side w.r.t. number of partitions.
  • Pay particular attention to the number of partitions when using flatMap, especially if the following operation will result in high memory usage. The flatMap op usually results in a DataFrame with a [much] larger number of rows, yet the number of partitions will remain the same. Thus, if a subsequent op causes a large expansion of memory usage (i.e. converting a DataFrame of indices to a DataFrame of large Vectors), the memory usage per partition may become too high. In this case, it is beneficial to repartition the output of flatMap to a number of partitions that will safely allow for appropriate partition memory sizes, based upon the
% bin/nutch parsechecker -Dplugin.includes='protocol-selenium|parse-tika' \
-Dselenium.grid.binary=.../geckodriver \
-Dselenium.enable.headless=true \
-followRedirects \
-dumpText https://nutch.apache.org
# Terminal Cheat Sheet
pwd # print working directory
ls # list files in directory
cd # change directory
~ # home directory
.. # up one directory
- # previous working directory
help # get help
-h # get help
@sbatururimi
sbatururimi / linux_mount_partition.txt
Last active June 27, 2019 11:15
Mount partition(disk)for all users, source https://askubuntu.com/a/548445
To mount a partition at startup for all users, we need an entry in the fstab file. What is happening presently is, the HDD is getting mounted for the user who logs in which gives access permissions to only that user. By adding an entry in the fstab, the partition will be mounted by root with access to all users. this r/w access can be controlled later on.
sudo blkid lists down all partitions available on your system. Note down the UUID of the NTFS partition that you want to mount at boot. In your case, it seems 00148BDE148BD4D6
now create a folder, for example sudo mkdir /media/ExtHDD01. This is the folder where your external HDD partition will be mounted at. This folder will be owned by root. To give other users permission to r/w into this folder we need to give the proper permissions. so chmod -R 777 /media/ExtHDD01 would be good enough. Now you need to edit your fstab file. to do so, type the following command.
sudo nano /etc/fstab
go to the bottom of the file and add the following line there.
UUID=001
@sbatururimi
sbatururimi / tmux.md
Created May 23, 2019 06:08 — forked from andreyvit/tmux.md
tmux cheatsheet

tmux cheat sheet

(C-x means ctrl+x, M-x means alt+x)

Prefix key

The default prefix is C-b. If you (or your muscle memory) prefer C-a, you need to add this to ~/.tmux.conf:

remap prefix to Control + a

set nocompatible " required
filetype off " required
" set the runtime path to include Vundle and initialize
set rtp+=~/.vim/bundle/Vundle.vim
call vundle#begin()
" alternatively, pass a path where Vundle should install plugins
"call vundle#begin('~/some/path/here')
@sbatururimi
sbatururimi / memory_check.py
Last active March 13, 2020 11:47
Checking memory in Jupyter notebook
def show_mem_usage():
'''Displays memory usage from inspection
of global variables in this notebook'''
gl = sys._getframe(1).f_globals
vars= {}
for k,v in list(gl.items()):
# for pandas dataframes
if hasattr(v, 'memory_usage'):
mem = v.memory_usage(deep=True)
if not np.isscalar(mem):
@sbatururimi
sbatururimi / docker_terminal_size.txt
Last active June 26, 2019 14:51
Docker resize of terminal window
docker exec -it container_name /bin/bash -c "export COLUMNS=`tput cols`; export LINES=`tput lines`; exec bash"
@sbatururimi
sbatururimi / jupyter_snippets.py
Last active July 7, 2020 07:30
Snippets for Jupyter: autotime+notifications $(jupyter --data-dir)/nbextensions/snippets/snippets.json
{
"snippets" : [
{
"name" : "timing-notifications",
"code": [
"%load_ext autoreload",
"%load_ext jupyternotify",
"%load_ext autotime"
]
}
@sbatururimi
sbatururimi / Data_augmentation_keras_opencv_brightness_hsv.py
Created July 10, 2019 15:09 — forked from avsthiago/Data_augmentation_keras_opencv_brightness_hsv.py
Data augmentation using Keras ImageDataGenerator and OpenCV. Also with brightness augmentation.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Thu Mar 29 09:57:55 2018
@author: avsthiago
"""
from keras.preprocessing.image import ImageDataGenerator
import numpy as np