Skip to content

Instantly share code, notes, and snippets.

View sbatururimi's full-sized avatar

Stas Batururimi sbatururimi

View GitHub Profile
% bin/nutch parsechecker -Dplugin.includes='protocol-selenium|parse-tika' \
-Dselenium.grid.binary=.../geckodriver \
-Dselenium.enable.headless=true \
-followRedirects \
-dumpText https://nutch.apache.org
@sbatururimi
sbatururimi / spark_tips_and_tricks.md
Created January 18, 2019 13:52 — forked from dusenberrymw/spark_tips_and_tricks.md
Tips and tricks for Apache Spark.

Spark Tips & Tricks

Misc. Tips & Tricks

  • If values are integers in [0, 255], Parquet will automatically compress to use 1 byte unsigned integers, thus decreasing the size of saved DataFrame by a factor of 8.
  • Partition DataFrames to have evenly-distributed, ~128MB partition sizes (empirical finding). Always err on the higher side w.r.t. number of partitions.
  • Pay particular attention to the number of partitions when using flatMap, especially if the following operation will result in high memory usage. The flatMap op usually results in a DataFrame with a [much] larger number of rows, yet the number of partitions will remain the same. Thus, if a subsequent op causes a large expansion of memory usage (i.e. converting a DataFrame of indices to a DataFrame of large Vectors), the memory usage per partition may become too high. In this case, it is beneficial to repartition the output of flatMap to a number of partitions that will safely allow for appropriate partition memory sizes, based upon the
@sbatururimi
sbatururimi / sshd_config
Last active January 10, 2019 10:10 — forked from HacKanCuBa/sshd_config
Modern secure SSH daemon config
# Modern secure (OpenSSH Server 7+) SSHd config by HacKan
# Refer to the manual for more info: https://www.freebsd.org/cgi/man.cgi?sshd_config(5)
# Server fingerprint
# Regenerate with: ssh-keygen -f /etc/ssh/ssh_host_rsa_key -N '' -t rsa -b 4096
HostKey /etc/ssh/ssh_host_rsa_key
# Regerate with: ssh-keygen -f /etc/ssh/ssh_host_ed25519_key -N '' -t ed25519
HostKey /etc/ssh/ssh_host_ed25519_key
# Log for audit, even users' key fingerprint
@sbatururimi
sbatururimi / .bash_profile
Last active January 10, 2019 11:50
As simple bash profile with colorization
# terminal color
export PS1="\[\033[00;35m\][GPU_Andrew] \[\033[00:32m\]\\u:\\W\\$\[\033[00m\] "
# alternative
# export PS1="\[\033[00;36m\][\\d, \\t] \[\033[00:32m\]\\u:\\W\\$\[\033[00m\] "
# vi as vim
alias vi=vim
# alias for ssh procy (uncomment if needed)
@sbatururimi
sbatururimi / .vimrc
Created November 15, 2018 07:07
vimrc
" vim plugins with pathogen
execute pathogen#infect()
syntax on
"filetype plugin indent on
filetype indent on
set autoindent
" number
set number
@sbatururimi
sbatururimi / google-pylint.rc
Created November 8, 2018 08:40
A set of rules to be used with pylint: pylint --rcfile=google-pylint.rc
[MASTER]
# Specify a configuration file.
#rcfile=
# Python code to execute, usually for sys.path manipulation such as
# pygtk.require().
#init-hook=
# Profiled execution.
@sbatururimi
sbatururimi / Parsing url
Created October 31, 2018 08:51
Parsing urls to obtain for example the host
regex = r"^[a-z][a-z0-9+\-.]*:\/\/([a-z0-9\-._~%!$&'()*+,;=]+@)?([a-z0-9\-._~%]+|\[[a-z0-9\-._~%!$&'()*+,;=:]+\])"
\A
[a-z][a-z0-9+\-.]*:// # Scheme
([a-z0-9\-._~%!$&'()*+,;=]+@)? # User
([a-z0-9\-._~%]+ # Named or IPv4 host
|\[[a-z0-9\-._~%!$&'()*+,;=:]+\]) # IPv6+ host
@sbatururimi
sbatururimi / pyspark_pass graphframes package.txt
Last active June 27, 2019 11:20
Install GraphFrames with Jupyter notebook
mkdir ~/jupyter
cd ~/jupyter
wget https://github.com/graphframes/graphframes/archive/release-0.6.0.zip
unzip release-0.6.0.zip
cd graphframes-release-0.6.0
build/sbt assembly
cd ..
# Copy necessary files to root level so we can start pyspark.
cp graphframes-release-0.6.0/target/scala-2.11/graphframes-assembly-0.6.0-spark2.3.jar .
@sbatururimi
sbatururimi / remote_desktop.txt
Last active June 11, 2019 06:35
Access a remote ubuntu desktop
1. Install tightvnc
```bash
sudo apt-get install tightvncserver
```
2. Setup the vnc to start the default desktop
```
vi ~/.vnc/xstartup
```
3. Insert and save:
@sbatururimi
sbatururimi / URL parsing Regex.js
Created September 4, 2018 15:03 — forked from metafeather/URL parsing Regex.js
URL parsing regex.js
/*
A single regex to parse and breakup a full URL including query parameters and anchors e.g.
https://www.google.com/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash
*/
Url.regex = /^((http[s]?|ftp):\/)?\/?([^:\/\s]+)((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(.*)?(#[\w\-]+)?$/;
url: RegExp['$&'],
protocol: RegExp.$2,
host: RegExp.$3,