- If values are integers in [0, 255], Parquet will automatically compress to use 1 byte unsigned integers, thus decreasing the size of saved DataFrame by a factor of 8.
- Partition DataFrames to have evenly-distributed, ~128MB partition sizes (empirical finding). Always err on the higher side w.r.t. number of partitions.
- Pay particular attention to the number of partitions when using
flatMap
, especially if the following operation will result in high memory usage. TheflatMap
op usually results in a DataFrame with a [much] larger number of rows, yet the number of partitions will remain the same. Thus, if a subsequent op causes a large expansion of memory usage (i.e. converting a DataFrame of indices to a DataFrame of large Vectors), the memory usage per partition may become too high. In this case, it is beneficial to repartition the output offlatMap
to a number of partitions that will safely allow for appropriate partition memory sizes, based upon the
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
% bin/nutch parsechecker -Dplugin.includes='protocol-selenium|parse-tika' \ | |
-Dselenium.grid.binary=.../geckodriver \ | |
-Dselenium.enable.headless=true \ | |
-followRedirects \ | |
-dumpText https://nutch.apache.org |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Modern secure (OpenSSH Server 7+) SSHd config by HacKan | |
# Refer to the manual for more info: https://www.freebsd.org/cgi/man.cgi?sshd_config(5) | |
# Server fingerprint | |
# Regenerate with: ssh-keygen -f /etc/ssh/ssh_host_rsa_key -N '' -t rsa -b 4096 | |
HostKey /etc/ssh/ssh_host_rsa_key | |
# Regerate with: ssh-keygen -f /etc/ssh/ssh_host_ed25519_key -N '' -t ed25519 | |
HostKey /etc/ssh/ssh_host_ed25519_key | |
# Log for audit, even users' key fingerprint |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# terminal color | |
export PS1="\[\033[00;35m\][GPU_Andrew] \[\033[00:32m\]\\u:\\W\\$\[\033[00m\] " | |
# alternative | |
# export PS1="\[\033[00;36m\][\\d, \\t] \[\033[00:32m\]\\u:\\W\\$\[\033[00m\] " | |
# vi as vim | |
alias vi=vim | |
# alias for ssh procy (uncomment if needed) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
" vim plugins with pathogen | |
execute pathogen#infect() | |
syntax on | |
"filetype plugin indent on | |
filetype indent on | |
set autoindent | |
" number | |
set number |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[MASTER] | |
# Specify a configuration file. | |
#rcfile= | |
# Python code to execute, usually for sys.path manipulation such as | |
# pygtk.require(). | |
#init-hook= | |
# Profiled execution. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
regex = r"^[a-z][a-z0-9+\-.]*:\/\/([a-z0-9\-._~%!$&'()*+,;=]+@)?([a-z0-9\-._~%]+|\[[a-z0-9\-._~%!$&'()*+,;=:]+\])" | |
\A | |
[a-z][a-z0-9+\-.]*:// # Scheme | |
([a-z0-9\-._~%!$&'()*+,;=]+@)? # User | |
([a-z0-9\-._~%]+ # Named or IPv4 host | |
|\[[a-z0-9\-._~%!$&'()*+,;=:]+\]) # IPv6+ host | |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
mkdir ~/jupyter | |
cd ~/jupyter | |
wget https://github.com/graphframes/graphframes/archive/release-0.6.0.zip | |
unzip release-0.6.0.zip | |
cd graphframes-release-0.6.0 | |
build/sbt assembly | |
cd .. | |
# Copy necessary files to root level so we can start pyspark. | |
cp graphframes-release-0.6.0/target/scala-2.11/graphframes-assembly-0.6.0-spark2.3.jar . |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1. Install tightvnc | |
```bash | |
sudo apt-get install tightvncserver | |
``` | |
2. Setup the vnc to start the default desktop | |
``` | |
vi ~/.vnc/xstartup | |
``` | |
3. Insert and save: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/* | |
A single regex to parse and breakup a full URL including query parameters and anchors e.g. | |
https://www.google.com/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash | |
*/ | |
Url.regex = /^((http[s]?|ftp):\/)?\/?([^:\/\s]+)((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(.*)?(#[\w\-]+)?$/; | |
url: RegExp['$&'], | |
protocol: RegExp.$2, | |
host: RegExp.$3, |