- Apache Spark 2.4.0
- Spark shell:
/home/ubuntu/spark/bin/spark-shell
- Spark shell:
- Python 3.7.1 (Anaconda)
- Java 8
- Ruby 2.5.1
- jq
- GNU Parallel
- Jupyter Notebook
- Datathon notebook
- 16 virtual cores
- 30G RAM (One has 60G)
- data:
/mnt/data
(179G of free space)
- #climatemarch
- #elxn42
- #marchforscience
- #panamanpapers
- #womensmarch
- Each collection's directory will have a
warcs
directory containing all of it's ARCs/WARCs - Each collection's directory will have a
derivatives
directory containing the scholarly derivatives created on cloud.archivesunleashed.org- More info about those files can be found here.
- See this for a look at the directory structure.
-
To run Apache Spark shell with
aut
on each machine, run the following command:~/spark/bin/spark-shell --packages "io.archivesunleashed:aut:0.17.0"
-
In the course of your project, you might need to use additional flags. These should work well on each machine:
~/spark/bin/spark-shell --packages "io.archivesunleashed:aut:0.17.0" --master local[*] --driver-memory 12G --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s
-
The permissions on the key should be 600. You can do this with the following command on your own laptop before shelling in:
chmod 600 /path/to/archives-hackathon.key
-
You can shell in the machines with the following command:
ssh -i /path/to/archives-hackathon.key [email protected]
-
I will provide the key, and IP address to each machine after teams are formed on the first day of the datathon.