About the VMs

Each VM has:

Apache Spark 2.4.0
- Spark shell: /home/ubuntu/spark/bin/spark-shell
Python 3.7.1 (Anaconda)
Java 8
Ruby 2.5.1
jq
GNU Parallel
Jupyter Notebook
Datathon notebook

The machine has (6 total):

16 virtual cores
30G RAM (One has 60G)
data: /mnt/data (179G of free space)

Archive-It Collections

ID	Title
4719	International Brotherhood of Teamsters
7082	Environmental Advocates of New York
7081	New York Civil Liberties Union
6917	Senator Kristen Gillibrand
6918	Senator Chuck Schumer
9655	New York State Political Third Parties
3642	Avian Influenza A (H7N9) Virus web archive
4254	Disorders of the Developing and Aging Brain: Autism and Alzheimer’s on the Web
6850	Bioethics web archive
7219	Environmental Health web archive
8370	Domestic Violence Awareness and Prevention
10568	District of Columbia Elections 2018
10427	DC Punk (Web) Archive
10985	DC 1968
7485	Nova Scotia Municipal Governments
10188	Ecology Action Centre websites
11360	Websites of the Former Soviet Union & Eastern Europe
10866	#metoo collection
5828	StopBullying.gov

Web Archives for Historical Research Collections

#climatemarch
#elxn42
#marchforscience
#panamanpapers
#womensmarch

About data:

Each collection's directory will have a warcs directory containing all of it's ARCs/WARCs
Each collection's directory will have a derivatives directory containing the scholarly derivatives created on cloud.archivesunleashed.org
- More info about those files can be found here.
See this for a look at the directory structure.

Apache Spark + aut

To run Apache Spark shell with aut on each machine, run the following command: ~/spark/bin/spark-shell --packages "io.archivesunleashed:aut:0.17.0"
In the course of your project, you might need to use additional flags. These should work well on each machine: ~/spark/bin/spark-shell --packages "io.archivesunleashed:aut:0.17.0" --master local[*] --driver-memory 12G --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s

Accessing the machines:

The permissions on the key should be 600. You can do this with the following command on your own laptop before shelling in: chmod 600 /path/to/archives-hackathon.key
You can shell in the machines with the following command: ssh -i /path/to/archives-hackathon.key [email protected]
I will provide the key, and IP address to each machine after teams are formed on the first day of the datathon.

Raw

dc-datathon-datasets-directory-structure.md

├── albany
│   ├── environmental-advocates
│   │   ├── derivatives
│   │   └── warcs
│   ├── gillibrand
│   │   ├── derivatives
│   │   └── warcs
│   ├── ny-civil-liberties
│   │   ├── derivatives
│   │   └── warcs
│   ├── ny-third-parties
│   │   ├── derivatives
│   │   └── warcs
│   └── schumer
│       ├── derivatives
│       └── warcs
├── dalhousie
│   ├── eco-action-centre
│   │   ├── derivatives
│   │   └── warcs
│   └── nova-scoita-municipal-govs
│       ├── derivatives
│       └── warcs
├── dcpl
│   ├── dc-1968
│   │   ├── derivatives
│   │   └── warcs
│   ├── dc-2018-elections
│   │   ├── derivatives
│   │   └── warcs
│   └── dc-punk
│       ├── derivatives
│       └── warcs
├── fdlp
│   └── stopbullying
│       ├── derivatives
│       └── warcs
├── gwu
│   └── teamsters
│       ├── derivatives
│       └── warcs
├── harvard
│   └── metoo
│       ├── derivatives
│       └── warcs
├── ivy
│   └── former-soviet-union
│       ├── derivatives
│       └── warcs
├── nlm
│   ├── H7N9
│   │   ├── derivatives
│   │   └── warcs
│   ├── bioethics
│   │   ├── derivatives
│   │   └── warcs
│   ├── brain-disorders
│   │   ├── derivatives
│   │   └── warcs
│   ├── domestic-violence
│   │   ├── derivatives
│   │   └── warcs
│   └── enviro-health
│       ├── derivatives
│       └── warcs
└── wahr
    ├── climatemarch
    │   └── warcs
    ├── elxn42
    │   └── warcs
    ├── marchforscience
    │   └── warcs
    ├── panamanpapers
    │   └── warcs
    └── womensmarch
        └── warcs

Raw

dc-datathon-vm-ssh-aliases.md

##-------------------
## AU DC Datathon
##-------------------
## c16-30gb-880gb machines:
alias datathon1="ssh -i ~/.ssh/archives-hackathon.key [email protected]"
alias datathon2="ssh -i ~/.ssh/archives-hackathon.key [email protected]"
alias datathon3="ssh -i ~/.ssh/archives-hackathon.key [email protected]"
alias datathon4="ssh -i ~/.ssh/archives-hackathon.key [email protected]"
alias datathon5="ssh -i ~/.ssh/archives-hackathon.key [email protected]"
## c16-60gb-880gb machine:
alias datathon6="ssh -i ~/.ssh/archives-hackathon.key [email protected]"
##-------------------

ruebot/dc-datathon-about-vms.md

About the VMs

Each VM has:

The machine has (6 total):

Archive-It Collections

Web Archives for Historical Research Collections

About data:

Apache Spark + aut

Accessing the machines: