- Apache Spark 2.4.0
- Spark shell:
/home/ubuntu/spark/bin/spark-shell
- Spark shell:
- Python 3.7.1 (Anaconda)
- Java 8
- Ruby 2.5.1
- jq
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/home/nruest/bin/spark-3.0.0-bin-hadoop2.7/bin/spark-submit --master local\[2\] --driver-memory 4g --conf spark.driver.maxResultSize=0 --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.80.1-SNAPSHOT-fatjar.jar --extractor AudioInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities --output /home/nruest/Projects/au/sample-data/3.0.0-testing/audio/csv | |
/home/nruest/bin/spark-3.0.0-bin-hadoop2.7/bin/spark-submit --master local\[2\] --driver-memory 4g --conf spark.driver.maxResultSize=0 --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.80.1-SNAPSHOT-fatjar.jar --extractor DomainFrequencyExtractor --input /home/nruest/Projects/au/sample-data/geocities --output /home/nruest/Projects/au/sample-data/3.0.0-testing/domains/csv | |
/home/nruest/bin/spark-3.0.0-bin-hadoop2.7/bin/spark-submit --master local\[2\] --driver-memory 4g --conf spark.driver.maxResultSize=0 --class io.archivesunleashed.app.CommandLineAppR |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Welcome to | |
____ __ | |
/ __/__ ___ _____/ /__ | |
_\ \/ _ \/ _ `/ __/ '_/ | |
/___/ .__/\_,_/_/ /_/\_\ version 3.0.0-preview | |
/_/ | |
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_222) | |
Type in expressions to have them evaluated. | |
Type :help for more information. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
URL | MD5 | COUNT MD5 | COUNT FILENAME | |
---|---|---|---|---|
http://www.geocities.com/clipart/pbi/c.gif | c4746081d66bc2abc269f22ca27ebb46 | 2,705 | 373,198 | |
http://pic.geocities.com/images/pixel.gif | b4682377ddfbe4e7dabfddb2e543e842 | 3,336 | 18,685 | |
http://www.google.com/images/cleardot.gif | fc94fb0c3ed8a8f909dbc7630a0987ff | 69,625 | 747 | |
http://www.google.com/clear.gif | 55fade2068e7503eae8d7ddf5eb6bd09 | 2,551 | 13,852 | |
https://killersites.com/killerSites/resources/dot_clear.gif | b4682377ddfbe4e7dabfddb2e543e842 | 3,336 | 1,780 | |
https://mail.google.com/mail/images/cleardot.gif | fc94fb0c3ed8a8f909dbc7630a0987ff | 69,625 | 747 | |
http://visit.geocities.yahoo.com/visit.gif | 4f59788bde58d15d541a9c116d0e850d | 2,729,121 | 2,731,243 | |
http://blingee.com/images/spaceball.gif | 325472601571f31e1bf00674c368d335 | 18,537,796 | 39 | |
http://www-cdr.stanford.edu/~petrie/blank.gif | accba0b69f352b4c9440f05891b015c5 | 1,341 | 26,292 |
We can make this file beautiful and searchable if this error is corrected: It looks like row 8 should actually have 8 columns, instead of 1 in line 7.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
http://it.geocities.com/grannoce/camere/thumb/camera_blu_001.jpg,camera_blu_001.jpg,jpg,image/jpeg,image/jpeg,112,150,fffffef31a159782b97876b7a17eab92 | |
http://ar.geocities.com/angeles_uno/PLAYMATES/1999/JUNIO/KIMBERLY_SPICER/06_small.jpg,06_small.jpg,jpg,image/jpeg,image/jpeg,100,143,fffffd5fe6d986c04f028854bbd4a20a | |
http://in.geocities.com/nileshtx/images/DSC01219.jpg,DSC01219.jpg,jpg,image/jpeg,image/jpeg,510,768,fffffc7244d39657dd286547fda3fd0d | |
http://kr.geocities.com/magicianclow/img/favor.gif,favor.gif,gif,image/gif,image/gif,71,20,fffff8a7566c250585fb4453594b9c3e | |
http://login.space2000.de/logo.gif,logo.gif,gif,image/gif,image/gif,168,49,fffff72ef7571cf00d0717ac96bfad07 | |
http://91-143-80-250.blue.kundencontroller.de/logo.gif,logo.gif,gif,image/gif,image/gif,168,49,fffff72ef7571cf00d0717ac96bfad07 | |
http://cf.geocities.com/rouquins/images/merlin0.jpg,merlin0.jpg,jpg,image/jpeg,image/jpeg,129,140,fffff077e30e213fa08cecc389a60bdb | |
http://ar.geocities.com/aliaga_fernandoo/ediciones/ed7/imagenes/menu/MENU7_r11_c21. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import io.archivesunleashed._ | |
import io.archivesunleashed.df._ | |
val images = RecordLoader | |
.loadArchives("/path/to/web/archive/collection", sc) | |
.extractImageDetailsDF(); | |
images.select($"url", $"filename", $"extension", $"mime_type_web_server", | |
$"mime_type_tika", $"width", $"height", $"md5") | |
.orderBy(desc("md5")) |
├── albany
│ ├── environmental-advocates
│ │ ├── derivatives
│ │ └── warcs
│ ├── gillibrand
│ │ ├── derivatives
│ │ └── warcs
│ ├── ny-civil-liberties
│ │ ├── derivatives
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
$ ./spark-shell --master local\[10\] --driver-memory 30G --conf spark.network.timeout='10000000' --conf spark.executor.heartbeatInterval='600s' --conf spark.driver.maxResultSize='4G' --jars ~/git/aut/target/aut-0.17.1-SNAPSHOT-fatjar.jar | |
2018-11-30 09:08:03 WARN Utils:66 - Your hostname, wombat resolves to a loopback address: 127.0.1.1; using 10.0.1.44 instead (on interface enp0s31f6) | |
2018-11-30 09:08:03 WARN Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address | |
2018-11-30 09:08:04 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable | |
Setting default log level to "WARN". | |
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). | |
Spark context Web UI available at http://10.0.1.44:4040 | |
Spark context available as 'sc' (master = local[10], app id = local-1543586887449). | |
Spark session available as 'spark'. | |
Welcome to |
NewerOlder