This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| docker run - rm -it -v "/my/data:/data" aut:0.50.0 /spark/bin/spark-shell - packages "io.archivesunleashed:aut:0.50.0" - driver-memory 7G |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| .select($"crawl_date", $"url", RemoveHTMLDF(ExtractBoilerpipeTextDF(RemoveHTTPHeaderDF($"content")))) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| import io.archivesunleashed._ | |
| import io.archivesunleashed.matchbox._ | |
| RecordLoader.loadArchives("/political_actors_data/*.warc.gz", sc) | |
| .webpages() | |
| .keepLanguagesDF(Set("de")) | |
| .select($"crawl_date", $"url", ExtractBoilerpipeTextDF(RemoveHTTPHeaderDF($"content"))) | |
| .write.csv("/political_actors_data/plain-text-noboilerplate-df/") |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| import io.archivesunleashed._ | |
| import io.archivesunleashed.matchbox._ | |
| RecordLoader.loadArchives("/political_actors_data/*.warc.gz", sc) | |
| .webpages() | |
| .keepLanguagesDF(Set("de")) | |
| .select($"crawl_date", $"url", RemoveHTMLDF($"content")) | |
| .write.csv("/political_actors_data/plain-text-df/") |
We can make this file beautiful and searchable if this error is corrected: It looks like row 10 should actually have 6 columns, instead of 1 in line 9.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| seed,status,all_count,new_count,all_size,new_size | |
| https://blog.aarp.org/aarp-celebrates-stonewall-50th-anniversary-during-lgbt-pride-month,Crawled,10356,9664,5826862764,5804623908 | |
| https://www.vogue.com/article/stonewall-inn-50th-anniversary-interview/,Redirected,25227,16941,5471678002,5385428188 | |
| http://nyfos.org/stonewall-at-50/,Crawled,69082,68923,5371521735,5369687319 | |
| https://en.wikipedia.org/wiki/Stonewall_50_%E2%80%93_WorldPride_NYC_2019,Crawled,40728,39016,5416393613,5369366999 | |
| https://www.atlanta.net/events/detail/stonewall-50-exhibit-at-atlanta-city-hall/122642/,Crawled,66378,62970,5423253752,5369149241 | |
| https://www.thedailybeast.com/stonewall-50-dont-forget-the-black-and-brown-lgbtq-struggle/,Crawled,61033,56887,5448153720,5369118822 | |
| https://kywnewsradio.radio.com/categories/stonewall-50/,Crawled,34508,34174,5414148507,5369005035 | |
| http://www.roosevelthouse.hunter.cuny.edu/events/fifty-years-stonewall-now-go/,Crawled,86378,86261,5370103576,5368828744 | |
| https://soundcloud.com/workingclasshistory/stonewall-r |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| from google.colab import files | |
| uploaded = files.upload() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> | |
| <!--page.tpl.php--> | |
| <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" dir="ltr"> | |
| <head> | |
| <title>Consumers need protection from genetically modified foods | NDP</title> | |
| <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> | |
| <link rel="shortcut icon" href="/favicon.ico" type="image/x-icon" /> | |
| <link type="text/css" rel="stylesheet" media="all" href="/sites/all/modules/nice_menus/nice_menus.css?f" /> |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| ianmilligan1@Ians-MacBook-Pro-3:~/dropbox/git/aut$ python ~/dropbox/git/aut/src/main/python/tf/detect.py --web_archive "/Users/ianmilligan1/dropbox/git/aut-resources/Sample-Data/ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz" --aut_jar /Users/ianmilligan1/dropbox/git/aut/target/aut-0.17.1-SNAPSHOT-fatjar.jar --spark /Users/ianmilligan1/dropbox/spark-2.4.3-bin-hadoop2.7/bin --master spark://Ians-MacBook-Pro-3.local:7077 --img_model ssd --filter_size 50 50 --output_path /Users/ianmilligan1/desktop/aut-image-tf-testing | |
| 19/07/10 15:44:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable | |
| Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties | |
| Setting default log level to "WARN". | |
| To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). | |
| height >= 50 and width >= 50 | |
| [Stage 0:> (0 + 1) / 1 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| scala> :paste | |
| // Entering paste mode (ctrl-D to finish) | |
| import io.archivesunleashed._ | |
| import io.archivesunleashed.matchbox._ | |
| RecordLoader.loadArchives("example.arc.gz", sc) | |
| .keepValidPages() | |
| .keepDomains(Set("www.archive.org")) | |
| .keepLanguages(Set("fr")) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| 2018-11-30 09:36:20,149 [main-ScalaTest-running-CommandLineAppTest] ERROR CommandLineApp - _AUTCmdTestOutputDir already exists | |
| Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.191 sec <<< FAILURE! | |
| command line app tests(io.archivesunleashed.CommandLineAppTest) Time elapsed: 0.108 sec <<< ERROR! | |
| java.lang.IllegalArgumentException | |
| at io.archivesunleashed.app.CommandLineApp.verifyArgumentsOrExit(CommandLineApp.scala:219) | |
| at io.archivesunleashed.app.CommandLineAppRunner$.test(CommandLineApp.scala:344) | |
| at io.archivesunleashed.CommandLineAppTest$$anonfun$2$$anonfun$apply$mcV$sp$1.apply(CommandLineAppTest.scala:76) | |
| at io.archivesunleashed.CommandLineAppTest$$anonfun$2$$anonfun$apply$mcV$sp$1.apply(CommandLineAppTest.scala:75) | |
| at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) | |
| at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) |
NewerOlder