NOTE - The rscripts are not optimised, they were just created to prove this can be done
Problem to solve - extract all the records from a webmap on Historic England's website for aerial photography and create a CSV file. This relates to a section 21 refusal of a FOIA request by Andy Mabbett: https://www.whatdotheyknow.com/request/847052/response/2037681/attach/html/4/Mr%20Mabbett%20FOI.docx.html
He has uploaded an example image to wikicommons here: https://commons.wikimedia.org/wiki/File:Historic_England_Aerial_Photo_Explorer_-_raf_540_78_sffo_0003_-_screenshot_-_01.png
All scripts have been written and tested on Intel based Mac OSX, with latest R, RStudio, Docker, and image magick. Guides to install these packages can be found elsewhere. Most packages I have installed via homebrew.
- Look at this StackOverflow Q&A: https://stackoverflow.com/questions/50161492/how-do-i-scrape-data-from-an-arcgis-online-map
- Get the ID for the application: 9adb70fef4fa4844ba0e091a12e66455
- Find the URL for the layer you want to query: https://services-eu1.arcgis.com/ZOdPfBS3aqqDYPUQ/arcgis/rest/services/Obliques_03_02_2022_WGS84_Date_view/FeatureServer/29
- Work out what you want to query for using the HTML page https://services-eu1.arcgis.com/ZOdPfBS3aqqDYPUQ/arcgis/rest/services/Obliques_03_02_2022_WGS84_Date_view/FeatureServer/29/query?where=0%3D0&outFields=%2A&f=html
- Write a script to page through the queried data set and create a CSV file with only the columns you need and a newly generated URL column.
Final script: See scrapePhotoDataEsri.R
Constraints on this - default query limit 50, total records 394,390.
I didn't want to just search for RAF photos as there maybe more useful information.
To parse just RAF photography:
The final parameter can be html|pjson|pgeojson
- Look at this StackOverflow Q&A: https://stackoverflow.com/questions/50161492/how-do-i-scrape-data-from-an-arcgis-online-map
- Get the ID for the application: 9adb70fef4fa4844ba0e091a12e66455
- Find the URL for the layer you want to query: https://services-eu1.arcgis.com/ZOdPfBS3aqqDYPUQ/arcgis/rest/services/Verts_07_02_22_WGS84_Date_view/FeatureServer/32/
- Work out what you want to query for using the HTML page https://services-eu1.arcgis.com/ZOdPfBS3aqqDYPUQ/arcgis/rest/services/Verts_07_02_22_WGS84_Date_view/FeatureServer/32/query?where=0%3D0&outFields=%2A&f=html
- Write a script to page through the queried data set and create a CSV file with only the columns you need and a newly generated URL column.
Final script: See scrapeVertical.R
Constraints on this - default query limit 50, total records 47,230.
I didn't want to just search for RAF photos as there maybe more useful information.
Each image, in theory has a British National Grid Easting/Northing pair. These get converted to a latlng pair in the scripts below.
The final parameter can be html|pjson|pgeojson
You can visualise these data using Simon Willison's Datasette libraries via: https://aerial-photos-exploration.glitch.me/data/ The mapping plugin is active in this instance, and it may take a few minutes to wake up.
Using Rselenium and docker selenium server, it is possible to automate the download of these images. My docker knowledge is a little noddy, but this is how I got it working.
- Install Docker
- Pull a docker standalone Selenium image (I'm using this one https://github.com/SeleniumHQ/docker-selenium)
docker pull selenium/standalone-chrome
- Map a directory on your machine for storing downloaded images - I mapped a directory in my home folder.
- Start docker with selenium standalone chrome using a command like the below:
docker run -d -p 4444:4444 -v /Users/Danielpett/rafImages:/home/seluser/Downloads --shm-size="2g" selenium/standalone-chrome:4.1.4-20220427
- Use the automatedDownload.R script to get an image using this example code https://gist.github.com/stuartlangridge/82c87a601a7e2ae566e640d93138ed85
- Downloaded images will appear in the mapped volume
Another script - batchAutomated.R allows one to download all the images (may take a long time). This script downloads images and converts them to jpgs using imagick. Timeout increases were needed for this to capture decent size images and to also pause processing whilst images are saved.