Last active
December 29, 2015 03:39
-
-
Save kentbye/7609727 to your computer and use it in GitHub Desktop.
This script removes the sidebar and navigation information from the HTML files of the Puppet Labs documentation in the "puppetdocs-latest" folder.It uses Nokogiri to select all of the content in div with a "primary-content" class, strips out the last "Back to top" text at the bottom, and then writes the data to a separate folder.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env ruby | |
require 'nokogiri' | |
# This script will grab the main content out of the Puppet Labs documentation, | |
# and write the cleaned HTML files to a new directory. | |
# | |
# To use, first download http://docs.puppetlabs.com/puppetdocs-latest.tar.gz | |
# Create a TEMP folder at the top-level of a directory | |
# Unzip the puppetdocs-latest at the top-level, and then make a copy into the TEMP directory. | |
# | |
# puppetdocs-latest # Copied directory and files so that the ruby script can overwrite the files | |
# TEMP | |
# |--- puppetdocs-latest # Directory with original data | |
# |--- extract-content.rb | |
# |--- html-input-files.txt # A pruned list of HTML files | |
# To create the html-input-files.txt, then run this command: | |
# $find puppetdocs-latest -type f -name "*.html" > html-input-files.txt | |
filename = 'html-input-files.txt' | |
File.open(filename, 'r').each_line do |line| | |
puts line[0..-2] | |
# Open up the file that is passed in through the input of the script | |
f = File.open(line[0..-2]) | |
doc = Nokogiri::XML(f) | |
# Select all of the div content that has a class of primary-content | |
primarycontent = doc.css('.primary-content') | |
# Remove "Back to Top" link at the bottom of the page. | |
links = primarycontent.xpath('//blockquote/p/a') | |
if !primarycontent.empty? then | |
if links[links.length-1].inner_html == "↑ Back to top" | |
links[links.length-1].remove | |
end | |
end | |
# Close out the original file | |
f.close() | |
# Create a new file with the filename entered as an argument and prepend it with ebook | |
new = File.open("../" + line[0..-2], "w") | |
# Write the first instance of primarycontent. The second instance is erroneous | |
new.write(primarycontent[0]) | |
puts line[0..-2] + " FINISHED" | |
end |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment