Skip to content

Instantly share code, notes, and snippets.

@nitrogenlogic
Last active October 23, 2016 00:43
Show Gist options
  • Save nitrogenlogic/5200766 to your computer and use it in GitHub Desktop.
Save nitrogenlogic/5200766 to your computer and use it in GitHub Desktop.
This quick and dirty script imports posts and images exported by the Posterous backup feature into Octopress. Requires the escape_utils and nokogiri gems. Doesn't import comments.See comments below the gist for more instructions.
#!/usr/bin/env ruby
# This quick and dirty script imports posts and images exported by the
# Posterous backup feature into Octopress. Requires the escape_utils and
# nokogiri gems. Doesn't import comments.
#
# Videos and images are copied into a post-specific image directory used
# by my customized Octopress setup. Encoded videos are downloaded from
# Posterous. Images will probably need to be compressed/optimized afterward.
#
# Links to other posts in the same import will try to be converted. You will
# need to edit the generate_* functions below if your permalink format is
# different from /:year/:month/:day/:title/.
#
# Links, images, videos, special characters/question marks, etc. should be
# verified after running this script.
#
# Posterous seems to have broken any UTF-8 characters in the exported
# wordpress_export_1.xml, but you can work around this by concatenating all the
# *.xml files under posts/ and replacing all <item> tags in
# wordpress_export_1.xml with the concatenated <item> tags from posts/*.xml.
# You may also want to remove all CR characters from the .xml file first.
#
# Run from the base directory of your Octopress setup.
#
# Usage:
# cd [octopress_base_dir]
# ./posterous_import.rb /path/to/wordpress_export_1.xml [base_path]
# ./posterous_import.rb --links /path/to/wordpress_export_1.xml [base_path]
#
# base_path is the base path of your blog's URLs (e.g. '/' or '/blog').
#
# The --links invocation generates a directory and index.html under source/ for
# each Posterous permalink, allowing an old Posterous domain to be setup with
# 301 redirects to new post locations. The --links invocation does not import
# any posts. This is useful if you use a permalink format that differs from
# Posterous's (which is the default behavior).
#
# This script is not guaranteed to work with any Posterous archive other than
# my own. Do what you want with this script; attribution is appreciated, but
# optional. Comments and corrections are welcome.
#
# In hindsight it may have been easier to fix up the archived HTML posts or
# individual XML files instead of using the RSS feed.
#
# Created 2013 by Mike Bourgeous - Released under CC0
require 'rss'
require 'yaml'
require 'fileutils'
require 'escape_utils'
require 'nokogiri'
# Fixes references to Posterous in document tags of the given type. Only
# attributes that appear to contain a Posterous URL will be processed.
#
# If no block is given, tries to find a file matching the tag's attribute under
# [srcdir], or if [srcdir] is nil, downloads the URI contained in [attr]. The
# matching file, if one is found, will be copied into [destdir], and the tag's
# [attr] attribute changed to point at [serverdir]/filename. Posterous image
# name abbreviation is taken into account, but this has not been tested with a
# wide variety of names.
#
# If a block is given, the block will be called once for each matching tag and
# the contents of its [attr] attribute, and the return value of the block used
# to replace the tag's [attr] attribute.
#
# After the attribute is updated, an immediately surrounding <a> tag linking to
# Posterous, if one exists, will be removed.
#
# doc - The parsed Nokogiri document.
# srcdir - The directory in which to find replacement files, or nil to download
# the originals.
# destdir - The directory to which to copy replacement files.
# serverdir - The name of destdir on the server (used for updating image tags).
# tag - The name of the tags to update.
# attr - The attribute of the tags to update.
def fix_sources doc, srcdir, destdir, serverdir, tag='img', attr='src', &bl
puts "\tFixing #{tag} tags' #{attr} attribute"
tags = doc.css(tag)
postregex = %r{https?://[^/]*posterous.com/}
tags.each do |img|
next unless img[attr] =~ postregex
shortname = img[attr].split('/').last.split('.scaled').first
ext = shortname.split('.').last.downcase
puts "\t#{tag}: #{shortname}"
if block_given?
img[attr] = yield img, img[attr]
else
if srcdir == nil
# Download the file
puts "\t\tDownloading #{shortname}"
File.open(File.join(destdir, shortname), "w") do |file|
file.write(URI.parse(img[attr]).read)
end
in_img = shortname
else
# Find matching files
matches = Dir.entries(srcdir).select {|imgfile|
imgfile.downcase.end_with?(ext) &&
imgfile.gsub(/\s+/, '_').include?(shortname.split('.').first)
}
if matches.length == 0
matches = Dir.entries(srcdir).select {|imgfile|
imgfile.gsub(/\s+/, '_').include?(shortname.split('.').first)
}
if matches.length == 0
puts "\n\n\n########\nNo match found for #{img[attr]} in #{srcdir}\n########\n\n"
next
end
end
if matches.length > 1
reduced = matches.select {|imgfile|
imgfile.include?(shortname)
}
if reduced.length == 1
matches = reduced
else
puts "\n\n\n########\nMore than one match found for #{shortname}:"
puts matches
puts "You will need to double-check #{tag} tags in #{filename}\n\n"
end
end
in_img = matches.first
puts "\t\tUsing #{in_img} for #{shortname}"
# Copy the file into the destination directory
FileUtils.cp(File.join(srcdir, in_img), destdir)
end
# Update the tag's attribute
img[attr] = EscapeUtils.escape_uri(File.join(serverdir, in_img))
end
# Remove a link wrapping the image, if one exists
parent = img.parent
if parent.node_name == 'a' && parent['href'] =~ postregex
puts "\t\tRemoving parent link: #{parent['href']}"
parent.replace(img)
end
end
end
# Writes each item from the given RSS feed into ./source/_posts (use Dir.chdir
# to change directories first if necessary). Posts will be marked as
# unpublished if the post's link starts with '/private/'.
#
# rss - The File containing the RSS feed. The images will be found relative to
# the feed.
# basedir - The server directory in which the blog's posts and images/
# directory reside.
def generate_posts rss_file, basedir='/'
basedir = "/#{basedir}" unless basedir.start_with? '/'
basedir = "#{basedir}/" unless basedir.end_with? '/'
dir = File.dirname(File.expand_path(rss_file))
rss = File.read(rss_file)
feed = RSS::Parser.parse(rss, false)
item_map = Hash[*feed.items.map{|item|
link = item.link.split('/').last
[link, {:item => item, :filename => item.pubDate.strftime("source/_posts/%Y-%m-%d-#{link}.html")}]
}.flatten]
feed.items.each do |item|
post_uri = URI.parse(item.link)
permalink = item.link.split('/').last
filename = item_map[permalink][:filename]
date = item.pubDate
header = {
'layout' => "post",
'title' => item.title,
'date' => date,
'comments' => true,
'categories' => item.categories.select{|cat| cat.domain == "tag"}.map{|cat| cat.content},
'published' => !post_uri.path.start_with?('/private/')
}
puts "Generating #{filename}#{header['published'] ? '' : ' (unpublished)'}"
imgdir = "source/images/#{date.strftime('%Y/%m/%d')}/#{permalink}/"
serverdir = '/' + imgdir.split('/', 2).last
FileUtils.mkdir_p(imgdir)
outfile = File.new(filename, "w")
outfile.puts header.to_yaml
outfile.puts "---"
# Fix up images and video
html = Nokogiri::HTML("<div id=\"import_#{permalink}\">#{EscapeUtils.unescape_html(item.content_encoded)}</div>")
images = html.css('img')
fix_sources html, date.strftime("#{dir}/image/%Y/%m"), imgdir, serverdir
fix_sources html, nil, imgdir, serverdir, 'source'
fix_sources html, nil, nil, nil, 'video', 'poster' do nil end
# Fix up links to other posts
fix_sources html, nil, nil, nil, 'a', 'href' do |tag, href|
link_uri = URI.parse(href)
next unless post_uri.host == link_uri.host
link_shortname = href.split('/').last.split('#').first
if item_map.include? link_shortname
link = item_map[link_shortname][:item]
href = link.pubDate.strftime("#{basedir}%Y/%m/%d/#{link_shortname}/")
href += "##{link_uri.fragment}" if link_uri.fragment
puts "\t\tUsing #{link.title} (#{href})"
else
puts "\t######## No match found for #{href}"
end
href
end
outfile.puts html.css("div#import_#{permalink}").first.children.map{|node| node.to_html}.join
outfile.close
end
nil
end
# Generates a redirecting link from the permalink of each item from the given
# RSS feed to the corresponding post generated by generate_posts().
#
# rss - The File containing the RSS feed.
# basedir - The server directory in which the blog's posts and images/
# directory reside.
def generate_links rss_file, basedir='/'
basedir = "/#{basedir}" unless basedir.start_with? '/'
basedir = "#{basedir}/" unless basedir.end_with? '/'
dir = File.dirname(File.expand_path(rss_file))
rss = File.read(rss_file)
feed = RSS::Parser.parse(rss, false)
item_map = Hash[*feed.items.map{|item|
link = item.link.split('/').last
[link, {:item => item, :filename => item.pubDate.strftime("source/#{link}/index.html")}]
}.flatten]
feed.items.each do |item|
post_uri = URI.parse(item.link)
permalink = item.link.split('/').last
filename = item_map[permalink][:filename]
dirname = File.dirname(filename)
href = item.pubDate.strftime("#{basedir}%Y/%m/%d/#{permalink}/")
title = item.title
FileUtils.mkdir_p(dirname)
outfile = File.new(filename, "w")
outfile.write <<-HTML
<!DOCTYPE html>
<html>
<head>
<title>#{title}</title>
<meta http-equiv="Refresh" content="0; url=#{href}">
<link href="#{basedir}stylesheets/screen.css" rel="stylesheet" type="text/css">
</head>
<body>
<a style="color: inherit; text-decoration: none" href="#{href}">#{title}</a>
</body>
</html>
HTML
outfile.close
end
nil
end
if __FILE__ == $0
raise 'No RSS feed given' unless $ARGV.length > 0
if $ARGV[0] == '--links'
raise 'No RSS feed given' unless $ARGV.length > 1
generate_links $ARGV[1], $ARGV[2] || '/'
else
generate_posts $ARGV[0], $ARGV[1] || '/'
end
end
@nitrogenlogic
Copy link
Author

A bit of manual work still has to be done before and after with my script, such as gathering the XML files and minifying images. To generate the single XML file my script would need (since wordpress_export_1.xml has replaced all UTF-8 characters with question marks), I would do something like this:

cd /path/to/space-[numbers, name, etc.]
cat head.xml posts/*.xml > fixed_export.xml
echo '</channel></rss>' >> fixed_export.xml
cd /path/to/new/blog
./posterous_import.rb /path/to/space-[numbers, name, etc.]/fixed_export.xml

@ivanromanko
Copy link

Thanks a lot! This great script saved me a lot of time!

@marcamillion
Copy link

This seems to be giving me a problem with Posterous posts that were archived. What seems to be happening is it is reading the wordpress_export_1.xml file, and that is referencing a post in 2010-05, but the earliest date in the images directory is 2010-07.

Not quite sure how to approach this.

Thoughts?

@marcamillion
Copy link

Ok here is something else I have learned....this is an example of one of a snippet from 1 of my posts:

<h3>Know when to change tables - by Tony Hsieh (CEO of Zappos)</h3>
<div class='post_info'>
<span class='post_time'>June 21 2010, 11:46 PM</span>
<span class='author'>&nbsp;by Marc Gayle</span>
</div>
</div>
<div class='post_body'><p><div class='p_embed p_image_embed'>
<img src='../../../image/2010/07/11605730-media_httpfarm3static_mAyIi.jpg'>
</div>

The filename of the image, is also specified in the fixed_exports.xml as can be seen here:

<content:encoded><![CDATA[<p><div class='p_embed p_image_embed'>
<img alt="Media_httpfarm3static_mayii" height="375" src="http://getfile4.posterous.com/getfile/files.posterous.com/import-yfku/JEptCojDvjcGozqkthctiGidGfysDAhpicfjqplvoaatkwFHqezzfJlyuBnl/media_httpfarm3static_mAyIi.jpg.scaled500.jpg" width="500" />

This is the error that parsing this file generated:

Generating source/_posts/2010-06-22-know-when-to-change-tables-by-tony-hsieh-ceo-of-zappos.html
    Fixing img tags' src attribute
    img: media_httpfarm3static_mAyIi.jpg
/Dropbox/My Blog/posterous_import.rb:101:in `open': No such file or directory - /Dropbox/My Blog/Marc Gayle/image/2010/06 (Errno::ENOENT)
    from /Dropbox/My Blog/posterous_import.rb:101:in `entries'
    from /Dropbox/My Blog/posterous_import.rb:101:in `block in fix_sources'

So the trick is, when the image is not found at the default image/year/month/day path, to either search the directory structure for the filename, or to actually find the path within the individual html file included in the archive - in this case <img src='../../../image/2010/07/11605730-media_httpfarm3static_mAyIi.jpg'>.

Any thoughts on the best way to approach this?

@marcamillion
Copy link

For what it's worth, I have forked this and updated it to fix the issues I was having.

https://gist.github.com/marcamillion/5493177

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment