Created
June 19, 2009 16:57
-
-
Save tommorris/132729 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
I've just put up a large chunk of my Twitter archives up on the | |
Talis Platform service. Talis Platform is a 'cloud'-based | |
triplestore hosting service. More at http://n2.talis.com | |
A triplestore is like a database but for graphs of RDF triples. | |
The cool thing about RDF and the triplestore is that you | |
basically have a completely schema-less datastore. You don't have | |
to figure out "Oh, there's integers going in this field and | |
strings going in that". You just upload a big pile of RDF and the | |
triplestore keeps it all there. This is obviously not as efficient | |
as using a database, so if you want to grow to Google size, it may | |
not be the best solution. But because it's cloud-based I don't | |
have to think about that either - that's up to Talis! ;) | |
Turning Twitter data into RDF is pretty easy. The approach I found | |
easiest was to use the API which returns either XML or JSON. | |
I used XML as I have already got an XSLT stylesheet that does most | |
of the work. | |
Pre-requisites: | |
* a Unix-based OS | |
* curl | |
* xsltproc | |
* Ruby 1.8.6+ (or JRuby 1.3.0) | |
* nokogiri gem | |
I had old archive data from Twitter, back using the old archive method. | |
In that, tweets that are at-replies to other tweets only have the ID | |
of the other user, not the screen name. But the URI of tweets is | |
constructed from the screen name. You then need to look up the IF | |
using the /users/show.xml?user_id=(val) method. The code to do that | |
is in transform.rb | |
transform.rb is a bit of a lazy hack. If you run it over old archive | |
data, it WILL crash. that's because open-uri raises an exception when | |
it gets a 404 status. Silly really, as 404 is a perfectly valid status, | |
and is semantically meaningful. <http://twitter.com/tommorris> | |
returning 404 means there is no @tommorris on twitter. ;) | |
When it hit a 404, I took whatever number it returned and manually | |
grepped for it in the file, figured out who the at-reply was to and | |
then added that persons etails to the YAML file. | |
The XSLT used is below, but I recommend that if you want to do this | |
to wait a few days. I'm planning on rewriting the XSLT a bit soon to | |
make it suck less. The code is twitter-rdf.xsl | |
As for actually doing the transformations and loading them into the | |
Talis store, I used IRB (interactive Ruby shell) to invoke xsltproc | |
and curl. | |
irb> `ls *.xml`.split.each{|i| `xsltproc ~/Code/twitter-rdf.xsl #{i} > #{i.split('.')[0] + ".rdf"` } | |
irb> `ls *.rdf`.split.each{|i| `curl -v --digest -u "(username):(password)" --retry 10 --retry-delay 10 -H "Content-Type:application/rdf+xml" --data @#{i} http://api.talis.com/stores/(storename)/meta` } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
require "rubygems" | |
require "nokogiri" | |
require "open-uri" | |
require "yaml" | |
def username_lookup(val, hash) | |
if hash[val].nil? | |
print "-- #{val}" | |
screenname = Nokogiri::XML(open("http://twitter.com/users/show.xml?user_id=#{val}").readlines.join).search("screen_name")[0].content.to_s | |
print " = #{screenname}\n" | |
hash[val] = screenname | |
sleep 30 # so as not to exceed the Twitter API limit | |
end | |
return hash[val] | |
end | |
(1..76).to_a.each do |f| | |
hash = YAML::load_file("/home/tom/twitter_usernames.yml") | |
puts "Processing #{f}.xml" | |
origarchive = Nokogiri::XML(open("/home/tom/twitter_archive/#{f.to_s}.xml").readlines.join) | |
origarchive.search("status").collect {|i| i if i.search("in_reply_to_user_id")[0].content != "" && i.search("in_reply_to_screen_name").size == 0 }. delete_if {|i| i.nil? }.collect {|i| screenname = username_lookup(i.search("in_reply_to_user_id")[0].content.to_s, hash); newnode = Nokogiri::XML:: Node.new("in_reply_to_screen_name", origarchive); newnode.content = screenname; i.search("in_reply_to_user_id")[0].add_next_sibling(newnode); i } | |
origarchive.root.write_to(File.open("/home/tom/twitter_archive/#{f.to_s}.xml", "w")) | |
File.open("/home/tom/twitter_usernames.yml", "w") do |out| | |
YAML.dump(hash, out) | |
end | |
end |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<?xml version="1.0" encoding="UTF-8"?> | |
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0" | |
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:dcterms="http://purl.org/dc/terms/" | |
xmlns:sioc="http://rdfs.org/sioc/ns#" xmlns:twitter="http://rdf.opiumfield.com/twitter/0.1/" | |
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#"> | |
<xsl:output method="xml" indent="yes" encoding="UTF-8" /> | |
<xsl:template match="text()"/> | |
<xsl:param name="username"/> | |
<xsl:template match="users"> | |
<rdf:RDF> | |
<rdf:Description rdf:about=""> | |
<foaf:primaryTopic rdf:resource="http://twitter.com/{$username}"/> | |
</rdf:Description> | |
<foaf:Agent rdf:about="http://twitter.com/{$username}"> | |
<xsl:apply-templates select="user" mode="link"/> | |
<rdfs:seeAlso rdf:resource="http://tools.opiumfield.com/twitter/{$username}/rdf/history" /> | |
</foaf:Agent> | |
<xsl:apply-templates select="user" mode="details"/> | |
</rdf:RDF> | |
</xsl:template> | |
<xsl:template match="user" mode="link"> | |
<foaf:knows rdf:resource="http://twitter.com/{screen_name}"/> | |
</xsl:template> | |
<xsl:template match="user" mode="details"> | |
<foaf:Agent rdf:about="http://twitter.com/{screen_name}"> | |
<foaf:nick> | |
<xsl:value-of select="screen_name"/> | |
</foaf:nick> | |
<foaf:name> | |
<xsl:value-of select="name"/> | |
</foaf:name> | |
<xsl:if test="string-length(url) > 0"> | |
<foaf:homepage rdf:resource="{url}"/> | |
</xsl:if> | |
<xsl:if test="status"> | |
<foaf:made rdf:resource="http://twitter.com/{screen_name}/{status/id}" /> | |
</xsl:if> | |
<rdfs:seeAlso rdf:resource="http://tools.opiumfield.com/twitter/{screen_name}/rdf"/> | |
<rdfs:seeAlso rdf:resource="http://tools.opiumfield.com/twitter/{screen_name}/rdf/history" /> | |
</foaf:Agent> | |
<xsl:if test="status"> | |
<xsl:apply-templates select="status" /> | |
</xsl:if> | |
</xsl:template> | |
<xsl:template match="statuses[@type='array']"> | |
<rdf:RDF> | |
<xsl:apply-templates select="status" /> | |
</rdf:RDF> | |
</xsl:template> | |
<xsl:template match="status"> | |
<xsl:variable name="screen_name"> | |
<xsl:choose> | |
<xsl:when test="../screen_name"> | |
<xsl:value-of select="../screen_name" /> | |
</xsl:when> | |
<xsl:when test="user/screen_name"> | |
<xsl:value-of select="user/screen_name" /> | |
</xsl:when> | |
</xsl:choose> | |
</xsl:variable> | |
<sioc:Post rdf:about="http://twitter.com/{$screen_name}/statuses/{id}"> | |
<rdf:type rdf:resource="http://rdfs.org/sioc/types#MicroblogPost" /> | |
<sioc:content xml:lang="en"> | |
<xsl:value-of select="text"/> | |
</sioc:content> | |
<xsl:if test="truncated/text() = 'true'"> | |
<twitter:truncated rdf:datatype="http://www.w3.org/2001/XMLSchema#boolean">true</twitter:truncated> | |
</xsl:if> | |
<xsl:choose> | |
<xsl:when test="in_reply_to_screen_name/text() != '' and in_reply_to_status_id/text() != ''"> | |
<sioc:reply_to> | |
<sioc:Post rdf:about="http://twitter.com/{normalize-space(in_reply_to_screen_name/text())}/status/{normalize-space(in_reply_to_status_id/text())}"> | |
<foaf:maker> | |
<foaf:Agent> | |
<foaf:weblog rdf:resource="http://twitter.com/{normalize-space(in_reply_to_screen_name/text())}" /> | |
</foaf:Agent> | |
</foaf:maker> | |
</sioc:Post> | |
</sioc:reply_to> | |
</xsl:when> | |
<xsl:when test="in_reply_to_screen_name/text() != ''"> | |
<sioc:reply_to> | |
<rdf:Description> | |
<foaf:maker> | |
<foaf:Agent> | |
<foaf:weblog rdf:resource="http://twitter.com/{normalize-space(in_reply_to_screen_name/text())}" /> | |
</foaf:Agent> | |
</foaf:maker> | |
</rdf:Description> | |
</sioc:reply_to> | |
</xsl:when> | |
</xsl:choose> | |
<dcterms:created rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime"> | |
<xsl:value-of select="substring(created_at, 27, 4)"/> | |
<xsl:text>-</xsl:text> | |
<xsl:if test="substring(created_at, 5, 3) = 'Jan'"> | |
<xsl:text>01</xsl:text> | |
</xsl:if> | |
<xsl:if test="substring(created_at, 5, 3) = 'Feb'"> | |
<xsl:text>02</xsl:text> | |
</xsl:if> | |
<xsl:if test="substring(created_at, 5, 3) = 'Mar'"> | |
<xsl:text>03</xsl:text> | |
</xsl:if> | |
<xsl:if test="substring(created_at, 5, 3) = 'Apr'"> | |
<xsl:text>04</xsl:text> | |
</xsl:if> | |
<xsl:if test="substring(created_at, 5, 3) = 'May'"> | |
<xsl:text>05</xsl:text> | |
</xsl:if> | |
<xsl:if test="substring(created_at, 5, 3) = 'Jun'"> | |
<xsl:text>06</xsl:text> | |
</xsl:if> | |
<xsl:if test="substring(created_at, 5, 3) = 'Jul'"> | |
<xsl:text>07</xsl:text> | |
</xsl:if> | |
<xsl:if test="substring(created_at, 5, 3) = 'Aug'"> | |
<xsl:text>08</xsl:text> | |
</xsl:if> | |
<xsl:if test="substring(created_at, 5, 3) = 'Sep'"> | |
<xsl:text>09</xsl:text> | |
</xsl:if> | |
<xsl:if test="substring(created_at, 5, 3) = 'Oct'"> | |
<xsl:text>10</xsl:text> | |
</xsl:if> | |
<xsl:if test="substring(created_at, 5, 3) = 'Nov'"> | |
<xsl:text>11</xsl:text> | |
</xsl:if> | |
<xsl:if test="substring(created_at, 5, 3) = 'Dec'"> | |
<xsl:text>12</xsl:text> | |
</xsl:if> | |
<xsl:text>-</xsl:text> | |
<xsl:value-of select="substring(created_at, 9, 2)"/> | |
<xsl:text>T</xsl:text> | |
<xsl:value-of select="substring(created_at, 12, 8)"/> | |
<xsl:text>Z</xsl:text> | |
</dcterms:created> | |
<dcterms:source rdf:resource="http://twitter.com/{$screen_name}"/> | |
<foaf:maker> | |
<foaf:Agent> | |
<foaf:weblog rdf:resource="http://twitter.com/{$screen_name}"/> | |
<xsl:if test="user/url/text() != ''"> | |
<foaf:homepage rdf:resource="{user/url}" /> | |
</xsl:if> | |
</foaf:Agent> | |
</foaf:maker> | |
</sioc:Post> | |
</xsl:template> | |
</xsl:stylesheet> |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment