-
-
Save kascote/9d4cc27b8c13d4f50e95 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#encoding: UTF-8 | |
require 'rubygems' | |
require "pismo" | |
require 'punkt-segmenter' | |
require 'htmlentities' | |
require './markov.rb' | |
CLEAN_TEXT = <<-EOF | |
In the winter of 2012, I wrote a bot because I was sad. The aftermath of presidential elections can be hard on journalists — although they are even worse for the loser of course — because after months of planning and breathing election coverage, all of our work is over in a single night. So, I will confess to a minor amount of despondency as I shut down the loaders and baked out the final version of the election pages. Is that all there is? Colleagues decamped to tropical beaches to restore themselves, one paper cocktail-umbrella at a time, but I stayed at home to look for something new to correct my drift. New projects loomed on the horizon, but it was November and who wants to start something huge just before the rush of the holidays? Which is how I began writing haiku. More precisely, it's how I came to write a computer bot inspired by haikuleaks that looked for haiku hidden within the New York Times. The process is pretty simple:\n* Load the New York Times homepage and look for articles you haven't seen before.\n * Pull down the text of the article and separate it into sentences.\n * Tokenize each sentence into words and look up the number of syllables in the word\n * If the number of syllables in the word match the Haiku pattern, write it out as a possible haiku.\nThere are some additional details. For instance, knowing how many syllables are in a word was not trivial, but helped immensely by the CMU Pronouncing Dictionary. Tokenization can sometimes get confused, and there are sometimes articles we don't want to find haikus from (i.e., stories about terrorism or plane crashes). The system is also not fully automatic: a human moderates haikus found by the bot before they are published to the site. But the entire process of discovery is fully bot-driven, finding haiku in the most unexpected places and making me smile to this day.\n The definition of a bot is amorphous. We generally think of a bots as autonomous programs that do discrete—and often silly—tasks often for an indefinite period. For instance, there are the bots that pull and scramble text in unexpected ways like @TwoHeadlines, @4myrealfriends and @pentametron. Other bots might regularly pull from a remote data source like the English dictionary or ships crossing under the Tower Bridge. I've even seen web crawlers called bots these days, although this is often because of media confusion with spam botnets or other malicious actors. Generally though, when we say \"bot,\" we usually mean a generally harmless agent that acts on the behalf its creator in some fashion but has limited ambition and basic operations.\n But, is that all there is? This definition is largely exact, but it misses the point: bots are also magic. I am aware this sounds wildly hokey, but I have a deep emotional regard for my favorite bots that doesn't make much logical sense. It's not because they perform some Turing-test trickery; there's no confusing the best bots with humans. Rather, it's because they occasionally by pure chance output something of unexepected beauty. You know how the bot's rules work—most are only a few hundred lines of code—and yet it can still surprise you. Magic and whimsy are frustratingly elusive in computer science, often encountered in university courses or hackathon before being muscled aside by the needs for practical programming introductions. My first serious university programming textbook has a freaking wizard on the front cover; while the last few introductions to programming frameworks I've read have each taught me how to code to-do apps. Is it any wonder why we find the incoherent ramblings of a bot so appealing? They remind us of who we were when we first started to program.\n But this is Source and not the New York Review of Bots, so what about bots and the news? As part of Botweek, we've seen a few serious cases of bots being used for news, reporting on earthquakes or nailbiters in sports. Furthermore, Ben Welsh has been ranting about writing journalistic bots for years and Derek Willis has written reporting triggers on Congressional bills and campaign-finance filings. But, it feels like we could do so much more. The current generation of bots look for easily-defined news triggers like an earthquake alert or a bill being passed. Humans write the rules and the bots execute on them. But what about things that aren't so obvious? For instance, if an incumbent is not able to raise funds from an industry he used to rely on for backing, that's a great sign of future election troubles. Political talking points could be traced back to concerted pushes by party officials and think tanks. Outliers and aberrations in crime statistics could reflect either policing success or tampering with the numbers. These are all further rules we could potentially enumerate in our code were we exhaustive enough, but it feels like the next step beyond is to make the bots more intelligent by adding learning to them. We need a bot with a sense of aesthetics.\n This is easier said then done though. Training a bot would be a tedious process that could go wrong in various ways. And training the bot to flag certain things doesn't mean it wouldn't miss others. But once our bots can accept feedback and adjust their behavior, they become more than mere shadows of ourselves. All of which makes for some exciting hypothetical discussions: must your bot work under the same code of ethics that covers your journalism? Yes. Are you legally responsible if your bot causes some damage? Maybe. Is there any established journalism created by learning machines? Depends on if you count humans as learning machines. Obviously, there are a lot of details that would need to be worked out—and I can't imagine trying to run the phrase Hidden Markov Model by a copy editor—but someone will clearly do it someday. That someone could be you.\n On the flip side, we must also be savvier about how bots are intruding into the areas we report on. For starters, everywhere online is polluted by bots pretending to be people. Weeding out the bots and figuring out their motives will be a priority for any project you do with social media or other online content. As more and more organizations and agencies produce terrible generated content, there also will always be a utility in figuring out how to reverse-engineer the raw data embedded within bot-written content. What better way to counter the bots out there than to write some bots of your own? \n But what if you just want to write silly bots too? Excellent. Even a \"useless\" bot can serve a worthy purpose. It can challenge you to try some new programming techniques. It can add some welcome silliness to your twitter timeline. Or it can just bring you a little warmth when the days are getting dark and cold around you. That's not much in the grand scheme of things, but it will suffice. | |
EOF | |
BOTWEEK_URLS = [ | |
'https://source.opennews.org/articles/welcome-bot-week/', | |
'https://source.opennews.org/articles/how-break-news-while-you-sleep/', | |
'https://source.opennews.org/articles/open-source-bot-factory/', | |
'https://source.opennews.org/articles/bot-find-source-serendipity/', | |
'https://source.opennews.org/articles/bot-or-be-botted/', | |
'https://source.opennews.org/articles/botmaking-primer/', | |
'https://source.opennews.org/en-US/articles/bots-with-thoughts/', | |
] | |
texts = BOTWEEK_URLS.map do |url| | |
puts "Fetching #{url}" | |
doc = open(url) | |
Pismo::Document.new(doc, :reader => :cluster).body | |
end | |
markov = MarkovChainer.new(1) #changed this from 3 | |
tokenizer = Punkt::SentenceTokenizer.new(CLEAN_TEXT) # init with corpus of all sentences | |
texts.each do |text| | |
sentences = tokenizer.sentences_from_text(text, :output => :sentences_text) | |
sentences.each do |sentence| | |
unless sentence.length < 30 #changed this from 20 | |
markov.add_sentence(sentence) | |
end | |
end | |
end | |
puts "<i>To close #botweek, we got an assist from <a href='https://source.opennews.org/en-US/people/jacob-harris/'>Jacob Harris</a> to <a href='https://gist.github.com/sinker/9843753'>create a our own bot.</a> Its parting thoughts:</i>" | |
10.times do #changed this from 200 | |
puts markov.generate_sentence | |
end | |
#markov.clear! # Clear the temporary dictionary. | |
puts "IF YOU HAVE READ THIS FAR, HAPPY #BOTWEEK. SEE YOU NEXT YEAR." |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment