Created
May 14, 2014 00:34
-
-
Save kf0jvt/de4b985def0bcc0dbb7f to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"metadata": { | |
"name": "" | |
}, | |
"nbformat": 3, | |
"nbformat_minor": 0, | |
"worksheets": [ | |
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The goal here is to train a computer to recognize an article about Distributed Denial of Service attacks and put it into a queue of articles that need to be read or processed later. The source material is the VERIS Community Database (vcdb.org) which already has several dozen records on denial of service incidents and several associated news articles about each record." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"import json\n", | |
"import os\n", | |
"from datetime import datetime\n", | |
"import uuid\n", | |
"import random\n", | |
"from readability.readability import Document # pip install readability-xml\n", | |
"import urllib\n", | |
"import BeautifulSoup \n", | |
"\n", | |
"random.seed('follow @bfist on Twitter')\n", | |
"vcdb_path = '/Users/v527234/Documents/development/python/vcdb/data/json'\n", | |
"\n", | |
"# i = getIncident('blahblahblah.json')\n", | |
"def getIncident(inString):\n", | |
" return json.loads(open(inString).read())" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 75 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"# get a list of articles about Denial of Service incidents\n", | |
"articles = []\n", | |
"for eachFile in os.listdir(vcdb_path):\n", | |
" fullName = os.path.join(vcdb_path, eachFile)\n", | |
" i = getIncident(fullName)\n", | |
" if 'DoS' in i['action'].get('hacking', {}).get('variety', []):\n", | |
" for article in i.get('reference', '').split(';'):\n", | |
" articles.append(article)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 76 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"# Clean the urls. Remove dates, and remove emtpy entries\n", | |
"for index, value in enumerate(articles):\n", | |
" articles[index] = value.split(' ')[0]\n", | |
"for index, value in enumerate(articles):\n", | |
" if value == \"\":\n", | |
" articles.pop(index)\n", | |
"articles = list(set(articles))\n", | |
"print len(articles)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"76\n" | |
] | |
} | |
], | |
"prompt_number": 77 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"# Now split our articles into two pools, a training set and a testing set.\n", | |
"random.shuffle(articles)\n", | |
"training = articles[0:len(articles)/2]\n", | |
"testing = articles[len(articles)/2:]" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 78 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"# https://pypi.python.org/pypi/readability-lxml\n", | |
"# http://stackoverflow.com/questions/11151000/is-there-a-way-to-use-readability-and-python-to-extract-just-text-not-html\n", | |
"article_text = urllib.urlopen(training[0]).read()\n", | |
"readable_article = Document(article_text).summary()\n", | |
"readable_article = BeautifulSoup.BeautifulSoup(readable_article)\n", | |
"readable_article = readable_article.getText()\n", | |
"print readable_article" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"<a href=\"#skip-target\">Continue reading below</a>\n", | |
"Boston Children\u2019s Hospital was hit with a series of cyberattacks this week that tried to take down its website, officials said Wednesday. However, it does not appear the attackers were successful in crippling the website, and Children\u2019s said that so far no patient data or its internal systems had been compromised.\u201cOver the weekend and through today, Boston Children\u2019s Hospital\u2019s website has been the target of multiple attacks designed to bring down the site by overwhelming its capacity,\u201d spokesman Rob Graham said in a statement.Continue reading belowChildren\u2019s added it had contacted law enforcement authorities, who are now investigating the source of the attacks. No groups or individuals have claimed responsibility.Cybersecurity officials say that it is not surprising that a hospital would be targeted in a cyberattack.\u201cAfter all, a hospital knows just about everything there is to know about their patients, which is very valuable information for criminals,\u201d said Eric Cowperthwaite, vice president for advanced security and strategy at Core Security of Boston.Michael B. Farrell can be reached [email protected]. Follow him on Twitter@GlobeMBFarrell.\n" | |
] | |
} | |
], | |
"prompt_number": 88 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"OK so I've got an article and it does a reasonable job of extracting the text. Not perfect, but it's a start. Now how do I feed this and all the other articles into a machine learning monster so I can test it against my training list? It seems like I'm drowning in options and not finding a lot of documentation to tell me what I should try." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
} | |
], | |
"metadata": {} | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment