V David Zvenyach vdavez

FIS Scraper

Currently, the CFO makes fiscal impact statements (FIS) available at http://app.cfo.dc.gov/services/fiscal_impact/search.asp. But, bulk data. So, scrape.

Once pdfs were obtained, they were saved into txt and then reinserted into the json. Final result = searchable json.

Get the laws

There's a great crosswalk of all of the laws here: https://raw2.github.com/openlawdc/browser/gh-pages/js/dc_laws.js

Then, using the helpful scraper put together by Sunlight, I'll be able to get the Enrolled version of all of the bills.

Then, I'll wget the files and upload them to S3.

Converting a Word Document to Markdown in Two Moves

The Problem

A lot of important government documents are created and saved in Microsoft Word (*.docx). But Microsoft Word is a proprietary format, and it's not really useful for presenting documents on the web. So, I wanted to find a way to convert a .docx file into markdown.

The Solution

As it turns out, there are several open-source tools that allow for conversion between file types. Pandoc is one of them, and it's powerful. In fact, pandoc's website says "If you need to convert files from one markup format into another, pandoc is your swiss-army knife." But, although pandoc can convert from markdown into .docx, it doesn't work in the other direction.

	[
	{
	"date": "Tuesday, Jan 7, 2014 at 1:06 p.m.",
	"url": "/shows/2014-01-07/al-qaidas-new-rise-middle-east",
	"summary": "Iraqi armed forces are battling militants to reclaim control of the city of Fallujah in Iraq's Anbar province. For the first time since U.S. forces defeated insurgents in 2006-2007, the region bordering war-torn Syria has become a hub for an al Qaida affiliate called the Islamic State of Iraq and Syria. Experts join Kojo to understand the rise of militancy in Iraq and its traces in neighboring countries like Syria.",
	"guests": [
	{
	"credentials": "Vice president, Middle East Institute",
	"guest": "Paul Salem"
	},

	cd ~
	sudo apt-get update
	sudo apt-get install openjdk-7-jre-headless -y

	### Check http://www.elasticsearch.org/download/ for latest version of ElasticSearch and replace wget link below

	# NEW WAY / EASY WAY
	wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.3.0.deb
	sudo dpkg -i elasticsearch-1.3.0.deb

	#!/usr/bin/env python

	import re
	import os
	import glob
	import json
	import pymongo
	from pymongo import MongoClient
	import shutil

	/*Known Bugs
	[x] It's probably better to build a function "inRecess" to test whether a House is in recess for more than three days (ex. Aug. 12)
	[ ]
	*/

	var _holidays = {
	'M': {//Month, Day
	'01/01': "New Year's Day",
	'07/04': "Independence Day",
	'11/11': "Veteran's Day",

	{
	"congress": [
	"2013-1-1",
	"2013-1-2",
	"2013-1-3",
	"2013-1-4",
	"2013-1-21",
	"2013-1-22",
	"2013-1-23",
	"2013-1-29",

	#!/usr/bin/env python
	import os
	import mechanize
	import cookielib
	import json
	import urllib

	#initialize outfile
	out = open('glob.html', 'w')

	#!/usr/bin/env python

	##This is the definition for the function to return the dollar value. But it doesn't work because the D&F formats are inconsistent

	def dandftext(url):
	url = re.split('\\\\',url)[2]
	call('wget http://app.ocp.dc.gov/intent_award/D_F/' + url, shell=True)
	call('pdftotext ' + url, shell=True)
	url_text = re.split('(.pdf)', url)[0] + '.txt'
	df = open(url_text,'r')