Skip to content

Instantly share code, notes, and snippets.

@vdavez
vdavez / blob.json
Last active August 29, 2015 13:57
Kojo Nnamdi Show Scraper
[
{
"date": "Tuesday, Jan 7, 2014 at 1:06 p.m.",
"url": "/shows/2014-01-07/al-qaidas-new-rise-middle-east",
"summary": "Iraqi armed forces are battling militants to reclaim control of the city of Fallujah in Iraq's Anbar province. For the first time since U.S. forces defeated insurgents in 2006-2007, the region bordering war-torn Syria has become a hub for an al Qaida affiliate called the Islamic State of Iraq and Syria. Experts join Kojo to understand the rise of militancy in Iraq and its traces in neighboring countries like Syria.",
"guests": [
{
"credentials": "Vice president, Middle East Institute",
"guest": "Paul Salem"
},
@vdavez
vdavez / es.sh
Last active August 29, 2015 13:57
cd ~
sudo apt-get update
sudo apt-get install openjdk-7-jre-headless -y
### Check http://www.elasticsearch.org/download/ for latest version of ElasticSearch and replace wget link below
# NEW WAY / EASY WAY
wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.3.0.deb
sudo dpkg -i elasticsearch-1.3.0.deb
@vdavez
vdavez / README.MD
Last active August 29, 2015 13:55
FIS Scraper

FIS Scraper

Currently, the CFO makes fiscal impact statements (FIS) available at http://app.cfo.dc.gov/services/fiscal_impact/search.asp. But, bulk data. So, scrape.

Once pdfs were obtained, they were saved into txt and then reinserted into the json. Final result = searchable json.

@vdavez
vdavez / extractenrolled.py
Created January 21, 2014 22:30
The process by which I built the openlims list of enrolled bills
#!/usr/bin/env python
import re
import os
import glob
import json
import pymongo
from pymongo import MongoClient
import shutil
@vdavez
vdavez / readme.md
Created January 13, 2014 15:34
Scrape the DC Laws
@vdavez
vdavez / eff_date.js
Last active January 1, 2016 12:09
Automatically calculate the effective date of a DC Law! To see it in action: http://jsfiddle.net/g5mYh/2/
/*Known Bugs
[x] It's probably better to build a function "inRecess" to test whether a House is in recess for more than three days (ex. Aug. 12)
[ ]
*/
var _holidays = {
'M': {//Month, Day
'01/01': "New Year's Day",
'07/04': "Independence Day",
'11/11': "Veteran's Day",
@vdavez
vdavez / both_in_session_days.json
Last active January 1, 2016 11:09
The days that Congress is in session (built using the script below & at http://jsfiddle.net/YV8B3/). This will be updated in a few days to add a neat feature using http://beta.congress.gov/congressional-record/browse-by-date/ to automatically populate the days in session...
{
"congress": [
"2013-1-1",
"2013-1-2",
"2013-1-3",
"2013-1-4",
"2013-1-21",
"2013-1-22",
"2013-1-23",
"2013-1-29",
@vdavez
vdavez / docx2md.md
Last active June 17, 2024 19:40
Convert a Word Document into MD

Converting a Word Document to Markdown in Two Moves

The Problem

A lot of important government documents are created and saved in Microsoft Word (*.docx). But Microsoft Word is a proprietary format, and it's not really useful for presenting documents on the web. So, I wanted to find a way to convert a .docx file into markdown.

The Solution

As it turns out, there are several open-source tools that allow for conversion between file types. Pandoc is one of them, and it's powerful. In fact, pandoc's website says "If you need to convert files from one markup format into another, pandoc is your swiss-army knife." But, although pandoc can convert from markdown into .docx, it doesn't work in the other direction.

@vdavez
vdavez / get_decisions.py
Created October 15, 2013 10:31
Can anyone figure out why this doesn't work? A sample record of the JSON data referred to: { "description": "Opinion", "url": "GetDoc.asp?Database=CAB_DOCS&docnum=25884&version=1&minLevel=0", "date_filed": "10/9/2013", "case_number": "P-0943", "file_size": "48222", "row_id": "0" },
#!/usr/bin/env python
import os
import mechanize
import cookielib
import json
import urllib
#initialize outfile
out = open('glob.html', 'w')
@vdavez
vdavez / extract_df.py
Created October 10, 2013 01:48
Scrape the D&Fs for the dc-contracts
#!/usr/bin/env python
##This is the definition for the function to return the dollar value. But it doesn't work because the D&F formats are inconsistent
def dandftext(url):
url = re.split('\\\\',url)[2]
call('wget http://app.ocp.dc.gov/intent_award/D_F/' + url, shell=True)
call('pdftotext ' + url, shell=True)
url_text = re.split('(.pdf)', url)[0] + '.txt'
df = open(url_text,'r')