I hereby claim:
- I am thiagomarzagao on github.
- I am thiagomarzagao (https://keybase.io/thiagomarzagao) on keybase.
- I have a public key ASAVdCRfhIPn2ajeJwty8IPOGKV0NmlWGuSLIU5iSELYego
To claim this, I am signing this object:
library(tm) | |
setwd('/Users/thiagomarzagao/Dropbox/dataScience/UnB-CIC/aulaText/') | |
comprasnet <- read.table('subset.csv', | |
stringsAsFactors = FALSE, | |
sep = ',', | |
nrows = 1000) | |
corpus <- Corpus(VectorSource(comprasnet$V2)) | |
corpus <- tm_map(corpus, PlainTextDocument) | |
tfidf <- DocumentTermMatrix(corpus, control = list(weighting = weightTfIdf)) |
To produce the ADS I relied on supervised learning. I tried three different approaches, compared the results, | |
and picked the approach that worked best. More specifically, I tried: a) a combination of Latent Semantic Analysis | |
and tree-based regression methods; b) a combination of Latent Dirichlet Allocation and tree-based regression methods; | |
and c) the Wordscores algorithm. The Wordscores algorithm outperformed the alternatives. | |
I created a <a href="http://democracy-scores.org">web application</a> where anyone can tweak the training data and | |
see how the results change (no coding required). <u>Data and code</u>. The two corpora (A and B) are available | |
in <a href="http://math.nist.gov/MatrixMarket/formats.html#MMformat">MatrixMarket format</a>. | |
Each corpus is accompanied by other files: an internal index; a Python pickle with a dictionary mapping word IDs | |
to words; and a Python pickle with a dictionary mapping words to word IDs. Here are the links: | |
<a href="https://s3.amazonaws.com/thiagomarzagao/corpor |
I hereby claim:
To claim this, I am signing this object:
* The main model | |
xtscc pressfree leftism presvrl leftismXpresvrl ief gdpcapita | |
* Effect of VRLpres at in-sample min, ave, and max, values of presidential leftism | |
lincom _b[presvrl] + _b[leftismXpresvrl]*1.5 | |
lincom _b[presvrl] + _b[leftismXpresvrl]*7.194014 | |
lincom _b[presvrl] + _b[leftismXpresvrl]*18 | |
* Figure 1 data (the 'xtscc' add-on does not accept the # operator, so the 'margins' | |
* command could not be used to graph the interaction) |
import time | |
import random | |
import telepot | |
from telepot.loop import MessageLoop | |
bot = telepot.Bot('xxxxx') | |
family_group_id = 'xxxxx' | |
def handle(msg): |
import time | |
import requests | |
destination = '/Volumes/UNTITLED/wimoveis/paginas/' | |
base_url = 'https://www.wimoveis.com.br/' | |
num_pages = 1557 # number of results pages | |
for i in range(1, num_pages): | |
print('page', i) | |
query_url = base_url + 'apartamentos-venda-distrito-federal-goias-pagina-{}.html'.format(i) | |
response = requests.get(query_url) |
import os | |
from bs4 import BeautifulSoup | |
hrefs = [] | |
path = '/Volumes/UNTITLED/wimoveis/paginas/' | |
for fname in os.listdir(path): | |
print(fname) | |
if ('.html' in fname) and ('._' not in fname): | |
with open(path + fname, mode = 'r') as f: | |
html = f.read() |
import os | |
import re | |
import time | |
import requests | |
import pandas as pd | |
from bs4 import BeautifulSoup | |
basepath = '/Volumes/UNTITLED/wimoveis/anuncios/' | |
hrefs = pd.read_csv('hrefs.csv') # get URLs | |
hrefs = set(hrefs['href']) # remove duplicate URLs |
import numpy as np | |
import pandas as pd | |
import forestci as fci | |
import matplotlib.pyplot as plt | |
from sklearn.utils import shuffle | |
from sklearn.model_selection import train_test_split | |
from sklearn.ensemble import RandomForestRegressor | |
from sklearn.model_selection import cross_val_predict | |
# set seed |
import numpy as np | |
import pandas as pd | |
from collections import Counter | |
from bs4 import BeautifulSoup | |
from scipy import stats | |
from matplotlib import pyplot as plt | |
path = '/path/to/export.xml' | |
with open(path) as f: |