Skip to content

Instantly share code, notes, and snippets.

View ettorerizza's full-sized avatar
🏠
Working from home

Ettore Rizza ettorerizza

🏠
Working from home
View GitHub Profile
@ettorerizza
ettorerizza / extract_belgian_municipalities.py
Created July 9, 2017 14:06
Jython naive method to detect names of belgian municipalities in OpenRefine based on a gazeeter
import sys
sys.path.append(r'D:\jython2.7.0\Lib\site-packages')
from unidecode import unidecode
#TEST
value = "carette leuven"
with open(r"C:\Users\Boulot\Desktop\communes.tsv", 'r', encoding="utf8") as f:
lieux = [unidecode(name.strip().lower().replace("-", " ")) for name in f]
@ettorerizza
ettorerizza / airbnb.r
Created July 24, 2017 07:58 — forked from t-andrew-do/airbnb.r
AirBnB Scraping Script
library(stringr)
library(purrr)
library(rvest)
#------------------------------------------------------------------------------#
# Author: Andrew Do
# Purpose: A bunch of utility functions for the main ScrapeCityToPage The goal
# is to be able to scrape up to a specified page number for a given city and
# then to store that information as a data frame. The resulting data frame will
# be raw and will require additional cleaning, but the structure is more or less
@ettorerizza
ettorerizza / open_refine_to_R.py
Created August 1, 2017 11:27
Open Refine cluster and edit Json translation to R code
#! python3
import json
import sys
import os
#prend en entrée un Json de "cluster and edit" et renvoye du code R
if len(sys.argv) < 2:
print("USAGE: ./utils/open_refine_to_R.py [edits.json] > r_file.R")
exit(1)
json_file = sys.argv[-1]
@ettorerizza
ettorerizza / scrape_patrom.py
Last active August 7, 2017 21:02
scrape patronym in the database http://patrom.fltr.ucl.ac.be
#! python3
import requests
from bs4 import BeautifulSoup
import string
import pandas as pd
url = "http://patrom.fltr.ucl.ac.be/contemporain/query.cfm"
letters = list(string.ascii_lowercase)
@ettorerizza
ettorerizza / google_books_links.py
Created August 17, 2017 08:29
Jython : Use Google Books api with OpenRefine records
@ettorerizza
ettorerizza / parseNeckarJson.py
Last active September 9, 2017 15:56
python parser for Wikidata Neckar dumps (http://event.ifi.uni-heidelberg.de/?page_id=429)
import pandas as pd
import simplejson as json
import gzip
def getTargetIds(jsonData):
data = json.loads(jsonData)
return (str(data.get('id', 'null')),
str(data.get('norm_name', 'null')),
str(data.get('description', 'null')),
str(data.get('date_birth', 'null')),
@ettorerizza
ettorerizza / count_lines.py
Created September 28, 2017 18:15
Select all txt files in a folder, count lines and write the result in csv
import csv
import copy
import os
import sys
import glob
os.chdir(r"FOLDER_PATH")
names={}
for fn in glob.glob('*.txt'):
with open(fn, encoding="utf8") as f:
@ettorerizza
ettorerizza / LCS.py
Created December 28, 2017 13:49
Get the least common subsumer (LCS) between two Wikidata items
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
array = ["Q32815", "Q34627"]
query = {"query": """
SELECT ?classe ?classeLabel WHERE {
wd:%s wdt:P279* ?classe .
@ettorerizza
ettorerizza / jq.R
Created December 28, 2017 20:43
How to use jq with R
library(jqr)
data <- readr::read_file("tweets.json")
data %>% keys()
data %>% jq("{id: .id, hashtag: .entities.hashtags[].text}",
"[.id, .hashtag]") %>% jsonlite::toJSON()
stri <- "--h"
@ettorerizza
ettorerizza / stanford_ner_europeana
Created February 4, 2018 12:08
Test du Stanford NER tagger avec les modèles CRF d'Europeana entrainés sur des journaux : http://lab.kbresearch.nl/static/html/eunews.html
# -*- coding: utf-8 -*-
"""
Test du Stanford NER tagger avec les modèles CRF d'Europeana
entrainés sur des journaux :
http://lab.kbresearch.nl/static/html/eunews.html
La fonction est lente --> songer au multiprocessing
"""
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize