Ettore Rizza ettorerizza

🏠

Working from home

Researcher & PhD student in Information Sciences & Technologies. Open Refine supporter.

ettorerizza / extract_belgian_municipalities.py

Created July 9, 2017 14:06

Jython naive method to detect names of belgian municipalities in OpenRefine based on a gazeeter

	import sys
	sys.path.append(r'D:\jython2.7.0\Lib\site-packages')
	from unidecode import unidecode

	#TEST
	value = "carette leuven"


	with open(r"C:\Users\Boulot\Desktop\communes.tsv", 'r', encoding="utf8") as f:
	lieux = [unidecode(name.strip().lower().replace("-", " ")) for name in f]

ettorerizza / airbnb.r

Created July 24, 2017 07:58 — forked from t-andrew-do/airbnb.r

AirBnB Scraping Script

	library(stringr)
	library(purrr)
	library(rvest)

	#------------------------------------------------------------------------------#
	# Author: Andrew Do
	# Purpose: A bunch of utility functions for the main ScrapeCityToPage The goal
	# is to be able to scrape up to a specified page number for a given city and
	# then to store that information as a data frame. The resulting data frame will
	# be raw and will require additional cleaning, but the structure is more or less

ettorerizza / open_refine_to_R.py

Created August 1, 2017 11:27

Open Refine cluster and edit Json translation to R code

	#! python3
	import json
	import sys
	import os

	#prend en entrée un Json de "cluster and edit" et renvoye du code R
	if len(sys.argv) < 2:
	print("USAGE: ./utils/open_refine_to_R.py [edits.json] > r_file.R")
	exit(1)
	json_file = sys.argv[-1]

ettorerizza / scrape_patrom.py

Last active August 7, 2017 21:02

scrape patronym in the database http://patrom.fltr.ucl.ac.be

	#! python3
	import requests
	from bs4 import BeautifulSoup
	import string
	import pandas as pd

	url = "http://patrom.fltr.ucl.ac.be/contemporain/query.cfm"

	letters = list(string.ascii_lowercase)

ettorerizza / google_books_links.py

Created August 17, 2017 08:29

Jython : Use Google Books api with OpenRefine records

	import urllib
	data = set(row['record']['cells']['extract_persons']['value'])
	liste = []
	for el in data:
	liste.append('"%s"' %el)
	terms = "+".join(liste)

	if len(terms) > 1:
	return "https://www.googleapis.com/books/v1/volumes?q=" + urllib.quote(terms.encode('utf8')) + "&key="

ettorerizza / parseNeckarJson.py

Last active September 9, 2017 15:56

python parser for Wikidata Neckar dumps (http://event.ifi.uni-heidelberg.de/?page_id=429)

	import pandas as pd
	import simplejson as json
	import gzip

	def getTargetIds(jsonData):
	data = json.loads(jsonData)
	return (str(data.get('id', 'null')),
	str(data.get('norm_name', 'null')),
	str(data.get('description', 'null')),
	str(data.get('date_birth', 'null')),

ettorerizza / count_lines.py

Created September 28, 2017 18:15

Select all txt files in a folder, count lines and write the result in csv

ettorerizza / LCS.py

Created December 28, 2017 13:49

Get the least common subsumer (LCS) between two Wikidata items

	# -- coding: utf-8 --
	import requests
	from bs4 import BeautifulSoup

	array = ["Q32815", "Q34627"]


	query = {"query": """
	SELECT ?classe ?classeLabel WHERE {
	wd:%s wdt:P279* ?classe .

ettorerizza / jq.R

Created December 28, 2017 20:43

How to use jq with R

	library(jqr)

	data <- readr::read_file("tweets.json")

	data %>% keys()

	data %>% jq("{id: .id, hashtag: .entities.hashtags[].text}",
	"[.id, .hashtag]") %>% jsonlite::toJSON()

	stri <- "--h"

ettorerizza / stanford_ner_europeana

Created February 4, 2018 12:08

Test du Stanford NER tagger avec les modèles CRF d'Europeana entrainés sur des journaux : http://lab.kbresearch.nl/static/html/eunews.html

	# -- coding: utf-8 --
	"""
	Test du Stanford NER tagger avec les modèles CRF d'Europeana
	entrainés sur des journaux :
	http://lab.kbresearch.nl/static/html/eunews.html
	La fonction est lente --> songer au multiprocessing
	"""

	from nltk.tag import StanfordNERTagger
	from nltk.tokenize import word_tokenize