This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import pandas as pd | |
from covea.claudia.core.nlp.etudeponctuelle import save_as_table, load_dataframe_from_delta | |
from netme0a.settings.paths_to_data import * | |
from transformers import AutoTokenizer | |
import umap | |
import seaborn as sns | |
import matplotlib.pyplot as plt | |
from sentence_transformers import SentenceTransformer | |
model_name = "dangvantuan/sentence-camembert-large" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
library(data.table) | |
library(sf) | |
library(tidyr) | |
library(leaflet) | |
library(dplyr) | |
# télécharger le fichier FINESS géocodé le plus récent et le renommer finess_geocoded_latest.csv | |
fi1 = fread("finess_geocoded_latest.csv",encoding="Latin-1",colClasses = "character") | |
n = nrow(fi1) | |
fi2 = fread("finess_geocoded_latest.csv",encoding="Latin-1",colClasses = "character",skip = n+1,sep=";") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
library(shiny) | |
library(httr) | |
library(shinydashboard) | |
library(magrittr) | |
mykey = paste(sample(LETTERS,20,replace = T),collapse="")#should be secret | |
# OAuth setup -------------------------------------------------------- | |
# Most OAuth applications require that you redirect to a fixed and known | |
# set of URLs. Many only allow you to redirect to a single URL: if this | |
# is the case for, you'll need to create an app for testing with a localhost |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
## Word2vec & GloVe | |
Certains chercheurs mettent à disposition des modèles word2vec, GloVe, LSA qui peuvent être considérés comme des tables de correspondance entre des mots et un vecteur numérique dans un espace d'une certaine taille (en générale très inférieure à la taille du vocabulaire qui est, on le rappelle, grande). Cet espace est un espace "sémantique" en quelques sortes. | |
Glove, Word2vec et LSA produisent tous les 3 des vectorisations du langage ie des représentations dans un espace vectoriel réel de taille N (choisie) | |
- word2vec : s'appuie sur un simple perceptron multi-couches (réseau de neurone) à une couche cachée (de taille N) où la tâche est de prédire le mot en fonction du contexte ou réciproquement. La vectorisation en dimension N est fournie par l'ensemble des poids des neurones de la couche cachée. | |
- global vector : s'appuie sur la factorisation de la matrice de co-occurrence des termes. | |
- LSA : s'appuie sur la décomposition en valeurs singulières de la matrice termes-documents. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#------------------ | |
# Data Preparation | |
#------------------ | |
library(data.table) | |
library(ggplot2) | |
library(plotly) | |
#Read datasets | |
#Download the data from http://www.saedsayad.com/datasets/BikeRental.zip | |
train <- read.csv("data/BikeRental/bike_rental_train.csv") | |
test <- read.csv("data/BikeRental/bike_rental_test.csv") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# https://joparga3.github.io/Udemy_text_analysis/#document-similarity-cosine-similarity-and-latent-semantic-analysis | |
library(data.table) | |
library(tm) | |
library(SnowballC) | |
library(Rtsne) | |
library(irlba) | |
library(plotly) | |
# articles =fread("data_text_mining/lemonde_csv_formation.csv",encoding='UTF-8') | |
# données scrapées avec le gist scraping_lemonde | |
scrape = pbapply::pblapply(list.files("lemonde_scraping/"),function(x){ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
library(rvest) | |
library(data.table) | |
annees=1980:2019 | |
annee = sample(annees,1) | |
pbapply::pblapply(annees,function(annee){ | |
pbapply::pblapply(1:250,function(i){ | |
tryCatch({ | |
url = sprintf(paste0("https://www.lemonde.fr/recherche/", | |
"?search_keywords=a&start_at=01/01/%s&", |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
library(data.table) | |
prep_data=F | |
if(prep_data){ | |
# https://www.insee.fr/fr/statistiques/2520034 | |
carreaux=foreign::read.dbf("external_data/carroyage_200m.dbf") | |
carreaux=data.table(carreaux) | |
head(carreaux) | |
carreaux[,idINSPIRE:=as.character(idINSPIRE)] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#############@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@####### | |
#### GRRRRR FUCKING ECRITURE SCIENTIFIQUE !!!! ##### | |
#############@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@####### | |
# FIX !!!! options(scipen = 999) | |
one_mat= NN[mater==mater_id] | |
lon_lat_mat=one_mat[1,c("lon_mater","lat_mater")]%>%paste(collapse=",") | |
divide_and_conquer <- function(data,mater_id,position = "0"){ | |
print(paste(position,nrow(data))) |