Skip to content

Instantly share code, notes, and snippets.

@Altons
Created January 15, 2013 11:57
Show Gist options
  • Save Altons/4538133 to your computer and use it in GitHub Desktop.
Save Altons/4538133 to your computer and use it in GitHub Desktop.
R function for detecting language in text
# Program name: langScore.R
# Author: Alberto Negron
# date: 15/01/2013
# Description: Detect language in plain text
langScore <- function(post)
{
# Params:
# post: Plain text.
require(tm)
# Languages supported by tm package
langList <- list('danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian',
'italian', 'norwegian', 'portuguese', 'russian', 'spanish','swedish')
# List for storing stopwords.
langStopwords <- list()
for (lang in langList) langStopwords[[lang]] <- stopwords(kind=lang)
# Result list
results <-list()
# convert text to list
toList <- strsplit(post, " ")
# Normalize tokens
lowerList <- lapply(toList,tolower)
for(word in lowerList)
{
for(lang in names(langStopwords)) {
results[[lang]] <- is.element(word, strsplit(langStopwords[[lang]]," "))
}
}
scores <- as.vector(lapply(results,sum))
# In case language is not supported or anything else like links or hashtags
if(Reduce("+",scores)==0) { return('unknown')}
max_score <- which.max(scores)
return(names(scores[max_score]))
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment