Skip to content

Instantly share code, notes, and snippets.

@vlad-bezden
Created August 13, 2019 13:56
Show Gist options
  • Save vlad-bezden/d8b7681f941a7ea94b3c67b09a8a751f to your computer and use it in GitHub Desktop.
Save vlad-bezden/d8b7681f941a7ea94b3c67b09a8a751f to your computer and use it in GitHub Desktop.
An example on how to cleanup text for text classification in ML
package main
import (
"fmt"
"regexp"
"strings"
)
func preProcess(text string) string {
// Find all chars that are not alphabet
reg := regexp.MustCompile("[^a-zA-Z]+")
// Replace those chars with spaces
text = reg.ReplaceAllString(text, " ")
text = strings.ToLower(text)
// Tokenize on whitespace, while removing excess whitespace
tokens := strings.Fields(text)
// Join the tokens back to string
return strings.Join(tokens, " ")
}
func main() {
test := `
The idea of designing a new language has been inspired by common dissatisfaction
with C++. Go was announced in 2009 and in March 2012 version 1.0 was released.
Go is used in Google’s production, as well as by many other companies
and open-source projects.
`
r := preProcess(test)
fmt.Println(r)
}
/*
OUTPUT:
the idea of designing a new language has been inspired by common dissatisfaction with c
go was announced in and in march version was released go is used in google s
production as well as by many other companies and open source projects
*/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment