Created
August 13, 2019 13:56
-
-
Save vlad-bezden/d8b7681f941a7ea94b3c67b09a8a751f to your computer and use it in GitHub Desktop.
An example on how to cleanup text for text classification in ML
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
package main | |
import ( | |
"fmt" | |
"regexp" | |
"strings" | |
) | |
func preProcess(text string) string { | |
// Find all chars that are not alphabet | |
reg := regexp.MustCompile("[^a-zA-Z]+") | |
// Replace those chars with spaces | |
text = reg.ReplaceAllString(text, " ") | |
text = strings.ToLower(text) | |
// Tokenize on whitespace, while removing excess whitespace | |
tokens := strings.Fields(text) | |
// Join the tokens back to string | |
return strings.Join(tokens, " ") | |
} | |
func main() { | |
test := ` | |
The idea of designing a new language has been inspired by common dissatisfaction | |
with C++. Go was announced in 2009 and in March 2012 version 1.0 was released. | |
Go is used in Google’s production, as well as by many other companies | |
and open-source projects. | |
` | |
r := preProcess(test) | |
fmt.Println(r) | |
} | |
/* | |
OUTPUT: | |
the idea of designing a new language has been inspired by common dissatisfaction with c | |
go was announced in and in march version was released go is used in google s | |
production as well as by many other companies and open source projects | |
*/ |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment