Last active
August 29, 2015 14:19
-
-
Save computerphysicslab/14a4520b2ab9545eebbe to your computer and use it in GitHub Desktop.
php hashtag splitter / what does a given hashtag stand for? / hashtag segmentation algorithm
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
php hashtag splitter / what does a given hashtag stand for? / hashtag segmentation algorithm | |
This should be a 3 tier process: | |
1.- Using uppercase traces: #BarackObama => Barack Obama | |
2.- Querying a populated knowledge base or corpus of context such as obloop.com or first-thoughts.org or ritetag.com | |
3.- Using an English dictionary and figuring out its word composition: #socialmedia => social media | |
To take into account: | |
#nothere or #alwaysagain can be splited into multiple semantically correct segmentations. | |
C++ Challenge => https://www.hackerrank.com/challenges/url-hashtag-segmentation | |
Algorithm in Java => http://www.usna.edu/Users/cs/nchamber/courses/nlp/f12/labs/lab1.html | |
Useful assets: English Dictionaries/Corpora => | |
https://raw.githubusercontent.com/piyushbansal/hashtag-segmentation/master/one-grams.txt | |
http://www.pregunton.org/pregunta.php?id=1423 | |
http://s3.amazonaws.com/hr-testcases/479/assets/words.txt |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment