Skip to content

Instantly share code, notes, and snippets.

@computerphysicslab
Last active August 29, 2015 14:19
Show Gist options
  • Save computerphysicslab/14a4520b2ab9545eebbe to your computer and use it in GitHub Desktop.
Save computerphysicslab/14a4520b2ab9545eebbe to your computer and use it in GitHub Desktop.
php hashtag splitter / what does a given hashtag stand for? / hashtag segmentation algorithm
php hashtag splitter / what does a given hashtag stand for? / hashtag segmentation algorithm
This should be a 3 tier process:
1.- Using uppercase traces: #BarackObama => Barack Obama
2.- Querying a populated knowledge base or corpus of context such as obloop.com or first-thoughts.org or ritetag.com
3.- Using an English dictionary and figuring out its word composition: #socialmedia => social media
To take into account:
#nothere or #alwaysagain can be splited into multiple semantically correct segmentations.
C++ Challenge => https://www.hackerrank.com/challenges/url-hashtag-segmentation
Algorithm in Java => http://www.usna.edu/Users/cs/nchamber/courses/nlp/f12/labs/lab1.html
Useful assets: English Dictionaries/Corpora =>
https://raw.githubusercontent.com/piyushbansal/hashtag-segmentation/master/one-grams.txt
http://www.pregunton.org/pregunta.php?id=1423
http://s3.amazonaws.com/hr-testcases/479/assets/words.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment