This is meant to be an approximate port of Web Words, written in Elixir. I was appalled at the number of external dependencies needed (MongoDB, RabbitMQ), so I decided to see if I could come up with something similar.
Also, I wasn't smart enough to understand the Scala code – my brain melted and my eyes bled. My apologies to all Scala programmers out there.
The application takes in a Wikipedia site URL and performs crawling of various other Wikipedia pages. As it goes through the pages, it calculates the word frequencies and presents the results in "real-time" (whatever that means). Stop words are ignored (e.g. 'a', 'the', 'of' etc.).
-
a web interface (using Phoenix) receives a request to crawl a Wikipedia site.
-
the request is then forwarded to the main ExWiKiDx application. It checks if a previous site has been crawled before or not. It does this via a cache that uses ETS (Erlang Term Storage).
-
if the page has not been cached, the application then sends a request to the Indexer supervisor.
-
the supervisor spawns an Indexer process. The Indexer process then parses the page and returns an Index, and stores this in ETS. Meanwhile, the Indexer sends a message to the web process informing that the data is ready.
- URLFetcher : Downloads a page given a URL
- IndexActor : Indexes a page given a URL. The indexer extracts all the links on the page, filters out only the Wikipedia links. Then it uses Tasks to further crawl these pages and perform a frequency count on them.
- SpiderActor: Given a URL, computes an Index. An Index is a structure that contains word counts and links.