ExWiKiDx - A Wikipedia Indexer in Elixir

This is meant to be an approximate port of Web Words, written in Elixir. I was appalled at the number of external dependencies needed (MongoDB, RabbitMQ), so I decided to see if I could come up with something similar.

Also, I wasn't smart enough to understand the Scala code – my brain melted and my eyes bled. My apologies to all Scala programmers out there.

How it works

The application takes in a Wikipedia site URL and performs crawling of various other Wikipedia pages. As it goes through the pages, it calculates the word frequencies and presents the results in "real-time" (whatever that means). Stop words are ignored (e.g. 'a', 'the', 'of' etc.).

Overview: A Request Step-by-step

a web interface (using Phoenix) receives a request to crawl a Wikipedia site.
the request is then forwarded to the main ExWiKiDx application. It checks if a previous site has been crawled before or not. It does this via a cache that uses ETS (Erlang Term Storage).
if the page has not been cached, the application then sends a request to the Indexer supervisor.
the supervisor spawns an Indexer process. The Indexer process then parses the page and returns an Index, and stores this in ETS. Meanwhile, the Indexer sends a message to the web process informing that the data is ready.

Actors

URLFetcher : Downloads a page given a URL
IndexActor : Indexes a page given a URL. The indexer extracts all the links on the page, filters out only the Wikipedia links. Then it uses Tasks to further crawl these pages and perform a frequency count on them.
SpiderActor: Given a URL, computes an Index. An Index is a structure that contains word counts and links.

benjamintanweihao/exwikidx.markdown

Select an option

No results found

Select an option

No results found

ExWiKiDx - A Wikipedia Indexer in Elixir

How it works

Overview: A Request Step-by-step

Actors