Skip to content

Instantly share code, notes, and snippets.

@benjamintanweihao
Created September 26, 2014 11:28
Show Gist options
  • Select an option

  • Save benjamintanweihao/60dbd11d65d6d7182d5e to your computer and use it in GitHub Desktop.

Select an option

Save benjamintanweihao/60dbd11d65d6d7182d5e to your computer and use it in GitHub Desktop.

ExWiKiDx - A Wikipedia Indexer in Elixir

This is meant to be an approximate port of Web Words, written in Elixir. I was appalled at the number of external dependencies needed (MongoDB, RabbitMQ), so I decided to see if I could come up with something similar.

Also, I wasn't smart enough to understand the Scala code – my brain melted and my eyes bled. My apologies to all Scala programmers out there.

How it works

The application takes in a Wikipedia site URL and performs crawling of various other Wikipedia pages. As it goes through the pages, it calculates the word frequencies and presents the results in "real-time" (whatever that means). Stop words are ignored (e.g. 'a', 'the', 'of' etc.).

Overview: A Request Step-by-step

  • a web interface (using Phoenix) receives a request to crawl a Wikipedia site.

  • the request is then forwarded to the main ExWiKiDx application. It checks if a previous site has been crawled before or not. It does this via a cache that uses ETS (Erlang Term Storage).

  • if the page has not been cached, the application then sends a request to the Indexer supervisor.

  • the supervisor spawns an Indexer process. The Indexer process then parses the page and returns an Index, and stores this in ETS. Meanwhile, the Indexer sends a message to the web process informing that the data is ready.

Actors

  • URLFetcher : Downloads a page given a URL
  • IndexActor : Indexes a page given a URL. The indexer extracts all the links on the page, filters out only the Wikipedia links. Then it uses Tasks to further crawl these pages and perform a frequency count on them.
  • SpiderActor: Given a URL, computes an Index. An Index is a structure that contains word counts and links.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment