Skip to content

Instantly share code, notes, and snippets.

@chrismedrela
Last active March 15, 2016 00:14
Show Gist options
  • Save chrismedrela/9348472 to your computer and use it in GitHub Desktop.
Save chrismedrela/9348472 to your computer and use it in GitHub Desktop.
"Improving numerical routines in Scala Breeze" GSoC 2014 proposal.

"Improving numerical routines in Scala Breeze" GSoC 2014 proposal.

Abstract

Breeze is a great numerical processing library. However, it lacks some high-level functions that you can find in other libraries like SciPy. The second issue is that Breeze lacks documentation. This makes the entry barrier higher for new contributors.

My proposal is to revamp documentation and to introduce interpolation and integration facilities.

My main principle will be to lower entry barrier as much as possible and to get people excited about Breeze so that it will gain a lot of new contributors. In my opinion, exceeding the critical mass is the most important thing at this moment.

Improving documentation

At the beginning, I'm going to improve the documentation of existing Breeze modules and then revamp the tutorial. This is a chance for me to get into internals of Breeze. I'm going to spend at this step about 3-4 weeks.

I believe that good documentation consists of three components:

  • Step-by-step tutorial shows the key concepts (like vectors and matrixes); shows how Breeze "feels"; should be quick and easy.
  • Topical guides are for those, who have read the tutorial. In Breeze there is a natural one-to-one relation between modules and topics, so this is equivalent to high-level module documentation.
  • Low-level deep-dive reference included at the end of topical guides.

If time permits, I will investigate if it's possible to treat code snippets in documentation as doctests. Introducing doctests means that we won't worry any more about out-of-date snippets.

The main breeze page

The main page, that is, the first page users can see, should be as short as possible -- everything, that can be moved to other pages (e.g. installation and contributors), will be moved out. People don't have time to read negligible details, first they need to know if Breeze is what they are looking for and to get excited about the project.

The first paragraph should contain the most important information like:

  • what Breeze is;
  • why you should care about it;
  • and why it's worth the effort to learn it;
  • what license Breeze uses;
  • what the latest release is;
  • what are the other components (breeze-viz, breeze-learn, breeze-process).

The second paragraph will be a bunch of links. Everything should be simply accessible, the best are two steps -- go to home page and find an appropriate link. There wouldn't be too much links (installation, contribution; link to tutorial and full documentation; bug tracker and source code -- both link to github).

The next paragraph should show the power of Breeze and get people excited. So a superb simple way to play with Breeze is a must:

$ sbt
set libraryDependencies ++= Seq("org.scalanlp" % "breeze_2.10" % "0.7-SNAPSHOT")
set libraryDependencies ++= Seq("org.scalanlp" % "breeze-natives_2.10" % "0.7-SNAPSHOT")
set resolvers ++= Seq("Sonatype Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots/")
set resolvers ++= Seq("Sonatype Releases" at "https://oss.sonatype.org/content/repositories/releases/")
set scalaVersion := "2.10.3"
console

And then, we will show the most amazing thing you can do in Breeze in little code. It will consists mostly of code examples, no deep description. The first impression is very important. At the end, there will be a link to the tutorial.

Switching to Github Pages and Jekyll

The current documentation consists of wiki pages on github. This causes two problems. First of all, they are separated repositores. There are two different workflows to work at documentation and code. There is no association between documentation and code and you don't know which code version is a doc version about.

The second problem is that people would get more excited if breeze had it's own webpage instead of using github wiki pages. There is an initial movement for breeze and epic (see scalanlp webpage). My goal would be to enhance it and to move all documentation to this site.

Github Pages is a good choice because it's free and is integrated with Jekyll. That means that documentation html pages will be generated directly from Markdown files. It's also very easy to use this technology.

Improving numerical routines

After revamping the existing documentation I will focus on the main part of this proposal -- introducing interpolation and integration modules.

My goal is not to implement all possible facilities. Instead, for each family of algorithms I will implement only one and design an interface that all algorithms from that family must fulfill. Good documentation will lower the entry barrier. I find it better in the long term because lowering the barrier will attract more contributors which is the most important thing at this moment rather than completeness. The new contributors will implement other algorithms.

I'm going to write documentation and/or tests before implementation so that other people can see how the API will look like and can comment and discuss it.

This project is not risky at all. The code can be merged with the master branch after implementing every family of algorithm. I find iterative approach very suitable for this project.

I'm working at implementing linear interpolation, so you can "feel" what I'd like to do this summer.

Brief plan

I will start from implementing interpolation. Univariate linear interpolation is already in progress. Then I will implement 1d splines with degree equal or smaller than 3. After that, I will move to multivariate interpolation and I will implement both n-d linear interpolation as well as 2-d splines (with degree <= 3). If time permits, I will also implement other interpolators like barycentric and krogh ones.

The rest of time I will focus on integration. Again, I will start from single integral. I will implement trapezoid and Simpson methods with equidistant nodes. Then, I will move to n-d integrals and implement Monte Carlo method.

If time permits, I will also focus on enhancing existing modules like signal processing, optimization and statistical functions.

About me

My name is Christopher Mędrela and I'm a student of University of Science and Technology in Kraków (Poland). My time zone is UTC+01:00. My email address is chris.medrela+gsoc2014 at gmail.com. I have [a github account] (https://github.com/chrismedrela).

I'm contributor of open source projects since 2011. I'm working mainly at Django. I've written a lot of patches. Last year I was participating GSoC and I've successfully revamped Django check framework (proposal, merge).

I'm fluent in Python so I can easily comprehend SciPy. I'm interested in other languages too. I met Scala about one year ago. Before switching to Python, I was coding in Java, so Scala is not a completely new language for me. I'm familiar with Scala enough to manage this project and the branch where I'm working at implementing linear interpolation proves that.

I can use the tools necessary to manage this project. During the last GSoC I've mastered git. This proposal is written in Markdown, so I get started with it. I know the basics of sbt, otherwise I couldn't write the pull request.

During the last GSoC it turned out that my English is good enough to talk in real time although it's not fluent.

I'd like to internally shift GSoC dates to start on 21 April (one month earlier) and finish after 12 weeks. Google said "We don't police what you deliver to your org and when, simply that you meet the milestones of the program as laid out.". Since we will make everything earlier, deadlines are not a problem. David Hall doesn't object to that too. The reason for the shifting is that I'd like to have an internship in the late summer and this is the only way I can avoid a clash with GSoC.

During the GSoC, I'm not going to have any job, holidays nor any other time-consuming activity except for classes at university. I'm not going to apply for those internship which will collide with GSoC. I'd like to reserve one week for preparing to exams (23 - 27 June), so I will be able to work at GSoC for eleven weeks.

@dlwh
Copy link

dlwh commented Mar 17, 2014

I think there's nothing to worry about. If you successfully completed a GSoC doing basically the same thing last year, I see no problem.

@hubertp
Copy link

hubertp commented Mar 20, 2014

How many and what exams do you have (Polish names are ok, I can understand them)? Which year are you currently in? I just want to get a rough idea on the workload (apart from the one you already provided).

@chrismedrela
Copy link
Author

I'm second year student (major: Automatic Control and Robotics) and I will have three exams this term:

  • Basics of Automatic Control -- Podstawy Automatyki
  • Elektrotechnics -- Elektrotechnika
  • Automation Apparatus (no idea how to translate it into English) -- Aparatura Automatyzacji

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment