Sameer Deshmukh
Email - [email protected]
GitHub - @v0dro
Twitter - @v0dro
Blog - v0dro.github.io
I am the author and maintainer of daru, a Ruby DataFrame library for data analysis and manipulation. I have been contributing to Ruby's scientific ecosystem by working on libraries maintained under the Ruby Science Foundation (SciRuby) for more than a year. I have mainly worked extensively on nmatrix, rb-gsl and made minor contributions to several others.
I love speaking at conferences and have delivered several talks about my work on using Ruby for scientific computing and data analysis at Ruby conferences around the world, among them Deccan Ruby Conf 2015, Red Dot Ruby Conf 2016 and Ruby Kaigi 2016. I am also scheduled to speak on 'Scientific Computation in Ruby' at Ruby World Conference 2016. I received a travel grant from SciRuby for Ruby Kaigi and am receiving another travel grant from India's Josh Software and Emerging Technologies Trust for the Ruby World Conference.
I was selected as a Google Summer of Code 2015 student under SciRuby to further develop daru and integrate it with other important SciRuby projects like statsample, statsample-glm, statsample-timeseries, rb-gsl, nmatrix etc. I was the admin for SciRuby for GSOC 2016 and also co-mentored a student to further improve daru. I received the Ruby Association Grant in 2015 to build an interface between nmatrix and FFTW, and also integrated rb-gsl with nmatrix. You can my grant accomplishment report here.
In my third year at college I co-authored a research paper on 'Automatic Speech Recognition of Marathi Consonants' (Marathi is my mother tongue). The paper has been published by IEEE. You can see it here.
I will be graduating in the first week of November and am not taking up a job since I want to pursue a master's degree (which will start in September 2017). Therefore, I will be working on Rubex full time until the grant period is over.
Rubex - A new language for writing Ruby extensions.
In the past few months, Ruby has experienced a spurt of scientific and data analysis libraries like daru, nmatrix, numo, statsample, tensorflow.rb, spice_rub etc. In a very high level language like Ruby, these kind of libraries have the important requirement of being able to interface with high speed libraries written in C, like GSL, ATLAS, FFTW, etc.
For interfacing C libraries with Ruby, authors can use libraries like FFI or Rice for directly creating Ruby interfaces to C libraries from Ruby code itself. But the problem with FFI or Rice is that they are often not very useful for creating complex C interfaces that need a lot of 'glue' code or use exotic data types.
Thus, the only alternative left for serious external library interfacing is manually writing C code, wrapping it as an extension with the CRuby C API, and then calling these externally written functions from Ruby. The library author is thus forced to write code in C.
C is a very low level language and writing C extensions for Ruby poses the following challenges:
- Difficult and irritating to write.
- Time consuming to debug.
- Tough to trace memory leaks.
- Change of mindset from high level to low level language.
- Familiarity with MRI C API.
- Changes in MRI C API breaks the whole gem.
- Need to care about small things.
Having worked with scientific libraries for more than a year, I have gained ample experience in this domain and have had to work with many C extensions. In my opinion, the best way to be able to easily write C extensions for Ruby is to create a new language, called rubex (stands for RUBy EXtension language), that will allow one to write C extensions without leaving the comfort and usefulness of Ruby.
The aim of rubex will be to only make it easier to write CRuby C extensions. It will not focus on making Ruby faster with C extensions (the way Cython or Pyrex do for Python) that look like Ruby since speed has already become a very important concern for the Ruby core developers, as is reflected by their Ruby 3x3 agenda and the new Guild multi-threading interface.
There already exist such languages whose sole job is to make it easier to write extensions for other high level languages. Pyrex and Cython are prime examples of such languages written for the Python Programming Language.
There has been some effort from the Ruby community to make it easier to write C extensions. I have been talking with SciRuby developers for developing such a language for a while now.
To cite some concrete examples, the now defunct ruby2c project (presentation) was dedicated to generating C code from Ruby code. The authors focused on generating CRuby API-compatible C code that would run faster than the equivalent pure Ruby code since C is a low level language. However, the full Ruby syntax was never implemented and the project is still in beta phase. Also, this project focuses more on creating a sort of a 'workaround' for making Ruby fast by compiling it to C. It is not very concerned with writing C extensions. Speed is a matter of concern when releasing Ruby 3.0 and in my opinion, this project would soon be in a position of being phased out because Ruby itself will be very fast.
Another somewhat similar project that has tried to build a bridge between Ruby and C is the rubyinline project. Rubyinline makes it very simple to embed C code in any Ruby script by allowing programmers to embed C code into a script. This somewhat reduces the complexity of writing a C extension since the user does not need to configure the extconf
or even follow the conventions of the CRuby interpreter. But this approach again forces the programmer to code in C, and leads to lack of consistency and toolchain friction since the C code is embedded in the Ruby code. Moreover, rubyinline is meant for porting small CPU intensive portions of code to C so that it executes faster than Ruby code. It is not meant for the creation of large projects that depend on C extensions, nor does it have any explicit mechanism for interfacing Ruby with external C libraries. If Ruby data types are to used in these C code snippets, it is necessary to know the CRuby C API. Debugging in case something goes wrong is also tough.
Ricsin is also another project by Mr. Koichi Sasada similar to Rubyinline above. It also lets you embed C snippets within Ruby code, but the difference is that Ricsin is much more advanced in this functionality and takes it a step further by letting users specify C code inside Ruby loops and conditionals by simply passing a string of C code to a designated Ricsin method. This simplifies things greatly, but it still has the overhead of having to know the CRuby C API. Ricsin does not have facilities for interfacing with external C libraries and also does not provide a way to specify linking options in case a binary file is to be linked with a program during compilation. Moreover, the project seems very old and not actively maintained.
In all these above examples, the thing that stands out is that users have to resort to writing C code, remembering the C API and manually link to external C libraries if they want to do some serious interfacing with external C libraries.
Rubex will solve this problem and make things extremely straightforward and simple.
I have been contributing to Ruby's scientific computing ecosystem for more than a year with the Ruby Science Foundation. C extensions are very important for scientific computing since most of the heavy lifting involving numerical calculations between thousands of numbers need to happen in native code, using battle tested and highly optimized libraries like ATLAS, LAPACK, GSL, FFTW, etc.
There are many places here where Rubex can be potentially used. In this section I will list out a few places where a dedicated language for writing C extensions was sorely missed, or where it can find some very interesting applications.
Using Rubex for interfacing with ATLAS/LAPACKE in NMatrix
NMatrix is a linear algebra library for CRuby. It stores data in the form of fixed-length C arrays and internally uses the ATLAS, LAPACKE, GSL and FFTW libraries for fast computations of various functions. NMatrix is a very large project and is mostly written in C and C++. The complexity of this project is huge, especially to new-comers.
I have worked extensively on NMatrix and also wrote the nmatrix-fftw
plugin that interfaces NMatrix data types with the FFTW library, and what I noticed is that it would be been a lot simpler to write the plugin if a language like Rubex were present. For example, have a look at this function. It is a function is called directly from the Ruby runtime.This function has most of it's functionality dedicated to type conversions and data allocation. Hence a lot of effort has actually gone into simply creating an interface between C and Ruby compared to the actual effort of interfacing the FFTW function for creating the plan. This effort could have been reduced with a Rubex function that is callable from Ruby. The function would simply need to declare the data types that the data is supposed to be in and the Rubex compiler would have implicitly converted the Ruby objects to the specified data types. So the above nm_fftw_create_plan
function in Rubex would be:
def nm_fftw_create_plan(
const int* rb_shape, int rb_size, int rb_dim, int rb_flags, int rb_direction,
int rb_type, rb_real_real_kind
)
# Tentative abstraction over ALLOC and ALLOC_N macros.
fftw_data* data = Rubex.allocate(data)
# This is a Rubex pure C function.
nm_fftw_actually_create_plan(data, rb_size, rb_dimensions, rb_shape,
rb_direction, rb_flags, rb_type, rb_real_real_kind)
# Each member of the fftw_data struct will be a member of a Ruby Hash that will
# be returned to the Ruby script that calls this method.
return data
end
It would also be very helpful if the NMatrix C API can be wrapped with Rubex so that it becomes more accessible.
Using Rubex to speed up computations with daru
Daru is a DataFrame library for Ruby (similar to pandas in Python) written by me. It works by storing data in a primary 2D data structure called the Daru::DataFrame
. Currently daru is a pure Ruby library, but many methods in daru take much longer to execute than their counterparts in pandas simply because they do not interface with any high speed C libraries for the major heavylifting.
Most of the statistical analysis requirement of daru is satisfied by libraries like statsample and statsample-glm. For this example, let me elaborate on a regression example performed with the help of statsample-glm. Statsample-glm is currently too slow for practical usage since it cannot scale to large data sets. Resolving this issue is critical if the gem should be widely adopted.
One of the ways in which the regression can be made fast is by performing all the intensive calcuations in native code. We are currently trying to achieve high performance by using NMatrix, however that approach isn't working out out too well since we inevitably end up converting data from Ruby to C and vice versa. One possible solution is to use a C extension that can directly access the NMatrix data and perform the computations on it in C. The effort required by this can be significantly reduced if a language like Rubex is employed for making this happen.
Rubex is basically a superset of the Ruby Programming Language that will allow users to write Ruby C extensions for the CRuby interpreter without having to leave the comfort of Ruby.
When writing C extensions, it is very important to be able to support data types and easily interface with C libraries. Thus, rubex will require a special syntax for this purpose. Instead of reinventing the wheel and creating a new syntax for specifying data types and other C-level constructs like arrays, the wisest thing to do would be to adopt the syntax of another Ruby-like language that supports data types, i.e. Crystal.
The Crystal project has been around for ~5 years and has made considerable progress in that much time. The developers were ex-Rubyists who basically wanted the succintness and expressiveness of Ruby combined with the type-checking and speed of C. Crystal also features a very elegant syntax for calling functions from other C libraries.
To summarize, the objectives of the rubex project are as follows:
- Provide a Ruby-like syntax for writing C extensions.
- Completely abstract the CRuby C API from the user.
- Allow Ruby code to co-exist with rubex-specific syntax.
- Ultimately generate C code for writing Ruby extensions.
Rubex aims to make writing C extensions as intuitive as writing Ruby code. A very simple example would be a recursive implementation of a function that computes the factorial of a given number. The method for this is called factorial
and the class in which it resides is called Fact
. The code in rubex would like this:
class Fact
def factorial(i64 n)
return (n > 1 ? n*factorial(n-1) : 1)
end
end
The rubex compiler will compile this code into the equivalent C code, and also make the appropriate calls to the CRuby C API which will perform interconversions between Ruby and C data types and register the class and it's instance method with the interpreter using the appropriate calls to rb_define_class()
and rb_define_method()
.
Making C extensions in this manner will GREATLY simply the process, and allow for more succint and readable extensions that look very similar to Ruby. To actually see exactly how simple writing extensions will become, the current way of writing the same factorial
function in the Fact
class would look something like this with pure C code:
#include <ruby.h>
int
calc_factorial(int n)
{
return (n > 1 ? n*calc_factorial(n-1) : 1);
}
static VALUE
cfactorial(VALUE self, VALUE n)
{
return INT2NUM(calc_factorial(NUM2INT(n)));
}
void Init_factorial()
{
VALUE cFact = rb_define_class("Fact", rb_cObject);
rb_define_method(cFact, "factorial", cfactorial, 1);
}
Now imagine growing this to solving a non-trivial problem, and the benefits imparted by rubex in terms of productivity and simplicity become increasingly apparent. Users will simply need to call a command or rake task that will generate the relevant C code and create a shared object binary, that can be then imported into any Ruby program with a call to require
.
Here's an example of a somewhat more complex rubex method:
def adder(n)
# declare a C array of int 32 data type and size n
a = StaticArray(i32, n)
i32 i = 0
i32 sum = 0
n.times { |p| a[p] = p*5 }
for 0 <= i < n do
sum += a[i]
end
sum
end
The Rubex syntax is defined in the Rubex README.
In this section I will elaborate on how I plan to go about with the implementation of the rubex compiler, including the tools and algorithms that I will use. The whole rubex compiler will be written in pure Ruby and be distributed over rubygems.org so that it becomes simple for users to install and use.
I have already started making a proof-of-concept implementation of Rubex and it can be seen on this repo.
There are two major components for the implementation of the rubex compiler:
- Lexical analysis and parsing.
- Code generation.
For the lexical analysis part, I will use the Oedipus Lex gem. The gem is under active maintainence and sports a syntax similar to that of flex.
For parsing, I will use the racc gem. It is a parser supporting LALR(1) grammar. Racc is also a very widely used and stable gem.
This is the most important and complex part of the process, involving writing algorithms that will take the output generated by the parser and translate that to the equivalent C function calls.
For leads on the code generation process I have gone through the Pyrex code (pyrex is the predecessor of cython), and also the source of the parser gem and have found many useful things. I chose pyrex since it does a lot of the work that the first few verions of rubex will be doing and the code base is relatively smaller and more understandable than cython.
The grant period starts from 23rd October and goes on till the 10th of March. I expect to cover a lot of ground over these 4 months. By the end of the grant period v0.1
of Rubex will be released.
In my opinion, the most important use of rubex is for writing C extensions for interfacing with external C libraries. Keeping that in mind, functionality that allows us to do this will be given priority over everything else. Following is the functionality that I think should be given priority for this purpose:
- Being able to create functions callable from external Ruby code.
- Create rake tasks for easily compiling Rubex code to C code.
- Declaration and translation of primitive C types.
- Translate basic Ruby-style expressions to C code.
- Encapsulating methods in classes and modules.
- Ruby-style if-elsif-else blocks.
- Creating C static arrays.
- Looping support with Rubex's own syntax for for-loops and while loops.
- Support for struct, enum, union and typedefs.
- Interface for building C extensions.
The milestones listed above are only the most important parts of Rubex and not the full functionality, since it will be too much to achieve during the grant period.
For achieving each milestone, I will first make improvements to the lexer and parser, then make improvements to the Abstract Syntax Tree (AST) generation, and then finally the code generation. This process will be repeated for every milestone and hence the project will be built incrementally and adequate tests will be written at each phase. I have started reading the Language Implementation Patterns book to get a better understanding of implementing a language.
The milestones for the project will be as follows:
For the mid term review, the following functionality of Rubex will be complete:
- Being able to create functions callable from external Ruby code.
- Create rake tasks for easily compiling Rubex code to C code.
- Declaration and translation of primitive C types.
- Translate basic Ruby-style expressions to C code.
I will now elaborate on each of the above:
Being able to create functions callable from external Ruby code.
The first milestone will help the project get off the ground. I have already started with the programming of this (see this commit). I will also include a very basic expression conversion logic in the code generator so that a useful method can be created. At the end of this first milestone the compiler will be able to convert the following code snippet into C code (and interface it with the Ruby interpreter):
def adder(i32 a, i32 b)
return a + b
end
For the purpose of creating a basic prototype, I will support the 32 bit integer data type (i32
). Type conversions or declarations will not be handled at this stage. However, since these kind of functions must return Ruby objects, it is important to keep track of the data types of the variables that are being returned so that the appropriate function/macro from the Ruby C API can be used for converting them to the correct Ruby data type (like INT2NUM(a+b)
in the above snippet). A symbol table will be introduced for this purpose. Since functionality for encapsulating into classes is yet not complete, this function will be loaded into Ruby's Object
class.
Create rake tasks for easily compiling Rubex code to C code.
The above code snippet is useless if it cannot be easily converted to C code. Hence, for this purpose, I will create rake tasks that will easily allow the user to create C extension files from Rubex files. Rubex rake tasks will be of the nature rake rubex *
, where *
is a command for making Rubex do something specific.
A rake task called rake rubex compile
will be first created to compile the above code snippet into a usable C extension. This rake task will perform the following jobs:
- Read files in a
/rubex
directory inside the current working directory (pwd
) with a.rubex
extension and call the Rubex compiler to convert it to the equivalent C code, all of which will go into a directory called/ext
(similar to the structure of current gems with extensions). - Generate a
extconf.rb
file that will contain rules using themkmf
interface for compiling this generated C code into a shared object file (.so
). - Run the
extconf.rb
file and compile and link the generated C extensions and make a shared object file (.so
) that can be used by a Ruby script for calling the relevant Rubex functions (with a call torequire
).
Thus, the first two milestones will set the stage for the rest of the project.
Adequate testing will be performed with rspec for making sure that the functionality works as expected.
Declaration and translation of primitive C types.
Once a scaffold has been properly created, I will proceed to add more features. I will start by making sure that Rubex has the ability to declare basic C data types. I have listed the supported data types in the project README. This milestone will also include supporting C types in argument lists of functions. A rule of Rubex is that in the data type of a variable is not specified, it is assumed to be of type VALUE
(Ruby object). This functionality will also be covered in this milestone. So for example, if a user simply specifies a = 3.4
, it will be translated to VALUE a = rb_float_new(3.4)
.
I will also write functionality for creating simple character literals in this milestone (single ASCII characters enclosed within single quotes) so that the char
data type can also be used. In case a character literal is assigned to a variable without a type (i.e. a Ruby object), it will be assigned to the VALUE type with rb_string_new2()
.
Note that string literals or type conversions will not be handled in this milestone. That is a somewhat complex topic and will be handled later.
Once this milestone has been achived, users will be able to write the following code:
def adder(i64 a, i64 b)
f32 c = 5.432
i32 i, j = 4
char ch = 'a'
yy = 't'
return a + b
end
Translate basic Ruby-style expressions to C code.
I will start by translating various Ruby-like expressions like arithmetic (something like (a+b)*c*d-f
), return statements and logic expessions (AND
, OR
etc.) into the equivalent C code. This will also include initialization of variables at the time of declaration.
Note that advanced features like converting between Ruby types and C types will not be handled at this stage. For example, in an expression a + b
, both a
and b
must be C compatible C data types like integers or floats. Operations between pure C data types will be supported.
Note that Rubex will not have type conversion error detection as of now and such errors on part of the programmer will not be picked up by the compiler.
After this stage, code snippets similar to this one will be supported:
def adder(int left, float right)
float mid = left - right
float t = (mid / 3) * (left + right) - left / mid
float z
z = t - 3
return z
end
For the final term review, the following functionality of Rubex will be complete:
- Ruby-style if-elsif-else blocks.
- Creating C static arrays.
- Looping support with Rubex's own syntax for for-loops and while loops.
- Support for struct, enum, union and typedefs.
- Interface for building C extensions.
Ruby-style if-elsif-else blocks.
Conditional statements are one of the most important building blocks of any language and this will be the first priority for the final review. Conditional blocks in Rubex will look exactly the way they do in Ruby, the only difference being that an if-else block will not return any value the way it does in Ruby.
After this milestone is achieved, use of the following constructs will be possible:
def adder(i32 a, long int c)
i32 b
if a > 3
b = 3
elsif a < -4
b = 2
else
b = 5
end
b = 100 if c + 3 < 400
if c - 4 == 44 then b = -43 end
return b
end
Creating C static arrays.
Here I will add support for creating C arrays using the StaticArray()
keyword or with the C-styly []
operator after a data type.
Looping support with Rubex's own syntax for for-loops and while loops.
I will implement looping support for Rubex during this milestone. Since loops in Ruby involve the use of blocks and Rubex at this stage will not support blocks, these loops will feature a slightly different syntax that will very closely resemble Ruby's looping syntax, but with some caveats. These kind of loops will exist alongside Ruby-style loops once support for those has been implemented at a later stage.
After this milestone, the following code will be made possible:
def adder(i32 arr_size)
arr = StaticArray(arr_size)
i32 i
for 0 <= i < arr_size do
arr[i] = i*5
end
i = 0
while i < arr_size do
arr[i] = i*5
i += 1
end
end
Support for struct, enum, union and typedefs.
This milestone will involve creation of ADTs supported by C i.e structs, enums and unions. Once this milestone has been achieved, users will be able to create Abstract Data Types from primitive C types. Functionality for accessing the data within an ADT will also be implemented.
The alias
keyword will be introduced during this milestone that will users to create typedefs for custom data types. The above data type struct Node
can be aliased to ANode
by using the alias
keyword. It can then be used for declaring data types:
def foo()
struct Node do
int a
f64 f
end
alias ANode = struct Node
ANode a
a.a = 3
a.f = 4.5
end
Interface for building C extensions.
Once basic functionality like data types, conditionals and looping is in place, I will go ahead and implement the most important function of Rubex, i.e. interfacing with C libraries. It will be done as described in the project README.
The milestones that I have listed above are by no means exhaustive. There will still be lots of work remaining to make Rubex production-ready. However, the above goals, upon completion, will set the stage for Rubex to become a full-fledged production-ready language that can be used in critical applications. I will be working full time on this project (at least 50 hours/week) and will try my best to finish these bonus milestones as well:
- Being able to create Pure C functions (with the
cdef
keyword). - Encapsulating methods in classes and modules.
- Support for string literals.
- Support for variable and function pointers.
- Interfaces to basic C I/O functions like
printf
,scanf
etc. - Support for Ruby-style for and while loops.
- Type checking to detect operations between potential incompatible types (for example between a Ruby
String
and Cint
). This will be handled at the Rubex level and errors will be thrown by the Rubex compiler.
I have been corresponding with Mr. Koichi Sasada for the past few weeks. He has been very kind to offer his support as mentor for this project. I am very grateful to him for his support.
- Compile-time error checking.
- Exception handling functionality with Ruby-style begin-rescue-ensure blocks.
- Debugging functionality.
- NMatrix-aware compilation (similar to numba).
- Interface many C standard libraries with rubex so these C functions are directly callable from Ruby code.
- Being able to create Ruby symbols.
- Implicit type conversions between Ruby and C types.
- Provide functionality to acquire and release the GIL in CRuby.
- Extend rubex so that it can generate both CRuby and JRuby extensions.