GSoC 2024 - OpenRefine - Customized Clustering Project

GSoC 2024 Final Work Product

Student: Zyad Taha
GitHub: @zyadtaha
LinkedIn: Zyad Taha
Organization: OpenRefine
Mentor: @wetneb

Customized Clustering Functions for OpenRefine

This project enhances OpenRefine by enabling users to create custom clustering functions using GREL, Jython, or Clojure without needing to develop Java-based extensions. Users can now define their own binning and kNN-based clustering methods tailored to their specific data needs. This flexibility allows for more personalized and precise clustering, empowering users to experiment with innovative strategies and fostering a more engaged community.

Progress Report

Before GSoC, I inspected the OpenRefine codebase, explored clustering documentation and algorithms, and learned about expression languages. I engaged with the community through the forum and contributed three PRs (#1, #2, #3), which prepared me for further contributions during GSoC.

During GSoC, I initially started a discussion on the OpenRefine forum to gather feedback on the feature design and began working on the backend while awaiting responses. This discussion thread was then used for feature design discussions, implementation details, and addressing development problems.

PR #1: User-Defined Keyers and Distances Implementation

This PR introduces the following:

UserDefinedKeyer: A class that takes a single argument (the value being clustered) and returns its corresponding key (bin).
UserDefinedDistance: A class that takes two arguments and returns their distance (the number of single-character differences between two strings).

Both classes are registered for use in the compute-clusters endpoint, and the clustering expression from the HTTP request is passed to their constructors. Unit tests for these classes are also included.

PR #2: Dialog for Custom Clustering Functions Management

This pull request adds a new "Manage clustering functions" button and dialog to the clustering user interface in OpenRefine.

The new dialog allows users to:

View currently defined clustering functions.
Add new custom clustering functions by providing a function name and expression.
Edit and remove existing custom functions.

The management dialog features a clean and organized interface with a two-tab layout to separate Keying and Distance functions. It includes scrolling support for when there are many functions, visual styling for the edit/remove buttons, and overall dialog improvements. Localization support has also been added.

Issues Resolved:

Custom functions initially did not persist across different projects or page refreshes. This has been fixed, and custom functions are now properly retained.
Previously, there was no feedback when no custom functions were defined. A placeholder message is now displayed: "No custom functions. Add a new function below."
The column widths for 'Name' & 'Action' were constantly resizing. This issue has been addressed, and the column widths now remain consistent regardless of the expression length.

Additionally, Cypress tests have been included to ensure the new functionality works as expected.

PR #3: Add LevenshteinDistance GREL Function to OpenRefine

This pull request introduces a new GREL function, LevenshteinDistance(), to OpenRefine. The LevenshteinDistance() function calculates the Levenshtein distance between two strings, defined as the minimum number of single-character edits (insertions, deletions, or substitutions) needed to transform one string into another. This implementation uses the simile-vicino library to perform the distance calculation.

Originally named editDistance(), the function was renamed to LevenshteinDistance() to specify the exact algorithm used, as "edit distance" can refer to different methods. This renaming ensures clarity and helps users understand the specific type of distance calculation being performed.

Motivation: The primary goal of this pull request is to establish LevenshteinDistance() as the default and initial function in the expression textarea while creating a custom distance clustering function. This change will enhance user experience by providing a reliable and widely-used method for string comparison from the outset. LevenshteinDistance() is particularly useful for tasks involving string similarity or difference, such as data deduplication and fuzzy matching.

Testing: Comprehensive unit tests have been added to verify the function's accuracy and reliability, covering a range of scenarios to ensure it performs as expected.

PR #4: Introducing Real-Time Expression and Clusters Preview

This pull request introduces the following updates:

Clusters Preview Tab: Added a "Clusters Preview" tab to dynamically preview how your function works in real-time. This tab updates as you type expressions, allowing for immediate adjustments. This tab also includes small statistics showing the total number of clusters compared to the total number of rows, helping users gauge how their expression affects clustering.
Redesigned Expression Preview Tab: The "Expression Preview" tab has been redesigned for distance-type clustering. Previously, the tab previewed each cell value individually. Now, users can input two example values to preview the distance clustering between them. The fields are pre-filled with the initial value from the column, and the distance is dynamically calculated as the user types.
Default Function: Set levenshteinDistance() as the default GREL function and starting expression while writing a new distance-type expression.
Error Messaging: Added error messaging to notify users if the distance expression does not return a numerical value. Also ensured backward compatibility with the new preview interface by maintaining notifications for parsing errors or if a function expects more arguments, as was present in the previous interface.
Adjustable Parameters: Included a row in the distance-type preview for adjusting the radius and blocking characters, enabling dynamic clustering based on these settings.
Code Refactoring: Refactored the expression and clusters preview dialogs to reduce code duplication and improve maintainability.

PR #5: Enable Clustering Using Custom Functions from Dropdown Menu

This pull request introduces several key updates:

Custom clustering functions are now accessible via the dropdown menu, allowing users to select and apply these functions more easily.
Enables clustering based on the custom functions chosen from the dropdown menu.
A Cypress test is included to ensure complete coverage.
A user hint has been added, along with a link to documentation, to guide users on leveraging custom clustering expressions for improved results.

PR #6: Document LevenshteinDistance Function on OpenRefine's Website

Added detailed documentation for the levenshteinDistance function.
Included usage examples and explanations of parameters.

PR #7: Add User-Friendly Guide for Custom Clustering Functions on OpenRefine's Website

Provided step-by-step instructions on how to create a custom clustering function.
Included cases where the feature is useful with relatable examples to enhance understanding.

Scope of Future Development

Further optimize the performance of distance clustering, especially for larger datasets (like lieux-de-tournage-a-paris.csv), as discussed here.
When creating a new function with an expression language and saving it, it doesn't get stored in the History tab in the Expression Preview dialog.
Consider implementing options to export and import custom functions to help with sharing functions between users and easily synchronize functions for users using multiple OpenRefine instances (e.g., one Docker-based OpenRefine instance per project).
The Clusters Preview tab currently shows "No clusters returned" instead of specific error messages when using an expression that returns a string instead of a number. Improving error reporting by including details about failed computations and invalid values is suggested—an issue has been opened for this.
Best Levenshtein Distance Algorithm for OpenRefine – Simile Vicino, Apache, or PassJoin? Opened a discussion thread for it.
Investigate the OpenRefine contributing guidelines for Windows, as I tried building OpenRefine for more than 7 days but ultimately switched to Linux.

What I Learned

Gained proficiency in contributing to large, complex codebases.
Enhanced full-stack development skills using Java and JavaScript.
Developed expertise in unit testing with TestNG and end-to-end testing with Cypress.
Mastered version control with Git and GitHub.
Collaborated effectively with a team of four contributors alongside my mentor.
Improved the ability to design and implement user-friendly features.
Strengthened technical writing and communication skills.
Gained experience in remote work.

Gratitude

I am very grateful to my mentor, Antonin Delpeuch, and the OpenRefine community for this opportunity. I deeply appreciate the time and effort Antonin dedicated to mentoring me and helping me complete this project. It would have been impossible without his support. I have learned a great deal throughout the project and look forward to staying connected with OpenRefine and continuing to contribute to this awesome organization!

zyadtaha/gsoc-2024-work-product.md