`PydSEAMSlib` : Enhanced Python bindings for d-SEAMS

d-SEAMS (Deferred Structural Elucidation Analysis for Molecular Simulations), a free and open-source advanced post-processing engine designed for the analysis of molecular dynamics (MD) trajectories, focusing particularly on the classification of ice structures. It was developed by Rohit Goswami, Amrita Goswami, and Jayant Kumar Singh, and introduced in a 2020 publication in the Journal of Chemical Information and Modeling.

The primary objective of this project was to enhance pySEAMS, a Python interface for the d-SEAMS (Deferred Structural Elucidation Analysis for Molecular Simulations) core engine. Last year, under GSoC 2023, I started writing pybind11 bindings to the engine to slowly transition away from the lua embedded single-purpose yodaStruct binary. This project focused on improving the usability and accessibility of pySEAMS by creating comprehensive documentation, refining the installation process, and integrating new features to make the tool more user-friendly.

Weekly Progress

Throughout the 12-week project, I documented my progress through regular (mandatory) short mastodon updates on a server managed by my umbrella organization, the PSF. These provide a detailed account of the development process. The posts themselves can be found here.

Key Enhancements:

Creation of a Dockerfile:
- To streamline the installation process, a Dockerfile was created for d-SEAMS which was meant to facilitate easier use on arm64 machines, in particular M1 macos machines. Users can set up the environment in a single step, with hardware closer to that provided by a standard docker image, ensuring that the tool is more accessible.
Sphinx Documentation:
- Comprehensive documentation was developed using Sphinx, detailing each Python binding and its arguments.
- This resource is now available on a dedicated project website, making it easier for users to understand and utilize the full capabilities of pySEAMS.
- I also prompted into creation a light and dark logo for the project.
Refactoring of build Dependencies:
- The project addressed compatibility issues with macOS by refactoring the boost dependencies in pySEAMS, along with static linkage issues for the wheel generation.
- Beyond this, I made several changes to my copy of the C++ fork for facilitating the bindings
  - Some of these changes manifested in upstream merged pull requests as well.
Python Packaging Tools:
- To align pySEAMS with standard Python practices including PEPs 518, 725 and 621, like pyproject.toml, and __init__.py were integrated into the project, along with meson-python.
- As a compiled extension with additional Python wrappers and helpers, this package with its meson-python build system was at the forefront of its kind, since the distutils deprecation.
  - Luckily, the mentors have extensive experience (including on NumPy) to help guide the packaging efforts.
- This not only made the tool more Pythonic for end users but also facilitated easier packaging and distribution.
Integration with ASE:
- The Atomic Simulation Environment (ASE) was integrated as a primary interface for pySEAMS.
- This makes it easier for users to interact with the tool, particularly those already familiar with ASE, which is taught across the world to graduate students in the physical sciences.
Benchmarking and Approval Testing:
- Extensive benchmarking was conducted to ensure that the functions within pySEAMS operate efficiently.
- Approval tests were also added to verify that the results from pySEAMS are consistent with those obtained directly from d-SEAMS.
Pip Installation:
- The improved pySEAMS can now be pip-installed, further simplifying the installation process and making the tool more accessible to the broader scientific community.

Current State of the Project

As an interesting aside, it so happened that the project name pyseams, was deemed too similar to existing projects on PyPI so the name had to be changed to PydSEAMSlib.

Thin-binding usage

By almost directly calling the underlying C++ functions, we can effectively sidestep the need for lua altogether, as was the final part of last year's project (since expanded).

# This is equivalent to running the lua_inputs/config.yml file
# after building yodaStruct from seams-core
from pydseamslib import cyoda

trajectory = "subprojects/seams-core/input/traj/exampleTraj.lammpstrj"

# Get the frame
resCloud = cyoda.readLammpsTrjreduced(
    filename=trajectory,
    targetFrame=1,
    typeI=2,  # oxygenAtomType
    isSlice=False,
    coordLow=[0, 0, 0],
    coordHigh=[0, 0, 0],
)

# Calculate the neighborlist by ID
nList = cyoda.neighListO(
    rcutoff=3.5,
    yCloud=resCloud,
    typeI=2,  # oxygenAtomType
)

# Get the hydrogen-bonded network for the current frame
hbnList = cyoda.populateHbonds(
    filename=trajectory,
    yCloud=resCloud,
    nList=nList,
    targetFrame=1,
    Htype=1,  # hydrogen atom type
)

# Hydrogen-bonded network using indices not IDs
hbnList = cyoda.neighbourListByIndex(
    yCloud=resCloud,
    nList=hbnList,
)

# Gets every ring (non-primitives included)
rings = cyoda.ringNetwork(
    nList=hbnList,
    maxDepth=6,
)

# Does the prism analysis for quasi-one-dimensional ice
cyoda.prismAnalysis(
    path="runOne/",  # outDir
    rings=rings,
    nList=hbnList,
    yCloud=resCloud,
    maxDepth=6,
    atomID=0,
    firstFrame=1,  # targetFrame
    currentFrame=1,  # frame
    doShapeMatching=False,
)

`asv` compatibility layer

Crucially, the example above was still tied strongly to the C++ code in terms of I/O, making it difficult to use the algorithms for other types of input data. To this end, an adapter interface was added to the PydSEAMSlib project; which makes the previous code look like:

from ase.io import read as aseread


from pydseamslib import cyoda
from pydseamslib.adapters import _ase
from pathlib import Path

TRAJ = Path("subprojects/seams-core/input/traj/exampleTraj.lammpstrj")
strTRJ = str(TRAJ.absolute())

# Construct a pointcloud
atms = aseread(strTRJ)
lammps_to_ase = {1: "H", 2: "O"}
atms = _ase.map_LAMMPS_IDs_to_atomic_symbols(lammps_to_ase, atms)
only_O_mask = [x.symbol == "O" for x in atms]
molOID = np.repeat(np.arange(1, sum(only_O_mask) + 1), 1)
pcd = _ase.to_pointcloud(
    atms, lammps_to_ase, only_O_mask, molOID, TRAJ, currentFrame=[1]
)
# Calculate the neighborlist by ID
nl = cyoda.neighListO(
    rcutoff=3.5,
    yCloud=pcd,
    typeI=2,  # oxygenAtomType
)
# Get the hydrogen-bonded network for the current frame
hl = cyoda.populateHbonds(
    filename=strTRJ,
    yCloud=pcd,
    nList=nl,
    targetFrame=1,
    Htype=1,  # hydrogen atom type
)
# Hydrogen-bonded network using indices not IDs
hL = cyoda.neighbourListByIndex(
    yCloud=pcd,
    nList=hl,
)
# Gets every ring (non-primitives included)
Rgs = cyoda.ringNetwork(
    nList=hL,
    maxDepth=6,
)
# Ensure the directory is not present before beginning
outDir = "runOne/"  # / is important for the C++ engine..
if Path(outDir).exists():
    shutil.rmtree(outDir)
# Does the prism analysis for quasi-one-dimensional ice
cyoda.prismAnalysis(
    path=outDir,
    rings=Rgs,
    nList=hL,
    yCloud=pcd,
    maxDepth=6,
    atomID=0,
    firstFrame=1,  # targetFrame
    currentFrame=1,  # frame
    doShapeMatching=False,
)

Which is much more flexible, since now the basic unit for the seams-core algorithms to work with via the pydSEAMSlib bindings are ase atoms objects, which in turn can be constructed from a host of programs. They are also common intermediaries in, say for example, machine learning workflows for training new potential energy surfaces.

Verification

To reduce boilerplate and increase maintenance, I learned to use Approval Tests which allow for matching the two codes (and results from seams-core) with minimal boilerplate, e.g. :

from approvaltests import verify_file
from approvaltests.namer.default_namer_factory import NamerFactory

from pathlib import Path

def test_prisms():
    import subprocess
    gitroot = Path(subprocess.run(["git", "rev-parse", "--show-toplevel"],
                                  check=True,
                                  capture_output=True).stdout.decode("utf-8").strip())
    # Validate the run results
    verify_file(Path(f"{gitroot}/runOneRef/topoINT/dataFiles/system-prisms-1.data"), options=NamerFactory.with_parameters("systemPrisms"))
    verify_file(Path(f"{gitroot}/runOneRef/topoINT/nPrisms.dat"), options=NamerFactory.with_parameters("nPrisms"))

Which can be used to approve the results, where the runOneRef folder is generated from yodaStruct -c lua_inputs/config.yml in seams-core. Once this has been checked into the repository, it can be re-written in terms of the thin-client bindings and the ASE adapter to ensure there are no deviations from the original data.

Merged Code

During the project, several pull requests (PRs) were successfully merged, contributing to the core functionality and stability of PydSEAMSlib and the rest of the d-SEAMS ecosystem.

`seams-core`

This is the C++ engine, apart from work on my own fork, I also opened relevant issues upstream; namely:

Issue #47 :: Highlighting that though the bindings implement a local rename for type (i.e. it is bound to c_type) it makes little sense to not rename it upstream.

`PydSEAMSlib`

This is the binding repository I began last summer, and have continued working on, this time via PRs; since it is already under the d-SEAMS organization this year.

CI Integration:
- PR #3 - Addition of basic CI tests.
Bug Fixes:
- PR #6 - Fixing critical bugs.
Python Packaging Tools:
- PR #7 - Adding Python packaging tools.
Documentation:
- PR #8 - Developing documentation for pySEAMS at the API level.

PR #10 - Added more sphinx integration with pybind11
PR #18 - Wrote more user documentation and setup CI to generate the website.

ASE Bindings and Tests: (monolithic PR)
- PR #11 - Integrating ASE bindings and approval tests for the 1D ice nucleation study, which also reproduces data from the original publication. Also includes a CI for the tests, and enforces consistent formatting.

Future Directions

While significant progress was made during the project, there are areas that would benefit from further development:

Enhanced Documentation:
- Although substantial documentation was created, there is always room for improvement, particularly in explaining more advanced use cases and integrations with other tools.
- In particular, one pre-requisite for the next publication is to have all the examples from the paper reproduced as notebooks within the documentation (this effort is ongoing, and at the 45% mark).
Additional Testing:
- Expanding the test suite to cover more edge cases and complex scenarios would help ensure the robustness of the tool in various research contexts.
- Additionally, it would be instructive to see how to expand the analysis away from LAMMPS, since that was the primary input trajectory generator for d-SEAMS.
Physical Chemistry and Biological Applications:
- Exploring the application of pySEAMS in physical chemistry and biology, particularly in the context of genetic disease research, could open new avenues for the tool's usage.
- To prevent spending time on primary research, I did not get to focus much on the descriptors or scikit-learn / networkx usage, since there was a lot of code / documentation to write in a relatively short period; I hope to rectify this soon.

Personal Development

Key Milestones in My Growth:

From Initial Python Bindings to Mastering Reproducible Science:
- Last year, I took my first steps into Python bindings with pybind11. This year, I expanded on that by focusing on the broader implications of my work in the context of reproducible science. I not only managed the Python release process, ensuring that pySEAMS was robust and easily installable, but also worked on reproducing scientific data and figures using the tools I helped develop. This involved setting up experiments, running simulations, and ensuring that the results could be reliably reproduced by others. This work underscored the importance of precision and consistency in scientific research, where even small discrepancies can lead to significant differences in outcomes.
Advancing from Basic CI to Full DevOps, GitOps, and Scientific Reproducibility:
- Building on my initial experience with continuous integration, this year I took on the challenge of integrating DevOps and GitOps practices into the scientific research process. I automated CI/CD pipelines to handle everything from code testing to the packaging and deployment of pydSEAMSlib, ensuring that every release was not only stable but also conducive to reproducible research. I utilized tools like Meld to compare and merge code, ensuring that differences were managed effectively and that the software remained consistent across different environments. This was a significant step up from last year’s work, reflecting a deeper understanding of how these practices are essential for maintaining the integrity and reproducibility of scientific research.
Comprehensive Documentation and Reproducible Workflows:
- While I focused on documentation last year, this year’s work took that further by developing comprehensive guides that included reproducible workflows. I used Sphinx to create a documentation site that not only explained how to use pySEAMS but also provided detailed instructions for setting up and running simulations in a way that others could replicate exactly. This included bundling static libraries and other dependencies to ensure that environments could be recreated precisely, a critical factor in achieving reproducible science. This approach to documentation goes beyond mere explanation—it’s about creating a reliable foundation for future researchers to build upon.
Transitioning from Contributor to Leader in Reproducible Science:
- My role evolved from being a contributor to a leader within the d-SEAMS project, where I was responsible for ensuring that the tools we developed supported the principles of reproducible science. This involved managing pull requests, enforcing coding standards, and making high-level decisions about the direction of the project with a focus on ensuring that others could replicate our work. I also bundled static libraries and other dependencies to ensure that pySEAMS was as portable as possible, making it easier for researchers to reproduce results across different systems and environments. This role taught me the importance of leadership in scientific research, where the accuracy and reproducibility of results are paramount.
Enhancing Automation and Reproducibility in Scientific Research:
- Automation played a key role in this project, not just in terms of software deployment but in ensuring that scientific results could be reproduced consistently. I developed CI/CD pipelines that automated the testing and deployment of pySEAMS, ensuring that every aspect of the software was rigorously tested and that results could be reproduced without manual intervention. This was a natural progression from last year’s CI work, but with a stronger emphasis on the reproducibility of scientific experiments and data, which is crucial for validating research findings.

Reflection and Future Aspirations

This project has been more than just an exercise in software development—it has been a deep dive into the challenges and responsibilities of ensuring reproducible science. Moving from foundational software engineering skills to mastering complex processes like DevOps, GitOps, and scientific reproducibility has given me a much broader perspective on what it takes to not only develop software but to do so in a way that supports the broader scientific community.

By integrating tools like Meld for code comparison within the approval tests, bundling static libraries for wheel builds, and creating detailed reproducible workflows, I’ve contributed to making pydSEAMSlib a reliable tool for scientific research. This experience has shown me that the real challenge in software development, especially in the scientific domain, lies in ensuring that others can build upon your work with confidence.

As I look to the future, I am excited to continue applying these skills in both software development and scientific research. I aim to contribute further to the d-SEAMS project and other initiatives where I can continue to refine these skills and tackle new challenges in the pursuit of reproducible, impactful science.

RuhiRG/gsoc24_PydSEAMSlib_fin_report.md

PydSEAMSlib : Enhanced Python bindings for d-SEAMS