Pattern.split() pitfalls in Kotlin

This is about a little pitfall I stumbled upon recently while trying to use regular expressions in Kotlin to split strings. It is relevant to interoperability between Pattern.split() Kotlin's CharSequence.split().

The Gist

Pattern is part of the Standard Java Library and is documented in the Standard Java API Specifications.
Kotlin's CharSequence, String, and Regex classes have a split() method that is documented in the Kotlin Standard Library documentation. Everything that applies to CharSequence.split() also applies to Regex.split() and String.split().
Pattern.split() uses the limit parameter differently from how CharSequence.split() uses it. Despite having the same default limit of 0, their results can be different when trailing empty strings are involved.
Pattern.split() discards any trailing empty strings by default (limit = 0).
CharSequence.split() keeps any trailing empty strings by default (limit = 0).
Pattern.split() accepts a negative limit. This will keep any trailing empty strings.
CharSequence.split() only accepts non-negative values for limit. It will throw an IllegalArgumentException if passed a negative value.
The Kotlin documentation for split() (version 1.3.72 as of this writing) does not explicitly state the behavior with regard to limit and trailing empty strings nor does it mention the difference in behavior with Pattern.split().
Pattern.split() returns a String[] which is seen in Kotlin as Array<(out) String!>! (per IntelliJ IDEA type hint).
CharSequence.split() returns a List<kotlin.String>.

In Kotlin,

// *** Given the following ***

// Regular expression to split into pairs of characters.
// Result will have trailing "" for even-length strings.
val expr = "(?<=\\G.{2})"

val regex = expr.toRegex()        // kotlin.text.Regex
val pattern = expr.toPattern()    // java.util.regex.Pattern

// *** All expressions below are true ***

// when no trailing "" in result
regex.split("abc").last() != ""
"abc".split(regex).last() != ""
pattern.split("abc").last() != ""

// but List<String> is never == Array<(out) String!>!
regex.split("abc") != pattern.split("abc")
"abc".split(regex) != pattern.split("abc")

// need to use toList() to make them comparable
regex.split("abc") == pattern.split("abc").toList()
"abc".split(regex) == pattern.split("abc").toList()

// Kotlin keeps trailing "" by default (limit = 0)
regex.split("abcd") == listOf("ab", "cd", "")
"abcd".split(regex) == listOf("ab", "cd", "")

// Pattern discards trailing "" by default (limit = 0)
pattern.split("abcd").toList() == listOf("ab", "cd")

// Pattern keeps trailing "" when limit is negative
pattern.split("abcd", -1).toList() == listOf("ab", "cd", "")

Avoiding pitfalls with Pattern.split() in Kotlin

Pattern.split() returns a String[] whereas CharSequence.split() returns a List<String>.
Use Pattern.split().toList() when comparing with CharSequence.split().
Pattern.split() discards any trailing empty strings by default whereas CharSequence.split() keeps them by default.
Pattern.split(limit = -1) behaves the same way as Character.split(limit = 0).
CharSequence.split(), Regex.split(), kotlin.String.split(Regex), and kotlin.String.split(Pattern) all work the same way.
In Kotlin, Pattern.split(CharSequence) is not symmetrical with CharSequence.split(Pattern) when the result has trailing empty strings.

TL;DR

Note: The following is a somewhat revisionist account of the actual events, condensed for the sake of brevity. In reality, I went through a longer investigation process before I found out what was actually going on. In other words, I'm really much slower on the uptake than what this telling may lead you to believe. ¯\_(ツ)_/¯

Splitting strings in Java

While working on an exercise to split a string into groups of two characters, I tried a one-line solution that used a regular expression to do the job. This was the solution in Java:

public static String[] splitToPairs(String s) {
  return s.split("(?<=\\G.{2})");
}

The problem here is that the string will be compiled to a regular expression on the fly every time splitToPairs() is invoked. To avoid this, we can create a precompiled Pattern once and reuse it for all invocations. Since String.split() isn't overloaded to take a Pattern, we have to flip the call to Pattern.split(String) instead. No biggie. After refactoring, we get this:

// Compile once, reuse many times
private static final BY_PAIRS = Pattern.compile("(?<=\\G.{2})");

public static String[] splitToPairs(String s) {
  return BY_PAIRS.split(s);
}

This also shows that there is symmetry between String.split() and Pattern.split() which is a good thing. Here's a JUnit 5 test that captures this:

private static final String expr = "(?<=\\G.{2})";
private static final Pattern byPairs = Pattern.compile(expr);

@ParameterizedTest(name="\"{0}\" should be split as [{1}]")
@CsvSource({
    "a, a",
    "ab, ab",
    "abc, ab/c",
    "abcd, ab/cd",
    "abcde, ab/cd/e",
    "abcdef, ab/cd/ef"
})
void split_is_symmetrical(String s, String pairs) {
    String[] expectedPairs = pairs.split("/");
    assertAll(
        () -> assertArrayEquals(expectedPairs, s.split(expr)),
        () -> assertArrayEquals(expectedPairs, byPairs.split(s))
    );
}

Splitting strings in Kotlin

Now to try this it in Kotlin. Just as I did in Java, I started with a straightforward solution:

fun splitToPairs(s: String) = s.split("(?<=\\G.{2})")

Running the same kind of tests I had in Java, I was a bit surprised when some of the cases failed. It turns out that java.lang.String.split(String) works differently from kotlin.String.split(String) and any tests that passed in Kotlin was purely coincidental. The names used in the method signatures reveal the different intents. In Java, it's split(String regex) while in Kotlin it's split(delims: String).

Kotlin does, however, provide two other versions that accept regular expressions: split(regex: Regex, limit: Int = 0) and split(regex: Pattern, limit: Int = 0).

Into the pit we fall

First, I ported the JUnit 5 test I had from Java to Kotlin. Kotlin has a couple of convenient extensions for String that can create precompiled Pattern and Regex objects: toPattern() and toRegex(), respectively. So instead of Pattern.compile(expr), I used expr.toPattern().

private val expr = "(?<=\\G.{2})"
private val pattern = expr.toPattern()

@ParameterizedTest(name="\"{0}\" should be split as [{1}]")
@CsvSource(
    "a, a",
    "ab, ab",
    "abc, ab/c",
    "abcd, ab/cd",
    "abcde, ab/cd/e",
    "abcdef, ab/cd/ef"
)
fun `split should be symmetrical`(s: String, pairs: String) {
    val expectedPairs = pairs.split("/")
    assertAll(
        { assertEquals(expectedPairs, s.split(pattern)) },
        { assertEquals(expectedPairs, pattern.split(s)) }
    )
}

I was a bit surprised when all the tests failed. Looking at the first stacktrace, I realize that the return types are different. This was the first pitfall. Pattern.split() returns a String[] while kotlin.String.split() returns a List<String>. Again, no biggie. I just use toList() to make them compatible:

    assertAll(
        { assertEquals(expectedPairs, s.split(pattern)) },
        { assertEquals(expectedPairs, pattern.split(s).toList()) }
    )

This time, half the tests failed. Now what? I expected all of them to pass. After adding a message to each assertion, I see that the s.split(pattern) assertion failed:

s.split(pattern) ==> expected: <[ab]> but was: <[ab, ]>
Comparison Failure: 
Expected :[ab]
Actual   :[ab, ]

This is the main pitfall. This test reveals that CharSequence.split(Pattern) gives different results from Pattern.split(CharSequence) even though they use the same limit of 0.

That is, in Kotlin, Pattern.split(CharSequence, 0) is not always symmetrical with CharSequence.split(Pattern, 0). However, Pattern.split(CharSequence, -1) is symmetrical with CharSequence.split(Pattern, 0), which is non-intuitive at best. I've already been over most of this before so I won't rehash it here.

With my new understanding, I got all the tests to pass by modifying the expected results and using a limit of -1 with Pattern.split().

private val expr = "(?<=\\G.{2})"
private val pattern = expr.toPattern()
private val regex = expr.toRegex()

@ParameterizedTest(name="\"{0}\" should be split as [{1}]")
@CsvSource(
        "a, a",
        "ab, ab/",
        "abc, ab/c",
        "abcd, ab/cd/",
        "abcde, ab/cd/e",
        "abcdef, ab/cd/ef/"
)
fun `split should be symmetrical`(s: String, pairs: String) {
    val expectedPairs = pairs.split("/")
    assertAll(
        { assertEquals(expectedPairs, s.split(pattern), "s.split(pattern)") },
        { assertEquals(expectedPairs, pattern.split(s, -1).toList(), "pattern.split(s)") }
    )
}

Even though the test passes, it doesn't express what I learned very well. So I refactored it into tests that are more coherent and expressive:

import org.junit.jupiter.api.Assertions.*
import org.junit.jupiter.api.TestInstance
import org.junit.jupiter.api.assertAll
import org.junit.jupiter.params.ParameterizedTest
import org.junit.jupiter.params.provider.EmptySource
import org.junit.jupiter.params.provider.MethodSource

@TestInstance(TestInstance.Lifecycle.PER_CLASS)
internal class SplitStringsWithRegexVsPatternTest {

    private val expr = "(?<=\\G.{2})"
    private val regex = expr.toRegex()
    private val pattern = expr.toPattern()

    private fun withTrailingEmpties() = arrayOf("ab", "abcd", "abcdef")
    private fun withoutTrailingEmpties() = arrayOf("", "a", "abc", "abcde")

    private fun keepsTrailingEmpties(results: List<String>) = results.last() == ""
    private fun discardsTrailingEmpties(results: List<String>) = !keepsTrailingEmpties(results)

    @ParameterizedTest(name = "splitting \"{0}\"")
    @MethodSource("withTrailingEmpties")
    fun `Kotlin keeps trailing empty strings by default`(s: String) {
        assertAll(
                { assertTrue(keepsTrailingEmpties(s.split(pattern)), "s.split(pattern)") },
                { assertTrue(keepsTrailingEmpties(s.split(regex)), "s.split(regex)") },
                { assertTrue(keepsTrailingEmpties(regex.split(s)), "regex.split(s)") }
        )
    }

    @ParameterizedTest(name = "splitting \"{0}\"")
    @MethodSource("withTrailingEmpties")
    fun `Pattern keeps trailing empty strings with negative limit`(s: String) {
        assertTrue(keepsTrailingEmpties(pattern.split(s, -1).toList()))
    }

    @ParameterizedTest(name = "splitting \"{0}\"")
    @MethodSource("withTrailingEmpties")
    fun `Pattern discards trailing empty strings by default`(s: String) {
        assertTrue(discardsTrailingEmpties(pattern.split(s).toList()))
    }

    @ParameterizedTest(name = "splitting \"{0}\"")
    @MethodSource("withoutTrailingEmpties")
    fun `Pattern and Kotlin give the same results when no trailing empty strings`(s: String) {
        assertAll(
                { assertTrue(pattern.split(s).toList() == s.split(pattern)) },
                { assertTrue(pattern.split(s).toList() == s.split(regex)) },
                { assertTrue(pattern.split(s).toList() == regex.split(s)) }
        )
    }

    @ParameterizedTest(name = "splitting \"{0}\"")
    @MethodSource("withTrailingEmpties")
    fun `Pattern and Kotlin give different results when there are trailing empty strings`(s: String) {
        assertAll(
                { assertTrue(pattern.split(s).toList() != s.split(pattern)) },
                { assertTrue(pattern.split(s).toList() != s.split(regex)) },
                { assertTrue(pattern.split(s).toList() != regex.split(s)) }
        )
    }
}

Conclusion

While interoperability between Kotlin and Java seems to be generally good and straightforward, there is a fundamental yet subtle difference between the behavior of Pattern.split() and Kotlin's split() implementation in the Regex, CharSequence, and String classes with respect to the limit parameter and the treatment of trailing empty strings. This can lead to surprising results if you are unaware of the difference and why and when they might occur.

The Kotlin team is aware of this issue and said that they have created a task for updating the documentation to emphasize the difference. Hopefully, that will be enough to help people avoid the pitfalls described above.

Links

Reddit: https://www.reddit.com/r/Kotlin/comments/gls1ko/stringsplitpattern_is_not_symmetrical_with/

CodeRanch: https://coderanch.com/t/730500/languages/behavior-java-util-regex-Pattern

jlacar/PatternSplitLimitPitfallInKotlin.md