Behavior-Driven Development: Overcoming Cucumber

⚠Draft⚠ transcript of talk by Andrey Mikhaylov (lolmaus)

Slides: https://slides.com/andreymikhaylov-lolmaus/bdd-cucumber-ember-berlin

Hey everyone! My name is Andrey, I'm a frontend developer at kaliber5, where I specialize in EmberJS.

There are more than enough talks, articles and even books about the Behavior Driven Development methodoloy and the Cucumber tool. In addition to telling you what they are and how awesome they are, I’ll share how terrible Cucumber can be and how to deal with it.

I use the term BDD in the broadest sense. I do not mean the assertion style and in fact I believe you can do BDD with assert as much as you can with expect and should.

And I do not even mean the famous double loop test workflow, where you are supposed to write acceptance test first, then unit tests and finally proceed to writing code. By the way doing it in a waterfall fashion is so challenging that I finding it impossible. But you’re not supposed to. Instead, you do a few lines of tests to cover a few lines of code you’re about to write next. Then you do it over and over again incrementally. I believe, you can do BDD in the broadest sense even if you write tests after the code.

So what is BDD? It is a methodology that involves all team members: not only developers, but also managers, designers, representatives of the client and so on. BDD revolves around user stories. A user story is a short text describing how a certain feature is supposed to be used: it describes the initial state of the app, what actions the user takes and what the expected outcome is.

A typical feature is covered by multiple user stories. For example let's take the feature of login and sign up. How many user stories do you think you need to cover it? At the first glance it may seem that you need two: one to log in and one more to sign up. But then you remember about invalid passwords, password recovery, validation of the email field and so on. If you forget to implement one of these then your app wil be faulty and some users will be unhappy and the client will be unhappy.

User stories thread through the whole feature lifecycle, statring with its conception and finishing with deployment to production. In the beginning, the whole team converges to brainstorm all possible user stories for the upcoming feature. The designer then uses the stories to create mockups. The developer uses the stories to implement acceptance tests. Finally, the CI server executes those tests during deployment.

So this is how many user stories I came up with out of my head for the sign up and log in feature. And those are just the names of the stories. Behind each item on this list the should be a text describing what that the user is actually doing. This involves some very rare but still important edge cases. For example the user clicks a recovery link from the email but this link has expired or has previously been clicked. What should the user see? Or the user clicks the password recovery link from the email but they had previously remembered the password and logged in, so clicking the link leads them to a logged-in app, what should they see?

Let’s look at key benefits of the user story approach.

As you have seen with the sign up example, user stories reveal the true scope of your feature.
And this lets you estimate time requirements more accurately. Of course, estimations are never precise, but the error margin will be MUCH smaller, you will avoid being overly optimistic.
User stories serve as a specification for you feature…
…and as a single source of truth that resolves disagreements about what should be done and how.
User stories make it easy for developers to track their progress. Each story can be represented in your issue tracker as a separate issue or as a checklist in a single issue that covers the whole feature. The developer sees how much work they have done and how much is left. Also, checking those items one by one gives satisfaction.
User stories are written from the user's perspective, naturally, and not from the developer's perspective. There is a book about this fact called “The inmates are running the asylum” by Alan Cooper which says basically that putting developers in charge of desigining user interface and interaction turns the undertaking into a crazy house, because of how much developers are prone to cutting corners and prioritizing geeky stuff over user experience. BDD prevents that.
And finally, BDD as a methodology integrates really well into automated testing. For this purpose, I use Cucumber.

Cucumber started over 14 years ago as a library for the Ruby language. Later it evolved into a full fledged test framework which was ported to almost every popular programming language especially the ones used to implement user interfaces. For JavaScript, there is the official CucumberJS, the unofficial YaddaJS which I happen to be using and who knows how many more on NPM.

In Cucumber, acceptance tests are written as user stories in a human-readable language. The file extension is .feature and the syntax is called Gherkin. A feature file is an equivalent of an EmberJS acceptance test file. The name of the feature maps to the name of the test module and Scenario names are names of test cases.

Inside each scenario there are steps, one per line which are executed sequentially. There are three types of steps: Given steps set initial state of the app, When steps execute user actions and Then steps make assertions.

Cucumber is not an AI, so behind each step there should be its implementation in your programming language. You can think of step implementations as of functions and step names in the feature file are calls to those functions.

Functions can be parameterized. For example if you have steps Then there should be 2 posts on the page and Then there should be 3 posts on the page, you do not have to create two separate steps. You can use one step, and the number of posts to assert will be passed as an argument into the function.

Cucumber uses regular expressions to parse step names. You use regular expressions right inside function names. The YaddaJS library also has a nice syntax called converters or dictionaries. I call it macros. Use the dollar sign to indicate a macro. The macro can be programmed to convert the value for you. Regular expressions yield strings, which you need to convert by hand. Macros can automatically deliver numbers, DOM elements, JSON, convert tables into data structures. Very convenient.

Cucumber also has test matrices. The same test case can be invoked multiple times with a different sets of parameters. But Cucumber does not have conditionals and loops. This is intentional, in order to keep the readability of tests as high as possible.

Let’s look at key benefits of Cucumber.

Cucumber provides a separation of concerns. When you read tests, most of the time you want to see the high-level logi ofl the test: which components have been clicked and what are the expecteations. You rarely care about to the exact implementation: which helper function was used to to invoke the click, which Mirage invocations were used to seed the database.

Such low-level details are visual noise most of the time. With Cucumber, all the secondary details are abstracted away, letting you focus on the important bits and save a lot of time.
This makes it easy to validate tests. Very often, acceptance tests are notoriously difficult to comprehend, effectively being write-only code. With Cucumber, this is not the case.
And since tests are written in a human readable language, not only developers but all other stakeholders can validate those tests.
Code becomes very reusable. You can reuse the same step implementations in many scenarios and many feature files and you don't have to reimplement ther for every acceptance test.
As a result, the speed of writing tests improves a lot. You can compose new test cases by rearranging the step names and changing their params without having to write a line of code.
It becomes easy to maintain tests because step implementaitons are short and battle-tested across many feature, so you rarely have to look inside them.
Cucumber enforces the BDD methodology, increasing the discipline and preveting desginers and developers from cutting corners.
It increases code coverage…
…and makes the code much more uniform.
Cucumber makes it convenient to develop the frontend ahead of the backend, with the help of Mirage mock backend. You can pause any user story at any step, and every time your dev server reloads, the app will appear in the intermediate state that you need, saving you many clicks.

Too good to be true, right? Right. 😞 Turns out that though cucumber has an army of avid followers, it also has a comparable number of haters. Many of those haters started as followers who got enthusiastic about Cucumber after reading articles an listening to talks, managed to persuade their teams to adopt Cucumber, spent months on rewriting existing tests and creating new ones. But then they got bitterly disappointed.

Basically, every other Cucumber promise is not fulfilled. In practice, Cucumber increases friction in the team, increases friction in development. This typically ends in one of two ways. Either the team decides to rewrite everything back and waste months of work. Or…. There is a joke in Russian: the mice were crying and bleeding but they kept eating the cactus. So the second outcome is thath the team would decide to keep Cucumber but stay very unhappy with it. Either way, the whole team will frown upon the developer who came up with the initiative. Guess who was this developer in my team?

But I did not give up, I have identified the problems. And I have figured that those are not problems inherent to Cucumber but rather they are related to the way Cucumber is used. I can compare this to trying out cooking and having cut my hands with the kitchen knife. And I would go preaching that kitchen knives are terrible, they're unsafe, and everyone should stop using kitchen knives in cooking. That would be ridiculous, right?

The first Cucumber issue was that our library of step implementations kept growing indefinitely. Basically, every feature required introducing new steps. A large library of step implementations is very difficult to navigate. No IDE has an addon that would automatically look up a step implementation for a given step. And steps are distributed across multiple files, across projects and reusable libraries. Speaking about multiple projects, there may be variations of how same steps behave across projects. Developers become lazy and instead of looking up existing implementations, they would just create new ones, multiplying the problem.

Regular expressions in different steps can overlap, so that the same step can be matched by more than one implementation. YaddaJS has heuristics that would always prefer a more specific step name over a greedy regex. But it will nnever work the way you expect it to. When a wrong step is picked, it can still pass. And then the test proceeds assuming a certain action happenned while it didn’t. This ends up with a false negative, and the assertion error message give you no clue why the test failed.

Secondly, every step is a black box. The step name can say something buth the step implementaiont can do something else or nothing. Step names hide important details. This test says: “Given there are 3 posts in the database. Then the title of the first post should be Hello World”. Why is it Hello World and not Lorem Ipsum? In order to validate that this test makes the right asertions, you need to look into the actual step implementaion. Maybe the post title is generated in a Mirage factory. Or a Mirage trait. Maybe “Hello World” is hardcoded into every post. Maybe the step implementation uses a lorem ipsum JS library. Maybe posts are generated from fixtures. Having to look into step implementaions every time you read tests — is terrible. And needless to say, non-developers can’t do this and Cucumber features lose any value to them.

The third problem is that steps can implicitly depend on each other. For example, a step could say “Then the previously expanded item should contain details”. The problem is in the words “previously expanded”. The step cannot simply look up an expanded item, because there can be pre-expanded items on the page, we need to know exactly which one has been expanded.

Cucumber has a solution for this. Step implementations share the same context. The step that expands an item can memorize the expanded DOM element into a property on this, and another step can read from it. The problem is that this makes the logic very tangled, and complexity grows over time, increasing the friction of both maintaining tests and building upon existing ones.

The fourth problem is using CSS selectors to target elements on the page. A non-developer would probably think that this step is written in a foreign language, and a developer cannot tell if it’s a navigation menu, a sidebar or a list of posts. You cannot solve this problem by simply saying “let’s use semantic CSS selectors”. If there’s a technical possibility to cut corners, corners wi,l be cut.

The last problem I have time to mention today is how difficult debugging Cucumber tests is. If you integrate Cucumber into your framework in the simplest way, there will be no means of debugging. You can’t see which step implementation what used for which step, you cannot see which step the test failed at. All you see is the name of the scenario and an unhelpful error message with a meaningless stacktrace. Working in such an environment is extremely unproductive.

Now let’s talk solutions!

First of all, I have realized that the user can do only so many types of actions on the page: the user can click something, fill something into a field, read something on the page and so on. Overall, there are maybe a couple dozen of common actions and another few dozen of less common actions. For example, picking an item from a power select using the type-ahead search.

But even these rare steps are still very generic, they are not tied to any specific feature. I decided that we should only use such absolutely generic steps. No feature-specific steps. This decision has reduced our libary of steps from several hundred to a few dozens. Luckily we have not reached the milestone of 1000 steps, that would’ve been a sad anniversary to celebrate.

Those highly reusable steps are very short and atomic. They are easy to remember and the same step names appear in every test, so you can copy-paste. And since they're so small and battle-tested there is no need to maintain the implementations.

The solution to the second problem is to extract all the truth from the step implementations into the actual feature files. No implict logic. This makes feature more verbose, especially in the seeding part, and more technical but still understandable to designers and managers. This is a good price to pay for not having to look into step implementations ever.

As for the third problem, we have agreed to avoid using the this context of step implementations. Step names should reference everything explicitly. Thus a step saying the previously expanded item turns into the third item, for example. This may seem a simple rule, obvious in the hindsight, but it resolved so much friction.

The fourth problem is the CSS selectors. We agreed to stop using them and instead we rely on. semantic test selectors via the ember-test-selectors addon. You put a data test attribute on every element that you want to interact with from your test, marking it with a semantic name.

But the resulting syntax is way too technical for non-developers and hard to read for developers, so I came up with a custom DSL, a custom syntax I called Labels, that makes this much more readable. First of all, I drop the square brackets. Then I remove the data test prefix and capitalize in order to make Labels stand out in the step name.

And then I also reversed the order of compound selectors. So instead of saying When I click data-test-post:nth-child(2) data-test-expand-button , I now say When I click the expand button of the second post. Much more natural to read and and unambiguously maps to CSS selectors.

Then I forbade using arbitrary CSS selectors. Only Labels are allowed in feature files.

There is a problem with the nth-child pseudo-selector. It doesn't work as we want it to. For example, I wanna select the second post. And by second post I mean this post because it’s the second among all posts on the page. But in this example, data-test-post:nth-child(2) will pick every second post from their immediate parents. You can see that it picks the wrong item and it picks more than one item.

To fix this, I borrowed the idea of the :eq selector from the jQuery library. The data-test-post:eq(2) pseudo-selector makes a flat list of all posts and picks the second one.

The fifth problem is debugging. When integrating Cucumber into our framework, I created a custom wrapper for the invocation of EmberJS tests. And when an error happens in a test, this wrapper would catch the error, enhance the error message and re-throw.

This is how our test output looks like. You can see a list of steps in the order they were executed in. You can see which step implementation was used for which step.

For failing steps, in addition to the step name and the step implementation, there is a list of arguments with which the step implementation was called with. If an argument is a Label, you don’t see [object Object] in square brackets, this is what a DOM element would look like when cast into string. Instead, you see the label name, the CSS selector it corresponds to and the number of matching DOM elements. You also see a meaningful error message and a meaningful stack trace.

This concludes it. When we started following these five rules and a few other ones that I don’t have time to explore today, using Cucumber became efficient and enjoyable.

At kaliber5.de, we dedicate some of our worktime to open source causes. My open source effort has been an Ember addon I called ember-bdd. It is in active development and not yet feature-complete, but when it’s finished, it will offer a drop-it Cucumber suite featuring all the know-hows that I have shared with you today.

In the GitHub repo of the addon I have enabled the Discussions feature, which is a more casual way to discus stuff than GitHub Issues. I encourage everyone to chime in, ask questions, share ideas and become early adopters of the Ember BDD Cucumber library.

Thank you! ^_^

lolmaus/overcoming-cucumber.md

Behavior-Driven Development: Overcoming Cucumber