status.gallery: a LowJS reactive web app — Part 2

This is the second part of a set of articles on a real-world application (called “status.gallery”) which uses a reactive Spring WebFlux backend with Thymeleaf templates and Kotlin Coroutines, Kotlin Flows and Server-Sent Events using HTML content.

The set of articles covers:

  1. LowJS frontend (Hotwire, fragments, SSE, Bootstrap)
  2. Reactive backend (Spring WebFlux, Kotlin Coroutines, Kotlin Flows)
  3. Types and application design (Arrow, Mockk, async logging)
  4. The build (Gradle Kotlin build script, browser dependencies, JVM dependencies)

which also form the basis of a talk for YOW! 2021. Remember, too, the code is available to view on Bitbucket.

Moving on to the backend, we’re delivering an SSE any time the status of a system we’re observing (an “Observee”) changes. Recall that each Observee can have one or more instances (an “Observee Instance”), and the aggregate recent health status of each instance is used to calculate the overall health of the system.

A table showing 4 observees and their current health status. The Google observee has 2 instances and is showing as currently unhealthy.
Some or all of the “Google” Observee’s instances are not healthy, so various rules determine it’s aggregate health status to be “Unhealthy”

What does that mean? Well if we had a service called “Payment Gateway” in a distributed system, we’d say there was a “Payment Gateway Observee”. Let’s also say this service has n+1 redundancy and we have two instances being monitored, so we say there are two “Observee Instances” of the “Payment Gateway Observee”.

We check those instances periodically (e.g. every 30 seconds), recording the results of the observations of each of those Observee Instances. Depending on rules around “degrading” and “recovering” health (more on those below), one or both of those instances may become “ailing” or “unhealthy”. Then, depending on rules about tolerating how many instances can be in a non-healthy state, the Observee may become “ailing” or “unhealthy” as an aggregate status.

From the outset I wanted to achieve the following:

  • A fully reactive system that didn’t involve shared mutable state (or if there were any, it was abstracted away from me as an implementation detail of the reactive toolkit, preferably using lock- and wait-free data structures)
  • Because calculating the current health status of each Observee Instance potentially involves looking at the past several observations (the Health Calculator), it needs to be captured even if “nobody is watching”
  • Use automated tests and a REPL to drive the design, particularly of the reactive portions of the system
  • Despite the cognitive load of multiple asynchronous, non-blocking, moving parts the reactive stream processing code would be fairly easy to follow (even if the underlying library mechanisms were not)

Continuing the journey started earlier in the year with Kotlin Coroutines and Spring WebFlux, instead of using Project Reactor this time I went with Kotlin Flows. They have matured significantly over the past year, interleave nicely with coroutines, have good analogies to other Reactive Streams concepts (making some of them even easier and clearer), and helped achieve the “nobody is watching” goal very simply.

Kotlin Flows

At the core of status.gallery, there are three Flows. Marble diagrams are great at describing how reactive streams work.

For this particular application, and to show the way the three Flows work together, I created an Explainer Presentation with animated marble diagrams alongside written commentary about how Flows are used in this app.

Please take a few minutes to walk through that along with the following text. For deeper insight, read up on the various types of Kotlin Flows and how their characteristics can be used to change how shared and state flows work in the face of zero, one, or more subscribers, slow readers, slow writers, and the various reactive streams operations that can be performed on them.

Observation events

The flow deepest from the frontend is a “shared flow” which streams even if there are no subscribers. This shared flow collects ObserveUrlEvent events, those being the result of periodic requests to each of the ObserveeInstances that have been configured. In our case, the only subscriber is the same thing making those periodic requests. If nobody is watching, observation events keep coming in, but the oldest events are dropped once the flow gets to its capacity. In this way, it acts like the ring buffer in the Disruptor (my original thought was indeed to use the Disruptor library as I already include it for Log4J2 asynchronous logging).

The types used in the ObserveUrlEvent are interesting and I cover them in detail in part 3). Suffice it to say, the observation results in success or failure and the recent list of these (held in that ring buffer) are used to calculate whether the instance is healthy or not. This is done using optional “degrade” and “recover” values.

Instance State events

The next flow is a “state flow” (which is a type of shared flow). A state flow only holds 1 value at a time. In this app, we only emit an event when there’s been a change in the health calculation of the instance.

By default, if a request to one an instance fails, the instance is immediately determined to be Unhealthy. As soon as a request succeeds, the instance is immediately determined to be Healthy.

Thumbnail of a marble diagram showing a “shared flow” ring buffer feeding a “state flow” event stream
Thumbnail of a marble diagram showing a “shared flow” ring buffer feeding a “state flow” event stream
Marble diagram

But, as we know, a lot of things can go wrong between the Observer and the Observee that don’t mean the Observee is unhealthy (i.e. a false negative), and also that even when a system is unhealthy it may be able to respond sufficiently enough to appear healthy for a brief moment (i.e. a false positive).

To that end, we allow the Observee to be configured with settings used to “degrade health after this many more failures” and “recover health after this many more successes”. So a degrade value of 1 means that two failures in a row means the instance is Ailing and three failures in a row means the instance is Unhealthy. A recover value of 1 means two successes in a row are needed before the instance is Healthy again. The various scenarios are specified in the HealthCalculator tests.

Observee State events

The next flow aggregates the state of the one-or-more Instances of an Observee into a single health status.

When we have multiple instances, we may treat the group as healthy even if some of them aren’t healthy. Whether, and to what degree, that is true comes down to the manner in which consumers connect to the system, so this is configurable per Observee as a tolerance configuration value.

Marble diagram showing 3 state flows being combined into 1 state flow
Marble diagram showing 3 state flows being combined into 1 state flow
Marble diagram

The health status of each Instance is used to determine the overall health of the Observee. The tolerance level is used to decide when to move an unhealthy group between healthy and ailing, or ailing and unhealthy. Unless all Instances are in an unknown state, such instances are not included in this determination. The various scenarios are specified in the ToleranceCalculator tests.

Bounce events

I have included an optional “bounce” capability which will re-emit the last event to the state after the given number of minutes. A “bounce” is the opposite of the “debounce” concept in reactive streams which suppresses quickly repeated events (e.g. a user double-clicking a button that only needs to be clicked once). In our case, because we are generally sending events to a consumer on a different host (the browser), intermittent failures may lose a change in status and it might be a long time between changes.

So, if the bounceMinutes configuration value is set, the previous event will be re-emitted after those number of minutes have passed since it was emitted. The event data is unchanged, but to distinguish the bounced events from the original, the time stamp of the bounce is included — this also ensures any distinctUntilChanged() or debounce functions working on the stream don’t go and suppress the bounce event.

Status Controller

This controller provides the resources for the status gallery. There are four resources presented:

  1. /status An HTML list of ObserveeInfo, one for each Observee
  2. /status.stream A Turbo Stream of ObserveeInfo, as each event is emitted from the ObserveeState flow
  3. /observee/{observeeNumber}/status.stream A Turbo Stream of ObserveeInfo for a given Observee, as each event is emitted from the ObserveeState flow
  4. /observee/{observeeNumber}/instance/{instanceNumber}/status.stream A Turbo Stream of InstanceInfo for a given Instance of a given Observee, as each event is emitted from the InstanceState flow.

The last three each present a Kotlin Flow, each of which is comprised of an originating State Flow transformed as required by the representation of the resource we want.

status.stream

This takes the list of flows of ObserveeEvents for each Observee, turns that list into a flow, merges the flows (using flatMapMerge), maps each ObserveeEvent to an ObserveeInfo, and presents the resultant flow to the Thymeleaf template integration to Spring WebFlux as a ReactiveDataDriverContextVariable.

Thus, the template is interpolated with each ObserveeInfo and sent as an event in a text/event-stream channel to the browser. See Part 1 for more detail.

The other flows aren’t particularly special and aren’t used by the frontend yet. They simply map from ObserveeEvent to ObserveeInfo and InstanceEvent to InstanceInfo respectively. The intention is to be able to drill-down on the frontend to see detail of each observee or any of it’s instances — but given the primary use case is a big dashboard screen, it’s lower priority functionality.

So far, we’ve dived into the main aspects of “status.gallery” #LowJS frontend using Hotwired and reactive backend using Spring WebFlux, Kotlin Coroutines, and Kotlin Flows.

In part 3, we’ll look a little closer at the application design and types used in the Kotlin backend, in particular the immutable objects and pure functional constructs being used to test and model the system.

CTO | Restaurateur | Invest/Advise/Ex: Mass Dynamics, Cookaborough, Secure Code Warrior, Canva, Atlassian, Hashrocket, ThoughtWorks, OzEmail | YOW! Conferences