Disclaimer: while the views expressed in this article are mine and mine alone, the project introduced here was part of engineering efforts carried out with Withings.
Alright. I know it. You know it. We all know it: pretty much every website and app out there tracks its users nowadays. If it's not the annoying, half-the-webview cookie banner, it's going to be a full-on modal that can't dimissed and takes a solid 3 minutes to disappear when you hit "Reject all." Depending on the country you're currently in, it could actually be much worse: you'll get no warning, no notifications, nothing: just some sweet-sweet silent tracking, formely known as "spying."
For as long as websites and apps have been able to generate financial profit, teams of dedicated professionals have made it their mission to study exactly how their products are being used and how they could adjust even the tiniest details of the experience to improve revenue performance. With the rise of machine learning in data analytics, this has now gone full-blast, with massive amounts of user data flowing into automated pipelines that yield neat tables, figures and graphs for marketing teams (and AIs) to ponder.
Given the serious computational cost of running these pipelines, SaaS quickly became the go-to solution for analytics. To this day, Google's platform still has a massive head start on potential competitors and very few businesses take the time to set up their own infrastructure for this. However, while many users remain unaware of what online privacy means, companies (especially in the EU) are progressively being forced to care about it. The GDPR, for example, is the reason why all those cookie banners seemingly appeared out of nowhere about a decade ago: it became illegal to gather personal data from EU citizens without a proper consent procedure.
Since Google remains unrivalled on EU grounds, companies which wanted to avoid the legal hustle started looking into self-hosted solutions. One of the early and well-known options is Matomo, formely known as Piwik. Mostly geared towards web applications, this solution allowed website owners to include their own little JS snitch and ship the data to their own dashboards. Matomo is still very popular to this day, but it has one drawback: it hasn't properly evolved into today's world of mobile apps. For those, there are a couple of solutions out there today. I'd say the most popular are Snowplow, Segment and Rudderstack.
What's Rudderstack?
Like many of its comtemporaries, Rudderstack is usually defined with a set of careful-crafted marketing terms which I've never got around to understanding. From a technical standpoint though, I would call it an events collector. Whenever something happens on your website or app (user click, user login, item added to cart, you name it), Rudderstack ships that event to whatever storage backend you want. You can then query that database and pull statistics on pretty much anything.
In my opinion, one of the key strengths of Rudderstack is that it is incredibly easy to integrate. Mobile developers import a library, plug it into a couple of places, and boom: events start flowing. From a developer's perspective, I've been told it's quite breezy.
The Cloud frenzy
From a sysadmin's perspective though... this thing is unmanageable. Like many such products, it has fallen prey to the all-too-common Cloud hype. Rudderstack can only be deployed on Docker and Kubernetes: that's it. It does not ship as a package, nor does it provide viable instructions for installing from source.
If a good portion of your infrastructure (the one that can scale at least) is already Kubernetes, you might be able to get something. There's a Helm chart you can use and configuration can be reduced to a couple of values in a YAML file. Even then though, you will face another set of cute challenges:
- The documentation provided by Rudderstack is very (very) limited indeed. A good amount of experience with Helm and K8S is a must.
- Forget troubleshooting: Rudderstack logs only contain (an insane amount of) debug-level information full of internal jargon: pretty much impossible to understand unless you're a Rudderstack dev yourself.
- Albeit self-hosted, Rudderstack requires access to an AWS or GCP S3 bucket. You'll need to mock that.
- In a similar style: by default, Rudderstack SDKs will try and reach out to Rudderstack Cloud at initialisation, even when you're self-hosting.
Sadly, while this will understably annoy any sysadmin, it does make a lot of sense from a business perspective. Rudderstack's main product isn't Rudderstack, but Rudderstack Cloud: a SaaS instance of their open-source solution. Unfortunately, Rudderstack is based in San Francisco, which makes it little more than an inexperienced, less legally-scrutinised alternative to Google. Back to square 1 we go.
Note that to this day, I haven't given a shot to Segment or Snowplow. From what I gather, some features are exclusive to Rudderstack, and the analytics team I worked with on this couldn't find those anywhere else.
Introducing Stilgar
Now: I do consider myself a rather patient sysadmin. I don't mind going through countless documentation pages or configuration errors to get something to work. For a good project, I'll even comply with dependency requirements I don't like (or, like, write configuration in XML).
However, there comes a point when you just need to say (scream) enough! That's what happened with Rudderstack. Too many errors, too many unexplained pod crashes, failed log streamings, ... At the end of the day, the product just wasn't worth the time and money it cost to run. And since no competitor could do what Analytics wanted, there was but one solution: build it! This is how our little project, Stilgar, came to life.
Key elements
Stilgar is described just like that: it is a lightweight, no-fuss, drop-in replacement for Rudderstack. It doesn't bring anything new to the table: it exposes the exact same server API and is compatible with Rudderstack's client SDKs, meaning the migration requires zero changes from app devs. It does, however, come with some serious advantages:
- It compiles to a simple native binary. You get the code, you compile it, you have it.
- No dependencies.
- You can run as many instances of that binary as you like. Whether you use physical hosts, VMs, containers, K8S, magic: scaling strategy is up to you.
- Configuration is a single YAML file.
- No Cloud dependency: you can even use it to mock Rudderstack's control plane (although this might require changes to client app code).
Plus, it's pretty damn fast. Written in Rust on the Tokio runtime: event drop-off is a very quick operation.
At the time of this writing though, there's one major drawback you won't like... It only supports Clickhouse as a destination for your events, when Rudderstack can reach... over 150 of them. Stilgar being a work project, I was only able to spend time on my company's choice of destination. The design, however, makes it rather easy to implement others (and I may or may not eventually work on those on my spare time).
Install, configure, run!
First, you'll need to get the thing. You can download and build Stilgar from crates.io using:
$ cargo install stilgar
You may also clone the repository and build with cargo build
. To save yourself the hassle of glibc linking, I recommend building for musl with:
$ rustup target add x86_64-unknown-linux-musl
$ cargo build --release --target=x86_64-unknown-linux-musl
Once that's done, go ahead and deploy the binary to your soon-to-be Stilgar hosts.
Finally, configuration. Stilgar looks for that in the form of a YAML file at $STILGAR_CONFIG or any of the following locations:
- /etc/withings/stilgar.yml
- /etc/withings/stilgar.yaml
- $XDG_CONFIG_HOME/stilgar/stilgar.yml
- $XDG_CONFIG_HOME/stilgar/stilgar.yaml
- ./stilgar.yml
- ./stilgar.yaml
A sample configuration file is available in the repository. Here are the main elements:
- The server section: define how Stilgar listens to requests, the maximum event size, the allowed origins (CORS) and the admin credentials for the /status route
- The logging section
- The destinations list
Each destination is defined by a type (storage backend) and a set of write keys. The rest of the settings is backend-specific.
Once you've got all that, simply run stilgar
and you should be all set!
A closing note
We have been running a bunch of Stilgar nodes for a bit over a year now and with some configuration tweaking, the performance has been very satisfying. We are able to process all of our events almost instantly, keeping Stilgar's internal queues at zero most of the time.
I believe Stilgar may be a good alternative for teams that like to own their analytics data, and/or need to provide strong guarantees in terms of privacy and governance. Hopefully we'll be able to grow the project to cover more destinations in the future, although for now the improvements will probably be limited to whatever our analytics team needs!
The nifty details
Beyond this point, this is just technical details taken from the repository to explain how Stilgar is designed. Note that this may not be up to date: feel free to check the doc/ directory over on Github for the latest version.
Design basics
For any destination, the basic lifecycle of an event is a follows:
- Events are sent to the Stilgar web service
- The payloads are queued into a Tokio channel
- A separate forwarder thread pops events from the channel and forwards them to your destinations
Events are received in POST on the same endpoints as Rudderstack:
- v1/batch
- v1/alias
- v1/group
- v1/identify
- v1/page
- v1/screen
- v1/track
Stilgar also exposes an extra GET endpoint at /sourceConfig. This is used to mock Rudderstack's control plane, as some SDKs (JS) will refuse to run if a control plane does not validate their write keys.
For monitoring, Stilgar exposes 2 endpoints which make it easier to integrate into high-availability (load-balanced) setups:
- A /ping route which simply replies with pong. This route does not require authentication and can be used to determine whether Stilgar is running or not (eg. in a health check).
- A /status route which provides some basic statistics. This route supports authentication, using admin credentials (not the write keys), as well as a network whitelist.
Standard issue ASCII diagram
The following diagram explains how an event moves around before it reaches a target destination (Clickhouse). See below for even more details.
Web services logic
------------------
+-------+ +---------+ +-----------+
| event +------> /page +------> tokio |
+-------+ | /screen | | channel |
| ... | +-----+-----+
+---------+ |
API tasks |
|
|
Forwarder logic |
--------------- |
|
+-----------+ |
| forwarder <---+
+-----+-----+
|
|
|
|
+----------------------------+------------+
| |
| event error recently |
| is a returned by |
| batch a destination |
| |
| |
+-----v------+ +--------------+ +-------v-----+
| split into | | forward | | apply |
| individual +-----> to <-----+ exponential |
| events | | destinations | | backoff |
+------------+ +--------------+ +-------------+
^
|
|
Destination logic |
----------------- +-------------------+
| |
v v
+- Blackhole -+ +--- Clickhouse --+
| | | |
| +---------+ | | +-------------+ |
| | discard | | | | write event | |
| | event | | | | to cache | |
| +----+----+ | | +-------+-----+ |
| | | | ... | |
+------+------+ | +-------v-----+ |
| | | flush cache | |
v | | to server | |
x | +-------+-----+ |
| | |
+---------+-------+
|
v
destination
Destinations
When an event is sent to a destination within Stilgar, it does not necessarily mean that it's been sent out. It means that the event has been taken out of the queue and passed over to the destination logic. That code can then decide to hold on to the event in memory for a bit, or send it straight away. This is useful to improve performance, when your destination would rather receive the events in batch.
Blackhole
This destination can be used for testing, or as an example to write an actual destination from. It drops all events.
Clickhouse
This destination can send events over to Yandex Clickhouse, using the gRPC protocol. The logic for this target is buffered: events will only be sent out once the forwarder has provided enough of them. Here's the overall flow:
+-------+ +-------------------+
| event +------> destination logic |
+-------+ +---------------+---+
from the [mod.rs] |
forwarder |
|
+--------------v--+
| in-memory cache |
+--------------+--+
[cache.rs] |
|
|
once enough events |
have been received... |
|
+----------------v--+
| TSV event batches |
+-------------------+
sent over to Clickhouse
|
v
The in-memory cache stores events grouped by columns. That is, events with properties a, b, c go one way, those with a, c, d another, and so on. Once the cache has enough entries, each group is taken separately and sent in TSV format over to Clickhouse. This means each INSERT
query always covers the same set of columns, which allows us to avoid inefficient input formats like TabSeparatedWithNames or JSONEachRow.