From MVP to scalability - #1 The requirements

In this previous article we explained why it was time for us to move forward to build a greater Octobat, more focused on automating the full billing process for companies selling online. Now comes a series of technical posts following our achievements in this long journey.

menhir

The current Octobat: The Rails monolith

We've always loved Rails since the day we started our first app as a web agency in 2010. There were still the days of the 2.3 version, obtrusive JS, and plugins. As we came from PHP, we immediately enjoyed the refreshing Ruby syntax, and the Rails way permitting to ship faster and faster.

MVP pattern was a perfect fit for our needs as we were building basic CRUD apps, built for corporate use only, and not communicating with external services.
Came one day, we were asked to enhance an existing app with an API, and we started to realize that Uncle Bob was right. Decoupling business logic from models and controllers quickly became an obsession as we could not depart from the DRY principle anymore.

Following these principles, and relying on great libraries, we started building the current Octobat during fall 2014. In its first days, it was a Rails 4.1 app, with a private API listening to Stripe Webhooks, and handling them in an asynchronous queuing system.
We then added a web interface, export features, a PDF generation system, a public REST-API, our own webhooks, experimented polymers, and more-recently, introduced our new frontend app: Checkout.

Even if the code is still pretty well written, all these parts are almost all parts of the same complex Rails app, that tends to become slower and slower.

aiguillages

So, how to build the new Octobat?

This monolithic approach leads to obvious drawbacks that are as many challenges for us to address. As re-architecting the Octobat app is now a life-or-death question, we listed what would be the technical issues of today - and tomorrow -, in order to help us writing better software to enhance the full picture:

  • Zero-downtime: The introduction of Checkout and built-in forms leads us to higher SLA expectations, as this forms are an interface for collecting payments through Stripe and, hopefully, many other gateways soon. Those frontend components are parts of a static app, and rely on our API that must obviously be available 24/7. The monolith approach cannot fit this requirement as a crash or an overload on the dashboard - for instance - can have direct impact on our customer's payment processing. In other words, we have to remove every single point of failure in Octobat.

  • Vertical and horizontal scaling: Some of our atomic operations require either a lot of concurrency, or a huge RAM/CPU consumption. For instance, you expect our dashboard or our GET API requests to be as fast as possible, as it's mainly database/cache read, and our web server must serve as many requests as possible at the same time. It's basically the same problem for our direct integrations, as they rely on webhooks. If we cannot treat the volume of requests performed by Stripe to our endpoint, it will result on a loss of transactions, which is absolutely inconceivable. That's our need for horizontal scaling. On the other hand, we also need to scale some parts vertically. Our exports and reporting system has to deal with more and more data to compute the numbers you need to run your business. PDF generation is a CPU-killer. Complex JSON parsing too. Scaling every component in both ways is not the answer as it's very expensive, and doesn't solve the issue of separating time-consuming tasks from fast and concurrent requests.

  • A Need for parallelism: As we've a great understanding of what should be synchronous and what cannot be, we currently don't prioritize tasks as all the asynchronous stuff is queued in a serialized way. It was a MVP dummy-answer to concurrency issues, as we have, for instance, a race condition in invoices sequence numbers. But some tasks can take a lot of time, as they rely on external - and often slow - services, such as validating the VAT numbers, or retrieving and processing the whole transactions history for a customer. When we introduced our own events and webhooks, we decided to rely on our customer's implementations, which is a great service, but leads to technical issues as we now depend on their own performance and crashes. Imagine, for a certain reason, an I/O request takes more than 30 seconds to execute. Even if we setup timeouts, blocking the whole queue for this is not an option at all.

  • Atomic operations, REST API and Workflows: This is probably the most tricky and sensitive question of our new architecture. As we have built an API for our customers that performs the same operations as our direct integrations - creating an invoice, for instance -, in order to rely on the DRY principle, we decided to use it for our own needs too. That's fine. But as Octobat's global operations - for instance, aggregating Stripe charges data, and generating a tax invoice for each one of them - rely on several atomic REST API calls, we need a way to serialize them, and ensure they are processed exactly once.
    For instance, creating an invoice from a Stripe charge would be the sum of at least 7 atomic operations - checking if the customer exists in Octobat, creating her if it's not the case, check if the invoice already exists, create it as a draft, add one or more line items, validate it, and mark it as already paid -.
    It's not an option to have the whole operation crashing at one step, and not trying to recover exactly from this step. It was not an issue within the current Octobat, as we performed local DB transactions, but, using an API, it's definitely a more complex challenge.
    Transitional data must be figured out too. Let's say for a certain reason, the processing of a Stripe Charge to an Octobat invoice is stuck while adding line items. If we display the data in the dashboard or return it in the API, our customer won't be able to understand where does this come from - as processing has not ended -, or worse, can establish wrong calculations on his own.

  • Testability, productivity and team matters: As explained in this previous article, we were not happy anymore with our codebase, as testing it became a nightmare, the DRY principle being more and more infringed, and developing new features became a big pain.

As we now have a lot of ideas on how to figure out of this, and started implementing our new architecture, the next blog posts will present our engineering solutions to provide a durable and scalable software.