{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Root cause analysis (RCA) of latencies in a microservice architecture\n", "\n", "In this case study, we identify the root causes of \"unexpected\" observed latencies in cloud services that empower an\n", "online shop. We focus on the process of placing an order, which involves different services to make sure that\n", "the placed order is valid, the customer is authenticated, the shipping costs are calculated correctly, and the shipping\n", "process is initiated accordingly. The dependencies of the services is shown in the graph below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from IPython.display import Image\n", "Image('microservice-architecture-dependencies.png', width=500) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This kind of dependency graph could be obtained from services like [Amazon X-Ray](https://aws.amazon.com/xray/) or\n", "defined manually based on the trace structure of requests.\n", "\n", "We assume that the dependency graph above is correct and that we are able to measure the latency (in seconds) of each node for an order request. In case of `Website`, the latency would represent the time until a confirmation of the order is shown. For simplicity, let us assume that the services are synchronized, i.e., a service has to wait for downstream services in order to proceed. Further, we assume that two nodes are not impacted by unobserved factors (hidden confounders) at the same time (i.e., causal sufficiency). Seeing that, for instance, network traffic affects multiple services, this assumption might be typically violated in a real-world scenario. However, weak confounders can be neglected, while stronger ones (like network traffic) could falsely render multiple nodes as root causes. Generally, we can only identify causes that are part of the data.\n", "\n", "Under these assumptions, the observed latency of a node is defined by the latency of the node itself (intrinsic latency), and the sum over all latencies of direct child nodes. This could also include calling a child node multiple times.\n", "\n", "Let us load data with observed latencies of each node." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "normal_data = pd.read_csv(\"rca_microservice_architecture_latencies.csv\")\n", "normal_data.head()" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "Let us also take a look at the pair-wise scatter plots and histograms of the variables." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "axes = pd.plotting.scatter_matrix(normal_data, figsize=(10, 10), c='#ff0d57', alpha=0.2, hist_kwds={'color':['#1E88E5']});\n", "for ax in axes.flatten():\n", " ax.xaxis.label.set_rotation(90)\n", " ax.yaxis.label.set_rotation(0)\n", " ax.yaxis.label.set_ha('right')" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "In the matrix above, the plots on the diagonal line are histograms of variables, whereas those outside of the diagonal are scatter plots of pair of variables. The histograms of services without a dependency, namely `Customer DB`, `Product DB`, `Order DB` and `Shipping Cost Service`, have shapes similar to one half of a Gaussian distribution. The scatter plots of various pairs of variables (e.g., `API` and `www`, `www` and `Website`, `Order Service` and `Order DB`) show linear relations. We shall use this information shortly to assign generative causal models to nodes in the causal graph." ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Setting up the causal graph\n", "\n", "If we look at the `Website` node, it becomes apparent that the latency we experience there depends on the latencies of\n", "all downstream nodes. In particular, if one of the downstream nodes takes a long time, `Website` will also take a\n", "long time to show an update. Seeing this, the causal graph of the latencies can be built by inverting the arrows of the\n", "service graph." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "import networkx as nx\n", "from dowhy import gcm\n", "\n", "causal_graph = nx.DiGraph([('www', 'Website'),\n", " ('Auth Service', 'www'),\n", " ('API', 'www'),\n", " ('Customer DB', 'Auth Service'),\n", " ('Customer DB', 'API'),\n", " ('Product Service', 'API'),\n", " ('Auth Service', 'API'),\n", " ('Order Service', 'API'),\n", " ('Shipping Cost Service', 'Product Service'),\n", " ('Caching Service', 'Product Service'),\n", " ('Product DB', 'Caching Service'),\n", " ('Customer DB', 'Product Service'),\n", " ('Order DB', 'Order Service')])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n", "plt.rcParams['figure.figsize'] = [13, 13] # Make plot bigger\n", "\n", "gcm.util.plot(causal_graph)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "