T-Mobile runs Pivotal Cloud Foundry platform at a very large scale (≈100K application instances!),
with a wide range of applications from across the organization spanning finance, payments, retail,
and customer care. Monitoring such a large-scale Cloud Foundry environment is super challenging,
yet mission-critical, for running the business operations seamlessly. We need to not only monitor
important platform KPIs, but also the most frequently exercised workflows involving critical
platform components and services. The platform-engineering team at T-Mobile have developed
a suite of customized smoke-tests to address the
observability of critical workflows, and to ensure swift real-time identification of problems
before they impact the platform applications.
Our journey started with the open-source community. They provided smoke-tests that were aimed at
monitoring specific platform services. However, we found a few gaps that restricted us from using
those tests at the scale we need. For example, inadequate and unreliable clean-ups resulted in high
volume of resource wastage. Execution of high-privilege, yet, non-essential operations during
test runs posed security risks, as well as wasted time and resources. Frequent
upstream changes caused failures in our pipelines, and compromised the reliability of the tests.
The lack of a unified framework to deploy and run the tests on multiple foundations on a
continuous basis introduced operational overhead. To address these gaps, and other use-cases
unique to our environment, we attempted to build a customized
suite of pipelines, configuration, scripts, and sample-apps
to monitor our platform components and services.
We designed a solution keeping reliability, accuracy, and maintainability in the
center of our focus. We wanted to ensure that a failure of a smoke-test run indicated a legitimate
problem in a foundation by eliminating false positive alerts. With our flexible plug-n-play
framework, and reusable libraries, onboarding new foundations or adding new smoke-tests is lean in
terms of time and efforts required. To complete our observability suite, we added capabilities
around visualization, alerting, and retention of historical information by integrating our
smoke-tests suite with monitoring platforms like Splunk, InfluxDB, and Elasticsearch. The
metrics emitted by the smoke-tests, and stored in these metrics-platforms, showed all critical
operations carried out by each of the smoke-test runs in easy-to-read dashboards and actionable
The smoke-tests suite is comprised of the following components:
The following diagram depicts the deployment approach we have followed. Here, each Concourse
deployment is dedicated to a specific hardware region, and has Concourse teams dedicated to each
specific CF foundation. This hierarchy can be different for each organization. Any such deployment
approach only needs to be reflected in foundation-specific configuration files, so that the
bootstrap pipeline knows where to deploy the smoke-test pipeline for a specific foundation. The
pipeline monitors the Git repo as a resource, and on change to the repo, deploys the smoke-test
pipeline for each foundation. The smoke-test pipeline contains the test jobs dedicated to
specific services or components. Each of those jobs can be configured to run at a different
frequency, based on how much time each run of job would take. We used Hashicorp Vault as a secret
management tool for both bootstrap and smoke-test pipelines, to fetch the secret parameters
that are not defined in environment config files. All such parameters were stored at a Vault path
that is accessible to the corresponding pipelines running in Concourse. There are other
alternatives to Vault that can integrate with Concourse.
Spring-Cloud-Services are among the most frequently used services offered on Cloud Foundry platform.
To understand the workflow of the smoke-tests in this suite, lets dive into the smoke-test for SCS
suite. The following diagram depicts the end-to-end workflow of the SCS smoke-test. This smoke-test
executes the entire lifecycle of a typical Spring-boot application that leverages SCS suite of
services. It starts with logging-in to a foundation/org/space and creating all the service-instances.
Then, sample spring-boot applications are pushed and bound to all the service-instances. Once the
applications start and are ready to receive traffic, the smoke-test hits various application endpoints to
validate the functionalities delivered by the app and the SCS services. Finally, the test performs
all the necessary clean-ups, and reports results of every operation as a separate metric to the
metric-store of choice. All the other smoke-tests in this suite also execute similar workflows with
the necessary customizations around the services they intend to test.
This suite offers library functions to report results of every run of all the critical smoke-test
operations as individual metrics with necessary tags. These metrics can be shipped to a choice of
metric stores via these functions. This gives the ability to the platform operators to create
dashboards and alerts for real-time monitoring on at-scale CF deployments. The alerts can be
triggered on consecutive failures of any operation that the operators want to monitor. Similarly,
dashboards can be set up to show continuously failing smoke-tests on a foundation with a link to
the most recently failed concourse job where the detailed logs can be viewed. All the necessary
datapoints are available for holistic dashboards that include all passed and failed smoke-tests
on each foundation.
T-Mobile’s Platform-Engineering team uses these smoke-tests to monitor the CF deployments hosting
thousands of applications. These tests continue to help our operators get the real-time health
information of platform components and services, and take actions to mitigate the issues before
they impact the hosted applications. They have been particularly invaluable during
platform-upgrades (when the state of the components is unstable), as well as new phone launches,
and retail seasons (when traffic to hosted applications is tremendously high). These tests have
helped identify issues with critical platform components such as auto-scaler, logging-service,
spring-cloud-services broker, cloud-cache service broker, and almost all the other components
that come under the scope of these tests. We hope that the CF community also reaps the benefits
of these smoke-tests, and help strengthening the tests in future through contributions to the repo.