Site Reliability Engineer, Observability Platform
Site Reliability Engineers (SREs) are responsible for keeping all user-facing services and other GitLab production systems running smoothly. SREs are a blend of pragmatic operators and software craftspeople that apply sound engineering principles, operational discipline, and mature automation to our operating environments and the GitLab codebase. SREs specialize in systems (operating systems, storage subsystems, networking), while implementing best practices for availability, reliability and scalability, with varied interests in algorithms and distributed systems.
The Observability Team's mission is to Build, Run and Own the entire lifecycle of the suite of services that enable observability of the GitLab SaaS environments. These services allow Infrastructure, Development & Product teams to observe running products and contribute to our overall reliability and scalability goals. In this role, you will maintain metrics and logs platforms for GitLab SaaS and you will be responsible for Prometheus, Thanos, Grafana, and Logging.
GitLab.com is a unique site and brings with it unique challenges: it is the largest GitLab instance in existence (and in fact, one of the largest single-tenancy open-source SaaS sites on the Internet). The team’s experience feeds back into other Engineering groups within the company, as well as to GitLab customers running self-managed installations.As an SRE you will:
Be on an on-call (PagerDuty) rotation to respond to incidents that impact GitLab.com availability, and provide support for service engineers with customer incidents.
Use your on-call shift to prevent incidents from ever happening.
Run our infrastructure with Chef, Ansible, Terraform, GitLab CI/CD, and Kubernetes.
Build monitoring that alerts on symptoms rather than on outages.
Document every action so your findings turn into repeatable actions and then into automation.
Use the GitLab product to run GitLab.com as a first resort and improve the product as much as possible
Improve operational processes (such as deployments and upgrades) to make them as boring as possible.
Design, build and maintain core infrastructure that enables GitLab scaling to support hundreds of thousands of concurrent users.
Debug production issues across services and levels of the stack.
Plan the growth of GitLab’s infrastructure.
Experience with Kubernetes deployment and management
Experience with Elastic Cloud, Loki, Fluentd, Promtail or other logging systems tools
Experience with Promethus and/or Thanos deployment management
Think about systems: edge cases, failure modes, behaviors, specific implementations.
Experience with or exposure to Bigquery and general cloud providers such as GCP and AWS
Know your way around Linux and the Unix Shell.
Know what is the use of configuration management systems like Terraform, Chef and/or Ansible.
Have an urge to collaborate and communicate asynchronously.
Have an urge to document all the things so you don’t need to learn the same thing twice.
Have an enthusiastic, go-for-it attitude. When you see something broken, you can’t help but fix it.
Have an urge for delivering quickly and effectively, and iterating fast.
Share our values, and work in accordance with those values.
Ability to use GitLab
Coding infrastructure automation with Chef, Ansible, Terraform, and GitLab CI/CD
Improving our Prometheus Monitoring or building new metrics
Helping release managers deploy and fix new versions of GitLab-EE.
Plan, prepare for, and execute the migration of GitLab.com from virtual machines running on Google Cloud to cloud-native container-based deployments with Kubernetes using Google Kubernetes Engine.
Develop a relationship with a product group, define their SLAs, share GitLab.com data on those SLAs and improve their reliability