Site Reliability Engineer (DevOps)

GitLab · Aug 3rd 2018

Apply on StackOverflow Careers

Site Reliability Engineers are responsible for the keeping GitLab.com and many other GitLab production systems running smoothly 24/7/365. They're developers specialising in systems, whether it be networking, or the Linux kernel, or even a specific interest in scaling, algorithms, or distributed systems. GitLab.com is a unique site and it brings unique challenges: it’s the biggest GitLab instance in existence; in fact, it’s one of the largest single-tenancy open-source SAAS sites on the internet. The experience of our production engineers feeds back into other engineer groups within the company, as well as to GitLab customers, running on-premise installations. Responsibilities:

  • Be on a PagerDuty rotation to respond to GitLab.com availability incidents and

  • provide support for service engineers with customer incidents.

  • Use your on-call shift to prevent incidents from ever happening.

  • Manage our infrastructure with Chef, Terraform and Kubernetes.

  • Make monitoring and alerting alert on symptoms and not on outages.

  • Document every action so your learnings turn into repeatable actions and then into automation.

  • Improve the deployment process to make it as boring as possible.

  • Design, build and maintain core infrastructure pieces that allow GitLab scaling to support hundred of thousands of concurrent users.

  • Debug production issues across services and levels of the stack.

  • Plan the growth of GitLab's infrastructure.

Requirements:

  • Think about systems - edge cases, failure modes, behaviors, specific implementations.

  • Know your way around Linux and the Unix Shell.

  • Know what is the use of config management systems like Chef (the one we use)

  • Have strong programming skills - Ruby and/or Go

  • Have an urge to collaborate and communicate asynchronously.

  • Have an urge to document all the things so you don't need to learn the same thing twice.

  • Have a proactive, go-for-it attitude. When you see something broken, you can't help but fix it.

  • Have an urge for delivering quickly and iterating fast.

  • Share our values, and work in accordance with those values.

Projects you might work on:

  • Coding infrastructure automation with Chef

  • Improving our Prometheus Monitoring or building new Metrics

  • Helping release managers deploy and troubleshoot new versions of GitLab-EE.

  • Migrate GitLab.com from it’s current home on Azure Cloud to Google Cloud Platform.

  • Migrate GitLab.com to Kubernetes.

Apply on StackOverflow Careers