Tutuka · Apr 12th 2019
What this job entails
As the Lead SRE at Tutuka you'll be working closely with entire technical team ensuring the reliability of enterprise-level, highly scalable, highly secure financial processing systems that power tens of millions of transactions and tying them to web, mobile and API interfaces that make it easy for people to issue, redeem and reconcile prepaid cards all over the world.
We already have a team of amazing developers that work out of our local offices in Johannesburg, South Africa as well as remotely across Europe and Southeast Asia, and now we need you to drive improvements in our reliability, scalability and efficiency.
What you will be doing
You'll find every day an exciting challenge, helping our technical team transform a monolithic enterprise processing environment with bank-level security and 99.95% uptime, into a sleek, nimble, micro-service serverless processing environment with better than bank-level security and 99.99% uptime.
If it was easy, we would already have done it! This role may or may not involve the following:
Work closely with software engineering teams to improve availability, latency, performance, efficiency, monitoring, emergency response, and capacity planning of services
Across hybrid cloud environment of hosted data centre and AWS
Handle upgrades of infrastructure and services through automation
Identify, gathering, documenting and automating responses to key performance metrics, logs, and alerts
Find optimizations and other efficiencies to scale the application
Develop playbooks and tools to streamline processes and shorten problem resolution time
Maintain infrastructure as a code management process
Perform periodic on call duties
Skills & requirements
We love taking on team members with a variety of skill levels, from intern to PhD. But there's no getting around the fact that we need this person to know what they're doing, and hit the ground running.
You should already be an SRE guru with:
Solid understanding of operational principles, such as capacity planning, monitoring and incident handling
Experience automating manual processes, leveraging cloud (preferably AWS) platforms
Telemetry, tracing, logging, and alerting best-practices
Experience implementing monitored and seamless deployment pipelines
Internet fundamentals. HTTP/s, DNS, TCP/IP, security-by-design, caching
Extra kudos are awarded for:
JVM performance tuning
Experience in monitoring of cloud based systems
Knowledge of automated testing frameworks and methodologies
If you have no site reliability engineering experience, your application cannot be considered.