Job Description: We are seeking an experienced SRE to ensure the high availability and reliability of our consumer-facing applications. The ideal candidate will have a strong background in supporting retail industry applications, with a preference for those with hands-on experience on AWS, Grafana, and Java development. Good English communication skill is also essential for this role.
Key Responsibilities: - Implement and maintain site reliability processes and systems. - Provide service outage escalation response and guidance alongside software engineers. - Review and assess the impact of monitoring metrics on current system behavior. - Research and implement new tools and technologies to solve problems more efficiently. - Conduct root cause analysis of production issues, including complex backend troubleshooting and debugging. - Collaborate with cross-functional teams to achieve reliability excellence.
Preferred Experience: - 5yrs+ Proven work experience as a Site Reliability Engineer or in a similar role, particularly in the retail industry. - Hands-on experience supporting consumer-facing applications. - In-depth knowledge of AWS services and best practices for cloud infrastructure. - Proficiency with Grafana for monitoring and observability. - Strong Java development skills, with experience implementing API functionality using REST, JSON, or similar technologies. - Background with statistical or reliability software packages.
Skills: - Expertise in troubleshooting Linux servers. - Strong coding skills in at least one programming language, with a preference for Java. - Familiarity with software engineering best practices, including testing, continuous integration, and continuous delivery. - A passion for solving problems using open-source software. - The ability to thrive in a rapidly evolving, globally distributed environment. - Strong English communication skills, both written and verbal.