Designation: Data Tier SREs
Roles and Responsibilities:
- Engage, influence, and promote SRE practices with development, operational, and product groups to align technology service/solution delivery.
- Drive quality accountability within the organization with well-defined processes, metrics, and goals.
- Manage availability, latency, scalability, and efficiency of Shared Services development by instilling engineering reliability into our development life cycle with a focus on fault-tolerant approaches.
- Must be able to define and report "progress" on strategic initiates and project-level tasks to all stakeholders including senior executives and clients and use practical communication approaches with each constituency.
- Implement metrics-driven processes to ensure service quality targets are met.
- Manage system availability, health and service levels (SLAs, SLOs) of the large-scale cloud infrastructure, running in AWS and GCP.
- Proactively monitor, diagnose, analyze failures, and provide support for software engineers to debug production issues across microservices and distributed platforms. Work with development team in resolving the issues found.
- Participate in on-call rotation and resolution of issues in multi-cloud (AWS/GCP) environment.
- Monitor metrics and performance of applications and cloud infrastructure.
- Manage code releases, i.e., push code and patches on cloud.
- Own entire lifecycle of incidents (incident management), including reporting, analyzing, handling incidents, all the way up to its closure and writing RCAs
Qualification:
- Bachelor’s or Master’s degree in Computer science, Information Science, Electronics and Communication.
- Minimum 6-7 years of DevOps/SRE experience.
- 3+ years hands-on experience with AWS or GCP, EC2 (GCE), IAM, S3 (GS), Docker, Kubernetes pods, Jenkins, Prometheus, CloudWatch (Stack Driver), Linux, Ansible.
- 3+ years’ experience in deploying code and infrastructure in AWS or GCP using continuous integration/continuous delivery (CI/CD) tools in production environments.
- 3+ years of automation using python or/and Golang or/and shell scripting.
- 4+ prior experience in developing metrics to monitor health of infrastructure and applications.
- 3+ years of experience in managing SaaS applications infrastructure with REST based test automation experience using python.
- The candidate should have a thorough understanding of networking fundamentals (TCP/IP, UDP, DHCP, DNS, ICMP, AR, routing and switching).
- General understanding of distributed systems.
- Understanding of data management technologies including relational and non-relational databases.
Additional Information:
- Certification on AWS etc is a BIG plus.
- Knowledge of build pipeline/infrastructure like Jenkin, GitHub, CICD would be added advantage.
- Work in an agile and highly collaborative environment with our globally distributed engineering teams, architecture, product management, and operations.
- Maintain excellent written and verbal communications with clients, employees, and management chain, including status reports, project plans, presentations, etc.
- Basic understanding of Terraform or CloudFormation or any IaC code is preferred.
- Ideally detailed understanding of IP routing, Security and Cloud services such as CGNAT, IPSec, IDP and SDWAN/SDN for different customer use cases.
Time zone interactions: US and Tokyo times
Location: Bengaluru