Senior Site Reliability Engineer I - 211515 - Electronic Arts

通用信息

地点：Hyderabad, Telangana, India

角色 ID

211515

工作人员类型

Regular Employee

工作室/部门

CT - Infrastructure & Platform

弹性工作安排

Hybrid

Description & Requirements

Electronic Arts 打造更高层次的娱乐体验，激励世界各地的玩家和粉丝。在这里，每个人都是故事的主角。活跃社群，畅联全球。这里充满创造力，鼓励新观点，注重好创意。这是一支人人都能让游戏成为现实的团队。

Senior SE I / Site Reliability Engineer (SRE)

Job Description

We are seeking an accomplished Senior Site Reliability Engineer (SRE) with 12–15 years of experience to lead the reliability, scalability, and performance engineering of our critical infrastructure and production systems. As a Senior SRE, you will play a strategic and technical leadership role — driving reliability practices, mentoring SRE teams, and influencing the adoption of automation, observability, and resilience engineering across the organization.

You will act as a technical thought leader and hands-on engineer, collaborating with infrastructure, application, and operations teams to build, automate, and scale reliable systems that support global business operations. This role requires deep expertise in cloud platforms, automation, monitoring, incident management, and system design for large-scale distributed environments.

Roles & Responsibilities

1. Reliability Engineering & Automation

Architect, implement, and manage resilient, scalable, and highly available infrastructure systems.
Lead initiatives to automate manual operations, deployment, and monitoring processes to improve reliability and reduce toil.
Drive the creation of observability solutions and dashboards to proactively detect and remediate potential issues.

2. Incident & Problem Management

Lead critical incident response, ensuring swift mitigation and clear communication to stakeholders.
Conduct detailed root cause analysis (RCA) and drive permanent corrective actions to prevent recurrence.
Implement and mature incident management frameworks, including runbooks, playbooks, and post-incident reviews.

3. Infrastructure Operations & Performance Optimization

Oversee system performance, capacity planning, and scalability of infrastructure across hybrid and cloud environments (AWS, Azure, GCP).
Optimize system resource utilization, latency, and reliability through performance tuning and automation.
Work closely with architecture and platform teams to accommodate growth, change, and modernization initiatives.

4. Leadership & Mentorship

Provide technical leadership and mentorship to SRE teams and cross-functional engineering groups.
Promote an SRE culture across teams — championing principles of reliability, automation, observability, and continuous improvement.
Drive collaboration between development, QA, DevOps, and release teams to embed reliability into the software development lifecycle (SDLC).

5. Service Level Management

Define, track, and continuously improve Service Level Objectives (SLOs) and Service Level Indicators (SLIs).
Apply the Four Golden Signals of SRE monitoring — Latency, Traffic, Errors, and Saturation — to guide system health and performance strategies.

6. Documentation & Knowledge Sharing

Establish and maintain comprehensive documentation of systems, operational procedures, and best practices.
Facilitate learning through technical sessions, blameless postmortems, and cross-team knowledge sharing.

7. Strategic Technology & Continuous Improvement

Contribute to defining the long-term SRE strategy, tooling roadmap, and automation frameworks.
Evaluate and adopt emerging technologies, tools, and methodologies to enhance system reliability and efficiency.
Partner with business and technical leaders to ensure alignment of SRE objectives with organizational goals.

8. Security & Compliance

Collaborate with security and compliance teams to ensure infrastructure, systems, and operations meet organizational and regulatory standards.
Implement secure configuration baselines, vulnerability remediation, and access control policies.
Integrate security practices into CI/CD pipelines to ensure DevSecOps alignment.

9. Strategic Leadership & Stakeholder Management

Partner with executive and business stakeholders to align SRE initiatives with enterprise objectives and risk frameworks.
Provide data-driven insights on reliability, capacity, and operational performance to influence strategic decision-making.
Represent SRE functions in technical governance forums, audits, and architecture reviews to drive reliability-focused outcomes.

Qualifications

Education: Bachelor’s or Master’s degree in Computer Science, Information Technology, or a related field.
Experience: 12–15 years of total IT experience, with at least 8+ years in SRE, DevOps, or large-scale systems engineering.
Technical Expertise:
- Strong proficiency in Linux/Unix system administration and internals.
- Proven experience in cloud platforms — AWS, Azure, or GCP.
- Advanced scripting and automation skills using Python, Go, PowerShell, or Bash.
- Hands-on exposure to containerization and orchestration technologies (Docker, Kubernetes) and expertise on service mesh like istio etc
- Deep understanding of monitoring and observability stacks (Prometheus, Grafana, ELK, Datadog, Splunk, Zabbix, Nagios).
- Expertise in configuration management and IaC tools (Ansible, Terraform, Chef, Puppet).
- Strong knowledge of networking, load balancing, databases, and distributed systems.
Operational Excellence:
- Hands-on experience in incident response, problem management, and capacity planning at enterprise scale.
- Proven ability to design for reliability, redundancy, and disaster recovery.
Soft Skills:
- Excellent analytical, communication, and leadership abilities.
- Proven track record of mentoring and developing high-performing engineering teams.
- Strong stakeholder management and cross-functional collaboration skills.

Nice to Have

Experience defining and implementing SRE frameworks or centers of excellence in global organizations.
Familiarity with REST API development, integration, and database query optimization.
Strong understanding of governance, risk, and compliance frameworks.
Experience with AIOps, self-healing systems, or machine learning-driven monitoring.
Demonstrated experience in driving organizational culture change toward reliability and automation.
Active participation in industry forums or open-source contributions related to DevOps or SRE practices.

Electronic Arts

我们拥有全面的游戏组合和丰富的体验，在世界各地设有分支机构，而且在整个 EA 提供大量机会。我们非常重视适应能力、韧性、创造力和好奇心。我们提供领导岗位让您发挥潜力，为学习和尝试提供空间，赋能您出色地完成工作并寻求成长的机会。

我们对福利计划采用整体方法，强调身体、情感、财务、职业和社区健康，以支持平衡的生活。我们的套餐专为满足当地需求而量身定做，可能包括医疗保险、心理健康支持、退休储蓄、带薪休假、家事休假、免费游戏等。我们营造和谐的环境，让各个团队始终都能尽展所能。

Electronic Arts 是一个注重机会平等的雇主。在聘用员工时不会考虑其种族、肤色、国籍、血统、生理性别、社会性别、性别认同或表达、性取向、年龄、遗传信息、宗教、身心障碍、医疗状况、怀孕状况、婚姻状况、家庭状况或兵役状况，或任何受法律保护的其他特征。我们也会遵守相关法律，考虑招聘有过犯罪记录的合格应聘者。EA 还会根据适用法律的要求，为合资格的残障人士提供工作场所的便利。