- Plats: Montreal
- Status: Quebec
- Land: Canada
- Startsida
- ...
- Lediga jobb
- Information om jobb
Description & Requirements
Pour visualiser la description de poste en français, veuillez sélectionner le français dans le menu déroulant au haut de la page.
This role sits at the intersection of production operations, incident response, and DevOps engineering. You will be a primary owner of production issues, handling incident triage, bug investigation, and internal feature requests, while also building automation and infrastructure improvements that reduce operational load over time.
This is not a pure infrastructure-only DevOps role. A significant portion of the work involves hands-on production support and cross-team coordination, with increasing opportunity to drive reliability, observability, and automation as capacity grows.
The role is Hybrid in Montreal/Vancouver (3 days in office).
What You’ll Do
Production Operations & Incident Ownership (Core Responsibility)
Act as a first responder for production incidents, outages, and customer-reported issues
Investigate issues using logs, metrics, dashboards, and tracing
Determine root cause and either resolve directly or drive resolution across teams
Own issues end-to-end, from report to mitigation to follow-up improvements
Identify recurring problems and reduce them through automation or tooling
Cross-Team Collaboration
Work closely with backend, platform, infrastructure, and product teams
Translate operational pain points into actionable engineering work
Coordinate fixes when multiple teams are involved and ensure accountability
DevOps, Automation & Reliability Engineering (Growing Focus)
Build and maintain CI/CD pipelines
Write automation scripts to reduce manual or error-prone operational tasks
Improve monitoring, alerting, and dashboards
Make configuration changes that improve system stability and reliability
Help drive adoption of internal tools and best practices
Who Thrives in This Role
Engineers who enjoy debugging real production problems
People comfortable being the point owner when something breaks
Candidates with experience in technical support, production operations, SRE, or platform teams
Engineers who want to grow deeper into DevOps by learning from real operational context
Strong communicators who can work across teams to get issues resolved
Required Qualifications
Reliability Mindset - 2+ years of experience with incident management, on-call rotations, and post-incident analysis
5+ year experience in managing distributed, scalable and resilient high-performing systems
Comfortable working in incident-driven environments
Ability to work effectively in a collaborative team environment and communicate complex technical concepts clearly to non-technical teams.
C#/.NET experience (C++ an asset)
Familiarity with CI/CD concepts and automation scripting
Experience with container workload technologies such as Kubernetes, Helm and Docker.
Experience with monitoring/observability systems such as Prometheus, Grafana and/or Datadog.
Experience with continuous integration and delivery, using pipeline automation systems such as Jenkins, GitLab and GitHub.
Bachelor's degree in Computer, Software Engineering, Computer Science Degree or related concentration, equivalent and/or combination of education and work experience.