Job Role: Azure Site Reliability Engineer
Location: Toronto, ON, Canada (Hybrid)
Job Type: Contract
Job Description:
Monitoring and Alerting
- Implement and maintain monitoring systems to proactively identify potential issues and alert engineers to problems before they impact users
Incident Response
- Respond to incidents and outages, diagnose problems, and implement solutions to minimize downtime and restore service
Automation
- Automate repetitive tasks and processes to improve efficiency and reduce manual effort
Performance Optimization
- Identify and address performance bottlenecks to ensure systems run efficiently and effectively
Infrastructure Management
- Manage and maintain the underlying infrastructure including servers, networks, and cloud resources
Capacity Planning
- Plan for future capacity needs to ensure systems can handle anticipated workloads
Release Engineering
- Develop and maintain processes for deploying software updates and releases
Collaboration
- Work closely with developers, operations teams, and other stakeholders to ensure system reliability and availability
Documentation
- Maintain clear and concise documentation of systems processes, and procedures
Continuous Improvement
- Identify areas for improvement and implement changes to enhance system reliability and performance
Skills and Qualifications
- Cloud Platform Microsoft Azure
- Excellent knowledge of AKS
- Monitoring tools: Dynatrace, Splunk, Grafana
- Operating System Windows Linux
- Scripting Shell Scripting Python PowerShell
- Database MySQL Oracle SQL database management
- Container Services Kubernetes Docker Helm
- An understanding of Camunda is preferable