£100,000 - £110,000
I’m currently working with a well-established company, who are looking for an experienced Principal Site Reliability Engineer (SRE). As a Principal Site Reliability Engineer you will play a critical role in the design and development of our tooling, monitoring, control, self-service reporting, and analysis approach. A Principal Site Reliability Engineer will also, they will establish policies and procedures governing our incident, change, and problem management protocols.
A successful Principal Site Reliability Engineer will continue to evolve the function, primary responsibilities will be monitoring and remediating systems, security, and network issues using various application and network management tools.
Key Skills & Qualifications
- 10+ years in a Senior technical role – DevOps, Software Engineering, System or Support Engineering position.
- Demonstrated experience designing, installing, and configuring monitoring solutions – ideally for mission-critical, 24x7 environments.
- Solid understanding of monitoring essentials associated with SNMP, WMI, Synthetic Transaction Engines and experience with various commercial, open source and homegrown monitoring packages and methods (e.g., Splunk, Nagios, Zabbix, OneSite, Gomez, CA, HP Openview, etc.).
- Strong scripting skills with languages such as Powershell or Python.
- Ability of Object-Oriented languages such as C# or Java
- Solid understanding of application-level observing tools and techniques, including Open Tracing, Open Telemetry and APM tools (e.g. Elastic, DataDog, New Relic etc.)
- Architecting and developing solutions and roadmaps for monitoring of various systems that constitute the companies operating environment and leveraging such telemetry in an IT setting for alert response and troubleshooting.
- Work with Architecture, Security, Development, Systems Engineering, and Operations teams to develop innovative solutions to achieve high availability scalability and reliability.
- Present technical leadership and do technical hands-on scripting, tooling, automation for continuous operations.
- Distinguish incidents based on monitoring tools, notifications, and log files.
- Improve new and modify existing monitors as needed.
- Annual bonus
- Remote working
- Annual pay review
- Unlimited Holidays + More