Our client is hiring a SRE (Site Reliability Engineer) SSR/SR to join team.
- Windows Server or Linux (RedHat and/or Debian based distributions) Administration. [required]
- Experience with at least one of the following programming languages: Python, Go. [required]
- Application monitoring, troubleshooting, log analysis, system metrics analysis. [required]
- Strong understanding of networking concepts (switching and routing, OSI Protocol). [required]
- Experience working with VCS systems such as Git. [required]
- Handle second level real-time alerts.
- Resolve high-impact incidents together with an incident response team.
- Feel confident learning native scripting languages (bash, powershell) to implement solutions.
- Experience coordinating resources to achieve service restoration aka Incident Management.
- Basic knowledge of Cloud infrastructure (AWS, Azure).
- Operating System Monitoring.
- Read and interpret monitoring system graphics.
- Knowledge about application servers such as Red Hat JBoss/WildFly.
It's a PLUS!:
- Experience working with configuration management tools such as Puppet or Ansible (Preferred).
- Experience in on-premise infrastructure management and cloud-based infrastructure.
- Experience in tracking problems with ticketing systems. Jira service desk (Preferred).
- Experience working with containerization software - Docker Engine.
- English Certifications, driver's license, U.S. Visa or European Passport are a big plus
- Availability to travel at least 2 weeks to USA.
- Advanced English (writing and speaking skills) is required to communicate with technical teams and customers.
- Availability to be on a passive on-call schedule.
You will be:
- Working closely with a cross-functional team of SREs, DBAs, developers, and Engineers to ensure the reliability of the platform.
- Participate on an Agile team, with daily scrum meetings, as well as planning and grooming meetings.
- Developing your monitoring skills by using different monitoring tools.
- Developing custom tools to automate processes as you see fit in order to reduce toil and increase engineering work.
- Monitoring metrics for overall reliability of a distributed SaaS product.
- Interacting with Cloud Services from Azure. Working mostly on Windows platforms. Working on some Linux platforms.
- Troubleshooting over distributed systems and applications.