The AWS Fleet Telemetry Team, is part of AWS Engineering that designs the worlds most innovative compute and storage platforms that enable one of the worlds largest infrastructure as a service (Iaas) offerings.
As Senior Engineer of the Development Team you will create, deploy and maintain autonomous monitoring agents at scale. You will create software that autonomously mines big data, extracts trends from disparate data sources, identifies unhealthy hosts before they impact compute or storage capacity, issues health diagnoses, and autonomously remediate the identified problems. Good health begins from the day our products are born. You will also be responsible for leading your team to design and develop software that ensures every server in the AWS fleet is built, configured and performing according to its design specification from the time it is placed into service until the time it is retired. Your software must be unobtrusive, efficient, and scalable. You will utilize one of the worlds most dependable, easy-to-use and most performant BigData platforms to innovate and develop your hardware immune system.
Our systems run 24/7, in the harshest environments, and serve more than a million customers each day who demand performance, even when the toughest compute workloads are considered. The health of our infrastructure is our top priority. We’re looking for technology leaders that can help us build these systems, solve really tough operational problems and suggest new, innovative, ways to keep AWS hardware in tip-top shape.
You will work with software and hardware teams across the company to build world-class software. You will be a part of a growing, fast paced team that is making history. You will own the software development roadmap for your team and will have a stake in creating the roadmap for future AWS hardware and software.
BA/BS in Computer Science or related discipline, or equivalent work experience.
5+ years of experience developing distributed services in at least one of: Python, Ruby, C/C++, and/or Java
Obsession in innovate and build services to solve large scale problems.
Passion to dive deep to resolve problems at their root, looking for failure patterns amenable to long-term solutions via simplification and automation.
Knowledgeable of the Linux operating system and user-level tools
Skilled in shell scripting
Possess superb troubleshooting and problem analysis skills
Basic understanding of how commodity servers, operating systems and networks function, perform and scale
Basic understanding of standard internet protocols (Ethernet, ARP, IP, ICMP, UDP, TCP, SSL, DNS, HTTP, etc.)
Have worked on highly concurrent, high throughput systems with knowledge of distributed systems