DE Jobs

Search from over 2 Million Available Jobs, No Extra Steps, No Extra Forms, Just DirectEmployers

Job Information

GRAIL Staff Site Reliability Engineer #3718 in Menlo Park, California

GRAIL is seeking a Staff Software Engineer in our Site Reliability Engineering (SRE) team to help us improve security and reliability of production systems that are critical for our mission to detect cancer early and save lives. You will contribute to the architecture, design, development, implementation, and be responsible for secure, healthy, and reliable operation of critical cloud-based infrastructure, services, and applications. You are someone who enjoys learning and implementing best industry technology trends and practices. You foster and contribute to the creative and collaborative culture to deliver results. You embrace ambiguity and enjoy exploring new technologies delivering robust, scalable solutions.

This is a hybrid role and requires you to be onsite 2 days a week in Menlo Park, CA

Responsibilities

  • Ensure High Availability: Implement and maintain resilient cloud architectures, monitor system performance, and proactively identify and resolve potential bottlenecks or points of failure.

  • Incident Management: Play an active role in production on-call, responding swiftly to troubleshoot and resolve production issues. Minimize service disruptions and downtime by conducting thorough triaging and debugging of product or system issues. Continuously optimize the on-call process for sustainability and efficiency.

  • Automation and Tooling: Develop and maintain automation scripts, tools, and processes to streamline system deployment, monitoring, and management tasks. Your contributions will be vital in efficiently scaling cloud operations.

  • Performance Optimization: Optimize cloud infrastructure and applications for performance, scalability, and cost-effectiveness.

  • Security and Compliance: Collaborate with security engineers to implement best practices and ensure compliance with security standards and policies.

  • Monitoring and Alerting: Design and configure advanced monitoring systems to gain insights into system behavior, set up alerts, and respond proactively to potential issues. Create and maintain comprehensive dashboards and playbooks for production on-call.

  • Software Development Consultation: Engage actively in the entire software development lifecycle. Participate in system design reviews and provide valuable Site Reliability Engineer (SRE) insights during launch reviews, influencing and enhancing system architecture.

Preferred Qualifications

  • Bachelor’s degree in Computer Science, a related field, or equivalent practical experience.

  • 3+ years of professional experience maintaining production systems on Cloud based services and infrastructure.

  • 8+ years of software development experience in one or more programming languages with a primary focus on leveraging, working on cloud-based services and infrastructure.

  • Strong knowledge of AWS cloud platform

  • Practical experience with containerization technologies, including Docker and Kubernetes.

  • Familiarity with Python, Bash scripting and Ansible

  • Familiarity with infrastructure as code tools like Terraform is essential.

  • Solid understanding of databases, networking, security principles, and best practices.

  • Proficiency in using monitoring and alerting tools to detect and respond to potential issues effectively.

Desired Skills

  • AWS Certifications (such as Solutions Architect, Security, etc.)

  • Experience in a regulated industry or healthcare field

The expected, full-time, annual base pay scale for this position is $180,000 - $210,000. Actual base pay will consider skills, experience, and location.

DirectEmployers