Splunk

Principal Site Reliability Engineering -US REMOTE OK

Splunk

September 14, 2021


Join us as we pursue our disruptive new vision to make data accessible, usable, and valuable to everyone. We are a company filled with people who are passionate about our product and seek to deliver the best experience for our customers. At Splunk, we’re committed to our work, our customers, having fun, and most importantly to each other’s success. Learn more about Splunk careers and how you can become a part of our journey!
Role
As part of Splunk's Cloud-First mission, Site Reliability Engineering (SRE) is accountable for the overall reliability of services running in our cloud production environments. We are systems and software engineers who engage with product and infrastructure teams at every level, from directly embedding on their teams to tagging in for the gnarliest of production challenges. Our goal is to make Splunk's production environments more transparent, more predictable, and less cognitively demanding for Splunk's service owners to operate their services in.
You Will:
  • Design and develop software to maximise the engineering velocity of teams and increase the reliability of Splunk products.
  • Mentor new software engineers to achieve more than they thought possible.
  • Work across the organization to deliver quality products that delight Splunk's passionate users.
Qualifications:
  • Proven experience as a Software Engineer.
  • A passion for automating away common tasks and processes.
  • You are dedicated to writing well-tested and maintainable code.
  • Understanding of observability libraries and ability to instrument code to expose new application metrics.
  • You enjoy making other teams successful and are fulfilled through the success of others.
  • You enjoy mentoring junior engineers in order to uplevel the skill of your teams.
  • You enjoy designing, developing, and maintaining distributed systems at scale in production.
  • You understand the challenges and trade-offs to be made when building and deploying systems to production.
  • Knowledge of standard methodologies related to security, performance, and disaster recovery.
  • Skilled in identifying performance bottlenecks, identifying anomalous system behavior, and resolving root cause of service issues.
  • Experience with Kubernetes and cloud-native environments.
  • Familiarity and passion around SRE methodology.
Preferred skills:
  • Experience with Golang, Gitlab CI, Qbec, OpenAPI/Swagger.
  • Understanding of multi-tenancy and security implications of a cloud native environment.
  • Experience working with container deployment and orchestration technologies with knowledge of fundamentals including service discovery, deployments, monitoring, scheduling, load balancing.
  • Experience with development and deployment in a hosted cloud environment, preferably AWS & GCP.
  • Experience with distributed cloud service development, infrastructure, traffic management, and architecture.
  • Experience with optimized and scalable software that operates on a large number of nodes.

We value diversity at our company. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, or any other applicable legally protected characteristics in the location in which the candidate is applying.
For job positions in San Francisco, CA, and other locations where required, we will consider for employment qualified applicants with arrest and conviction records.