|Site Reliability Engineer - Quip
At Quip, we have built what we think is an amazing new way for teams to collaborate, communicate, and make decisions, whether you're three friends in an office, or one of the largest companies on the planet with tens of thousands of employees at all corners of the globe. Our SF-based Site Reliability Engineering (SRE) team makes sure that Quip is delightfully fast, always available for our customers whenever they need it, and well-insulated from unpleasant surprises or outages.
We never lose sight of the principle that you cannot meet these goals at the expense of the team working on them: maintaining a healthy and supportive work environment that encourages happiness, good work-life balance, and technical and personal growth is as of much a core value for us as having top-notch operational standards.
Our team focuses on keeping our product running smoothly, consistently, and without manual toil, in both public cloud services or bare-metal on-premise environments. We continually improve our technology and processes to create easy-to-understand robust scaling, observability, and automation. SRE works closely with the rest of engineering to continue shipping new features to delight our customers and spread Quip to new teams, quickly and with little risk. We work to continually reduce the complexity that comes along with running large featureful applications over the Internet, especially for enterprise customers.
We know we have a keen product, and a fantastic place to work, and we're looking for some great people to join our SRE team to keep everything that way!
You would get the opportunity to:
Skills and technologies you'd use (and learn or improve!) here:
- Continually evolve Quip's operational reliability and simplicity, responding to changes in environment, requirements, or circumstances.
- Maintain our observability and automation at the level where we need it to be, by extending existing infrastructure, setting up open source projects, or even developing custom solutions when necessary.
- Investigate and repair bugs, mysterious occurrences, and production issues throughout the entire system, in concert with product and infrastructure engineers.
- Champion operational excellence and production quality across the entire company, via production readiness reviews, system refactoring projects, and leading by personal example.
Things we're looking for in people who we want to join us:
- Building efficient scalable products on top of public or privately-run cloud services.
- Understanding, modifying, and writing Python code, both in our product codebase, and in supporting infrastructure for automation.
- Using configuration and orchestration tools to create repeatable, auditable, documented-in-code systems.
- Monitoring, tuning, and administrating SQL databases for scalability, reliability, and performance.
- Designing systems for the sweet spot of long-term scale and reliability, while keeping manual maintenance and complexity costs down, and still shipping at reasonable speeds.
Bonus points (not at all required, but would let you hit the ground running!):
- Keen interest in keeping a holistic view of entire systems in mind: patterns, architectures, data flows, lifecycles, edge cases, and risks.
- Excited about continually reducing complexity, and creating systems that are easily understandable, repeatable, and observable.
- Convinced about the importance of communication (both verbal and written/online), close team collaboration, and sharing information with others (creating documentation, or in-person training).
- Eager to learn best-in-field design and engineering practices from coworkers with a wealth of skills and experience, and getting to add your unique mark to what we're building.
- Drawn to understand (at a rough level) the basic skeleton of the stacks on which your system operates, from both network (SSL/TLS, HTTP, DNS, TCP, IP, CIDR, local networking, global routing) and host (process/daemon, system library, process supervisor, binary packaging, UNIX distribution, kernel) perspectives.
- Committed to focusing on the priorities and needs of our customers, your coworkers, and the direction of the company in general, and aligning your strategic goals to benefit them.
- Experience running medium-to-large user-facing services on public cloud services, particularly AWS.
- Experience running, scaling, tuning, and debugging production SQL databases, particularly MySQL on AWS RDS.
- Experience with configuration and orchestration management tools, particularly Terraform and Docker.
- Experience with and opinions about modern best-practice observability/debugging/logging/monitoring stacks.
- Comfortable writing Python (specifically, Python 3) scripts and libraries from scratch, and modifying existing code.
Salesforce, the Customer Success Platform and world's #1 CRM, empowers companies to connect with their customers in a whole new way. The company was founded on three disruptive ideas: a new technology model in cloud computing, a pay-as-you-go business model, and a new integrated corporate philanthropy model. These founding principles have taken our company to great heights, including being named one of Forbes’s “World’s Most Innovative Company” five years in a row and one of Fortune’s “100 Best Companies to Work For” eight years in a row. We are the fastest growing of the top 10 enterprise software companies, and this level of growth equals incredible opportunities to grow a career at Salesforce. Together, with our whole Ohana (Hawaiian for "family") made up of our employees, customers, partners and communities, we are working to improve the state of the world.