About the Role
Protege is hiring a Senior Software Engineer to own the data processing layer at ingestion — the part of the platform that takes large-scale source data and turns it into clean, structured, enriched, validated, AI-ready datasets. This is a hands-on, backend- and data-heavy role with end-to-end ownership of the pipelines that move and process data at volume.
Ingestion & Processing Systems
- Design, build, and operate the ingestion systems that process large volumes of multimodal data into usable, well-structured datasets
- Own the ingestion path end to end, from how data lands to how it is validated, processed, tracked, and made available downstream
- Build modality-specific processing steps for real-world source data, such as medical imaging processing, audio and video metadata extraction, quality validation, and notes processing
- Build parsers, validators, and normalization logic that can systematically handle messy, non-standard, and high-variance source formats
- Turn repeated one-off data handling work into reusable processing patterns, internal tooling, and platform capabilities
Scale, Performance & Reliability
- Build for high volume and high throughput, optimizing systems for reliability, cost, and speed
- Work across distributed and parallel compute systems to process workloads that do not fit well on a single machine
- Choose the right execution model for the workload, including batch processing, distributed execution, and modern compute patterns for unstructured data and inference-heavy processing
- Diagnose and resolve bottlenecks across ingestion and processing systems, and keep performance from degrading as volume and modality complexity grow
Data Quality, Security & Compliance
- Build validation and quality checks that catch bad, incomplete, or malformed data before it propagates downstream
- Handle sensitive and regulated data, including PHI, with the security and care the domain demands, including de-identification where required
- Track provenance, metadata, and usage constraints through the ingestion path so downstream use remains compliant and auditable
- Raise the quality bar for observability, debuggability, and operational reliability across the ingestion layer
Cross-Functional Partnership
- Partner with product and Data Lab to support new modalities, new partner requirements, and non-standard source data
- Work directly with partner engineering teams when needed to translate source-system realities into robust ingestion and processing design
- Surface recurring patterns that are worth standardizing into reusable transforms, validators, and internal tooling
- Help shape how Protege handles new data types as the platform expands into more complex data environments
What You Bring
- 5+ years building and operating production backend or data systems, with real experience in data processing at scale
- Hands-on experience designing and running large-scale data pipelines
- Strong programming skills in Python
- Experience with distributed data processing
- Strong proficiency with AWS