About the Role
As our first Data/Infrastructure Advocate Engineer, you’ll bridge the gap between cutting-edge data infrastructure and the global community of data engineers, researchers, and developers. You’ll champion Xet storage on the Hugging Face Hub, empowering users to efficiently store, version, and collaborate on large-scale datasets. This role is for someone who thrives at the intersection of technical depth (storage, Parquet, deduplication) and community advocacy—helping define the future of open data workflows.
You’ll collaborate with teams like Datasets, Hub, and Infrastructure to shape how developers interact with data on our platform, and inspire a community to build better, faster, and more scalable data pipelines.
Your Main Missions:
- Grow and nurture the open-source data/infra community—launch initiatives, collaborate with data-focused groups, and organize events or challenges. Engage with communities like Apache Parquet, Open Tables Formats, and data engineering forums to promote best practices and Hugging Face tools.
- Promote the Hugging Face Hub as the go-to platform for data storage, versioning, and collaboration—curate and showcase datasets, benchmarks, and tools like Xet.
- Highlight use cases like efficient large dataset updates, Parquet editing, and deduplication to demonstrate the Hub’s value for data workflows.
- Create demos, benchmarks, and tools (e.g., Colab notebooks) to illustrate best practices for data storage and versioning.
- Experiment with Xet, Parquet, and other data formats to showcase their potential for ML and data engineering.
- Produce high-quality tutorials, blog posts, and videos that make complex topics accessible.
- Share insights on storage optimization, dataset versioning, and deduplication to empower developers.
- Actively participate in online communities (Discord, GitHub, forums) to highlight contributions, answer questions, and foster collaboration.
- Ensure datasets and tools released on the Hub are well-documented, with clear examples, benchmarks, and use cases.
About you
You’re a great fit if you:
- Have strong technical skills in Python, data libraries (e.g., pandas, pyarrow, huggingface/datasets), and storage systems (Parquet, Open Table Formats, S3).
- Are a hands-on builder who loves experimenting with data tools, storage optimization, and dataset versioning.
- Can clearly explain complex topics (e.g., deduplication, compression, Parquet editing) through writing, demos, or talks.
- Are active in developer communities (GitHub, Discord, forums) and passionate about open source and knowledge sharing.
- Thrive in fast-moving environments and enjoy building in public to inspire others.
If you're interested in joining us but don't tick every box above, we still encourage you to apply!