Partner DB for Breezy
1 June 2023
What is the project about?
Partner DB for Breezy is the data backbone behind Breezy’s partnership discovery: a custom crawler and big-data pipeline that surfaces affiliate and referral partnerships across the web. I scoped and led the product: the goal was to build the largest affiliate and referral partnerships database on the market—covering 100M+ websites, 15M affiliate campaigns, 6.5M affiliates, and ~600K brands, plus deep marketplace data (e.g. millions of Amazon products, sellers, and affiliate links). The system uses a hybrid approach: pre-parsed datasets from CommonCrawl plus live bots that target the most valuable sites, with an in-house signature database of affiliate networks and click-tracking SaaS so we capture as many partnerships as possible.
What’s unique about it?
I defined the architecture and infrastructure. The team built a custom bot (Google Bot–style) with GEO fencing, block prevention, anti-captcha, and device rotation, scaling on Kubernetes (peaked at ~5,800 instances processing ~7M pages/hour; normal load ~1,000 nodes). The stack is open-source (Debian, Node.js, TypeScript, PostgreSQL, MongoDB, NSQ, ClickHouse, bare-metal with NVMe and in-memory where it matters) to stay cost-effective and cloud-agnostic. We sped up CommonCrawl corpus processing by ~30× via optimized sorting and low-level direct-to-drive writes, and used NSQ for job and results queues handling tens of millions of tasks with delivery confirmation. The pipeline processes a 400 TiB CommonCrawl index in ~9 days for about $600/month—a fraction of typical cloud cost and 3–4× faster. I drove the design for an in-house distributed backup and a prioritization algorithm that decides which sites to parse next, cutting cost and raising data yield. The result is the largest partnerships database in the space, built and run at a fraction of what cloud providers would charge.
The role of the team
I managed the effort and owned scoping, architecture, and infrastructure. The team implemented the bot, the K8s scaling, the CommonCrawl pipeline, the NSQ queues, the signature and partnerships DB, and the backup solution under that design—so Breezy could offer partnership discovery at scale without depending on commercial cloud lock-in.
Conclusions
Leading Partner DB for Breezy showed that with clear scoping and the right architecture—open-source stack, hybrid CommonCrawl + live bots, NSQ for resilience, and a prioritization model—you can build the largest affiliate partnerships database on the market and run it at a fraction of cloud cost while keeping full control over data and infrastructure.