Startup Ideas DB Engineering Teardown: pgvector + Multi-Provider Billing

Founders waste weeks scouring Reddit, Product Hunt, Hacker News, and Indie Hackers trying to identify validated problems worth solving. Existing idea databases are shallow listicles or community-driven without quality filters. Startup Ideas DB is the opposite: 12K+ structured problem statements, sub-50ms vector search, on-demand 2,000-word LLM whitepapers, and three payment providers covering global readers.

This is the engineering teardown of Startup Ideas DB — the daily scraper pipeline, why pgvector at this scale, and how we synchronised three payment providers into one entitlement store.

The Problem

The idea-research market has two failure modes:

Listicle-grade content. Generic "47 SaaS ideas you can build" pages that scrape competitor lists with no validation
Community-grade content. IndieHackers and r/SaaS where ideas are unfiltered, contradictory, and require hundreds of hours of manual sorting

Founders need something in between: structured, search-grade, depth-on-demand. The kind of resource a serious researcher would build over a year — except shipped as a SaaS.

The Architecture

A TypeScript monorepo with shared types and components:

Public marketplace (Next.js) — Founder-facing idea search, filtering, whitepaper purchase
Admin portal (Next.js) — Internal team for content review, scraper monitoring, quality flagging
Python scraper service — Daily harvest from Reddit, Product Hunt, Hacker News, Indie Hackers
Supabase Postgres + pgvector — 12K+ embedded problem statements with metadata
OpenAI text-embedding-3-large — Embedding generation
BuilderAI service — Multi-step LLM chains for on-demand whitepaper generation
Three payment providers — Razorpay (INR), DodoPayments (USD/EUR + crypto), PayPal (international cards)

Hosted on Cloudflare for edge caching, with Supabase as the source-of-truth data layer.

Key Technical Decisions

Daily Scraper Pipeline with Survival Tactics

The scraper layer was the operationally hardest part. Reddit, Product Hunt, Hacker News, and Indie Hackers all change their API/HTML/auth periodically. Production scrapers fail constantly without active maintenance.

We designed for failure tolerance:

Per-source isolation. Each source has its own scraper with its own retry/backoff. One source breaking doesn't affect the others.
Schema-flexible parsing. We don't assume specific HTML structures; we extract via multiple selector fallbacks. When all fallbacks fail, we log and alert; the source temporarily marks as degraded.
Rate limit awareness. Each source has tracked quotas. The scraper backs off proactively rather than getting blocked.
Daily diff, not full re-scrape. We only fetch new posts since last run, with a periodic full re-scrape (weekly) to catch updated content.

The result: scraper uptime above 95% per source, with ~2 hours of monthly maintenance.

pgvector at 12K+ Scale

We deliberately chose pgvector over Pinecone/Weaviate. The reasoning:

12K vectors is small. Specialised vector DBs are overkill at this scale.
pgvector lets us filter on metadata (industry, source, date) and vector similarity in one SQL query, which is awkward across separate Postgres + Pinecone.
One database to manage, monitor, back up.
Sub-50ms p95 query latency at this scale on a modest Postgres instance.

We built a HNSW index on the embedding column. Query latency at 12K rows is ~12–30ms; at 50K rows we'd expect ~25–60ms. Past 100K we'd evaluate moving to a dedicated vector DB.

For the broader LLM and AI architecture context, see LLM Architecture Deep Dive and How to build an AI agent.

On-Demand 2K-Word LLM Whitepapers via Multi-Step Chains

The premium feature: founders can generate a 2,000-word technical whitepaper for any idea covering market sizing, competitor analysis, tech stack recommendations, and go-to-market strategy. The naive approach — single LLM call for "write a whitepaper" — produces shallow, formulaic output.

We use a multi-step chain:

Research call — LLM identifies the key questions a whitepaper should answer for this specific idea
Section generation — Each section (market, competitors, tech, GTM) gets a focused LLM call with relevant context
Synthesis call — A final LLM call merges sections, smooths transitions, adds an executive summary
Citation injection — We programmatically pull in references from our scraped data and adjacent ideas

Total latency: ~25–45 seconds per whitepaper. Total token cost: ~₹15–25. Worth it because the output is meaningfully better than single-shot generation, and users perceive the wait as "the AI is thinking carefully" rather than "this is slow".

For the broader design pattern around chained LLM calls, see How to build an AI agent. For LLM provider selection at this scale, ChatGPT vs Claude vs Custom LLM.

Three Payment Providers, One Entitlement Store

Different markets have different payment expectations. Indian users want Razorpay (UPI, cards). International users want PayPal or DodoPayments (which supports crypto). Forcing all users through one provider would have lost ~30% conversion in either market.

The architecture: each provider's webhook handler updates the same entitlements table. The table has columns: user_id, product_id, status, granted_at, expires_at, provider, provider_event_id. The frontend checks entitlements; doesn't care which provider granted them.

Reliability comes from:

Idempotency. Each event ID is processed exactly once.
State machine. Allowed transitions are explicit (pending → active, active → cancelled, etc.) with rejected transitions logged.
Drift detection. Periodic reconciliation jobs check provider state vs our state.

Real-world result: $13,852 in revenue across 3 providers in the first months, with zero billing-related support tickets.

For broader perspective on Indian payment gateway architecture, see AI Automation for Indian SMBs.

Quality Filter on Ingested Content

The biggest content-quality risk: scraped content includes spam, duplicates, off-topic posts, and low-effort listicles. Pushing this into the database degrades search quality permanently.

The ingestion pipeline applies multiple filters:

Duplicate detection. Cosine similarity > 0.92 against existing entries flags as duplicate.
Length and structure thresholds. Posts under 200 characters or without clear problem statements get rejected.
Topic classifier. A small classifier model rejects posts not actually about startup ideas (rants, news, off-topic discussions).
Manual review queue. Borderline cases route to the admin portal for human approval.

About 30% of scraped content is auto-rejected; another 10% goes to manual review. The result: a curated database where every entry is genuinely a problem worth considering.

Why This Matters for AI SaaS Builders

Startup Ideas DB demonstrates the content-curation moat for AI SaaS. The barrier to entry isn't building the LLM features (anyone can wire up OpenAI). It's curating the proprietary data that the LLM features operate on. Without 12K+ filtered, structured problem statements, the same LLM whitepaper generation would produce vastly worse output.

The pattern: scrape → filter → embed → expose → augment with LLM. Each layer compounds. Competitors can copy the LLM features in days; they can't copy the curated dataset in weeks.

For broader context on AI SaaS architecture, see AI and SaaS Convergence and our blog How to build an AI agent.

What We'd Do Differently

Move scraper to a managed orchestration platform like Airflow or Dagster. Currently uses cron + custom Python; managed orchestration would simplify monitoring.
Pre-generate and cache top 100 whitepapers. They're requested frequently; on-demand cost adds up.
Build the embedding-update pipeline to handle re-embedding when we change models. Currently a manual job; should be automated for the inevitable model upgrade cycle.

Where Nexolve Fits

We build AI-augmented SaaS platforms via our SaaS & Web Apps service and AI-Powered Automation service. For the full project context, see the Startup Ideas DB case study.

12K+ Embeddings + Multi-Provider Billing: Startup Ideas DB Engineering Teardown