Catalog & sitemaps

How Lumio discovers and imports products from a sitemap.

Lumio imports a product catalog by scanning the website’s sitemap — the same data path AI shopping agents take. The result is an accurate picture of what AI agents actually see.

How catalog import works

When a store URL is entered, Lumio runs four phases:

1. Sitemap discovery

Lumio fetches sitemap.xml, parses it, and recursively follows any child sitemaps (common on large stores). All URLs in the sitemap tree end up in a working list.

2. Product page identification

Not every URL in a sitemap is a product page. Lumio classifies in two passes:

  • Pattern match — URLs with obvious product paths (/products/, /product/, /p/, /item/, /dp/) are kept; URLs with obvious non-product paths (/collections/, /blog/, /policies/, /cart, /account, etc.) are dropped.
  • AI classification — Anything ambiguous is sent to a fast model that decides product vs. not based on URL structure and naming.

3. JSON-LD extraction

For each identified product page, Lumio fetches the page and extracts JSON-LD structured data — the <script type="application/ld+json"> blocks that contain machine-readable product information. Pages are fetched concurrently (5 at a time) to keep imports fast.

The complete JSON-LD payload is stored alongside normalized fields (title, description, brand, price, images) for scoring and display.

4. Catalog creation

Each discovered product becomes a row in the workspace’s active catalog with:

  • Title and description
  • Brand and pricing
  • Raw JSON-LD payload (drives Schema Health later)
  • Source URL

The previous active catalog is archived (not deleted) before the new one becomes active. Failed imports restore the archived catalog automatically.

What if a site doesn’t have a sitemap?

Most ecommerce platforms generate sitemaps automatically. Visit <store>.com/sitemap.xml to check. Shopify generates one at /sitemap.xml by default.

For sites without one, the simplest path is to use whatever sitemap generator the platform offers, or generate one manually before scanning.

What if products don’t have JSON-LD?

Lumio still imports basic product information from sitemap URLs, but scoring and enrichment work best with structured data. Most modern Shopify themes include Product schema by default — Schema Health will flag exactly what’s missing on each product so the gaps are explicit.

Shopify sync

If a Shopify store is connected, syncing from the Shopify Admin API is faster and more complete than sitemap scanning. Variants, pricing, images, and metadata all come through. Open Catalog, choose the Shopify tab, and click Sync from Shopify. See Shopify Integration for setup.

Rescanning

Catalogs aren’t static. When products change, the catalog can be re-imported:

  • Full re-import — Re-fetches the sitemap (or re-syncs Shopify) and refreshes all products. The previous catalog is archived.
  • Rescore — After re-importing, run scoring again to see how the changes moved AI readiness.

Archived catalogs preserve their scores and enrichments for historical comparison.

Background jobs and active-job guard

Sitemap scans and Shopify syncs both run as background jobs. Only one job runs per workspace at a time — starting a second one while a scan, score, or enrichment is in flight returns a clear “a job is already running” message. Stuck scan jobs (in progress for more than 30 minutes) are auto-detected and marked failed on the next poll.