Site de Emmanuel Demey

Building a Custom Astro Content Loader for Web Scraping

Learn how to create a custom Astro content loader that scrapes external websites at build time, replacing static JSON files with dynamic data fetching.

Astro’s Content Collections are powerful for managing local markdown files, but what happens when your data lives on an external website? Instead of maintaining a static JSON file that requires manual updates, you can create a custom content loader that fetches and parses data directly at build time.

In this article, we’ll walk through building a content loader that scrapes training session data from the Human Coders platform and integrates it seamlessly into an Astro site.

The Problem: Static Data Files

Many Astro projects start with a simple approach: export data from an external source into a JSON file, then import it where needed:

// src/layouts/TrainingLayout.astro
import trainingsHc from "../content/trainings-hc.json";

const sessions = trainingsHc.filter((t) => t.slug === slug);

This works, but has significant drawbacks. Someone must remember to regenerate the JSON file, information can become outdated between updates, the JSON structure isn’t validated against a schema, and you need a separate script (often using Puppeteer) to scrape data.

The Solution: Custom Content Loaders

Astro 5 introduced content loaders, allowing you to define custom data sources for Content Collections. Instead of reading from files, a loader can fetch data from APIs, databases, or even scrape websites.

A content loader is a function that returns an object with:

  • name - A unique identifier for the loader
  • load - An async function that populates the content store
import type { Loader } from "astro/loaders";

export function myCustomLoader(): Loader {
  return {
    name: "my-custom-loader",
    load: async ({ store, logger }) => {
      // Fetch and process data
      store.set({
        id: "item-1",
        data: {
          /* ... */
        },
      });
    },
  };
}

Building the Human Coders Loader

Let’s create a loader that scrapes training sessions from Human Coders. We’ll use node-html-parser for lightweight HTML parsing instead of Puppeteer, which can be problematic in build environments.

Step 1: Define the Data Types

First, define TypeScript interfaces for your data structure:

interface Session {
  month: string;
  day: string;
  location: string;
  url: string;
  rawText: string;
  date: string;
}

interface Training {
  title: string;
  slug: string;
  description: string;
  image: string;
  url: string;
  program: string[];
  objectives: string[];
  duration: string;
  sessions: Session[];
  company: string;
  companyUrl: string;
}

Step 2: Create the Loader

The loader fetches the trainer’s page, extracts training links, then visits each training page to gather session data:

import type { Loader } from "astro/loaders";
import { parse } from "node-html-parser";

const TRAINER_URL = "https://www.humancoders.com/formateurs/emmanuel-demey";

async function fetchPage(url: string): Promise<string> {
  const response = await fetch(url);
  return response.text();
}

export function scrapeHumanCodersTrainings(): Loader {
  return {
    name: "human-coders-trainings",
    load: async ({ store, logger }) => {
      logger.info("Scraping Human Coders trainings...");

      try {
        const trainerHtml = await fetchPage(TRAINER_URL);
        const trainerDoc = parse(trainerHtml);

        // Extract training links
        const trainingLinks: string[] = [];
        const anchors = trainerDoc.querySelectorAll('a[href^="/formations/"]');

        for (const anchor of anchors) {
          const href = anchor.getAttribute("href");
          if (href && !href.includes("/sessions/")) {
            trainingLinks.push(href);
          }
        }

        // Process each training
        for (const link of trainingLinks) {
          const fullUrl = `https://www.humancoders.com${link}`;
          const trainingHtml = await fetchPage(fullUrl);
          const doc = parse(trainingHtml);

          // Extract training data...
          const title = doc.querySelector("h1")?.textContent?.trim() || "";
          const slug = link.replace("/formations/", "");

          // Extract sessions
          const sessions = extractSessions(doc);

          store.set({
            id: slug,
            data: {
              title,
              slug,
              sessions,
              // ... other fields
            },
          });

          logger.info(`Loaded: ${title}`);
        }

        logger.info("Finished scraping");
      } catch (error) {
        logger.error(`Error scraping: ${error}`);
      }
    },
  };
}

Extracting Session Data

The extractSessions function parses session information from the training page. Each session link contains text like “janvier 15 Paris” that we need to parse:

function extractSessions(doc: ReturnType<typeof parse>): Session[] {
  const sessions: Session[] = [];
  const sessionLinks = doc.querySelectorAll('a[href*="/sessions/"]');

  for (const sessionLink of sessionLinks) {
    const sessionText = sessionLink.textContent?.trim() || "";
    const sessionUrl = sessionLink.getAttribute("href") || "";

    if (sessionText && sessionUrl) {
      // Parse text like "janvier 15 Paris" or "mars 20 À distance"
      const parts = sessionText.split(/\s+/);
      if (parts.length >= 2) {
        const month = parts[0];
        const day = parts[1];
        let location = parts.slice(2).join(" ") || "Online";

        // Normalize the URL
        const fullSessionUrl = sessionUrl.startsWith("http")
          ? sessionUrl
          : `https://www.humancoders.com${sessionUrl}`;

        sessions.push({
          month,
          day,
          location,
          url: fullSessionUrl,
          rawText: sessionText,
          date: "", // Computed later
        });
      }
    }
  }

  return sessions;
}

The function uses a CSS selector a[href*="/sessions/"] to find all links containing “/sessions/” in their href. Then it splits the link text to extract the month, day, and location components.

Step 3: Register the Collection

In your src/content/config.ts, register the collection with the custom loader:

import { defineCollection, z } from "astro:content";
import { scrapeHumanCodersTrainings } from "./loaders/human-coders-loader";

const sessionsHc = defineCollection({
  loader: scrapeHumanCodersTrainings(),
  schema: z.object({
    title: z.string(),
    slug: z.string(),
    description: z.string(),
    sessions: z.array(
      z.object({
        month: z.string(),
        day: z.string(),
        location: z.string(),
        url: z.string(),
        date: z.string(),
      }),
    ),
    company: z.string(),
    // ... other fields
  }),
});

export const collections = {
  "sessions-hc": sessionsHc,
};

Step 4: Use the Collection

Now you can use getCollection() to access the scraped data with full type safety:

// src/layouts/TrainingLayout.astro
---
import { getCollection } from "astro:content";

const sessionsHc = await getCollection("sessions-hc");

const sessions = [];
for (const slug of humanCodersSlugs) {
  const hcTraining = sessionsHc.find((t) => t.data.slug === slug);
  if (hcTraining) {
    for (const session of hcTraining.data.sessions) {
      sessions.push({
        ...session,
        trainingTitle: hcTraining.data.title,
      });
    }
  }
}
---

Data Transformation in the Loader

One advantage of custom loaders is the ability to transform data during ingestion. For example, translating French location names to English:

let location = parts.slice(2).join(" ") || "Online";

// Translate "À distance" to "Online"
if (location.toLowerCase().includes("à distance")) {
  location = location.replace(/à distance/gi, "Online");
}

Or computing dates from month names:

const MONTHS_FR: Record<string, string> = {
  janvier: "01",
  février: "02",
  mars: "03",
  // ...
};

const monthNum = MONTHS_FR[session.month.toLowerCase()];
const date = `${year}-${monthNum}-${day.padStart(2, "0")}`;

Conclusion

Custom content loaders transform how you handle external data in Astro. Instead of maintaining static JSON files with separate update scripts, you get a unified, type-safe approach that fetches fresh data on every build.

The combination of Astro’s content loader API, node-html-parser for lightweight scraping, and Zod for validation creates a robust pipeline for integrating external websites into your Astro project.

This pattern works well for any scenario where you need to pull structured data from external HTML pages: event listings, product catalogs, documentation sites, or any other web content you want to integrate into your Astro application.