Building a Custom Astro Content Loader for Web Scraping
Learn how to create a custom Astro content loader that scrapes external websites at build time, replacing static JSON files with dynamic data fetching.
Astro’s Content Collections are powerful for managing local markdown files, but what happens when your data lives on an external website? Instead of maintaining a static JSON file that requires manual updates, you can create a custom content loader that fetches and parses data directly at build time.
In this article, we’ll walk through building a content loader that scrapes training session data from the Human Coders platform and integrates it seamlessly into an Astro site.
The Problem: Static Data Files
Many Astro projects start with a simple approach: export data from an external source into a JSON file, then import it where needed:
// src/layouts/TrainingLayout.astro
import trainingsHc from "../content/trainings-hc.json";
const sessions = trainingsHc.filter((t) => t.slug === slug);
This works, but has significant drawbacks. Someone must remember to regenerate the JSON file, information can become outdated between updates, the JSON structure isn’t validated against a schema, and you need a separate script (often using Puppeteer) to scrape data.
The Solution: Custom Content Loaders
Astro 5 introduced content loaders, allowing you to define custom data sources for Content Collections. Instead of reading from files, a loader can fetch data from APIs, databases, or even scrape websites.
A content loader is a function that returns an object with:
name- A unique identifier for the loaderload- An async function that populates the content store
import type { Loader } from "astro/loaders";
export function myCustomLoader(): Loader {
return {
name: "my-custom-loader",
load: async ({ store, logger }) => {
// Fetch and process data
store.set({
id: "item-1",
data: {
/* ... */
},
});
},
};
}
Building the Human Coders Loader
Let’s create a loader that scrapes training sessions from Human Coders. We’ll use node-html-parser for lightweight HTML parsing instead of Puppeteer, which can be problematic in build environments.
Step 1: Define the Data Types
First, define TypeScript interfaces for your data structure:
interface Session {
month: string;
day: string;
location: string;
url: string;
rawText: string;
date: string;
}
interface Training {
title: string;
slug: string;
description: string;
image: string;
url: string;
program: string[];
objectives: string[];
duration: string;
sessions: Session[];
company: string;
companyUrl: string;
}
Step 2: Create the Loader
The loader fetches the trainer’s page, extracts training links, then visits each training page to gather session data:
import type { Loader } from "astro/loaders";
import { parse } from "node-html-parser";
const TRAINER_URL = "https://www.humancoders.com/formateurs/emmanuel-demey";
async function fetchPage(url: string): Promise<string> {
const response = await fetch(url);
return response.text();
}
export function scrapeHumanCodersTrainings(): Loader {
return {
name: "human-coders-trainings",
load: async ({ store, logger }) => {
logger.info("Scraping Human Coders trainings...");
try {
const trainerHtml = await fetchPage(TRAINER_URL);
const trainerDoc = parse(trainerHtml);
// Extract training links
const trainingLinks: string[] = [];
const anchors = trainerDoc.querySelectorAll('a[href^="/formations/"]');
for (const anchor of anchors) {
const href = anchor.getAttribute("href");
if (href && !href.includes("/sessions/")) {
trainingLinks.push(href);
}
}
// Process each training
for (const link of trainingLinks) {
const fullUrl = `https://www.humancoders.com${link}`;
const trainingHtml = await fetchPage(fullUrl);
const doc = parse(trainingHtml);
// Extract training data...
const title = doc.querySelector("h1")?.textContent?.trim() || "";
const slug = link.replace("/formations/", "");
// Extract sessions
const sessions = extractSessions(doc);
store.set({
id: slug,
data: {
title,
slug,
sessions,
// ... other fields
},
});
logger.info(`Loaded: ${title}`);
}
logger.info("Finished scraping");
} catch (error) {
logger.error(`Error scraping: ${error}`);
}
},
};
}
Extracting Session Data
The extractSessions function parses session information from the training page. Each session link contains text like “janvier 15 Paris” that we need to parse:
function extractSessions(doc: ReturnType<typeof parse>): Session[] {
const sessions: Session[] = [];
const sessionLinks = doc.querySelectorAll('a[href*="/sessions/"]');
for (const sessionLink of sessionLinks) {
const sessionText = sessionLink.textContent?.trim() || "";
const sessionUrl = sessionLink.getAttribute("href") || "";
if (sessionText && sessionUrl) {
// Parse text like "janvier 15 Paris" or "mars 20 À distance"
const parts = sessionText.split(/\s+/);
if (parts.length >= 2) {
const month = parts[0];
const day = parts[1];
let location = parts.slice(2).join(" ") || "Online";
// Normalize the URL
const fullSessionUrl = sessionUrl.startsWith("http")
? sessionUrl
: `https://www.humancoders.com${sessionUrl}`;
sessions.push({
month,
day,
location,
url: fullSessionUrl,
rawText: sessionText,
date: "", // Computed later
});
}
}
}
return sessions;
}
The function uses a CSS selector a[href*="/sessions/"] to find all links containing “/sessions/” in their href. Then it splits the link text to extract the month, day, and location components.
Step 3: Register the Collection
In your src/content/config.ts, register the collection with the custom loader:
import { defineCollection, z } from "astro:content";
import { scrapeHumanCodersTrainings } from "./loaders/human-coders-loader";
const sessionsHc = defineCollection({
loader: scrapeHumanCodersTrainings(),
schema: z.object({
title: z.string(),
slug: z.string(),
description: z.string(),
sessions: z.array(
z.object({
month: z.string(),
day: z.string(),
location: z.string(),
url: z.string(),
date: z.string(),
}),
),
company: z.string(),
// ... other fields
}),
});
export const collections = {
"sessions-hc": sessionsHc,
};
Step 4: Use the Collection
Now you can use getCollection() to access the scraped data with full type safety:
// src/layouts/TrainingLayout.astro
---
import { getCollection } from "astro:content";
const sessionsHc = await getCollection("sessions-hc");
const sessions = [];
for (const slug of humanCodersSlugs) {
const hcTraining = sessionsHc.find((t) => t.data.slug === slug);
if (hcTraining) {
for (const session of hcTraining.data.sessions) {
sessions.push({
...session,
trainingTitle: hcTraining.data.title,
});
}
}
}
---
Data Transformation in the Loader
One advantage of custom loaders is the ability to transform data during ingestion. For example, translating French location names to English:
let location = parts.slice(2).join(" ") || "Online";
// Translate "À distance" to "Online"
if (location.toLowerCase().includes("à distance")) {
location = location.replace(/à distance/gi, "Online");
}
Or computing dates from month names:
const MONTHS_FR: Record<string, string> = {
janvier: "01",
février: "02",
mars: "03",
// ...
};
const monthNum = MONTHS_FR[session.month.toLowerCase()];
const date = `${year}-${monthNum}-${day.padStart(2, "0")}`;
Conclusion
Custom content loaders transform how you handle external data in Astro. Instead of maintaining static JSON files with separate update scripts, you get a unified, type-safe approach that fetches fresh data on every build.
The combination of Astro’s content loader API, node-html-parser for lightweight scraping, and Zod for validation creates a robust pipeline for integrating external websites into your Astro project.
This pattern works well for any scenario where you need to pull structured data from external HTML pages: event listings, product catalogs, documentation sites, or any other web content you want to integrate into your Astro application.