HTML and web content parsing: Build RAG pipelines for online sources
HTML is everywhere: documentation sites, blog articles, wikis, and archived web pages. For RAG systems that ingest web content, you face a unique challenge—HTML mixes structural markup, styling, navigation, ads, and comments with actual content. Extracting clean text without boilerplate requires semantic parsing, not just tag stripping. This article covers tools and strategies for building robust web content pipelines.
HTML parsing for RAG differs from web scraping for analysis. You're not extracting data points; you're extracting human-readable text that will be embedded and used as context by an LLM. That means preserving paragraphs, lists, code blocks, and semantic headings while aggressively removing headers, footers, sidebars, and advertising.
Boilerplate Removal: The Core Challenge
Raw HTML contains far more structural overhead than content. A typical blog post fetched as HTML might be 60 KB with only 8 KB of actual article text; the rest is navigation, ads, analytics, and CSS. Simple approaches (like strip_tags or regex) fail because they:
- Lose semantic structure (heading hierarchy becomes flat text)
- Merge adjacent paragraphs
- Include nav links, comment sections, and ads as content
The solution is semantic parsing using tools like trafilatura (purpose-built for web-to-text), BeautifulSoup (fine-grained control), or vision-based approaches with Claude's vision API for complex layouts.
Extracting Text with Trafilatura
trafilatura is designed specifically for content extraction from web pages. It uses a combination of heuristics and ML to identify the main content area, remove boilerplate, and output clean text while preserving structure.
import trafilatura
import requests
def extract_html_content(url: str) -> dict:
"""Extract main content from a web page using trafilatura."""
# Fetch the page
response = requests.get(url, timeout=10)
html = response.text
# Extract main content with structure preservation
extracted = trafilatura.extract(
html,
include_comments=False,
favor_precision=True # Prefer clean output over completeness
)
# Also get metadata
metadata = trafilatura.extract_metadata(html)
return {
"text": extracted,
"title": metadata.title if metadata else "",
"author": metadata.author if metadata else "",
"date": metadata.date if metadata else "",
"url": url,
"source": "html"
}
# Usage
chunk = extract_html_content("https://example.com/article")
print(chunk["text"]) # Clean, structured text without boilerplate
trafilatura returns text with markdown-like formatting (headings, lists, code blocks are preserved), making it ideal for RAG chunking.
Fine-Grained Extraction with BeautifulSoup
For more control over which elements to include/exclude, use BeautifulSoup with semantic selectors.
from bs4 import BeautifulSoup
import requests
from urllib.parse import urljoin
def extract_html_semantically(url: str) -> dict:
"""Extract content using BeautifulSoup, preserving semantic structure."""
response = requests.get(url, timeout=10)
soup = BeautifulSoup(response.text, 'html.parser')
# Remove boilerplate elements
for tag in soup.find_all(['script', 'style', 'nav', 'footer', 'aside']):
tag.decompose()
# Find main content container (heuristics)
main = soup.find(['main', 'article']) or soup.find('div', class_=lambda x: x and 'content' in x.lower())
if not main:
main = soup.body
# Extract text with heading hierarchy
content_lines = []
for element in main.find_all(['h1', 'h2', 'h3', 'p', 'li', 'code']):
tag = element.name
text = element.get_text(strip=True)
if tag.startswith('h'):
level = int(tag[1])
content_lines.append('#' * level + ' ' + text)
elif tag == 'li':
content_lines.append('- ' + text)
elif tag == 'code':
content_lines.append('`' + text + '`')
else:
content_lines.append(text)
full_text = '\n\n'.join(content_lines)
# Extract metadata
title = soup.find('h1') or soup.find('title')
title_text = title.get_text(strip=True) if title else ""
return {
"text": full_text,
"title": title_text,
"url": url,
"source": "html"
}
This approach gives you direct control: you can exclude specific classes (ad-banner), preserve code blocks, and maintain heading hierarchy.
Handling Dynamic Content and JavaScript-Heavy Sites
Many modern websites render content with JavaScript. Static parsing (requests + BeautifulSoup) won't work. For these, use browser automation (Selenium, Playwright) or headless Chrome.
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
def extract_dynamic_html(url: str, wait_time: int = 3) -> dict:
"""Extract content from JavaScript-heavy sites using Selenium."""
options = webdriver.ChromeOptions()
options.add_argument('--headless') # Run headless
options.add_argument('--disable-blink-features=AutomationControlled')
driver = webdriver.Chrome(options=options)
try:
driver.get(url)
time.sleep(wait_time) # Wait for JS rendering
# Get rendered HTML
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'html.parser')
# Extract content (same as static parsing)
main = soup.find(['main', 'article'])
text = main.get_text(separator='\n\n') if main else soup.body.get_text(separator='\n\n')
return {
"text": text,
"url": url,
"source": "html_dynamic"
}
finally:
driver.quit()
Headless browser extraction is slower (5–20s per page) but necessary for React, Vue, and Svelte-based sites.
Handling Relative Links and Media References
Web content often includes relative links and image references. When chunks are detached from their original URL, these become invalid. Expand relative URLs to absolute URLs for preservation.
from urllib.parse import urljoin
def expand_relative_links(html: str, base_url: str) -> str:
"""Convert relative links to absolute URLs."""
soup = BeautifulSoup(html, 'html.parser')
for tag in soup.find_all(['a', 'img']):
if tag.name == 'a':
href = tag.get('href')
if href:
tag['href'] = urljoin(base_url, href)
elif tag.name == 'img':
src = tag.get('src')
if src:
tag['src'] = urljoin(base_url, src)
# Include alt text as a fallback description
if not tag.get('alt'):
tag['alt'] = 'Image'
return str(soup)
Content Quality Heuristics
Not all HTML pages are equally valuable for RAG. Estimate content quality before ingesting.
def estimate_content_quality(text: str, min_words: int = 300) -> float:
"""Estimate whether extracted text is substantive content (0.0–1.0)."""
word_count = len(text.split())
# Must have minimum length
if word_count < min_words:
return 0.0
# Check for English-like structure (rough heuristic)
line_count = len(text.split('\n'))
avg_line_length = word_count / max(line_count, 1)
# Well-formatted text has moderate line lengths (20–80 words per line)
if avg_line_length < 5 or avg_line_length > 150:
return 0.5
# Check for excessive repetition (common in boilerplate)
lines = text.split('\n')
unique_lines = len(set(lines)) / max(len(lines), 1)
quality_score = (word_count / 1000) * unique_lines # Rough heuristic
return min(quality_score, 1.0)
# Usage
if estimate_content_quality(chunk["text"]) > 0.7:
# Content looks substantive, proceed with chunking
chunks.append(chunk)
Comparison of HTML Extraction Tools
| Tool | Speed | Boilerplate | Dynamic JS | Cost | Best For |
|---|---|---|---|---|---|
| trafilatura | Fast | Excellent | No | Free | News, articles, blogs |
| BeautifulSoup | Fast | Good (manual) | No | Free | Custom extraction, structured HTML |
| Selenium | Slow | Excellent | Yes | Free | JS-heavy sites, SPAs |
| Playwright | Slow | Excellent | Yes | Free | JS rendering, modern stack |
| Claude Vision | Slow | Excellent | N/A | Per-image $ | Complex layouts, visual-first |
Key Takeaways
- Use
trafilaturafor automatic boilerplate removal from web pages—it's purpose-built for content extraction. - For custom extraction logic, use BeautifulSoup with semantic selectors to preserve heading hierarchy and list structure.
- For JavaScript-heavy sites (React, Vue, etc.), use headless browsers (Selenium, Playwright) to render before parsing.
- Expand relative URLs to absolute URLs so chunks remain valid when detached from their source page.
- Estimate content quality before ingesting to avoid low-value boilerplate or truncated content.
Frequently Asked Questions
Should I use trafilatura or BeautifulSoup?
Use trafilatura for autopilot boilerplate removal on arbitrary web pages. Use BeautifulSoup when you need custom logic (e.g., extracting only article text, excluding sidebars with specific IDs). Most production RAG systems use trafilatura as a baseline and BeautifulSoup for site-specific cleanup.
How do I handle paywalled or login-protected content?
Web scrapers cannot bypass paywalls ethically or legally. For login-protected content, authenticate within your browser automation (Selenium/Playwright) by storing cookies or credentials, then parse the rendered page. Ensure you comply with the site's terms of service.
What's the best way to handle very large pages that timeout?
Set reasonable timeouts (10 seconds) and skip pages that fail to load. Use a queue (RabbitMQ, Celery) to retry failed URLs. Alternatively, fetch only the first N KB of the page: requests.get(url, stream=True); response.raw.read(50000).
How do I preserve code blocks in HTML extraction?
Use semantic selectors: detect <pre>, <code>, and <script> tags separately. Preserve their formatting by joining lines without merging. trafilatura and BeautifulSoup both preserve <code> blocks if detected correctly.
Should I crawl all pages of a website or just surface pages?
Start with robots.txt and sitemap.xml. Crawl breadth-first (homepage → category → articles) rather than depth-first. Set crawl delay (1–2 seconds between requests) to avoid overloading servers. For large sites, sample (e.g., 10% of pages) if ingesting everything is infeasible.