Skip to main content

Fetch Web Sources: Research Agent Data Pipeline

Fetching web sources is where a research agent actually retrieves the content it needs to analyze. It's one of the most failure-prone steps because websites vary wildly in structure, use JavaScript to render content dynamically, hide content behind paywalls, or block automated clients. A robust fetcher must gracefully handle these challenges, extracting readable text and metadata while avoiding dead ends.

When a search API returns a URL, the agent can't rely on search engine snippets alone—those are incomplete and sometimes inaccurate. The agent must fetch the full page, extract its main text content (not ads or navigation), detect the publish date and author, and verify that the content is actually relevant before passing it to the reading/analysis step. This article teaches you how to build a fetcher that succeeds 85–95% of the time without being blocked.

How Do You Extract Readable Text from Complex HTML?

The simplest HTML fetchers use requests and BeautifulSoup to grab and parse a page. However, this fails on: (1) JavaScript-rendered sites (NextJS, React SPAs), (2) poorly structured pages with lots of noise, and (3) mobile-only designs. A production fetcher should combine multiple techniques:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import trafilatura # Specialized news/article extractor

def fetch_and_extract(url: str, timeout: int = 10) -> dict:
"""
Fetch a URL and extract main text content.
Returns: {"title", "text", "author", "date", "success", "reason"}
"""
try:
# Step 1: Fetch with a realistic User-Agent
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}
response = requests.get(url, headers=headers, timeout=timeout, allow_redirects=True)
response.raise_for_status()
except requests.exceptions.RequestException as e:
return {
"success": False,
"reason": f"HTTP error: {str(e)}",
"url": url
}

# Step 2: Try specialized article extractor (trafilatura)
# This handles most news sites and blogs well
extracted = trafilatura.extract(response.text, include_comments=False)

if extracted:
# Trafilatura succeeded; extract metadata
metadata = trafilatura.extract_metadata(response.text)
return {
"success": True,
"title": metadata.title or "Unknown",
"author": metadata.author or "Unknown",
"date": str(metadata.date) if metadata.date else "Unknown",
"text": extracted,
"url": url
}

# Step 3: Fallback to BeautifulSoup for generic pages
soup = BeautifulSoup(response.text, "html.parser")

# Remove script, style, nav, footer
for tag in soup(["script", "style", "nav", "footer", "aside"]):
tag.decompose()

# Extract title
title = soup.find("h1")
title_text = title.get_text(strip=True) if title else "Unknown"

# Extract main content (common selectors)
main_content = (
soup.find("main") or
soup.find("article") or
soup.find("div", class_=lambda x: x and "content" in x.lower()) or
soup.find("div", id=lambda x: x and "content" in x.lower())
)

if main_content:
text = main_content.get_text(separator="\n", strip=True)
else:
# Last resort: extract from body
text = soup.get_text(separator="\n", strip=True)

# Trim excessive whitespace and truncate if very long
text = "\n".join(line.strip() for line in text.split("\n") if line.strip())
text = text[:8000] # Limit to 8000 chars for processing

return {
"success": True,
"title": title_text,
"author": "Unknown",
"date": "Unknown",
"text": text,
"url": url
}

# Example
result = fetch_and_extract("https://arxiv.org/abs/2406.14283")
if result["success"]:
print(f"Title: {result['title']}")
print(f"Text length: {len(result['text'])} chars")
else:
print(f"Failed: {result['reason']}")

Handling JavaScript-Rendered Content and Paywalls

Many modern websites render content entirely with JavaScript, making the initial HTML response mostly empty. For these sites, you need a headless browser. However, headless browsers are slow and resource-intensive. Use them selectively:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options

def fetch_js_rendered(url: str, timeout: int = 15) -> dict:
"""
Fetch a JavaScript-heavy site using Selenium headless browser.
Use only if trafilatura fails.
"""
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--start-maximized")
chrome_options.add_argument(
"user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0"
)

driver = None
try:
driver = webdriver.Chrome(options=chrome_options)
driver.set_page_load_timeout(timeout)
driver.get(url)

# Wait briefly for JS to render
time.sleep(2)

# Check if paywall modal exists
try:
paywall = driver.find_element(By.CLASS_NAME, "paywall")
return {
"success": False,
"reason": "Paywalled content",
"url": url
}
except:
pass # No paywall

# Extract rendered HTML
html = driver.page_source
driver.quit()
driver = None

# Use trafilatura on the rendered HTML
extracted = trafilatura.extract(html)
if extracted:
return {
"success": True,
"title": "JavaScript-rendered page",
"author": "Unknown",
"date": "Unknown",
"text": extracted,
"url": url
}
else:
return {
"success": False,
"reason": "JS-rendered but no content extracted",
"url": url
}
except Exception as e:
return {
"success": False,
"reason": f"Selenium error: {str(e)}",
"url": url
}
finally:
if driver:
driver.quit()

# Detect and route appropriately
def smart_fetch(url: str) -> dict:
"""
Try simple fetch first; fall back to JS rendering if needed.
"""
# First attempt: simple fetch
result = fetch_and_extract(url)

if not result["success"] or len(result.get("text", "")) < 200:
# Likely JS-rendered; try headless browser
result = fetch_js_rendered(url)

return result

Detecting and Handling Incomplete or Corrupted Content

Not every fetch produces valid content. The agent should detect and skip problematic sources:

def is_valid_source(result: dict, min_text_length: int = 200) -> bool:
"""
Validate that fetched content is likely useful.
"""
if not result.get("success"):
return False

text = result.get("text", "")

# Too short: likely incomplete
if len(text) < min_text_length:
return False

# Check for common error indicators
error_phrases = [
"404 not found", "access denied", "page expired",
"error occurred", "coming soon", "under maintenance"
]
text_lower = text.lower()
if any(phrase in text_lower for phrase in error_phrases):
return False

# Check for excessive repetition (sign of boilerplate/spam)
lines = text.split("\n")
if len(lines) > 10:
# Count duplicate lines
unique_lines = len(set(lines))
if unique_lines / len(lines) < 0.5:
return False

return True

# Usage in agent loop
results = [fetch_and_extract(url) for url in urls]
valid_results = [r for r in results if is_valid_source(r)]

Key Takeaways

  • Use specialized extractors like trafilatura for articles and news; fall back to BeautifulSoup for generic pages to handle 85–95% of cases efficiently.
  • Detect JavaScript-heavy sites early (low text in simple fetch) and use headless browsers only when necessary to avoid slowdowns.
  • Validate fetched content by checking text length, error phrases, and uniqueness; discard incomplete or corrupted pages before analysis.
  • Implement timeout and exception handling to gracefully skip unreachable, slow, or blocked URLs without crashing the agent loop.

Frequently Asked Questions

Why not use headless browsers for all fetches?

Headless browsers (Selenium, Playwright) are 10–50x slower than simple HTTP requests and consume significant memory. Use them only for known JavaScript-heavy sites or when simple fetch returns too little text.

How do I handle sites that require login?

Most research requires publicly available sources. Skip login-required sites entirely. If a source is behind authentication, note it in your report but don't attempt to bypass authentication (legally and ethically risky).

What's the maximum content size I should extract?

Keep extracted text under 10,000 words per page (roughly 50 KB). Larger pages risk token overflow in downstream LLM analysis. Truncate gracefully, preserving the first 70% of the page (which usually contains key information).

Can I cache fetched content?

Yes, absolutely. Cache by URL and reuse for at least 7 days. This avoids re-fetching the same page multiple times and reduces bandwidth. Use SQLite or Redis.

Further Reading