Advanced Agentscope 6 min read

Building a Web Research Agent with AgentScope: An Advanced Tutorial

#AgentScope #Project #RAG #Web Scraping

Building a Web Research Agent with AgentScope: An Advanced Tutorial shows you how to push past basic chatbot patterns and build something genuinely useful — an autonomous agent that can search the web, scrape content, synthesize findings, and persist what it learns across sessions. This guide assumes you have Python experience and have read the introductory AgentScope material. By the end, you will have a production-grade research assistant running on AgentScope’s fully async architecture.


Prerequisites and Advanced Environment Setup

Before writing a single line of agent logic, your environment needs to be airtight. AgentScope requires Python 3.10 or higher — this is non-negotiable because the framework is now fully async-first following the v1.0.0 release.

python --version   # must be >= 3.10
pip install agentscope
pip install httpx beautifulsoup4 aiofiles

The additional packages handle HTTP requests and HTML parsing for the custom tools you will build later. httpx is the async-native HTTP client that plays well with AgentScope’s event loop; beautifulsoup4 handles HTML parsing; aiofiles handles async file I/O for session persistence.

Set your API key as an environment variable. This tutorial uses DashScope as the backing LLM:

export DASHSCOPE_API_KEY="sk-your-key-here"

Important: Never hardcode API keys in source files. If you are deploying this agent in a production environment, consider reviewing secrets management practices covered in our guide on n8n Self-Hosting: Production Deployment Guide, which applies broadly to any self-hosted agent workload.

Create your project structure:

web_research_agent/
├── main.py
├── tools.py
├── memory_store.py
└── sessions/         # persisted session files live here

Architecting the Web Research Agent in AgentScope

A web research agent has three responsibilities: gathering information from external sources, synthesizing it into coherent findings, and remembering what it has already learned to avoid redundant work. The AgentScope component model maps cleanly to these responsibilities.

The core architecture uses:

  • ReActAgent — the reasoning backbone that decides which tool to call and when to stop
  • DashScopeChatModel — the language model handling reasoning steps
  • Toolkit — the container registering search and scraping functions
  • InMemoryMemory — short-term session memory for the active conversation
  • MsgHub — optional multi-agent coordination if you later add a critic or summarizer agent

The ReAct (Reason + Act) loop is ideal for web research because the agent must decide iteratively: search → read result → decide whether to go deeper → synthesize. This is structurally similar to how LangChain Agents and Tools: Build Agents That Take Action approaches tool-using agents, but AgentScope’s async-first design means every tool invocation is non-blocking, which matters when your tools involve network I/O.

Here is the high-level wiring before filling in the tool implementations:

import asyncio
import os
from agentscope.agent import ReActAgent, UserAgent
from agentscope.memory import InMemoryMemory
from agentscope.model import DashScopeChatModel
from agentscope.tool import Toolkit
from tools import web_search, scrape_page, save_finding
from memory_store import load_session, save_session

async def build_agent(session_id: str) -> ReActAgent:
    model = DashScopeChatModel(
        api_key=os.environ["DASHSCOPE_API_KEY"]
    )

    toolkit = Toolkit()
    toolkit.register_tool_function(web_search)
    toolkit.register_tool_function(scrape_page)
    toolkit.register_tool_function(save_finding)

    prior_memory = load_session(session_id)

    agent = ReActAgent(
        name="ResearchAssistant",
        model=model,
        memory=InMemoryMemory(history=prior_memory),
        toolkit=toolkit,
        sys_prompt=(
            "You are a thorough web research assistant. "
            "Use search and scraping tools to gather evidence "
            "before answering. Always save key findings."
        )
    )
    return agent

Notice that InMemoryMemory accepts a history parameter here — this is how you warm the agent with prior context when resuming a session.


Developing Custom Tools for Search and Web Scraping

AgentScope tools are plain Python functions registered on a Toolkit. The framework inspects function signatures and docstrings to expose them to the agent’s reasoning loop. Every tool function used in an async agent must itself be an async function.

Create tools.py:

import httpx
from bs4 import BeautifulSoup
import aiofiles
import json
from datetime import datetime

async def web_search(query: str, num_results: int = 5) -> str:
    """
    Search the web for the given query and return a list of result titles and URLs.
    
    Args:
        query: The search query string.
        num_results: Number of results to return (default 5, max 10).
    
    Returns:
        A formatted string of search results with titles and URLs.
    """
    num_results = min(num_results, 10)
    # Using DuckDuckGo's HTML endpoint as a dependency-free search option
    encoded_query = query.replace(" ", "+")
    url = f"https://html.duckduckgo.com/html/?q={encoded_query}"
    
    async with httpx.AsyncClient(timeout=15.0, follow_redirects=True) as client:
        response = await client.get(url, headers={
            "User-Agent": "Mozilla/5.0 (research-agent/1.0)"
        })
        response.raise_for_status()
    
    soup = BeautifulSoup(response.text, "html.parser")
    results = []
    
    for result in soup.select(".result__title")[:num_results]:
        link = result.select_one("a")
        if link:
            title = link.get_text(strip=True)
            href = link.get("href", "")
            results.append(f"- {title}: {href}")
    
    if not results:
        return "No results found for this query."
    
    return "\n".join(results)


async def scrape_page(url: str, max_chars: int = 3000) -> str:
    """
    Fetch and extract the main text content from a webpage.
    
    Args:
        url: The URL to scrape.
        max_chars: Maximum characters to return (default 3000).
    
    Returns:
        Extracted plain text from the page, truncated to max_chars.
    """
    async with httpx.AsyncClient(timeout=20.0, follow_redirects=True) as client:
        response = await client.get(url, headers={
            "User-Agent": "Mozilla/5.0 (research-agent/1.0)"
        })
        response.raise_for_status()
    
    soup = BeautifulSoup(response.text, "html.parser")
    
    # Remove noise elements
    for tag in soup(["script", "style", "nav", "footer", "header", "aside"]):
        tag.decompose()
    
    text = soup.get_text(separator=" ", strip=True)
    # Collapse whitespace
    text = " ".join(text.split())
    
    return text[:max_chars]


async def save_finding(topic: str, summary: str, source_url: str) -> str:
    """
    Save a research finding to the local knowledge store.
    
    Args:
        topic: The topic label for this finding.
        summary: A concise summary of the finding.
        source_url: The URL where this information was found.
    
    Returns:
        Confirmation message with the saved file path.
    """
    finding = {
        "topic": topic,
        "summary": summary,
        "source_url": source_url,
        "saved_at": datetime.utcnow().isoformat()
    }
    
    filename = f"sessions/finding_{topic.replace(' ', '_')}_{datetime.utcnow().strftime('%H%M%S')}.json"
    
    async with aiofiles.open(filename, "w") as f:
        await f.write(json.dumps(finding, indent=2))
    
    return f"Finding saved to {filename}"

The docstrings are critical — AgentScope uses them to describe tool capabilities to the underlying LLM. Write them as you would an API specification: precise argument descriptions and explicit return value descriptions. Vague docstrings lead to incorrect tool invocations.

For agents that execute code, the research in this area extends further — see AutoGen Code Execution: Build Agents That Write and Run Code for a comparison of sandboxed execution patterns across frameworks.


Implementing the Research and Synthesis Logic

Now connect everything in main.py and implement the conversation loop:

import asyncio
import os
from agentscope.agent import ReActAgent, UserAgent
from agentscope.memory import InMemoryMemory
from agentscope.model import DashScopeChatModel
from agentscope.tool import Toolkit
from tools import web_search, scrape_page, save_finding
from memory_store import load_session, save_session

SESSION_ID = "research_session_001"

async def main():
    os.makedirs("sessions", exist_ok=True)
    
    model = DashScopeChatModel(
        api_key=os.environ["DASHSCOPE_API_KEY"]
    )

    toolkit = Toolkit()
    toolkit.register_tool_function(web_search)
    toolkit.register_tool_function(scrape_page)
    toolkit.register_tool_function(save_finding)

    prior_history = load_session(SESSION_ID)

    react_agent = ReActAgent(
        name="ResearchAssistant",
        model=model,
        memory=InMemoryMemory(history=prior_history),
        toolkit=toolkit,
        sys_prompt=(
            "You are a thorough web research assistant. "
            "For any research question: (1) search for relevant pages, "
            "(2) scrape the most promising results, "
            "(3) synthesize your findings into a clear answer, "
            "(4) save important findings using save_finding. "
            "Be precise about sources."
        )
    )

    user_agent = UserAgent(name="User")

    print("Web Research Agent ready. Type 'exit' to quit and save session.\n")

    msg = await user_agent.get_input("Research query: ")

    while msg.text.strip().lower() != "exit":
        # Agent reasons and acts — this may invoke tools multiple times
        response = await react_agent(msg)
        print(f"\nAssistant: {response.text}\n")
        
        msg = await user_agent(response)

    # Persist the session before exiting
    save_session(SESSION_ID, react_agent.memory.get_history())
    print("Session saved. Goodbye.")


if __name__ == "__main__":
    asyncio.run(main())

The await react_agent(msg) call is the ReAct loop in action. Internally, the agent will reason about which tool to call, invoke it asynchronously, observe the result, and continue reasoning until it produces a final answer. You do not manage this loop manually — that is the framework’s job.


Managing State and Memory Across Sessions

InMemoryMemory is wiped when the process exits. For a research agent that builds knowledge over time, you need persistence. Create memory_store.py:

import json
import os
from typing import Any

SESSIONS_DIR = "sessions"

def load_session(session_id: str) -> list[dict[str, Any]]:
    """Load prior conversation history for a session."""
    path = os.path.join(SESSIONS_DIR, f"{session_id}.json")
    if not os.path.exists(path):
        return []
    
    with open(path, "r") as f:
        data = json.load(f)
    
    return data.get("history", [])


def save_session(session_id: str, history: list[dict[str, Any]]) -> None:
    """Persist conversation history for a session."""
    os.makedirs(SESSIONS_DIR, exist_ok=True)
    path = os.path.join(SESSIONS_DIR, f"{session_id}.json")
    
    # Keep only the last 50 turns to avoid unbounded growth
    trimmed = history[-50:] if len(history) > 50 else history
    
    with open(path, "w") as f:
        json.dump({"history": trimmed}, f, indent=2)

This gives you short-term memory (the active InMemoryMemory during a session) and medium-term persistence (JSON files between sessions). The trimmed slice prevents history files from growing indefinitely — older context naturally falls off, keeping the agent focused on recent work.

v1.0.0 Warning: The full RAG and distribution modules were temporarily deprecated in v1.0.0. If you need semantic retrieval over a large corpus of saved findings, you will need to integrate an external vector store manually until AgentScope restores these modules. Track the changelog before upgrading.

For multi-agent scenarios — for example, adding a CriticAgent that evaluates research quality — you would wrap the agents in a MsgHub, which routes messages between participants and maintains shared conversation state. The MsgHub pattern also enables parallel research branches where multiple agents investigate different subtopics simultaneously.


Frequently Asked Questions

Why does my agent fail with a coroutine error when calling tools?

AgentScope v1.0.0 is fully asynchronous. If you define tool functions as regular def instead of async def, the framework may not await them correctly, leading to coroutine objects being returned as strings. Ensure every tool function is defined with async def and every call to an agent uses await.

Can I use a different LLM provider instead of DashScope?

Yes. AgentScope supports multiple model backends. Replace DashScopeChatModel with the appropriate model class for your provider and pass the corresponding API key. The research summary covers DashScope as the documented example, but the Toolkit and ReActAgent interfaces are model-agnostic.

How do I prevent the agent from scraping the same URL twice in one session?

The cleanest approach is to maintain a visited_urls set in your tool module and check it at the start of scrape_page. Since Python module-level state persists for the process lifetime, a simple set works for single-session deduplication. For cross-session deduplication, persist the set to a JSON file alongside your session history.

What replaced WebBrowser in AgentScope v1.0.0?

The WebBrowser class was deprecated in favor of an MCP-based approach for browser automation. If you need full browser interaction (JavaScript rendering, form submission), you should integrate an MCP tool server rather than using the old WebBrowser class. The custom httpx + BeautifulSoup approach in this tutorial covers the majority of research use cases that only require HTML content.

How do I add a second agent to review and critique the research output?

Instantiate a second ReActAgent (or a simpler DialogAgent equivalent — note DialogAgent is deprecated in v1.0.0, so use ReActAgent with a critic-focused system prompt). Then use a MsgHub to route messages between the researcher and critic agents. The MsgHub manages turn-taking and message routing, so you define participation rules once rather than manually wiring await agent(msg) chains.

Related Articles