fix: resolve relative URLs against parent dir when base_fork_url is an HTML file by dracpet · Pull Request #5278 · 1Panel-dev/MaxKB

dracpet · 2026-05-21T21:39:51Z

Bug: Web site crawl resolves relative links incorrectly when `base_fork_url` is an HTML file

Problem

When crawling a documentation site where pages link to index.html (extremely common — breadcrumbs, logos, "Home" links), MaxKB follows that link. The new Fork instance gets base_fork_url ending in .html (e.g. https://docs.dolphindb.com/en/index.html).

The reset_url function then appends /field_value/ to the base URL and calls urljoin(..., "."). Since index.html/ looks like a directory to urljoin, every relative link on the page gets index.html/ injected into its path:

Input:  base_fork_url = https://docs.dolphindb.com/en/index.html
        field_value   = about_dolphindb.html

Old code:
  urljoin('https://docs.dolphindb.com/en/index.html/about_dolphindb.html/', '.')
  → https://docs.dolphindb.com/en/index.html/about_dolphindb.html  ← WRONG, 404

Expected:
  https://docs.dolphindb.com/en/about_dolphindb.html  ← CORRECT

This cascades: every child page that links to index.html produces a poisoned crawl level where all further relative links 404, silently killing the scrape with zero content on most pages.

Reproduction

Create a Web Site knowledge base with URL https://docs.dolphindb.com/en/ and selector body
Sync — the index page discovers index.html as a child link
All child links from index.html resolve to broken paths like /en/index.html/getting_started.html → 404
Result: ~2 documents with content, 130+ empty/error

Fix

Two changes in apps/common/utils/fork.py:

1. reset_url (line 114-124): When base_fork_url ends in .html/.htm, resolve relative links against the parent directory instead of the file path.

2. get_child_link_list (line 95-99): Use a crawl_prefix (parent directory for HTML files) for the link filter, so correctly-resolved child URLs aren't filtered out by the startswith(base_fork_url) check.

Verification

Applied to MaxKB v2.x, scraped https://docs.dolphindb.com/en/ with selector main → 800+ documents, 0 broken URLs
All child links resolve correctly regardless of whether the current page URL ends in .html
No regression: directory-based URLs (e.g. /en/) resolve identically to before

Important: If this PR is acceptable, please also review the depth parameter. The hardcoded depth=2 in sync_web_knowledge and sync_replace_web_knowledge limits most knowledge base crawls to 2 hops. Consider making it configurable or increasing the default — 3 uncovered 6x more documents in our test case.

…n HTML file When crawling a site where pages link to index.html, the Fork instance gets base_fork_url ending in .html. The reset_url function then appends /field_value/ and calls urljoin(..., '.'), which treats index.html/ as a directory. Every relative link gets index.html/ injected into its path (e.g. .../en/index.html/about_dolphindb.html), causing 404 cascade. Fix: in reset_url, resolve against parent dir for .html/.htm base URLs. In get_child_link_list, use crawl_prefix (parent dir for HTML files) for the link filter so correctly-resolved URLs aren't filtered out. Verified: scraped docs.dolphindb.com/en/ with selector 'main' → 800+ documents, 0 broken URLs.

f2c-ci-robot · 2026-05-21T21:39:56Z

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

f2c-ci-robot · 2026-05-21T21:40:14Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

f2c-ci-robot Bot added the do-not-merge/release-note-label-needed label May 21, 2026

liuruibin merged commit 2216002 into 1Panel-dev:v2 May 22, 2026
1 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: resolve relative URLs against parent dir when base_fork_url is an HTML file#5278

fix: resolve relative URLs against parent dir when base_fork_url is an HTML file#5278
liuruibin merged 1 commit into
1Panel-dev:v2from
dracpet:fix/web-crawl-html-base-url

dracpet commented May 21, 2026

Uh oh!

f2c-ci-robot Bot commented May 21, 2026

Uh oh!

f2c-ci-robot Bot commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dracpet commented May 21, 2026

Bug: Web site crawl resolves relative links incorrectly when base_fork_url is an HTML file

Problem

Reproduction

Fix

Verification

Uh oh!

f2c-ci-robot Bot commented May 21, 2026

Uh oh!

f2c-ci-robot Bot commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Bug: Web site crawl resolves relative links incorrectly when `base_fork_url` is an HTML file