fix: resolve relative URLs against parent dir when base_fork_url is an HTML file#5278
Merged
Merged
Conversation
…n HTML file When crawling a site where pages link to index.html, the Fork instance gets base_fork_url ending in .html. The reset_url function then appends /field_value/ and calls urljoin(..., '.'), which treats index.html/ as a directory. Every relative link gets index.html/ injected into its path (e.g. .../en/index.html/about_dolphindb.html), causing 404 cascade. Fix: in reset_url, resolve against parent dir for .html/.htm base URLs. In get_child_link_list, use crawl_prefix (parent dir for HTML files) for the link filter so correctly-resolved URLs aren't filtered out. Verified: scraped docs.dolphindb.com/en/ with selector 'main' → 800+ documents, 0 broken URLs.
|
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bug: Web site crawl resolves relative links incorrectly when
base_fork_urlis an HTML fileProblem
When crawling a documentation site where pages link to
index.html(extremely common — breadcrumbs, logos, "Home" links), MaxKB follows that link. The newForkinstance getsbase_fork_urlending in.html(e.g.https://docs.dolphindb.com/en/index.html).The
reset_urlfunction then appends/field_value/to the base URL and callsurljoin(..., "."). Sinceindex.html/looks like a directory tourljoin, every relative link on the page getsindex.html/injected into its path:This cascades: every child page that links to
index.htmlproduces a poisoned crawl level where all further relative links 404, silently killing the scrape with zero content on most pages.Reproduction
https://docs.dolphindb.com/en/and selectorbodyindex.htmlas a child linkindex.htmlresolve to broken paths like/en/index.html/getting_started.html→ 404Fix
Two changes in
apps/common/utils/fork.py:1.
reset_url(line 114-124): Whenbase_fork_urlends in.html/.htm, resolve relative links against the parent directory instead of the file path.2.
get_child_link_list(line 95-99): Use acrawl_prefix(parent directory for HTML files) for the link filter, so correctly-resolved child URLs aren't filtered out by thestartswith(base_fork_url)check.Verification
https://docs.dolphindb.com/en/with selectormain→ 800+ documents, 0 broken URLs.html/en/) resolve identically to beforeImportant: If this PR is acceptable, please also review the depth parameter. The hardcoded
depth=2insync_web_knowledgeandsync_replace_web_knowledgelimits most knowledge base crawls to 2 hops. Consider making it configurable or increasing the default — 3 uncovered 6x more documents in our test case.