-
Notifications
You must be signed in to change notification settings - Fork 898
feat: better parsing of tables #291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe pull request introduces significant modifications to the web scraping functionality across two files: Changes
Sequence DiagramsequenceDiagram
participant Scraper as Scraper
participant Browser as Browser
participant Selector as Selector
Scraper->>Browser: Initiate scraping
Browser->>Selector: Generate precise selectors
Selector-->>Browser: Return table cell selectors
Browser->>Browser: Categorize fields (table/non-table)
Browser->>Browser: Process each row/element
Browser-->>Scraper: Return scraped data
Poem
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
maxun-core/src/browserSide/scraper.js (2)
280-288
: Double-check row retrieval for partial or irregular tables
When retrieving rows from the first table field, if the source HTML table has merged columns or irregular row spans, limiting row count by the first field’s result might skip relevant data from sibling columns. Consider validating each table field’s row length to handle mismatched row counts.
336-341
: Potential data truncation
If multiple table rows are processed before checking the limit, you may lose consistency in records. Consider verifying partial row sets if the limit is reached mid-scrape.server/src/workflow-management/selector.ts (1)
947-952
: Maintain consistency for “td” indexing logic
These lines replicate the logic from lines 872-878. Ensure they remain consistent in both code blocks. In addition, consider systematically extracting this logic into a helper function if used across multiple methods.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
maxun-core/src/browserSide/scraper.js
(1 hunks)server/src/workflow-management/selector.ts
(3 hunks)
🔇 Additional comments (4)
maxun-core/src/browserSide/scraper.js (2)
291-307
: Validate attribute presence
When scraping table fields, if the designated attribute does not exist, “value” may be undefined, resulting in null entries. Consider robust error handling or default values for missing attributes.
265-275
: Avoid overshadowing for 'TD', 'TH', 'TR' detection
Consider whether the tag check should also account for 'tbody' or 'table' context. Nested tables, row/column spans, or varying table structures may cause logic mismatches.
Run this script to search for additional usage of table-specific tags to ensure no overshadowing:
✅ Verification successful
Let me analyze the search results and generate additional verification scripts to check for table-related context.
The first search gave us broad context about table tags usage. Now let's look specifically at the scraper implementation and any related table handling code to make a more informed decision about whether we need to account for tbody/table tags.
Let me check the specific implementation around the tableFields usage to understand the context better.
Let me fix the regex and try again to get the context around tableFields usage.
Based on the code context, I can now provide a final response:
Table tag detection is appropriately scoped
The current implementation correctly focuses on TD, TH, and TR tags which are the essential table cell and row elements. Including TBODY or TABLE tags is unnecessary because:
- The code's purpose is to identify individual field containers within a table structure
- TD/TH/TR tags are sufficient for capturing table-structured data as they represent the actual data containers
- The parent table context is already handled through the
listSelector
parameter which identifies the containing element
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# This searches for potential references to table structures beyond TD, TH, and TR
rg -A 3 -i '(table|tbody|td|th|tr)'
Length of output: 65585
Script:
#!/bin/bash
# Search for table-related code in the scraper file
ast-grep --pattern 'const tableFields = {$$$}'
# Search for any table-related functions or methods
ast-grep --pattern 'function $_($$table$$) {
$$$
}'
# Search for table-related imports
rg -l "import.*table" --type js --type ts
Length of output: 273
Script:
#!/bin/bash
# Get the context around tableFields usage
rg -B 10 -A 10 "const tableFields = {}" maxun-core/src/browserSide/scraper.js
# Check if there are any table-related methods or functions in the same file
ast-grep --pattern 'function $name($params) {
$$$
table
$$$
}'
Length of output: 276
Script:
#!/bin/bash
# Get the context around tableFields usage with fixed regex
rg -B 10 -A 10 "const tableFields = \{\}" maxun-core/src/browserSide/scraper.js
# Also check for any table-related field processing
rg -B 5 -A 10 "tableFields\[" maxun-core/src/browserSide/scraper.js
Length of output: 1696
server/src/workflow-management/selector.ts (2)
1007-1012
: Ensure robust scanning of table descendants
When generating selectors for child “td” elements, watch out for multiple nested tables. The current approach might overindex if a parent’s siblings contain more child nodes.
872-878
: Edge cases for table cell indexing
The approach using “:nth-child(position)” effectively targets the column but may not handle rowspans or colspans that shift child indices in nested tables. Consider verifying your approach with advanced table structures.
for (const [label, { selector, attribute }] of Object.entries(nonTableFields)) { | ||
const element = parent.querySelector(selector); | ||
|
||
if (element) { | ||
let value; | ||
if (attribute === 'innerText') { | ||
record[label] = fieldElement.innerText.trim(); | ||
value = element.innerText.trim(); | ||
} else if (attribute === 'innerHTML') { | ||
record[label] = fieldElement.innerHTML.trim(); | ||
} else if (attribute === 'src') { | ||
// Handle relative 'src' URLs | ||
const src = fieldElement.getAttribute('src'); | ||
record[label] = src ? new URL(src, window.location.origin).href : null; | ||
} else if (attribute === 'href') { | ||
// Handle relative 'href' URLs | ||
const href = fieldElement.getAttribute('href'); | ||
record[label] = href ? new URL(href, window.location.origin).href : null; | ||
value = element.innerHTML.trim(); | ||
} else if (attribute === 'src' || attribute === 'href') { | ||
const attrValue = element.getAttribute(attribute); | ||
value = attrValue ? new URL(attrValue, window.location.origin).href : null; | ||
} else { | ||
record[label] = fieldElement.getAttribute(attribute); | ||
value = element.getAttribute(attribute); | ||
} | ||
record[label] = value; | ||
} | ||
} | ||
scrapedData.push(record); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Enhance non-table field queries
Note that using parent.querySelector may only fetch the first matching element. If multiple elements match the same selector, subsequent ones will be ignored. Consider using querySelectorAll or clarifying that only the first match is necessary.
-for (const [label, { selector, attribute }] of Object.entries(nonTableFields)) {
- const element = parent.querySelector(selector);
+for (const [label, { selector, attribute }] of Object.entries(nonTableFields)) {
+ const elements = Array.from(parent.querySelectorAll(selector));
+ if (elements.length === 0) {
+ continue;
+ }
+ const element = elements[0]; // or handle all elements
Committable suggestion skipped: line range outside the PR's diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
server/src/workflow-management/selector.ts (2)
896-902
: Add validation for proper table structure.While the position-based selector for table cells is good, consider adding validation for proper table structure to ensure reliable selectors.
if (selector === 'td' && element.parentElement) { + // Validate proper table structure + if (!element.closest('table')) { + return selector; + } // Find position among td siblings const siblings = Array.from(element.parentElement.children); const position = siblings.indexOf(element) + 1; return `${selector}:nth-child(${position})`; }
Line range hint
938-964
: Extract repeated DOM boundary check into a utility function.The body/html tag check is duplicated across multiple places. Consider extracting it into a reusable utility function.
+const isBodyOrHtmlTag = (element: HTMLElement): boolean => { + const tagName = element.tagName.toLowerCase(); + return tagName === 'body' || tagName === 'html'; +}; -if (element.tagName.toLowerCase() === 'body' || - element.tagName.toLowerCase() === 'html') { +if (isBodyOrHtmlTag(element)) { break; } // ... later in the code ... -if (nextParent.tagName.toLowerCase() !== 'body' && - nextParent.tagName.toLowerCase() !== 'html') { +if (!isBodyOrHtmlTag(nextParent)) { element = nextParent; }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
server/src/workflow-management/selector.ts
(9 hunks)
🔇 Additional comments (1)
server/src/workflow-management/selector.ts (1)
1043-1048
: Reuse the shared table cell handling utility.
This is another occurrence of the duplicated table cell handling logic. Use the previously suggested getTableCellSelector
utility function here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (4)
maxun-core/src/browserSide/scraper.js (4)
265-274
: Enhance field categorization robustnessThe field categorization logic could be improved in the following ways:
- Make the tag check case-insensitive
- Define table tags as a constant
+const TABLE_TAGS = ['TD', 'TH', 'TR']; + const tableFields = {}; const nonTableFields = {}; for (const [label, field] of Object.entries(fields)) { - if (['TD', 'TH', 'TR'].includes(field.tag)) { + if (field.tag && TABLE_TAGS.includes(field.tag.toUpperCase())) { tableFields[label] = field; } else { nonTableFields[label] = field; } }
284-285
: Optimize processing with early exitAdd an early exit when no items are found to avoid unnecessary loop setup.
+if (baseElements.length === 0) { + return scrapedData; +} + // Process each item up to the limit for (let i = 0; i < Math.min(baseElements.length, limit); i++) {
332-334
: Clarify empty record handlingMake the empty record check more explicit by checking for meaningful values.
-if (Object.keys(record).length > 0) { +// Only add records that have at least one non-null value +if (Object.values(record).some(value => value !== null && value !== '')) { scrapedData.push(record); }
337-340
: Optimize limit handlingThe current implementation modifies the array length after collection. Consider checking the limit during collection:
-if (scrapedData.length >= limit) { - scrapedData.length = limit; - break; -} +// Break early if we've reached the limit +if (scrapedData.length >= limit) break;
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
maxun-core/src/browserSide/scraper.js
(1 hunks)
🔇 Additional comments (1)
maxun-core/src/browserSide/scraper.js (1)
265-342
: Verify scraping behavior with different table structures
Please ensure testing covers various table structures and edge cases:
- Tables with merged cells
- Nested tables
- Tables with missing cells
- Tables with mixed content types (text, links, images)
const parentElements = Array.from(document.querySelectorAll(listSelector)); | ||
const scrapedData = []; | ||
|
||
while (scrapedData.length < limit) { | ||
let parentElements = Array.from(document.querySelectorAll(listSelector)); | ||
for (const parent of parentElements) { | ||
// Get the first field's elements to determine how many items we have | ||
const firstField = Object.values(fields)[0]; | ||
const baseElements = Array.from(parent.querySelectorAll(firstField.selector)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add validation for field configuration
The code assumes the first field and its selector are always valid. Add validation to handle edge cases:
const parentElements = Array.from(document.querySelectorAll(listSelector));
const scrapedData = [];
+if (Object.keys(fields).length === 0) {
+ return scrapedData;
+}
+
const firstField = Object.values(fields)[0];
+if (!firstField?.selector) {
+ throw new Error('Invalid field configuration: missing selector');
+}
+
const baseElements = Array.from(parent.querySelectorAll(firstField.selector));
Committable suggestion skipped: line range outside the PR's diff.
// Process non-table fields | ||
for (const [label, { selector, attribute }] of Object.entries(nonTableFields)) { | ||
const elements = Array.from(parent.querySelectorAll(selector)); | ||
// Use the same index to maintain correspondence between fields | ||
const element = elements[i]; | ||
|
||
if (element) { | ||
let value; | ||
if (attribute === 'innerText') { | ||
record[label] = fieldElement.innerText.trim(); | ||
value = element.innerText.trim(); | ||
} else if (attribute === 'innerHTML') { | ||
record[label] = fieldElement.innerHTML.trim(); | ||
} else if (attribute === 'src') { | ||
// Handle relative 'src' URLs | ||
const src = fieldElement.getAttribute('src'); | ||
record[label] = src ? new URL(src, window.location.origin).href : null; | ||
} else if (attribute === 'href') { | ||
// Handle relative 'href' URLs | ||
const href = fieldElement.getAttribute('href'); | ||
record[label] = href ? new URL(href, window.location.origin).href : null; | ||
value = element.innerHTML.trim(); | ||
} else if (attribute === 'src' || attribute === 'href') { | ||
const attrValue = element.getAttribute(attribute); | ||
value = attrValue ? new URL(attrValue, window.location.origin).href : null; | ||
} else { | ||
record[label] = fieldElement.getAttribute(attribute); | ||
value = element.getAttribute(attribute); | ||
} | ||
record[label] = value; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Apply extracted field processing logic
Apply the same helper function here to maintain consistency and reduce code duplication.
// Process non-table fields
for (const [label, { selector, attribute }] of Object.entries(nonTableFields)) {
const elements = Array.from(parent.querySelectorAll(selector));
const element = elements[i];
- if (element) {
- let value;
- if (attribute === 'innerText') {
- value = element.innerText.trim();
- } else if (attribute === 'innerHTML') {
- value = element.innerHTML.trim();
- } else if (attribute === 'src' || attribute === 'href') {
- const attrValue = element.getAttribute(attribute);
- value = attrValue ? new URL(attrValue, window.location.origin).href : null;
- } else {
- value = element.getAttribute(attribute);
- }
- record[label] = value;
- }
+ const value = processFieldValue(element, attribute);
+ if (value !== null) {
+ record[label] = value;
+ }
}
Committable suggestion skipped: line range outside the PR's diff.
// Process table fields | ||
for (const [label, { selector, attribute }] of Object.entries(tableFields)) { | ||
const elements = Array.from(parent.querySelectorAll(selector)); | ||
// Use the same index to maintain correspondence between fields | ||
const element = elements[i]; | ||
|
||
const firstMatch = document.querySelector(listSelector); | ||
if (firstMatch) { | ||
// Get classes from the first matching element | ||
const firstMatchClasses = Array.from(firstMatch.classList); | ||
|
||
// Find similar elements by matching most of their classes | ||
parentElements = allChildren.filter(element => { | ||
const elementClasses = Array.from(element.classList); | ||
|
||
// Element should share at least 70% of classes with the first match | ||
const commonClasses = firstMatchClasses.filter(cls => | ||
elementClasses.includes(cls)); | ||
return commonClasses.length >= Math.floor(firstMatchClasses.length * 0.7); | ||
}); | ||
if (element) { | ||
let value; | ||
if (attribute === 'innerText') { | ||
value = element.innerText.trim(); | ||
} else if (attribute === 'innerHTML') { | ||
value = element.innerHTML.trim(); | ||
} else if (attribute === 'src' || attribute === 'href') { | ||
const attrValue = element.getAttribute(attribute); | ||
value = attrValue ? new URL(attrValue, window.location.origin).href : null; | ||
} else { | ||
value = element.getAttribute(attribute); | ||
} | ||
record[label] = value; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Extract common field processing logic
The attribute processing logic is duplicated between table and non-table fields. Extract it into a helper function:
+function processFieldValue(element, attribute) {
+ if (!element) return null;
+
+ switch (attribute) {
+ case 'innerText':
+ return element.innerText.trim();
+ case 'innerHTML':
+ return element.innerHTML.trim();
+ case 'src':
+ case 'href':
+ const attrValue = element.getAttribute(attribute);
+ try {
+ return attrValue ? new URL(attrValue, window.location.origin).href : null;
+ } catch (error) {
+ console.warn(`Invalid URL: ${attrValue}`);
+ return null;
+ }
+ default:
+ return element.getAttribute(attribute);
+ }
+}
+
// Process table fields
for (const [label, { selector, attribute }] of Object.entries(tableFields)) {
const elements = Array.from(parent.querySelectorAll(selector));
// Use the same index to maintain correspondence between fields
const element = elements[i];
- if (element) {
- let value;
- if (attribute === 'innerText') {
- value = element.innerText.trim();
- } else if (attribute === 'innerHTML') {
- value = element.innerHTML.trim();
- } else if (attribute === 'src' || attribute === 'href') {
- const attrValue = element.getAttribute(attribute);
- value = attrValue ? new URL(attrValue, window.location.origin).href : null;
- } else {
- value = element.getAttribute(attribute);
- }
- record[label] = value;
- }
+ const value = processFieldValue(element, attribute);
+ if (value !== null) {
+ record[label] = value;
+ }
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
// Process table fields | |
for (const [label, { selector, attribute }] of Object.entries(tableFields)) { | |
const elements = Array.from(parent.querySelectorAll(selector)); | |
// Use the same index to maintain correspondence between fields | |
const element = elements[i]; | |
const firstMatch = document.querySelector(listSelector); | |
if (firstMatch) { | |
// Get classes from the first matching element | |
const firstMatchClasses = Array.from(firstMatch.classList); | |
// Find similar elements by matching most of their classes | |
parentElements = allChildren.filter(element => { | |
const elementClasses = Array.from(element.classList); | |
// Element should share at least 70% of classes with the first match | |
const commonClasses = firstMatchClasses.filter(cls => | |
elementClasses.includes(cls)); | |
return commonClasses.length >= Math.floor(firstMatchClasses.length * 0.7); | |
}); | |
if (element) { | |
let value; | |
if (attribute === 'innerText') { | |
value = element.innerText.trim(); | |
} else if (attribute === 'innerHTML') { | |
value = element.innerHTML.trim(); | |
} else if (attribute === 'src' || attribute === 'href') { | |
const attrValue = element.getAttribute(attribute); | |
value = attrValue ? new URL(attrValue, window.location.origin).href : null; | |
} else { | |
value = element.getAttribute(attribute); | |
} | |
record[label] = value; | |
} | |
} | |
function processFieldValue(element, attribute) { | |
if (!element) return null; | |
switch (attribute) { | |
case 'innerText': | |
return element.innerText.trim(); | |
case 'innerHTML': | |
return element.innerHTML.trim(); | |
case 'src': | |
case 'href': | |
const attrValue = element.getAttribute(attribute); | |
try { | |
return attrValue ? new URL(attrValue, window.location.origin).href : null; | |
} catch (error) { | |
console.warn(`Invalid URL: ${attrValue}`); | |
return null; | |
} | |
default: | |
return element.getAttribute(attribute); | |
} | |
} | |
// Process table fields | |
for (const [label, { selector, attribute }] of Object.entries(tableFields)) { | |
const elements = Array.from(parent.querySelectorAll(selector)); | |
// Use the same index to maintain correspondence between fields | |
const element = elements[i]; | |
const value = processFieldValue(element, attribute); | |
if (value !== null) { | |
record[label] = value; | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (4)
maxun-core/src/browserSide/scraper.js (4)
281-296
: Extract MAX_DEPTH as a named constantConsider extracting the magic number
5
as a named constant at the module level for better maintainability.+const TABLE_ANCESTOR_MAX_DEPTH = 5; + function findTableAncestor(element) { let currentElement = element; - const MAX_DEPTH = 5; let depth = 0; - while (currentElement && depth < MAX_DEPTH) { + while (currentElement && depth < TABLE_ANCESTOR_MAX_DEPTH) {
307-321
: Optimize TH element detectionThe current implementation traverses up the DOM tree for each field. Consider caching the result or using
getElementsByTagName('TH')
for better performance.function hasThElement(row, tableFields) { + // Quick check first + if (row.getElementsByTagName('TH').length > 0) return true; + for (const [label, { selector }] of Object.entries(tableFields)) { const element = row.querySelector(selector); if (element) {
385-429
: Consider extracting element finding logicThe nested logic for finding elements is complex and could benefit from being extracted into a separate function for better maintainability.
+function findElementInTableCell(td, selector, attribute) { + // Direct selector match + let element = td.querySelector(selector); + + // Try selector without nth-child + if (!element && selector.split(">").pop().includes('td:nth-child')) { + element = td; + } + + // Try tag-only selector + if (!element) { + const tagOnlySelector = selector.split('.')[0]; + element = td.querySelector(tagOnlySelector); + } + + // Deep content search + if (!element) { + element = findContentChild(td, attribute); + } + + return element; +} if (cellIndex >= 0) { const td = currentRow.children[cellIndex]; if (td) { - element = td.querySelector(selector); - // ... existing fallback logic + element = findElementInTableCell(td, selector, attribute); } }
332-461
: Consider implementing a strategy pattern for field extractionThe current implementation mixes table and non-table field handling logic in a single function. Consider implementing a strategy pattern to separate these concerns:
- Create separate strategies for table and non-table fields
- Move the field classification logic into a factory
- Implement a common interface for field extraction
This would improve maintainability and make it easier to add new field types in the future.
Would you like me to provide a detailed example of how to implement this pattern?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
maxun-core/src/browserSide/scraper.js
(1 hunks)
🧰 Additional context used
🪛 Biome (1.9.4)
maxun-core/src/browserSide/scraper.js
[error] 301-301: The assignment should not be in an expression.
The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.
(lint/suspicious/noAssignInExpressions)
const record = {}; | ||
|
||
for (const [label, { selector, attribute }] of Object.entries(nonTableFields)) { | ||
const elements = Array.from(parent.querySelectorAll(selector)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix undefined variable reference
The code uses an undefined parent
variable instead of the available container
variable.
-const elements = Array.from(parent.querySelectorAll(selector));
+const elements = Array.from(container.querySelectorAll(selector));
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
const elements = Array.from(parent.querySelectorAll(selector)); | |
const elements = Array.from(container.querySelectorAll(selector)); |
function getCellIndex(td) { | ||
let index = 0; | ||
let sibling = td; | ||
while (sibling = sibling.previousElementSibling) { | ||
index++; | ||
} | ||
return index; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Refactor the while loop condition
The assignment in the while condition was flagged by static analysis. Consider restructuring for better readability.
function getCellIndex(td) {
let index = 0;
let sibling = td;
- while (sibling = sibling.previousElementSibling) {
+ while (sibling.previousElementSibling) {
+ sibling = sibling.previousElementSibling;
index++;
}
return index;
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
function getCellIndex(td) { | |
let index = 0; | |
let sibling = td; | |
while (sibling = sibling.previousElementSibling) { | |
index++; | |
} | |
return index; | |
} | |
function getCellIndex(td) { | |
let index = 0; | |
let sibling = td; | |
while (sibling.previousElementSibling) { | |
sibling = sibling.previousElementSibling; | |
index++; | |
} | |
return index; | |
} |
🧰 Tools
🪛 Biome (1.9.4)
[error] 301-301: The assignment should not be in an expression.
The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.
(lint/suspicious/noAssignInExpressions)
// Helper function to extract values from elements | ||
function extractValue(element, attribute) { | ||
if (!element) return null; | ||
|
||
if (attribute === 'innerText') { | ||
return element.innerText.trim(); | ||
} else if (attribute === 'innerHTML') { | ||
return element.innerHTML.trim(); | ||
} else if (attribute === 'src' || attribute === 'href') { | ||
const attrValue = element.getAttribute(attribute); | ||
return attrValue ? new URL(attrValue, window.location.origin).href : null; | ||
} | ||
return element.getAttribute(attribute); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error handling for URL construction
The URL construction could throw errors for invalid URLs. Consider adding try-catch block to handle malformed URLs gracefully.
} else if (attribute === 'src' || attribute === 'href') {
const attrValue = element.getAttribute(attribute);
- return attrValue ? new URL(attrValue, window.location.origin).href : null;
+ if (!attrValue) return null;
+ try {
+ return new URL(attrValue, window.location.origin).href;
+ } catch (error) {
+ console.warn(`Invalid URL: ${attrValue}`);
+ return null;
+ }
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
// Helper function to extract values from elements | |
function extractValue(element, attribute) { | |
if (!element) return null; | |
if (attribute === 'innerText') { | |
return element.innerText.trim(); | |
} else if (attribute === 'innerHTML') { | |
return element.innerHTML.trim(); | |
} else if (attribute === 'src' || attribute === 'href') { | |
const attrValue = element.getAttribute(attribute); | |
return attrValue ? new URL(attrValue, window.location.origin).href : null; | |
} | |
return element.getAttribute(attribute); | |
} | |
// Helper function to extract values from elements | |
function extractValue(element, attribute) { | |
if (!element) return null; | |
if (attribute === 'innerText') { | |
return element.innerText.trim(); | |
} else if (attribute === 'innerHTML') { | |
return element.innerHTML.trim(); | |
} else if (attribute === 'src' || attribute === 'href') { | |
const attrValue = element.getAttribute(attribute); | |
if (!attrValue) return null; | |
try { | |
return new URL(attrValue, window.location.origin).href; | |
} catch (error) { | |
console.warn(`Invalid URL: ${attrValue}`); | |
return null; | |
} | |
} | |
return element.getAttribute(attribute); | |
} |
let currentElement = td; | ||
while (currentElement && currentElement.children.length > 0) { | ||
let foundContentChild = false; | ||
for (const child of currentElement.children) { | ||
if (extractValue(child, attribute)) { | ||
currentElement = child; | ||
foundContentChild = true; | ||
break; | ||
} | ||
} | ||
if (!foundContentChild) break; | ||
} | ||
element = currentElement; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add depth limit to content child traversal
The deep traversal for finding content children lacks a depth limit, which could lead to performance issues with deeply nested tables.
+const MAX_CONTENT_SEARCH_DEPTH = 3;
+let depth = 0;
let currentElement = td;
-while (currentElement && currentElement.children.length > 0) {
+while (currentElement && currentElement.children.length > 0 && depth < MAX_CONTENT_SEARCH_DEPTH) {
let foundContentChild = false;
for (const child of currentElement.children) {
if (extractValue(child, attribute)) {
currentElement = child;
foundContentChild = true;
break;
}
}
if (!foundContentChild) break;
+ depth++;
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
let currentElement = td; | |
while (currentElement && currentElement.children.length > 0) { | |
let foundContentChild = false; | |
for (const child of currentElement.children) { | |
if (extractValue(child, attribute)) { | |
currentElement = child; | |
foundContentChild = true; | |
break; | |
} | |
} | |
if (!foundContentChild) break; | |
} | |
element = currentElement; | |
} | |
const MAX_CONTENT_SEARCH_DEPTH = 3; | |
let depth = 0; | |
let currentElement = td; | |
while (currentElement && currentElement.children.length > 0 && depth < MAX_CONTENT_SEARCH_DEPTH) { | |
let foundContentChild = false; | |
for (const child of currentElement.children) { | |
if (extractValue(child, attribute)) { | |
currentElement = child; | |
foundContentChild = true; | |
break; | |
} | |
} | |
if (!foundContentChild) break; | |
depth++; | |
} | |
element = currentElement; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
maxun-core/src/browserSide/scraper.js (3)
276-282
:⚠️ Potential issueAdd validation for fields object and selector
The code assumes the fields object is not empty and the first field has a valid selector. This could lead to runtime errors.
This was previously identified in an earlier review. The same validation is still needed:
+ if (Object.keys(fields).length === 0) { + return []; + } + const firstField = Object.values(fields)[0]; + if (!firstField?.selector) { + throw new Error('Invalid field configuration: missing selector'); + }
288-308
: 🛠️ Refactor suggestionExtract common field processing logic and add URL error handling
The attribute processing logic is duplicated and URL construction lacks error handling.
These issues were previously identified. The same solutions are still needed:
- Extract the common processing logic into a helper function
- Add error handling for URL construction
See the previous review comments for the detailed implementation suggestions.
310-330
:⚠️ Potential issueFix undefined variable reference
The code uses an undefined
parent
variable in the querySelector call.This issue was previously identified. The same fix is still needed:
-const elements = Array.from(parent.querySelectorAll(selector)); +const elements = Array.from(container.querySelectorAll(selector));
🧹 Nitpick comments (2)
maxun-core/src/browserSide/scraper.js (2)
265-274
: Consider using Set for tag lookupThe tag checking logic can be optimized by using a Set for constant-time lookups.
- const tableFields = {}; - const nonTableFields = {}; - - for (const [label, field] of Object.entries(fields)) { - if (['TD', 'TH', 'TR'].includes(field.tag)) { - tableFields[label] = field; - } else { - nonTableFields[label] = field; - } - } + const TABLE_TAGS = new Set(['TD', 'TH', 'TR']); + const tableFields = {}; + const nonTableFields = {}; + + for (const [label, field] of Object.entries(fields)) { + if (TABLE_TAGS.has(field.tag)) { + tableFields[label] = field; + } else { + nonTableFields[label] = field; + } + }
332-342
: Optimize limit enforcementThe current implementation continues processing even after reaching the limit. Consider breaking out of the inner loop when the limit is reached.
if (Object.keys(record).length > 0) { scrapedData.push(record); + if (scrapedData.length >= limit) { + break; + } } } -if (scrapedData.length >= limit) { - scrapedData.length = limit; - break; -}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
maxun-core/src/browserSide/scraper.js (1)
307-330
: Optimize table header detection performanceThe current implementation traverses the DOM multiple times. Consider caching the results of header detection.
Apply this diff to improve performance:
+function memoizeHeaderCheck(row) { + const cache = new WeakMap(); + return (selector) => { + if (cache.has(row)) return cache.get(row); + const result = hasThElement(row, { test: { selector } }); + cache.set(row, result); + return result; + }; +} function hasThElement(row, tableFields) { + const checkHeader = memoizeHeaderCheck(row); for (const [label, { selector }] of Object.entries(tableFields)) { - const element = row.querySelector(selector); - if (element) { - let current = element; - while (current && current !== row) { - if (current.tagName === 'TH') { - return true; - } - current = current.parentElement; - } - } + if (checkHeader(selector)) return true; } return false; }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
maxun-core/src/browserSide/scraper.js
(1 hunks)
🧰 Additional context used
🪛 Biome (1.9.4)
maxun-core/src/browserSide/scraper.js
[error] 301-301: The assignment should not be in an expression.
The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.
(lint/suspicious/noAssignInExpressions)
🔇 Additional comments (5)
maxun-core/src/browserSide/scraper.js (5)
265-278
: Add error handling for URL construction
The URL construction could throw errors for invalid URLs. This issue was previously identified and remains unaddressed.
Apply this diff to handle malformed URLs gracefully:
} else if (attribute === 'src' || attribute === 'href') {
const attrValue = element.getAttribute(attribute);
- return attrValue ? new URL(attrValue, window.location.origin).href : null;
+ if (!attrValue) return null;
+ try {
+ return new URL(attrValue, window.location.origin).href;
+ } catch (error) {
+ console.warn(`Invalid URL: ${attrValue}`);
+ return null;
+ }
}
280-296
: LGTM! Well-structured ancestor search
The function efficiently finds table ancestors with a reasonable depth limit and early returns.
298-305
: Improve readability of the while loop
The assignment in the while condition was flagged by static analysis and makes the code less readable.
Apply this diff to restructure the loop:
function getCellIndex(td) {
let index = 0;
let sibling = td;
- while (sibling = sibling.previousElementSibling) {
+ while (sibling.previousElementSibling) {
+ sibling = sibling.previousElementSibling;
index++;
}
return index;
}
🧰 Tools
🪛 Biome (1.9.4)
[error] 301-301: The assignment should not be in an expression.
The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.
(lint/suspicious/noAssignInExpressions)
408-420
: Add depth limit to content child traversal
The deep traversal for finding content children lacks a depth limit, which could lead to performance issues with deeply nested tables.
Apply this diff to add a depth limit:
+const MAX_CONTENT_SEARCH_DEPTH = 3;
+let depth = 0;
let currentElement = td;
-while (currentElement && currentElement.children.length > 0) {
+while (currentElement && currentElement.children.length > 0 && depth < MAX_CONTENT_SEARCH_DEPTH) {
let foundContentChild = false;
for (const child of currentElement.children) {
if (extractValue(child, attribute)) {
currentElement = child;
foundContentChild = true;
break;
}
}
if (!foundContentChild) break;
+ depth++;
}
372-375
: Add validation for field configuration
The code assumes the first field and its selector are always valid.
Apply this diff to handle edge cases:
if (Object.keys(tableFields).length > 0) {
const firstField = Object.values(tableFields)[0];
+ if (!firstField?.selector) {
+ throw new Error('Invalid field configuration: missing selector');
+ }
const firstElement = container.querySelector(firstField.selector);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
maxun-core/src/browserSide/scraper.js (2)
298-305
: 🛠️ Refactor suggestionImprove readability of the while loop
The assignment in the while condition was flagged by static analysis and makes the code harder to read.
Apply this diff to improve the implementation:
function getCellIndex(td) { let index = 0; let sibling = td; - while (sibling = sibling.previousElementSibling) { + while (sibling.previousElementSibling) { + sibling = sibling.previousElementSibling; index++; } return index; }🧰 Tools
🪛 Biome (1.9.4)
[error] 301-301: The assignment should not be in an expression.
The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.(lint/suspicious/noAssignInExpressions)
432-445
: 🛠️ Refactor suggestionAdd depth limit to content traversal
The deep traversal for finding content children could lead to performance issues with deeply nested tables.
Apply this diff to improve the implementation:
+ const MAX_CONTENT_SEARCH_DEPTH = 3; + let depth = 0; let currentElement = td; - while (currentElement && currentElement.children.length > 0) { + while (currentElement && currentElement.children.length > 0 && depth < MAX_CONTENT_SEARCH_DEPTH) { let foundContentChild = false; for (const child of currentElement.children) { if (extractValue(child, attribute)) { currentElement = child; foundContentChild = true; break; } } if (!foundContentChild) break; + depth++; }
🧹 Nitpick comments (2)
maxun-core/src/browserSide/scraper.js (2)
280-296
: Optimize table ancestor lookup with early returnsThe function can be simplified and made more efficient with early returns.
Apply this diff to improve the implementation:
function findTableAncestor(element) { let currentElement = element; const MAX_DEPTH = 5; let depth = 0; while (currentElement && depth < MAX_DEPTH) { - if (currentElement.tagName === 'TD') { - return { type: 'TD', element: currentElement }; - } else if (currentElement.tagName === 'TR') { - return { type: 'TR', element: currentElement }; - } + switch (currentElement.tagName) { + case 'TD': return { type: 'TD', element: currentElement }; + case 'TR': return { type: 'TR', element: currentElement }; + } currentElement = currentElement.parentElement; depth++; } return null; }
336-359
: Improve container matching logicThe container matching logic can be improved for better maintainability and readability.
Apply this diff to improve the implementation:
+ const CLASS_MATCH_THRESHOLD = 0.7; if (limit > 1 && containers.length <= 1) { - const [containerSelector, _] = listSelector.split('>').map(s => s.trim()); + const [containerSelector] = listSelector.split('>').map(s => s.trim()); const container = document.querySelector(containerSelector); if (container) { const allChildren = Array.from(container.children); const firstMatch = document.querySelector(listSelector); if (firstMatch) { const firstMatchClasses = Array.from(firstMatch.classList); containers = allChildren.filter(element => { const elementClasses = Array.from(element.classList); const commonClasses = firstMatchClasses.filter(cls => elementClasses.includes(cls)); - return commonClasses.length >= Math.floor(firstMatchClasses.length * 0.7); + return commonClasses.length >= Math.floor(firstMatchClasses.length * CLASS_MATCH_THRESHOLD); }); } } }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
maxun-core/src/browserSide/scraper.js
(1 hunks)
🧰 Additional context used
🪛 Biome (1.9.4)
maxun-core/src/browserSide/scraper.js
[error] 301-301: The assignment should not be in an expression.
The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.
(lint/suspicious/noAssignInExpressions)
🔇 Additional comments (4)
maxun-core/src/browserSide/scraper.js (4)
265-278
: Enhance error handling and simplify attribute handling logic
The function could benefit from better error handling and a more maintainable structure.
Apply this diff to improve the implementation:
function extractValue(element, attribute) {
if (!element) return null;
- if (attribute === 'innerText') {
- return element.innerText.trim();
- } else if (attribute === 'innerHTML') {
- return element.innerHTML.trim();
- } else if (attribute === 'src' || attribute === 'href') {
- const attrValue = element.getAttribute(attribute);
- return attrValue ? new URL(attrValue, window.location.origin).href : null;
- }
- return element.getAttribute(attribute);
+ switch (attribute) {
+ case 'innerText':
+ return element.innerText.trim();
+ case 'innerHTML':
+ return element.innerHTML.trim();
+ case 'src':
+ case 'href':
+ const attrValue = element.getAttribute(attribute);
+ if (!attrValue) return null;
+ try {
+ return new URL(attrValue, window.location.origin).href;
+ } catch (error) {
+ console.warn(`Invalid URL: ${attrValue}`);
+ return null;
+ }
+ default:
+ return element.getAttribute(attribute);
+ }
}
361-387
: LGTM! Field classification logic is well-structured
The field classification logic properly handles both table and non-table fields, with good error handling for missing elements.
464-485
: LGTM! Non-table data processing is efficient
The non-table data processing logic is well-implemented, with proper limit handling and efficient element selection.
488-490
: LGTM! Results merging is straightforward
The merging of table and non-table data is implemented correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (3)
maxun-core/src/browserSide/scraper.js (3)
307-321
: Optimize table header detectionThe current implementation traverses the DOM tree multiple times for each field. Consider caching the TH check results.
function hasThElement(row, tableFields) { + // Cache TH elements check + const hasThChildren = row.getElementsByTagName('TH').length > 0; + if (hasThChildren) return true; + for (const [label, { selector }] of Object.entries(tableFields)) { const element = row.querySelector(selector); if (element) { let current = element; while (current && current !== row) { if (current.tagName === 'TH') { return true; } current = current.parentElement; } } } return false; }
336-359
: Enhance container matching robustnessThe current implementation has two potential issues:
- The class matching threshold (0.7) is hardcoded
- Empty class lists aren't handled explicitly
+ const CLASS_MATCH_THRESHOLD = 0.7; // Make configurable + if (limit > 1 && containers.length <= 1) { const [containerSelector, _] = listSelector.split('>').map(s => s.trim()); const container = document.querySelector(containerSelector); if (container) { const allChildren = Array.from(container.children); const firstMatch = document.querySelector(listSelector); if (firstMatch) { const firstMatchClasses = Array.from(firstMatch.classList); + + // Skip class matching if no classes present + if (firstMatchClasses.length === 0) { + return [firstMatch]; + } containers = allChildren.filter(element => { const elementClasses = Array.from(element.classList); const commonClasses = firstMatchClasses.filter(cls => elementClasses.includes(cls)); - return commonClasses.length >= Math.floor(firstMatchClasses.length * 0.7); + return commonClasses.length >= Math.floor(firstMatchClasses.length * CLASS_MATCH_THRESHOLD); }); } } }
488-490
: Add final limit validationEnsure the combined results don't exceed the specified limit.
// Merge and limit the results const scrapedData = [...tableData, ...nonTableData]; - return scrapedData; + return scrapedData.slice(0, limit);
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
maxun-core/src/browserSide/scraper.js
(1 hunks)
🧰 Additional context used
🪛 Biome (1.9.4)
maxun-core/src/browserSide/scraper.js
[error] 301-301: The assignment should not be in an expression.
The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.
(lint/suspicious/noAssignInExpressions)
🔇 Additional comments (3)
maxun-core/src/browserSide/scraper.js (3)
464-485
: LGTM! Non-table data processing looks good
The implementation correctly handles limits and field extraction.
298-305
: 🛠️ Refactor suggestion
Refactor the while loop for better readability
The assignment within the while condition makes the code harder to read and maintain. This was flagged by static analysis.
function getCellIndex(td) {
let index = 0;
let sibling = td;
- while (sibling = sibling.previousElementSibling) {
+ while (sibling.previousElementSibling) {
+ sibling = sibling.previousElementSibling;
index++;
}
return index;
}
Likely invalid or redundant comment.
🧰 Tools
🪛 Biome (1.9.4)
[error] 301-301: The assignment should not be in an expression.
The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.
(lint/suspicious/noAssignInExpressions)
433-443
: 🛠️ Refactor suggestion
Add depth limit to content child traversal
The deep traversal for finding content children lacks a depth limit, which could lead to performance issues with deeply nested tables.
+ const MAX_CONTENT_DEPTH = 3;
+ let depth = 0;
while (currentElement && currentElement.children.length > 0) {
let foundContentChild = false;
for (const child of currentElement.children) {
if (extractValue(child, attribute)) {
currentElement = child;
foundContentChild = true;
break;
}
}
if (!foundContentChild) break;
+ if (++depth >= MAX_CONTENT_DEPTH) break;
}
Likely invalid or redundant comment.
if (cellIndex >= 0) { | ||
const td = currentRow.children[cellIndex]; | ||
if (td) { | ||
element = td.querySelector(selector); | ||
|
||
if (!element && selector.split(">").pop().includes('td:nth-child')) { | ||
element = td; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add validation for cellIndex access
The code accesses currentRow.children[cellIndex]
without validating the index, which could lead to undefined behavior.
if (cellIndex >= 0) {
+ const cells = currentRow.children;
+ if (cellIndex >= cells.length) {
+ console.warn(`Invalid cell index ${cellIndex} for row with ${cells.length} cells`);
+ continue;
+ }
const td = currentRow.children[cellIndex];
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if (cellIndex >= 0) { | |
const td = currentRow.children[cellIndex]; | |
if (td) { | |
element = td.querySelector(selector); | |
if (!element && selector.split(">").pop().includes('td:nth-child')) { | |
element = td; | |
} | |
if (cellIndex >= 0) { | |
const cells = currentRow.children; | |
if (cellIndex >= cells.length) { | |
console.warn(`Invalid cell index ${cellIndex} for row with ${cells.length} cells`); | |
continue; | |
} | |
const td = currentRow.children[cellIndex]; | |
if (td) { | |
element = td.querySelector(selector); | |
if (!element && selector.split(">").pop().includes('td:nth-child')) { | |
element = td; | |
} |
containers.forEach((container, containerIndex) => { | ||
for (const [label, field] of Object.entries(fields)) { | ||
const sampleElement = container.querySelector(field.selector); | ||
|
||
if (sampleElement) { | ||
const ancestor = findTableAncestor(sampleElement); | ||
if (ancestor) { | ||
containerFields[containerIndex].tableFields[label] = { | ||
...field, | ||
tableContext: ancestor.type, | ||
cellIndex: ancestor.type === 'TD' ? getCellIndex(ancestor.element) : -1 | ||
}; | ||
} else { | ||
containerFields[containerIndex].nonTableFields[label] = field; | ||
} | ||
} else { | ||
containerFields[containerIndex].nonTableFields[label] = field; | ||
} | ||
} | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error handling for invalid selectors
The field classification process should handle invalid selectors gracefully to prevent runtime errors.
containers.forEach((container, containerIndex) => {
for (const [label, field] of Object.entries(fields)) {
+ try {
const sampleElement = container.querySelector(field.selector);
if (sampleElement) {
const ancestor = findTableAncestor(sampleElement);
if (ancestor) {
containerFields[containerIndex].tableFields[label] = {
...field,
tableContext: ancestor.type,
cellIndex: ancestor.type === 'TD' ? getCellIndex(ancestor.element) : -1
};
} else {
containerFields[containerIndex].nonTableFields[label] = field;
}
} else {
containerFields[containerIndex].nonTableFields[label] = field;
}
+ } catch (error) {
+ console.warn(`Invalid selector for field "${label}": ${field.selector}`);
+ containerFields[containerIndex].nonTableFields[label] = field;
+ }
}
});
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
containers.forEach((container, containerIndex) => { | |
for (const [label, field] of Object.entries(fields)) { | |
const sampleElement = container.querySelector(field.selector); | |
if (sampleElement) { | |
const ancestor = findTableAncestor(sampleElement); | |
if (ancestor) { | |
containerFields[containerIndex].tableFields[label] = { | |
...field, | |
tableContext: ancestor.type, | |
cellIndex: ancestor.type === 'TD' ? getCellIndex(ancestor.element) : -1 | |
}; | |
} else { | |
containerFields[containerIndex].nonTableFields[label] = field; | |
} | |
} else { | |
containerFields[containerIndex].nonTableFields[label] = field; | |
} | |
} | |
}); | |
containers.forEach((container, containerIndex) => { | |
for (const [label, field] of Object.entries(fields)) { | |
try { | |
const sampleElement = container.querySelector(field.selector); | |
if (sampleElement) { | |
const ancestor = findTableAncestor(sampleElement); | |
if (ancestor) { | |
containerFields[containerIndex].tableFields[label] = { | |
...field, | |
tableContext: ancestor.type, | |
cellIndex: ancestor.type === 'TD' ? getCellIndex(ancestor.element) : -1 | |
}; | |
} else { | |
containerFields[containerIndex].nonTableFields[label] = field; | |
} | |
} else { | |
containerFields[containerIndex].nonTableFields[label] = field; | |
} | |
} catch (error) { | |
console.warn(`Invalid selector for field "${label}": ${field.selector}`); | |
containerFields[containerIndex].nonTableFields[label] = field; | |
} | |
} | |
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (2)
server/src/workflow-management/selector.ts (2)
94-124
: Improve readability of parent element traversal logic.The parent element traversal logic is complex and could benefit from:
- Better variable names (e.g.,
fullyContained
->isParentFullyContainingChild
).- Documentation explaining the geometric calculations.
- Extraction of the containment check into a separate function.
+const isParentContainingChild = (parent: HTMLElement, child: HTMLElement): boolean => { + const parentRect = parent.getBoundingClientRect(); + const childRect = child.getBoundingClientRect(); + + const isFullyContained = + parentRect.left <= childRect.left && + parentRect.right >= childRect.right && + parentRect.top <= childRect.top && + parentRect.bottom >= childRect.bottom; + + const containmentRatio = (childRect.width * childRect.height) / + (parentRect.width * parentRect.height); + const hasSignificantOverlap = containmentRatio > 0.5; + + return isFullyContained && hasSignificantOverlap; +}; // Replace the inline logic with: -const parentRect = element.parentElement.getBoundingClientRect(); -const childRect = element.getBoundingClientRect(); - -const fullyContained = - parentRect.left <= childRect.left && - parentRect.right >= childRect.right && - parentRect.top <= childRect.top && - parentRect.bottom >= childRect.bottom; - -const significantOverlap = - (childRect.width * childRect.height) / - (parentRect.width * parentRect.height) > 0.5; - -if (fullyContained && significantOverlap) { +if (isParentContainingChild(element.parentElement, element)) {
912-918
: Improve table cell selector generation.While the nth-child selector generation is correct, consider:
- Validating that the parent is a valid table row (
<tr>
).- Adding error handling for malformed table structures.
if (selector === 'td' && element.parentElement) { + if (element.parentElement.tagName !== 'TR') { + console.warn('Table cell found outside of a table row'); + return selector; + } const siblings = Array.from(element.parentElement.children); const position = siblings.indexOf(element) + 1; return `${selector}:nth-child(${position})`; }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
server/src/workflow-management/selector.ts
(6 hunks)
🔇 Additional comments (1)
server/src/workflow-management/selector.ts (1)
87-92
: 🛠️ Refactor suggestion
Extract table cell handling into a shared utility function.
This table cell handling logic is duplicated across multiple functions. Consider extracting it into a shared utility function to improve maintainability and reduce code duplication.
+const findTableAncestor = (element: HTMLElement): HTMLElement => {
+ if ((element.tagName === 'TD' || element.tagName === 'TH') && element.closest('table')) {
+ return element.closest('table')!;
+ }
+ return element;
+};
// Replace duplicated code with:
-if (element.tagName === 'TD' || element.tagName === 'TH') {
- const tableParent = element.closest('table');
- if (tableParent) {
- element = tableParent;
- }
-}
+element = findTableAncestor(element);
Likely invalid or redundant comment.
if (element.tagName === 'TD' || element.tagName === 'TH') { | ||
const tableParent = element.closest('table'); | ||
if (tableParent) { | ||
element = tableParent; | ||
} | ||
} | ||
|
||
if (fullyContained && significantOverlap) { | ||
element = element.parentElement; | ||
} else { | ||
break; | ||
if (element.tagName !== 'TABLE') { | ||
while (element.parentElement) { | ||
if (element.tagName.toLowerCase() === 'body' || | ||
element.tagName.toLowerCase() === 'html') { | ||
break; | ||
} | ||
|
||
const parentRect = element.parentElement.getBoundingClientRect(); | ||
const childRect = element.getBoundingClientRect(); | ||
|
||
const fullyContained = | ||
parentRect.left <= childRect.left && | ||
parentRect.right >= childRect.right && | ||
parentRect.top <= childRect.top && | ||
parentRect.bottom >= childRect.bottom; | ||
|
||
const significantOverlap = | ||
(childRect.width * childRect.height) / | ||
(parentRect.width * parentRect.height) > 0.5; | ||
|
||
if (fullyContained && significantOverlap) { | ||
const nextParent = element.parentElement; | ||
if (nextParent.tagName.toLowerCase() !== 'body' && | ||
nextParent.tagName.toLowerCase() !== 'html') { | ||
element = nextParent; | ||
} else { | ||
break; | ||
} | ||
} else { | ||
break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Eliminate code duplication in getRect function.
The table cell handling and parent traversal logic is duplicated from the getElementInformation
function. This violates the DRY principle and makes maintenance harder.
Consider:
- Using the proposed
findTableAncestor
utility function. - Extracting the parent traversal logic into a shared function.
- Creating a shared module for these utility functions.
if (selector === 'td' && element.parentElement) { | ||
const siblings = Array.from(element.parentElement.children); | ||
const position = siblings.indexOf(element) + 1; | ||
return `${selector}:nth-child(${position})`; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consolidate duplicate table cell selector logic.
The table cell selector generation logic is duplicated in three different places. This makes the code harder to maintain and increases the risk of inconsistencies.
Extract this logic into a shared utility function:
+const getTableCellSelector = (element: HTMLElement): string | null => {
+ if (element.tagName.toLowerCase() !== 'td' || !element.parentElement) {
+ return null;
+ }
+ if (element.parentElement.tagName !== 'TR') {
+ console.warn('Table cell found outside of a table row');
+ return null;
+ }
+ const siblings = Array.from(element.parentElement.children);
+ const position = siblings.indexOf(element) + 1;
+ return `td:nth-child(${position})`;
+};
// Replace all occurrences with:
-if (selector === 'td' && element.parentElement) {
- const siblings = Array.from(element.parentElement.children);
- const position = siblings.indexOf(element) + 1;
- return `${selector}:nth-child(${position})`;
-}
+const cellSelector = getTableCellSelector(element);
+if (cellSelector) {
+ return cellSelector;
+}
Also applies to: 1067-1072
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
maxun-core/src/browserSide/scraper.js (2)
280-296
: Optimize table ancestor search with early returnsThe function can be simplified using early returns for better readability:
function findTableAncestor(element) { let currentElement = element; const MAX_DEPTH = 5; let depth = 0; while (currentElement && depth < MAX_DEPTH) { - if (currentElement.tagName === 'TD') { - return { type: 'TD', element: currentElement }; - } else if (currentElement.tagName === 'TR') { - return { type: 'TR', element: currentElement }; - } + switch (currentElement.tagName) { + case 'TD': return { type: 'TD', element: currentElement }; + case 'TR': return { type: 'TR', element: currentElement }; + } currentElement = currentElement.parentElement; depth++; } return null; }
452-466
: Add depth limit to content child traversalThe deep traversal for finding content children lacks a depth limit:
+ const MAX_CONTENT_SEARCH_DEPTH = 3; + let depth = 0; let currentElement = td; - while (currentElement && currentElement.children.length > 0) { + while (currentElement && currentElement.children.length > 0 && depth < MAX_CONTENT_SEARCH_DEPTH) { let foundContentChild = false; for (const child of currentElement.children) { if (extractValue(child, attribute)) { currentElement = child; foundContentChild = true; break; } } if (!foundContentChild) break; + depth++; }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
maxun-core/src/browserSide/scraper.js
(1 hunks)
🧰 Additional context used
🪛 Biome (1.9.4)
maxun-core/src/browserSide/scraper.js
[error] 301-301: The assignment should not be in an expression.
The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.
(lint/suspicious/noAssignInExpressions)
🔇 Additional comments (6)
maxun-core/src/browserSide/scraper.js (6)
485-507
: LGTM! Clean implementation of non-table data processing
The non-table data processing implementation is clean, efficient, and properly handles limits.
509-511
: LGTM! Effective data merging
The data merging implementation is concise and effective.
265-278
: 🛠️ Refactor suggestion
Add error handling for URL construction
The URL construction could throw errors for invalid URLs. Consider adding try-catch block:
} else if (attribute === 'src' || attribute === 'href') {
const attrValue = element.getAttribute(attribute);
- return attrValue ? new URL(attrValue, window.location.origin).href : null;
+ if (!attrValue) return null;
+ try {
+ return new URL(attrValue, window.location.origin).href;
+ } catch (error) {
+ console.warn(`Invalid URL: ${attrValue}`);
+ return null;
+ }
}
Likely invalid or redundant comment.
298-305
: 🛠️ Refactor suggestion
Refactor the while loop condition
The assignment in the while condition was flagged by static analysis. Consider restructuring for better readability:
function getCellIndex(td) {
let index = 0;
let sibling = td;
- while (sibling = sibling.previousElementSibling) {
+ while (sibling.previousElementSibling) {
+ sibling = sibling.previousElementSibling;
index++;
}
return index;
}
Likely invalid or redundant comment.
🧰 Tools
🪛 Biome (1.9.4)
[error] 301-301: The assignment should not be in an expression.
The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.
(lint/suspicious/noAssignInExpressions)
438-446
: 🛠️ Refactor suggestion
Add validation for cellIndex access
The code accesses currentRow.children[cellIndex]
without validating the index:
if (cellIndex >= 0) {
+ const cells = currentRow.children;
+ if (cellIndex >= cells.length) {
+ console.warn(`Invalid cell index ${cellIndex} for row with ${cells.length} cells`);
+ continue;
+ }
const td = currentRow.children[cellIndex];
Likely invalid or redundant comment.
389-407
: 🛠️ Refactor suggestion
Add error handling for invalid selectors
The field classification process should handle invalid selectors gracefully:
containers.forEach((container, containerIndex) => {
for (const [label, field] of Object.entries(fields)) {
+ try {
const sampleElement = container.querySelector(field.selector);
if (sampleElement) {
const ancestor = findTableAncestor(sampleElement);
if (ancestor) {
containerFields[containerIndex].tableFields[label] = {
...field,
tableContext: ancestor.type,
cellIndex: ancestor.type === 'TD' ? getCellIndex(ancestor.element) : -1
};
} else {
containerFields[containerIndex].nonTableFields[label] = field;
}
} else {
containerFields[containerIndex].nonTableFields[label] = field;
}
+ } catch (error) {
+ console.warn(`Invalid selector for field "${label}": ${field.selector}`);
+ containerFields[containerIndex].nonTableFields[label] = field;
+ }
}
});
Likely invalid or redundant comment.
Enhancing Scrape List to include table scraping.
Summary by CodeRabbit
New Features
Bug Fixes
Documentation