Browser Extraction
Extract data from web pages using headless browser automation.
Prerequisites
Section titled “Prerequisites”Browser nodes use Playwright with Chromium. Both are bundled in the Radhflow Docker image — no extra installation required. If you’re running outside Docker, install Playwright separately:
npx playwright install chromiumConfig fields
Section titled “Config fields”| Field | Required | Default | Description |
|---|---|---|---|
url | Yes | — | Target URL. Supports {{ }} templates. |
selector | Per action | — | CSS selector for the target element. |
wait_for | No | load | Page load strategy: load, domcontentloaded, networkidle. |
script | No | — | Custom JavaScript to run in the page context. |
screenshot | No | — | Path to save a screenshot (relative to project root). |
item_selector | browser.list | — | CSS selector for repeated items on a listing page. |
steps | Yes | — | Ordered extraction steps. |
Examples
Section titled “Examples”Scrape a table
Section titled “Scrape a table”Extract structured data from a page. Each extract step pulls one field.
scrape-profile: type: service op: browser.extract params: url: "https://example.com/profile/{{ user_id }}" steps: - action: extract selector: h1.profile-name field: name - action: extract selector: span.title field: title - action: extract selector: div.bio field: bio inputs: request: { type: Record, from: ref(lookup.user) } outputs: profile: type: Record schema: name: { type: string } title: { type: string } bio: { type: string }When the input is a Table, extraction runs once per row.
Extract text from a listing
Section titled “Extract text from a listing”Use browser.list to collect repeated items from a page. Each item becomes a row.
list-products: type: source op: browser.list params: url: https://shop.example.com/catalog item_selector: div.product-card steps: - action: extract selector: h3.product-name field: name - action: extract selector: span.price field: price - action: extract selector: a.product-link attribute: href field: url outputs: products: { type: Table }Run custom JavaScript
Section titled “Run custom JavaScript”Execute arbitrary JS in the page context and capture the result.
run-script: type: service op: browser.extract params: url: https://example.com/dashboard steps: - action: navigate wait_for: networkidle - action: script code: | return JSON.stringify( Array.from(document.querySelectorAll('.metric')) .map(el => ({ label: el.dataset.label, value: el.textContent })) ); field: metrics outputs: data: { type: Record }Take a screenshot
Section titled “Take a screenshot”capture-page: type: service op: browser.extract params: url: https://example.com/report steps: - action: navigate wait_for: networkidle screenshot: artifacts/report.png outputs: result: { type: Record }Selectors
Section titled “Selectors”You can target elements with CSS selectors or XPath.
# CSS selector- action: extract selector: "div.content > h1" field: title
# XPath- action: extract selector: "xpath=//table[@id='results']//tr[2]/td[1]" field: first_cellCSS selectors are simpler and faster. Use XPath when you need positional logic (e.g., “the second row in the third table”).
Semantic selectors describe elements by their visual role. At creation time, the code agent resolves these to concrete CSS selectors.
- action: extract semantic: "the price displayed near the buy button" field: priceStep actions
Section titled “Step actions”| Action | Description |
|---|---|
navigate | Load the page. Set wait_for: networkidle for SPAs. |
click | Click an element. Set wait_after (ms) for dynamic content. |
extract | Pull text or an attribute from an element into a named field. |
script | Run custom JavaScript. Return a value to capture it. |
steps: - action: navigate wait_for: networkidle - action: click selector: button.load-more wait_after: 1000 - action: extract selector: h1.title field: titleTroubleshooting
Section titled “Troubleshooting”Element not found. The selector doesn’t match any element on the page. Open the URL in a browser, right-click the target element, and use “Copy selector” to get the correct CSS path. If the page uses dynamic rendering, add wait_for: networkidle to the navigate step.
Timeout. The page took too long to load. Increase the timeout or switch wait_for from networkidle to domcontentloaded if you don’t need the page fully loaded.
Dynamic content not loading. Single-page apps render content after the initial page load. Use wait_for: networkidle on the navigate step, or add a click step followed by wait_after to trigger lazy-loaded content.