Skip to content

Browser Extraction

Extract data from web pages using headless browser automation.

Browser nodes use Playwright with Chromium. Both are bundled in the Radhflow Docker image — no extra installation required. If you’re running outside Docker, install Playwright separately:

Terminal window
npx playwright install chromium
FieldRequiredDefaultDescription
urlYesTarget URL. Supports {{ }} templates.
selectorPer actionCSS selector for the target element.
wait_forNoloadPage load strategy: load, domcontentloaded, networkidle.
scriptNoCustom JavaScript to run in the page context.
screenshotNoPath to save a screenshot (relative to project root).
item_selectorbrowser.listCSS selector for repeated items on a listing page.
stepsYesOrdered extraction steps.

Extract structured data from a page. Each extract step pulls one field.

flow.yaml
scrape-profile:
type: service
op: browser.extract
params:
url: "https://example.com/profile/{{ user_id }}"
steps:
- action: extract
selector: h1.profile-name
field: name
- action: extract
selector: span.title
field: title
- action: extract
selector: div.bio
field: bio
inputs:
request: { type: Record, from: ref(lookup.user) }
outputs:
profile:
type: Record
schema:
name: { type: string }
title: { type: string }
bio: { type: string }

When the input is a Table, extraction runs once per row.

Use browser.list to collect repeated items from a page. Each item becomes a row.

list-products:
type: source
op: browser.list
params:
url: https://shop.example.com/catalog
item_selector: div.product-card
steps:
- action: extract
selector: h3.product-name
field: name
- action: extract
selector: span.price
field: price
- action: extract
selector: a.product-link
attribute: href
field: url
outputs:
products: { type: Table }

Execute arbitrary JS in the page context and capture the result.

run-script:
type: service
op: browser.extract
params:
url: https://example.com/dashboard
steps:
- action: navigate
wait_for: networkidle
- action: script
code: |
return JSON.stringify(
Array.from(document.querySelectorAll('.metric'))
.map(el => ({ label: el.dataset.label, value: el.textContent }))
);
field: metrics
outputs:
data: { type: Record }
capture-page:
type: service
op: browser.extract
params:
url: https://example.com/report
steps:
- action: navigate
wait_for: networkidle
screenshot: artifacts/report.png
outputs:
result: { type: Record }

You can target elements with CSS selectors or XPath.

# CSS selector
- action: extract
selector: "div.content > h1"
field: title
# XPath
- action: extract
selector: "xpath=//table[@id='results']//tr[2]/td[1]"
field: first_cell

CSS selectors are simpler and faster. Use XPath when you need positional logic (e.g., “the second row in the third table”).

Semantic selectors describe elements by their visual role. At creation time, the code agent resolves these to concrete CSS selectors.

- action: extract
semantic: "the price displayed near the buy button"
field: price
ActionDescription
navigateLoad the page. Set wait_for: networkidle for SPAs.
clickClick an element. Set wait_after (ms) for dynamic content.
extractPull text or an attribute from an element into a named field.
scriptRun custom JavaScript. Return a value to capture it.
steps:
- action: navigate
wait_for: networkidle
- action: click
selector: button.load-more
wait_after: 1000
- action: extract
selector: h1.title
field: title

Element not found. The selector doesn’t match any element on the page. Open the URL in a browser, right-click the target element, and use “Copy selector” to get the correct CSS path. If the page uses dynamic rendering, add wait_for: networkidle to the navigate step.

Timeout. The page took too long to load. Increase the timeout or switch wait_for from networkidle to domcontentloaded if you don’t need the page fully loaded.

Dynamic content not loading. Single-page apps render content after the initial page load. Use wait_for: networkidle on the navigate step, or add a click step followed by wait_after to trigger lazy-loaded content.