Browser Extraction

Extract data from web pages using headless browser automation.

Prerequisites

Browser nodes use Playwright with Chromium. Both are bundled in the Radhflow Docker image — no extra installation required. If you’re running outside Docker, install Playwright separately:

npx playwright install chromium

Config fields

Field	Required	Default	Description
`url`	Yes	—	Target URL. Supports `{{ }}` templates.
`selector`	Per action	—	CSS selector for the target element.
`wait_for`	No	`load`	Page load strategy: `load`, `domcontentloaded`, `networkidle`.
`script`	No	—	Custom JavaScript to run in the page context.
`screenshot`	No	—	Path to save a screenshot (relative to project root).
`item_selector`	`browser.list`	—	CSS selector for repeated items on a listing page.
`steps`	Yes	—	Ordered extraction steps.

Examples

Scrape a table

Extract structured data from a page. Each extract step pulls one field.

scrape-profile:
  type: service
  op: browser.extract
  params:
    url: "https://example.com/profile/{{ user_id }}"
    steps:
      - action: extract
        selector: h1.profile-name
        field: name
      - action: extract
        selector: span.title
        field: title
      - action: extract
        selector: div.bio
        field: bio
  inputs:
    request: { type: Record, from: ref(lookup.user) }
  outputs:
    profile:
      type: Record
      schema:
        name: { type: string }
        title: { type: string }
        bio: { type: string }

When the input is a Table, extraction runs once per row.

Extract text from a listing

Use browser.list to collect repeated items from a page. Each item becomes a row.

list-products:
  type: source
  op: browser.list
  params:
    url: https://shop.example.com/catalog
    item_selector: div.product-card
    steps:
      - action: extract
        selector: h3.product-name
        field: name
      - action: extract
        selector: span.price
        field: price
      - action: extract
        selector: a.product-link
        attribute: href
        field: url
  outputs:
    products: { type: Table }

Run custom JavaScript

Execute arbitrary JS in the page context and capture the result.

run-script:
  type: service
  op: browser.extract
  params:
    url: https://example.com/dashboard
    steps:
      - action: navigate
        wait_for: networkidle
      - action: script
        code: |
          return JSON.stringify(
            Array.from(document.querySelectorAll('.metric'))
              .map(el => ({ label: el.dataset.label, value: el.textContent }))
          );
        field: metrics
  outputs:
    data: { type: Record }

Take a screenshot

capture-page:
  type: service
  op: browser.extract
  params:
    url: https://example.com/report
    steps:
      - action: navigate
        wait_for: networkidle
    screenshot: artifacts/report.png
  outputs:
    result: { type: Record }

Selectors

You can target elements with CSS selectors or XPath.

# CSS selector
- action: extract
  selector: "div.content > h1"
  field: title

# XPath
- action: extract
  selector: "xpath=//table[@id='results']//tr[2]/td[1]"
  field: first_cell

CSS selectors are simpler and faster. Use XPath when you need positional logic (e.g., “the second row in the third table”).

Semantic selectors describe elements by their visual role. At creation time, the code agent resolves these to concrete CSS selectors.

- action: extract
  semantic: "the price displayed near the buy button"
  field: price

Step actions

Action	Description
`navigate`	Load the page. Set `wait_for: networkidle` for SPAs.
`click`	Click an element. Set `wait_after` (ms) for dynamic content.
`extract`	Pull text or an `attribute` from an element into a named `field`.
`script`	Run custom JavaScript. Return a value to capture it.

steps:
  - action: navigate
    wait_for: networkidle
  - action: click
    selector: button.load-more
    wait_after: 1000
  - action: extract
    selector: h1.title
    field: title

Troubleshooting

Element not found. The selector doesn’t match any element on the page. Open the URL in a browser, right-click the target element, and use “Copy selector” to get the correct CSS path. If the page uses dynamic rendering, add wait_for: networkidle to the navigate step.

Timeout. The page took too long to load. Increase the timeout or switch wait_for from networkidle to domcontentloaded if you don’t need the page fully loaded.

Dynamic content not loading. Single-page apps render content after the initial page load. Use wait_for: networkidle on the navigate step, or add a click step followed by wait_after to trigger lazy-loaded content.