SQL Transforms

Write raw SQL when built-in ops are not enough.

transform.sql gives you the full DuckDB SQL dialect — window functions, CTEs, JSON extraction, regex, date arithmetic — while staying connected to the pipeline graph through ref() references.

Input: One or more Tables (via ref()) | Output: Table

Minimal example

nodes:
  top-customers:
    type: transform.sql
    config:
      query: |
        SELECT
          customer_id,
          sum(amount) AS total_spend,
          count(*) AS order_count
        FROM ref('load-orders')
        GROUP BY customer_id
        HAVING sum(amount) > 1000
        ORDER BY total_spend DESC

Config reference

Field	Type	Required	Description
`query`	string	yes	Full DuckDB SQL query using `ref()` to reference upstream nodes

The ref() function

ref('node_id') references the output of an upstream node. Radhflow resolves it to the node’s NDJSON output file at execution time. Edges are created automatically from ref() calls — you do not need to declare them manually.

-- Single source
SELECT * FROM ref('clean-data') WHERE status = 'active'

-- Multiple sources
SELECT a.*, b.category_name
FROM ref('orders') a
JOIN ref('categories') b ON a.category_id = b.id

Node IDs in ref() must match existing node IDs in the pipeline. The type checker validates these references at parse time.

DuckDB SQL dialect notes

DuckDB supports most PostgreSQL syntax plus extensions for analytics. A few things to know:

String literals use single quotes: 'active'
Identifiers with special characters use double quotes: "my-column"
ILIKE for case-insensitive LIKE
STRFTIME for date formatting
LIST() aggregate for collecting values into arrays
UNNEST() for expanding arrays into rows
STRUCT for nested objects
Full window function support: ROW_NUMBER(), RANK(), LAG(), LEAD(), etc.

Progressive examples

Window functions

nodes:
  ranked-sales:
    type: transform.sql
    config:
      query: |
        SELECT
          *,
          ROW_NUMBER() OVER (
            PARTITION BY region
            ORDER BY revenue DESC
          ) AS rank_in_region,
          revenue / SUM(revenue) OVER (
            PARTITION BY region
          ) AS pct_of_region
        FROM ref('monthly-sales')

Common Table Expressions (CTEs)

nodes:
  cohort-analysis:
    type: transform.sql
    config:
      query: |
        WITH first_purchase AS (
          SELECT
            customer_id,
            MIN(order_date) AS cohort_date
          FROM ref('orders')
          GROUP BY customer_id
        ),
        labeled AS (
          SELECT
            o.*,
            fp.cohort_date,
            DATE_DIFF('month', fp.cohort_date, o.order_date) AS months_since
          FROM ref('orders') o
          JOIN first_purchase fp ON o.customer_id = fp.customer_id
        )
        SELECT
          cohort_date,
          months_since,
          COUNT(DISTINCT customer_id) AS customers,
          SUM(amount) AS revenue
        FROM labeled
        GROUP BY cohort_date, months_since
        ORDER BY cohort_date, months_since

JSON extraction

nodes:
  parse-metadata:
    type: transform.sql
    config:
      query: |
        SELECT
          id,
          metadata->>'$.name' AS name,
          CAST(metadata->>'$.score' AS DOUBLE) AS score,
          json_array_length(metadata->'$.tags') AS tag_count
        FROM ref('raw-events')

Regex and string functions

nodes:
  extract-domains:
    type: transform.sql
    config:
      query: |
        SELECT
          email,
          regexp_extract(email, '@(.+)$', 1) AS domain,
          CASE
            WHEN email LIKE '%@gmail.com' THEN 'personal'
            WHEN email LIKE '%@company.com' THEN 'internal'
            ELSE 'other'
          END AS email_type
        FROM ref('contacts')

Date functions

nodes:
  daily-metrics:
    type: transform.sql
    config:
      query: |
        SELECT
          DATE_TRUNC('day', created_at) AS day,
          COUNT(*) AS events,
          COUNT(DISTINCT user_id) AS unique_users,
          DATE_DIFF('hour', MIN(created_at), MAX(created_at)) AS active_hours
        FROM ref('event-log')
        GROUP BY DATE_TRUNC('day', created_at)
        ORDER BY day

Multi-source join

nodes:
  full-report:
    type: transform.sql
    config:
      query: |
        SELECT
          o.id AS order_id,
          c.name AS customer_name,
          p.name AS product_name,
          o.quantity,
          o.quantity * p.unit_price AS line_total
        FROM ref('orders') o
        LEFT JOIN ref('customers') c ON o.customer_id = c.id
        LEFT JOIN ref('products') p ON o.product_id = p.id
        WHERE o.status != 'cancelled'
        ORDER BY o.created_at DESC

When to use SQL transforms

Scenario	Recommended approach
Filter rows by a condition	`data.filter`
Sort and take top N	`data.sort` + `data.limit`
Join two tables	`data.join`
Group and aggregate	`data.group`
Window ranking within groups	`transform.sql`
CTE-based multi-step logic	`transform.sql`
Multi-table join with conditions	`transform.sql`
Regex extraction + conditional logic	`transform.sql`
Anything with `HAVING`, `QUALIFY`, or `LATERAL`	`transform.sql`