Data Operations

Built-in ops for known problems. Custom code only for novel logic.

Radhflow ships twelve declarative data operations. Each one is a node type you drop into flow.yaml. You configure it with YAML. No code runs — Radhflow translates the config into DuckDB SQL at execution time.

All operations

Operation	Type	Description	SQL equivalent
filter	`data.filter`	Keep rows matching conditions	`WHERE`
map	`data.map`	Compute, rename, or transform fields	`SELECT expr AS name`
sort	`data.sort`	Order rows by one or more fields	`ORDER BY`
limit	`data.limit`	Take first N rows, optional offset	`LIMIT / OFFSET`
dedup	`data.dedup`	Remove duplicates on key fields	`ROW_NUMBER() OVER (PARTITION BY ...)`
join	`data.join`	Combine two tables on matching keys	`JOIN`
group	`data.group`	Group rows and aggregate	`GROUP BY`
SQL transforms	`transform.sql`	Freeform DuckDB SQL	Any SQL
select	`data.select`	Keep only named fields	`SELECT col1, col2`
concat	`data.concat`	Stack multiple tables vertically	`UNION ALL BY NAME`
partition	`data.partition`	Split rows into two groups	`WHERE` / `WHERE NOT`
pull	`data.pull`	Extract a single field from the first row as a Value	First-row field access
collect	`data.collect`	Gather multiple Value inputs into a list	Array aggregation

How they work

Every data operation follows the same lifecycle:

Parse. Radhflow reads your YAML config and validates it.
Translate. The config becomes a DuckDB SQL query.
Execute. DuckDB runs the query against NDJSON input files.
Write. Results go to an NDJSON output file with a companion .schema.json.

All operations handle nulls, type coercion, and large datasets automatically. DuckDB processes data in-memory with columnar compression — millions of rows complete in seconds.

Port layouts

Most operations take a single Table input and produce a single Table output:

nodes:
  clean-emails:
    type: data.filter
    config:
      conditions:
        all:
          - field: email
            op: is_not_null

Exceptions:

Operation	Input ports	Output ports
`data.join`	`left`, `right`	`output`
`data.concat`	`inputs[0..N]`	`output`
`data.partition`	`input`	`matching`, `not_matching`
`data.pull`	`input`	`value` (Value type)
`data.collect`	`value_0`, `value_1`…	`list` (Value type)
all others	`input`	`output`

Chaining operations

Operations compose naturally. Connect the output of one to the input of the next:

name: top-active-leads
version: 1

nodes:
  load-csv:
    type: file.source
    path: leads.csv
    format: csv

  active-only:
    type: data.filter
    config:
      conditions:
        all:
          - field: status
            op: equals
            value: active

  by-score:
    type: data.sort
    config:
      by:
        - field: score
          direction: desc

  top-100:
    type: data.limit
    config:
      count: 100

edges:
  - "load-csv.data -> active-only.input"
  - "active-only.output -> by-score.input"
  - "by-score.output -> top-100.input"

This pipeline loads a CSV, keeps active leads, sorts by score descending, and takes the top 100. Four nodes, zero code.

Schema propagation

Data operations propagate schemas automatically. If the input table has fields name, email, score, the output schema reflects exactly what the operation produces — same fields for filter and sort, new fields for map, aggregation columns for group.

The type checker validates field references at parse time, before any data flows.

When to use a custom node instead

Use built-in ops when the transformation maps to standard SQL. Use a custom node when:

The logic requires external API calls or side effects.
The transformation needs a library not available in DuckDB (e.g., ML inference).
The operation is domain-specific and not generalizable.

For complex SQL that combines multiple operations in a single query, see SQL Transforms.