Skip to content

Data Operations

Built-in ops for known problems. Custom code only for novel logic.

Radhflow ships twelve declarative data operations. Each one is a node type you drop into flow.yaml. You configure it with YAML. No code runs — Radhflow translates the config into DuckDB SQL at execution time.

OperationTypeDescriptionSQL equivalent
filterdata.filterKeep rows matching conditionsWHERE
mapdata.mapCompute, rename, or transform fieldsSELECT expr AS name
sortdata.sortOrder rows by one or more fieldsORDER BY
limitdata.limitTake first N rows, optional offsetLIMIT / OFFSET
dedupdata.dedupRemove duplicates on key fieldsROW_NUMBER() OVER (PARTITION BY ...)
joindata.joinCombine two tables on matching keysJOIN
groupdata.groupGroup rows and aggregateGROUP BY
SQL transformstransform.sqlFreeform DuckDB SQLAny SQL
selectdata.selectKeep only named fieldsSELECT col1, col2
concatdata.concatStack multiple tables verticallyUNION ALL BY NAME
partitiondata.partitionSplit rows into two groupsWHERE / WHERE NOT
pulldata.pullExtract a single field from the first row as a ValueFirst-row field access
collectdata.collectGather multiple Value inputs into a listArray aggregation

Every data operation follows the same lifecycle:

  1. Parse. Radhflow reads your YAML config and validates it.
  2. Translate. The config becomes a DuckDB SQL query.
  3. Execute. DuckDB runs the query against NDJSON input files.
  4. Write. Results go to an NDJSON output file with a companion .schema.json.

All operations handle nulls, type coercion, and large datasets automatically. DuckDB processes data in-memory with columnar compression — millions of rows complete in seconds.

Most operations take a single Table input and produce a single Table output:

nodes:
clean-emails:
type: data.filter
config:
conditions:
all:
- field: email
op: is_not_null

Exceptions:

OperationInput portsOutput ports
data.joinleft, rightoutput
data.concatinputs[0..N]output
data.partitioninputmatching, not_matching
data.pullinputvalue (Value type)
data.collectvalue_0, value_1list (Value type)
all othersinputoutput

Operations compose naturally. Connect the output of one to the input of the next:

name: top-active-leads
version: 1
nodes:
load-csv:
type: file.source
path: leads.csv
format: csv
active-only:
type: data.filter
config:
conditions:
all:
- field: status
op: equals
value: active
by-score:
type: data.sort
config:
by:
- field: score
direction: desc
top-100:
type: data.limit
config:
count: 100
edges:
- "load-csv.data -> active-only.input"
- "active-only.output -> by-score.input"
- "by-score.output -> top-100.input"

This pipeline loads a CSV, keeps active leads, sorts by score descending, and takes the top 100. Four nodes, zero code.

Data operations propagate schemas automatically. If the input table has fields name, email, score, the output schema reflects exactly what the operation produces — same fields for filter and sort, new fields for map, aggregation columns for group.

The type checker validates field references at parse time, before any data flows.

Use built-in ops when the transformation maps to standard SQL. Use a custom node when:

  • The logic requires external API calls or side effects.
  • The transformation needs a library not available in DuckDB (e.g., ML inference).
  • The operation is domain-specific and not generalizable.

For complex SQL that combines multiple operations in a single query, see SQL Transforms.