Skip to content

sort / limit / dedup

Sort rows, cap the count, remove duplicates — in one step or chained together.

Three operations for shaping result sets. Each does one thing. They chain together cleanly.

Orders rows by one or more fields.

Input: Table | Output: Table (same schema, reordered)

FieldTypeRequiredDescription
byarrayyesSort fields, each with field and optional direction
null_handlingstringnoWhere nulls appear: first or last (default: last)

Each entry in by:

FieldTypeDefaultDescription
fieldstring(required)Column to sort on
directionstringascSort direction: asc or desc
nodes:
by-date:
type: data.sort
config:
by:
- field: created_at
direction: desc
# Multi-field sort with nulls first
nodes:
ranked:
type: data.sort
config:
null_handling: first
by:
- field: priority
direction: asc
- field: created_at
direction: desc

Stable sort. Rows with equal sort keys retain their original relative order.

NULL ordering. By default, NULLs sort last. Set null_handling: first to put them at the top.

Takes the first N rows, optionally skipping an offset.

Input: Table | Output: Table (same schema, at most N rows)

FieldTypeRequiredDescription
countnumberyesMaximum rows to return
offsetnumbernoRows to skip before taking (default: 0)
nodes:
top-10:
type: data.limit
config:
count: 10
# Pagination: skip 20, take 10
nodes:
page-3:
type: data.limit
config:
count: 10
offset: 20

Fewer rows than count. If the input has fewer rows than count, all rows are returned.

Offset beyond data. If offset exceeds the row count, the output is empty.

Removes duplicate rows based on key fields. Keeps either the first or last occurrence in input order.

Input: Table | Output: Table (same schema, duplicates removed)

FieldTypeRequiredDescription
onarrayyesFields to deduplicate on
keepstringnoWhich duplicate to keep: first or last (default: first)
nodes:
unique-emails:
type: data.dedup
config:
on: [email]
# Keep the most recent entry per user
nodes:
latest-per-user:
type: data.dedup
config:
on: [user_id]
keep: last

Multi-field dedup. When on lists multiple fields, rows are considered duplicates only if all listed fields match.

NULL keys. Two rows with NULL in a dedup key field are considered duplicates of each other.

These three operations compose into a common pattern: sort to establish order, deduplicate on a key (keeping the desired occurrence based on that order), then cap the result size.

name: recent-unique-signups
version: 1
nodes:
load-signups:
type: file.source
path: signups.csv
format: csv
by-date:
type: data.sort
config:
by:
- field: signed_up_at
direction: desc
one-per-email:
type: data.dedup
config:
on: [email]
keep: first
top-100:
type: data.limit
config:
count: 100
edges:
- "load-signups.data -> by-date.input"
- "by-date.output -> one-per-email.input"
- "one-per-email.output -> top-100.input"

This pipeline loads signups, sorts newest first, keeps only the most recent signup per email address, and returns the top 100.