Skip to content

sort / limit / dedup

Three operations for shaping result sets. They do one thing each and chain together cleanly.

Orders rows by one or more fields.

Input: Table Output: Table (same schema, reordered)

FieldTypeRequiredDescription
byarrayyesSort fields, each with field and optional direction
null_handlingstringnoWhere nulls appear: first or last (default: last)

Each entry in by has:

FieldTypeDefaultDescription
fieldstring(required)Column to sort on
directionstringascSort direction: asc or desc
nodes:
by-date:
type: data.sort
config:
by:
- field: created_at
direction: desc
# Multi-field sort with nulls first
nodes:
ranked:
type: data.sort
config:
null_handling: first
by:
- field: priority
direction: asc
- field: created_at
direction: desc

Takes the first N rows, optionally skipping an offset.

Input: Table Output: Table (same schema, at most N rows)

FieldTypeRequiredDescription
countnumberyesMaximum rows to return
offsetnumbernoRows to skip before taking (default: 0)
nodes:
top-10:
type: data.limit
config:
count: 10
# Pagination: skip 20, take 10
nodes:
page-3:
type: data.limit
config:
count: 10
offset: 20

Removes duplicate rows based on key fields. Keeps either the first or last occurrence in input order.

Input: Table Output: Table (same schema, duplicates removed)

FieldTypeRequiredDescription
onarrayyesFields to deduplicate on
keepstringnoWhich duplicate to keep: first or last (default: first)
nodes:
unique-emails:
type: data.dedup
config:
on: [email]
# Keep the most recent entry per user
nodes:
latest-per-user:
type: data.dedup
config:
on: [user_id]
keep: last

These three operations compose into a common pattern: sort to establish order, deduplicate on a key (keeping the desired occurrence based on that order), then cap the result size.

nodes:
load-signups:
type: file.csv
config:
path: signups.csv
by-date:
type: data.sort
config:
by:
- field: signed_up_at
direction: desc
one-per-email:
type: data.dedup
config:
on: [email]
keep: first
top-100:
type: data.limit
config:
count: 100
edges:
- load-signups.output -> by-date.input
- by-date.output -> one-per-email.input
- one-per-email.output -> top-100.input

This pipeline loads signups, sorts newest first, keeps only the most recent signup per email address, and returns the top 100.