sort / limit / dedup

Sort rows, cap the count, remove duplicates — in one step or chained together.

Three operations for shaping result sets. Each does one thing. They chain together cleanly.

data.sort

Orders rows by one or more fields.

Input: Table | Output: Table (same schema, reordered)

Config reference

Field	Type	Required	Description
`by`	array	yes	Sort fields, each with `field` and optional `direction`
`null_handling`	string	no	Where nulls appear: `first` or `last` (default: `last`)

Each entry in by:

Field	Type	Default	Description
`field`	string	(required)	Column to sort on
`direction`	string	`asc`	Sort direction: `asc` or `desc`

Examples

nodes:
  by-date:
    type: data.sort
    config:
      by:
        - field: created_at
          direction: desc

# Multi-field sort with nulls first
nodes:
  ranked:
    type: data.sort
    config:
      null_handling: first
      by:
        - field: priority
          direction: asc
        - field: created_at
          direction: desc

Edge cases

Stable sort. Rows with equal sort keys retain their original relative order.

NULL ordering. By default, NULLs sort last. Set null_handling: first to put them at the top.

data.limit

Takes the first N rows, optionally skipping an offset.

Input: Table | Output: Table (same schema, at most N rows)

Config reference

Field	Type	Required	Description
`count`	number	yes	Maximum rows to return
`offset`	number	no	Rows to skip before taking (default: `0`)

Examples

nodes:
  top-10:
    type: data.limit
    config:
      count: 10

# Pagination: skip 20, take 10
nodes:
  page-3:
    type: data.limit
    config:
      count: 10
      offset: 20

Edge cases

Fewer rows than count. If the input has fewer rows than count, all rows are returned.

Offset beyond data. If offset exceeds the row count, the output is empty.

data.dedup

Removes duplicate rows based on key fields. Keeps either the first or last occurrence in input order.

Input: Table | Output: Table (same schema, duplicates removed)

Config reference

Field	Type	Required	Description
`on`	array	yes	Fields to deduplicate on
`keep`	string	no	Which duplicate to keep: `first` or `last` (default: `first`)

Examples

nodes:
  unique-emails:
    type: data.dedup
    config:
      on: [email]

# Keep the most recent entry per user
nodes:
  latest-per-user:
    type: data.dedup
    config:
      on: [user_id]
      keep: last

Edge cases

Multi-field dedup. When on lists multiple fields, rows are considered duplicates only if all listed fields match.

NULL keys. Two rows with NULL in a dedup key field are considered duplicates of each other.

Pipeline: sort + dedup + limit

These three operations compose into a common pattern: sort to establish order, deduplicate on a key (keeping the desired occurrence based on that order), then cap the result size.

name: recent-unique-signups
version: 1

nodes:
  load-signups:
    type: file.source
    path: signups.csv
    format: csv

  by-date:
    type: data.sort
    config:
      by:
        - field: signed_up_at
          direction: desc

  one-per-email:
    type: data.dedup
    config:
      on: [email]
      keep: first

  top-100:
    type: data.limit
    config:
      count: 100

edges:
  - "load-signups.data -> by-date.input"
  - "by-date.output -> one-per-email.input"
  - "one-per-email.output -> top-100.input"

This pipeline loads signups, sorts newest first, keeps only the most recent signup per email address, and returns the top 100.