sort / limit / dedup

Three operations for shaping result sets. They do one thing each and chain together cleanly.

data.sort

Orders rows by one or more fields.

Input: Table Output: Table (same schema, reordered)

Config reference

Field	Type	Required	Description
`by`	array	yes	Sort fields, each with `field` and optional `direction`
`null_handling`	string	no	Where nulls appear: `first` or `last` (default: `last`)

Each entry in by has:

Field	Type	Default	Description
`field`	string	(required)	Column to sort on
`direction`	string	`asc`	Sort direction: `asc` or `desc`

Examples

nodes:
  by-date:
    type: data.sort
    config:
      by:
        - field: created_at
          direction: desc

# Multi-field sort with nulls first
nodes:
  ranked:
    type: data.sort
    config:
      null_handling: first
      by:
        - field: priority
          direction: asc
        - field: created_at
          direction: desc

data.limit

Takes the first N rows, optionally skipping an offset.

Input: Table Output: Table (same schema, at most N rows)

Config reference

Field	Type	Required	Description
`count`	number	yes	Maximum rows to return
`offset`	number	no	Rows to skip before taking (default: `0`)

Examples

nodes:
  top-10:
    type: data.limit
    config:
      count: 10

# Pagination: skip 20, take 10
nodes:
  page-3:
    type: data.limit
    config:
      count: 10
      offset: 20

data.dedup

Removes duplicate rows based on key fields. Keeps either the first or last occurrence in input order.

Input: Table Output: Table (same schema, duplicates removed)

Config reference

Field	Type	Required	Description
`on`	array	yes	Fields to deduplicate on
`keep`	string	no	Which duplicate to keep: `first` or `last` (default: `first`)

Examples

nodes:
  unique-emails:
    type: data.dedup
    config:
      on: [email]

# Keep the most recent entry per user
nodes:
  latest-per-user:
    type: data.dedup
    config:
      on: [user_id]
      keep: last

Pipeline: sort, dedup, limit

These three operations compose into a common pattern: sort to establish order, deduplicate on a key (keeping the desired occurrence based on that order), then cap the result size.

nodes:
  load-signups:
    type: file.csv
    config:
      path: signups.csv

  by-date:
    type: data.sort
    config:
      by:
        - field: signed_up_at
          direction: desc

  one-per-email:
    type: data.dedup
    config:
      on: [email]
      keep: first

  top-100:
    type: data.limit
    config:
      count: 100

edges:
  - load-signups.output -> by-date.input
  - by-date.output -> one-per-email.input
  - one-per-email.output -> top-100.input

This pipeline loads signups, sorts newest first, keeps only the most recent signup per email address, and returns the top 100.