Skip to content

Referential integrity

A GTFS feed is a small relational database with ~20 foreign-key relations. Editing it naively — drop a stop, rename a route — is a fast way to produce a feed that validates structurally but fails in Google Maps because stop_times now reference nothing. Referential integrity is the property that rescues you from that.

gapline enforces integrity on every write. No edit reaches disk until the integrity model has confirmed that the result is consistent.

Internally, gapline maintains a reverse-index for every primary key used as a foreign key elsewhere. When you load a feed, the index maps:

stops.stop_id → rows in stop_times, transfers, pathways that reference it
routes.route_id → rows in trips, fare_rules that reference it
trips.trip_id → rows in stop_times, frequencies
calendar.service_id → rows in trips, calendar_dates

Queries against the index are O(1) per PK — building the cascade plan for a delete on a mid-sized feed takes milliseconds.

Deliberately, the index is just hash maps: no graph library, no path-finding, no cycle detection. GTFS foreign-key chains are shallow (at most 2–3 hops) so this is enough.

delete cannot orphan dependents. When you run:

Terminal window
gapline delete stops --where "stop_id=S01"

gapline:

  1. Computes the set of stop_id = S01 matches in stops.txt.
  2. For each match, walks the reverse index to find every row in every dependent file that references the match transitively.
  3. Prints a preview:
    Records to delete from stops.txt:
    S01
    Deleting would also delete:
    - 83 records in stop_times.txt
    - 2 records in transfers.txt
    Proceed with cascade delete? [y/N]
  4. Applies the plan only after you confirm (or if you passed --confirm).

There is no --cascade flag on delete because cascade is the only safe default. If the target has no dependents (for example, calendar_dates.txt is a leaf), the prompt simply lists the matched rows.

A non-PK update (say, changing a stop_name) touches only the target file. No cascade is needed, no cascade is computed.

A PK update (say, renaming stop_id=S01 to stop_id=STOP_MAIN) is different: every row in every dependent file that references the old PK needs to be rewritten to reference the new one. --cascade opts into this rewrite:

Terminal window
gapline update stops \
--where "stop_id=S01" \
--set stop_id=STOP_MAIN \
--cascade --confirm

Without --cascade, the PK rewrite is refused before it starts — the command would otherwise orphan every stop_times row that references S01.

create refuses to insert a record whose foreign-key fields do not point to existing rows. For example:

Terminal window
gapline create stop-times --set trip_id=UNKNOWN stop_id=S01 ...

fails immediately with an fk_violation error — trip_id=UNKNOWN is not in trips.txt.

A feed that passes structural validation but has orphaned references is the worst class of broken: it looks fine in a quick check but fails silently in production. Consumers handle orphans inconsistently — some skip the affected rows, some reject the feed entirely, some render partial data and never surface the error.

By enforcing integrity at write time, gapline makes this class of bug impossible to create through the CLI. The trade-off is that delete and update --cascade need to plan the full cascade before applying it — usually a few milliseconds, occasionally a few seconds on very large feeds.