Methodology — How WARN Firehose Collects, Normalizes & Verifies Data

Last updated: 2026-05-12. Material methodology changes are noted in the changelog at the bottom of this page.

Sources at a glance

Dataset	Source	Cadence	Coverage	License
WARN Act notices	50 state workforce agencies	Daily	1988–present, 47/50 states	Public records / CC0
H-1B / LCA petitions	USCIS + DOL Office of Foreign Labor Certification	Monthly	FY2009–present (LCA back to FY2012)	Public records / CC0
SEC 8-K filings	SEC EDGAR	Daily	2014–present (items 1.03 & 2.05)	Public records / CC0
Bankruptcy filings	PACER, SEC, FJC IDB	Daily	2015–present	Public records / CC0
DOL unemployment claims	U.S. Department of Labor	Weekly	1984–present	Public records / CC0
JOLTS labor turnover	BLS Job Openings & Labor Turnover Survey	Monthly	2000–present	Public records / CC0

WARN Act data — 50 state scrapers

The Worker Adjustment and Retraining Notification (WARN) Act requires US employers with 100+ employees to give 60 days' advance notice of mass layoffs and plant closures. Each state collects these filings and publishes them through its labor or workforce agency — with no federal central repository.

WARN Firehose operates 50 independent state-specific scrapers, each tailored to that agency's publishing format (PDF, HTML table, Excel, or CSV). Every scraper runs daily at 03:00 UTC and:

Fetches the latest filings from the agency website (or PDF, where the state still publishes that way)
Parses structured fields: company name, city, county, state, employees affected, notice date, effective date, layoff type
Generates a deterministic record ID: {STATE}-{YEAR}-{md5[:8]} so the same filing produces the same ID on re-scrape (idempotent ingest)
Validates against a schema and writes to SQLite with WAL mode
Fires webhooks to active subscribers if matched on watched companies

Source URLs for every record are preserved in the database. Each public record page on this site links back to the originating state agency filing.

Cross-dataset joins

Public WARN notices alone don't tell the whole story. We cross-reference each WARN record against five other federal datasets to surface signals that individual sources miss:

SEC 8-K item 2.05 — "Costs Associated With Exit or Disposal Activities," typically filed before public WARN notices and often the earliest legally-required disclosure of restructuring
SEC 8-K item 1.03 — "Bankruptcy or Receivership" notices, frequently filed days to weeks before WARN
Chapter 11 / Chapter 7 bankruptcy filings — cross-matched on employer name + state
H-1B / LCA petitions — the same employer's recent visa sponsorship volume, surfaced on company pages so H-1B holders can see if their sponsor's hiring has frozen before a WARN drops
DOL initial unemployment claims by state — the leading macroeconomic indicator that contextualizes a WARN filing

These joins power the Risk Signal API and the cross-referenced data tables on every company page.

Normalization rules

Public datasets are messy. WARN Firehose applies the following normalization in the ingest pipeline:

Company name normalization — "Amazon.com, Inc.", "Amazon Inc", "AMAZON.COM SERVICES LLC", and "Amazon Web Services" are unified under a canonical employer slug. Variants are preserved in a display_name column.
NAICS industry codes — back-filled from LCA petition records, sibling company records, known-employer mapping, and keyword classification.
County geocoding — when a record has a city but no county, we infer from a US Census place-to-county mapping.
Date parsing — multiple formats (ISO, US slashes, European slashes, "Jan 5, 2026") collapsed to ISO 8601.
Slug generation — URL slugs are deterministic, lowercase, hyphen-separated, ASCII-only.
Deduplication — records matching on company name + city + notice date within a 30-day window are merged. The earliest-scraped version wins; the others are kept for audit.

Documented gaps

We do not generate or estimate data we don't have. The following gaps are real and disclosed:

Arkansas, Wyoming, and New Hampshire do not publish WARN Act data publicly. We carry no records from these three states. We do not extrapolate from BLS claims data to fill the gap. State agencies have been contacted; if they begin publishing, we will ingest within 7 days.

LCA petitions before FY2012 are unavailable through the DOL legacy URL paths (they 404). FY2008–FY2011 H-1B/LCA records are not on the site.

USCIS H-1B Employer Data Hub only publishes back to FY2009. FY2008 and earlier are not addressable through public APIs.

Privacy redactions: states sometimes redact small-employer filings (under 50 affected workers) for privacy. We mirror their redaction — if the source state PDF lists "Confidential" for company name, we publish it as such with the source link, rather than guessing.

Data quality status

Current quality metrics (refreshed 2026-03-08):

WARN records: 3.4% missing city, 1.1% missing county, 6.7% missing industry (continuing to back-fill from LCA + keyword classifiers)
LCA petitions: 24.1% missing wage data (older fiscal years), 47.7% missing SOC code (older fiscal years)
H-1B petitions: 0.3% missing state (clean)
SEC 8-K filings: 100% parsed for items 1.03 & 2.05
Bankruptcies: 77.2% WARN-matched, 71.1% missing chapter (source-data limitation)

Accuracy guarantees

WARN Firehose surfaces public records as filed. We do not guarantee that the underlying agencies are correct. If a state agency publishes an error (wrong employer name, wrong employee count, wrong effective date), we mirror that error until they correct it. We are a data aggregator, not a primary source.

If you find a record we ingested incorrectly — for example, a parsing failure that misattributed a row — email [email protected] with the record ID and source URL and we will investigate within one business day.

Open source

The scraper pipeline, normalization code, and SEO page generators are publicly auditable at github.com/sendkamal. Issues, PRs, and reports of parsing failures are welcome.

Changelog

2026-05-12: Initial publication of this methodology page.
2026-03-08: NAICS back-fill pipeline reduced missing-industry rate from 13.2% to 6.7%.
2026-03-05: LCA schema expanded from 22 to 40 columns (employer address, worksite county, H1B-dependent flags, etc.).
2026-02-22: Quarterly federal data import workflow added to GitHub Actions (LCA + H-1B + JOLTS).
2024-12-01: Site launched with initial 14 state scrapers; expanded to 50 by mid-2025.

« About WARN Firehose · Cited By »