Methodology
How we collect, normalize, deduplicate, and serve six federal labor datasets. Open, audit-friendly, and explicit about what we cannot do.
Last updated: 2026-05-12. Material methodology changes are noted in the changelog at the bottom of this page.
Sources at a glance
| Dataset | Source | Cadence | Coverage | License |
|---|---|---|---|---|
| WARN Act notices | 50 state workforce agencies | Daily | 1988–present, 47/50 states | Public records / CC0 |
| H-1B / LCA petitions | USCIS + DOL Office of Foreign Labor Certification | Monthly | FY2009–present (LCA back to FY2012) | Public records / CC0 |
| SEC 8-K filings | SEC EDGAR | Daily | 2014–present (items 1.03 & 2.05) | Public records / CC0 |
| Bankruptcy filings | PACER, SEC, FJC IDB | Daily | 2015–present | Public records / CC0 |
| DOL unemployment claims | U.S. Department of Labor | Weekly | 1984–present | Public records / CC0 |
| JOLTS labor turnover | BLS Job Openings & Labor Turnover Survey | Monthly | 2000–present | Public records / CC0 |
WARN Act data — 50 state scrapers
The Worker Adjustment and Retraining Notification (WARN) Act requires US employers with 100+ employees to give 60 days' advance notice of mass layoffs and plant closures. Each state collects these filings and publishes them through its labor or workforce agency — with no federal central repository.
WARN Firehose operates 50 independent state-specific scrapers, each tailored to that agency's publishing format (PDF, HTML table, Excel, or CSV). Every scraper runs daily at 03:00 UTC and:
- Fetches the latest filings from the agency website (or PDF, where the state still publishes that way)
- Parses structured fields: company name, city, county, state, employees affected, notice date, effective date, layoff type
- Generates a deterministic record ID:
{STATE}-{YEAR}-{md5[:8]}so the same filing produces the same ID on re-scrape (idempotent ingest) - Validates against a schema and writes to SQLite with WAL mode
- Fires webhooks to active subscribers if matched on watched companies
Source URLs for every record are preserved in the database. Each public record page on this site links back to the originating state agency filing.
Cross-dataset joins
Public WARN notices alone don't tell the whole story. We cross-reference each WARN record against five other federal datasets to surface signals that individual sources miss:
- SEC 8-K item 2.05 — "Costs Associated With Exit or Disposal Activities," typically filed before public WARN notices and often the earliest legally-required disclosure of restructuring
- SEC 8-K item 1.03 — "Bankruptcy or Receivership" notices, frequently filed days to weeks before WARN
- Chapter 11 / Chapter 7 bankruptcy filings — cross-matched on employer name + state
- H-1B / LCA petitions — the same employer's recent visa sponsorship volume, surfaced on company pages so H-1B holders can see if their sponsor's hiring has frozen before a WARN drops
- DOL initial unemployment claims by state — the leading macroeconomic indicator that contextualizes a WARN filing
These joins power the Risk Signal API and the cross-referenced data tables on every company page.
Normalization rules
Public datasets are messy. WARN Firehose applies the following normalization in the ingest pipeline:
- Company name normalization — "Amazon.com, Inc.", "Amazon Inc", "AMAZON.COM SERVICES LLC", and "Amazon Web Services" are unified under a canonical employer slug. Variants are preserved in a
display_namecolumn. - NAICS industry codes — back-filled from LCA petition records, sibling company records, known-employer mapping, and keyword classification.
- County geocoding — when a record has a city but no county, we infer from a US Census place-to-county mapping.
- Date parsing — multiple formats (ISO, US slashes, European slashes, "Jan 5, 2026") collapsed to ISO 8601.
- Slug generation — URL slugs are deterministic, lowercase, hyphen-separated, ASCII-only.
- Deduplication — records matching on company name + city + notice date within a 30-day window are merged. The earliest-scraped version wins; the others are kept for audit.
Documented gaps
We do not generate or estimate data we don't have. The following gaps are real and disclosed:
Data quality status
Current quality metrics (refreshed 2026-03-08):
- WARN records: 3.4% missing city, 1.1% missing county, 6.7% missing industry (continuing to back-fill from LCA + keyword classifiers)
- LCA petitions: 24.1% missing wage data (older fiscal years), 47.7% missing SOC code (older fiscal years)
- H-1B petitions: 0.3% missing state (clean)
- SEC 8-K filings: 100% parsed for items 1.03 & 2.05
- Bankruptcies: 77.2% WARN-matched, 71.1% missing chapter (source-data limitation)
Accuracy guarantees
WARN Firehose surfaces public records as filed. We do not guarantee that the underlying agencies are correct. If a state agency publishes an error (wrong employer name, wrong employee count, wrong effective date), we mirror that error until they correct it. We are a data aggregator, not a primary source.
If you find a record we ingested incorrectly — for example, a parsing failure that misattributed a row — email [email protected] with the record ID and source URL and we will investigate within one business day.
Open source
The scraper pipeline, normalization code, and SEO page generators are publicly auditable at github.com/sendkamal. Issues, PRs, and reports of parsing failures are welcome.
Changelog
- 2026-05-12: Initial publication of this methodology page.
- 2026-03-08: NAICS back-fill pipeline reduced missing-industry rate from 13.2% to 6.7%.
- 2026-03-05: LCA schema expanded from 22 to 40 columns (employer address, worksite county, H1B-dependent flags, etc.).
- 2026-02-22: Quarterly federal data import workflow added to GitHub Actions (LCA + H-1B + JOLTS).
- 2024-12-01: Site launched with initial 14 state scrapers; expanded to 50 by mid-2025.