Population Core Data Product¶
The Population CDP is the unified source of truth for the synthesized population. It produces household-level and person-level records with demographic attributes, derived from multiple census and commercial data sources.
What is a synthetic population?¶
Replica creates a "synthetic population" that is statistically representative of the actual population. It is built by sampling and recombining Census demographic microdata (primarily PUMS records) and then calibrating the totals against ACS population estimates. The result is a full set of household and person records — not real individuals, but a statistically faithful representation of who lives, works, and travels in a region.
Details about which version of Census data is used for a given megaregion can be found in a Places Study by clicking the megaregion/season/day filter button and selecting "View Data Source Details." Weekly (Trends) datasets use 2019 ACS/PUMS data; Places seasons vary by megaregion.
Source Data¶
| Source | Description |
|---|---|
| ACS | American Community Survey 5-year population estimates — used as the calibration target for total population counts |
| CTPP | Census Transportation Planning Product — home-to-work commute flows |
| LEHD | Longitudinal Employer-Household Dynamics — workplace location and employment data |
| PUMS | Public Use Microdata Sample — household and person microdata used as the seed for synthesis |
| BLS | Bureau of Labor Statistics employment data |
| Epsilon | Commercial demographic and consumer marketing data |
| EV registrations | Electric vehicle registration and sales data (including Atlas EV Hub) used to calibrate state-level BEV totals |
Person Attributes¶
Each synthetic person carries the following attributes:
| Attribute | Notes |
|---|---|
| Age | |
| Sex | |
| Race and ethnicity | |
| Primary language | |
| Employment status | Binary employed/unemployed derived from ACS (see note on WORK trips below) |
| Industry of employment | 2017 NAICS codes; granularity varies between 4 and 6 digits (see Industry section) |
| Income | Individual and household income |
| Work-from-home status | Whether the person primarily works from home |
| Vehicle ownership | Whether the household owns a vehicle |
| BEV ownership | Whether the household owns a battery electric vehicle (see BEV section) |
| Home location | Census block/tract |
| Work location | |
| Resident or visitor status | See Visitors section for important caveats |
Household Attributes¶
Household records include:
- Location (census block/tract)
- Household size
- Income bracket
- Vehicle ownership count
- Housing type
Industry of Employment¶
Industry of employment uses 2017 NAICS codes. The level of granularity is variable:
- The primary source for detailed industry data is point-of-interest and parcel-level land use data.
- When multiple NAICS codes are inferred from a single parcel, Replica calculates the proportion of codes at different levels and selects the most confident level. For example, if three unique 6-digit NAICS codes share the same 5-digit prefix, the 5-digit code is assigned.
- Some industries have codes at only 4 digits instead of the full 6.
Battery Electric Vehicle (BEV) Data¶
BEV ownership is modeled using two sources:
- Third-party consumer marketing data — provides the geospatial distribution of BEV owners and the relationship between BEV ownership and sociodemographic attributes.
- Registration and sales data (e.g., Atlas EV Hub) — used to calibrate state-level BEV totals.
Visitors¶
Visitors are people who do not normally live or work in the megaregion. They fall into two categories:
- Stayed overnight in the megaregion, or
- Entered and exited via a port of entry (usually an airport) the same day
Visitor movements can be viewed using the "Resident/Visitor Status" module in Places Studies.
Important: Replica does not include the demographic characteristics (age, income, etc.) of visitors. Filtering to visitors only will show no data in demographic modules. This is also one of the reasons some trips have no associated person data (see below).
Modeling Minors¶
Data suppliers are legally prohibited from collecting data on minors, so minors are modeled with simplifications:
- Minors are assigned a school based on their age and the school's proximity to their home location.
- Overall enrollment totals are based on publicly available enrollment data.
- Home locations are derived from census data.
- Additional factors that may contribute to school choice (district boundaries, private school selection, etc.) are not accounted for.
Underrepresented Populations¶
The synthetic population is calibrated against recent Census ACS estimates. Because ACS itself has coverage gaps, calibration includes small margins of error. Replica does not require seeing all mobile devices in a given area — access to mobile devices is not necessary for scaling samples to represent movements of all people.
Rural and Sparsely Populated Areas¶
Replica uses a composite of data sources with national coverage. Rural areas with lower population density do have coverage, but the data will reflect that lower density. Transit data coverage is more limited in rural areas because Replica currently models transit agencies with a minimum ridership threshold of 500 daily boardings per route; smaller agencies in smaller cities often lack transit coverage.
Common Query Gotchas¶
Why do some trips have no person data?¶
Trips with no person data (age, home location, etc.) are either:
- Commercial (freight) trips — these are vehicle movements, not person trips.
- Visitor trips — Replica does not attach demographic attributes to visitors.
Why do "WORK" trips appear for unemployed people?¶
A small percentage of unemployed people taking WORK trips is expected. The synthetic population uses ACS data that covers an entire year and classifies employment as a binary (employed/unemployed). An "unemployed" person may have been employed occasionally, held a part-time or temporary job, or fall into categories like:
- Unemployed workers going to job interviews
- Day laborers
- Students categorized as not in the labor force but still commuting to part-time jobs
Data Configs¶
Located in populationgen/configs/:
- usa.yaml and mini_kc.yaml — current season configs
- {season}/ — archived configs for past seasons
- ingestion/ — unmanaged source ingestion configs
Building¶
- Manual:
arti build <path> - Preferred for USA: use the run-population GitHub Action workflow
Publishing¶
Published to GCS: gs://core_data_products/population/{season}/{version}/{region}.yaml
The mini_kc config must build successfully before merging to master (enforced by CI).