Methodology and Data Quality¶
Technical Approach Overview¶
Replica is an activity-based travel demand model. The pipeline works in three stages:
- Synthetic population -- Census demographic data (PUMS, ACS, CTPP, LEHD) is used to create a population that is statistically representative of actual residents at the block group level.
- Personas and mode choice -- Machine learning assigns travel personas to the synthetic population using a composite of mobile location data, consumer data, built environment data, and economic activity data. A mode choice model estimates each trip's likelihood of taking a given mode-route combination.
- Agent-based simulation -- Movement is simulated with an agent-based approach that accounts for congestion and interactions between individual travel itineraries. Trips are routed on road, bike, and transit networks.
After simulation, modeled outputs are compared against ground truth (observed counts) and calibration algorithms iteratively adjust model parameters. The calibration objective is to fit observed ground truth while avoiding overfitting to outliers and noise in the calibration data.
Places produces a complete trip and population table for a typical Thursday (weekday) and Saturday (weekend) over a 13-week modeling season.
Trends produces a weekly nationwide origin-destination table at the census tract level, with mode and purpose breakdowns.
How Replica Differs from Other Mobility Data Providers¶
Most mobility data providers take a single data source (typically location-based services data), apply scaling factors, and compare to AADT. Replica takes a fundamentally different approach:
- Uses a composite of five data source categories rather than a single source
- Builds a full synthetic population rather than scaling a sample
- Runs an agent-based simulation that models congestion and route interactions
- Calibrates to ground truth rather than simply scaling to match traffic counts
This means Replica data is a modeled representation of travel patterns, not a scaled-up sample. The strengths and limitations differ accordingly.
Data Sources¶
Replica uses third-party data from public and private sources in five categories:
| Category | Examples |
|---|---|
| Mobile location data | Location-based services (LBS), vehicle in-dash GPS (telematics), point-of-interest (POI) data |
| Consumer/resident data | Demographic data from Census and private sources (PUMS, ACS, CTPP, LEHD) |
| Built environment data | Land use, building footprints, road/bike/transit networks (OpenStreetMap, GTFS) |
| Economic activity data | Credit, debit, and cash transactions |
| Ground truth data | Auto/freight traffic volumes, transit ridership counts |
To see which data sources were used for a specific model, click the megaregion/season/day filter button in Replica and select "View Data Sources Details."
Calibration and Validation¶
Places¶
Each completed Places model includes a Quality Report that compares modeled outputs to ground truth data (traffic counts, transit ridership). If the comparison yields an unacceptable gap, model parameters are iteratively adjusted. The Quality Report is published so users can assess statistical confidence for their geography.
Trends¶
Trends data is released in near real time, so direct calibration against sensor data is not possible. Instead:
- Metrics are calibrated based on historical regression to ground truth (e.g., Bureau of Transportation Statistics for mobility, Census Bureau for economic activity).
- Certain metrics labeled "Quantified" Trends Estimates (including total trips) include a quantified margin of error for each day.
- For many metrics, no week-to-week ground truth exists, so margin of error cannot be calculated.
Freight¶
Freight uses a tour-based model: tours are identified from truck GPS data and scaled using ground truth counts. Scaling coefficients vary by geography and time.
BEV¶
Battery electric vehicle ownership is modeled using third-party consumer marketing data for geospatial distribution. State-level totals are calibrated against BEV registration and sales data.
Sources of Uncertainty¶
Data inputs¶
Each data source has strengths and weaknesses. Misfiring sensors appear as outliers. Sensor data is not available on all streets. Using multiple data types and vendors mitigates individual source weaknesses, but no input is perfect.
Sample size¶
Larger aggregates produce higher certainty. Filtering to small numbers of trips -- many demographic/purpose filters in Places, or sparsely populated tracts in Trends -- reduces certainty because noise and outliers have more influence.
Certainty for a given geography also depends on the proportion of the population for which Replica has data. Observed data is scaled to match the true population at the block group level, but areas or population groups with sparse data may see more variance.
Calibration data availability¶
As roads increase in size and importance, trip estimates grow in confidence because sensor coverage is better. Replica is typically less certain about trips on smaller residential roads.
Spatial data input quality¶
Replica depends on OpenStreetMap for road and bike networks and on agency GTFS for transit. When these inputs are inaccurate or incomplete (wrong turn restrictions, missing stops, missing weekday/weekend schedule variations), model outputs are directly affected.
Confidence Guidelines¶
Higher confidence: - Larger geographic areas (megaregion, county, large city) - Higher-volume roads and corridors - Aggregate trip counts with few filters applied - Mode split and purpose distributions at regional scale
Lower confidence -- interpret with caution: - Individual street segments, especially residential roads - Small geographies (individual block groups, rural tracts) - Heavily filtered queries (many demographic + purpose + mode filters stacked) - Areas with limited sensor coverage or sparse population - Trail usage volumes (may undercount due to recreational trip exclusion)
Rule of thumb: if your analysis depends on a small number of trips or a single road segment, treat the result as directional rather than precise. Zoom out to a larger aggregate to validate the pattern.
AADT Comparison Caveats¶
Discrepancies between Replica network volumes and AADT counts are common and usually explainable:
Seasonal vs. annual averaging¶
Replica models a typical mid-week or weekend day over a 13-week season. AADT is an annual average that blends weekdays and weekends across all seasons. These are structurally different measurements.
Different methodology¶
AADT datasets have their own scaling factors based on the location and time of year counts were collected. Replica builds a full simulation from composite data. Direct numeric comparison requires understanding both methodologies.
Isolated link issues¶
Because Replica models entire megaregions, occasional links may not match ground truth due to OpenStreetMap errors, router assignment issues, or volume estimation gaps on specific segments. The overall regional pattern remains reliable.
Different data vintage¶
Each Replica model uses the most recent Census dataset available for that season. If you are comparing against AADT from a different year, population and travel pattern changes may explain gaps.
AADT vs. free flow speed coverage¶
Replica often has greater AADT coverage than free flow speed coverage. Some local streets have AADT estimates but insufficient sample data for free flow speeds.
Known Limitations¶
| Limitation | Detail |
|---|---|
| No 2020 data | No Places seasons were developed for 2020. Pre-pandemic baseline uses 2019; post-pandemic starts at 2021. |
| No pre-2019 data | Trends mobility data starts January 2019. Places megaregions start Fall 2019. |
| Minors simplified | Data suppliers cannot collect minor data. Minors are assigned schools based on age and proximity; enrollment comes from public data; home locations from Census. Factors like school choice preferences are not modeled. |
| Transit boardings not calibrated to customer data | Line-level ridership is generally reliable, but boarding/alighting stop assignment may differ from actual rider behavior (e.g., riders walking farther to a preferred stop). |
| Recreational walking/biking excluded | Only purposeful trips with a clear origin and destination are modeled. Jogging, dog walking, and looping recreational trips are not included. Trail volumes may undercount observed traffic. |
| Spend data discontinued | Due to upstream source changes, spend data was discontinued as of December 28, 2024. |
| Rural transit coverage gaps | Transit agencies below 500 daily boardings per route are not modeled, which affects smaller agencies in smaller cities. |
| Rural land use gaps | Some very rural areas lack parcel and aggregate land use data, resulting in "Unknown" land use classification. |
| E-bike not differentiated | The model does not currently distinguish between traditional bike and e-bike trips. |
| Transit access modes limited | All transit access is modeled as walking or driving. Bike access, passenger drop-off, and for-hire vehicle access to transit are not modeled. |
| Ferry trips less reliable | Lower sample sizes and idiosyncratic rider preferences make ferry volumes less reliable than other transit modes. Cross-check against operator ridership counts. |
When to Contact Support¶
Reach out to the Replica team when:
- Ground truth discrepancies -- You have trusted local count data that significantly diverges from Replica outputs. Replica will review and may ingest the data for future calibration.
- Land use mismatches -- Your local land use data conflicts with Replica's classification. Share the dataset for review.
- Unexpected zero or extreme values -- Isolated links or areas showing volumes that seem implausible.
- Data source questions -- You need to understand which specific sources were used for a particular model season.
Replica improves with each release. Sharing ground truth data and flagging discrepancies directly contributes to future model quality.