Accuracy Matters: Imputing Zip-Level Employment

Posted on June 3, 2022 by Greg Chmura

JobsEQ incorporates a collection of sources for estimating industry employment at the zip-code level. [1] Two of the prime sources used are LODES[2] and Zip Code Business Patterns (ZBP), [3] both made available by the US Census Bureau. We use these data sets because of their superior accuracy, while avoiding others that are less reliable such those sold by private-sector firms that compile company profiles.

Data Set Comparisons

Though we are discussing data sets for use at the zip-code level, we evaluate the accuracy of these data summed to the county level so we can use the Quarterly Census of Employment and Wages (QCEW) as our benchmark. The QCEW data are produced by the Bureau of Labor Statistics and are widely viewed as the most accurate industry employment data for the United States. QCEW data are available for all counties, but are not available at the zip-code level.

Using county data for the whole nation—specifically, private sector employment at the 2-digit NAICS level [4] —we first compare the disclosed [5] QCEW employment figures to those from CBP (the county-level analog of ZBP) and LODES. While both data sets offer value, the results of this test show that LODES is closer on average to the QCEW benchmark at this NAICS level with about two-and-a-half times less overall error than the CBP data. [6]

Another possible data set to consider would be the aggregated employment estimates derived from a collection of company profiles. Such lists, which can be purchased from private-sector vendors, are more typically used for their contact information, but these databases often also include NAICS and employment estimates for individual companies. While we appreciate the difficulty in developing and maintaining such databases, and while we use such databases ourselves for projects such as conducting business surveys, we would not use these data for industry employment estimates over the more accurate data from the US Census.

While this may seem to be an obvious choice, we nevertheless tested the accuracy of these data sets using a sample from two popular vendors—who we’ll refer to as Vendor A and Vendor B.[7] When evaluating covered employment at the 2-digit NAICS level, there was really no comparison—both Vendor A and Vendor B data had about five times more overall error compared to the LODES data; put another way, the LODES averaged an approximate 80% reduction in error from the company-list estimates of 2-digit NAICS employment. When comparing private sector data at the 4-digit NAICS level, our processed ZBP dataset handily surpassed the company-list job estimates, offering a 61% reduction in error compared to Vendor A and a 51% error reduction for the Vendor B sample.

Additional Considerations

Within JobsEQ, we utilize both LODES and ZBP data, as well as other supplemental datasets. Since LODES is not more detailed than 2-digit NAICS, the ZBP dataset is required for the more granular industry-level employment imputations. There are also some industries that are not a strength in either LODES or ZBP, so for those we use additional sources to improve the overall accuracy of our employment imputations—an example of such an exception is Department of Defense civilian employment.

Beginning in 2017, ZBP was released with non-disclosures for region-industry combinations with fewer than three establishments. Due to this, we must implement additional processing to fill in the missing data points. Additionally, the ZBP dataset provides establishment counts within bucketed ranges of employment rather than exact employment estimates. Precise employment estimates, therefore, need to be estimated from these employment ranges. Regardless, even with these additional processing steps required, the resulting employment estimates are quite valuable as illustrated in the above assessments.

A further important point is that we do not use LODES and ZBP data sets in isolation within JobsEQ, but we use them in conjunction with each other as well as with our processed QCEW data as a benchmark. We thus can effectively leverage the more accurate and comprehensive QCEW data to benchmark our supplementary data sets. In addition, this means our employment data is consistent across all geographic levels so our clients do not have to contend with incompatible or contradictory data sets when performing regional analyses.

Please don’t hesitate to reach out to us to learn more about our labor market data or economic consulting services.

Research assistance for this blog was provided by Christopher Uchtman.

[1] In JobsEQ, we impute employment down to the block level, but for purposes of this blog we focus on methods at the zip code level.

[2] LODES stands for Longitudinal Employer-Household Dynamics Origin-Destination Employment Statistics and is part of the LEHD data set.

[3] More about the Census’ ZBP and County Business Patterns data can be read here:

[4] In this comparison, the private sector only is used (as opposed to inclusion of government employment) since the ZBP/CBP data are largely focused on private sector employment. Two-digit NAICS is used as that is the most granular industry level available in LODES.

[5] In some cases, the QCEW suppresses employment estimates to protect the privacy of the businesses providing data in this census. For our comparison, we only utilized QCEW data points that were disclosed.

[6] In this test, we looked at all counties in the nation using data for the 2019 calendar year.

[7] For Vendor A, our sample was four counties, arbitrarily chosen among counties where the zip codes matched or nearly matched the county aggregate (done so, so we could use our processed ZBP data in aggregate rather than CBP data directly). To keep costs contained, we chose relatively smaller counties for this exercise (counties having covered employment varying roughly from 40,000 to 100,000). One of these four counties, where Vendor A had approximately average accuracy, was chosen as a test county for Vendor B.

This blog reflects Chmura staff assessments and opinions with the information available at the time the blog was written.