Understanding our new Resume Forensics Data


We recently enriched the source data used in the JobsEQ Resume Forensics analytic with data that we have found to be more current and of higher quality.


Chmura operates under the spirit of continuous improvement and that includes our data quality. We are passionate about ensuring that our JobsEQ users continue to have the best, most accurate, and relevant data available.

To that end, we recently enriched the source data used in the JobsEQ Resume Forensics analytic with data that we have found to be more current, consistent, and with higher quality. With this update in source data comes some large changes in the data classification and the results seen in the Resume Forensics analytic. This blog discusses those changes.

Why replace sources?

Providing the best possible data to our users is of utmost importance in the work we do at Chmura, and one of the keys to providing the best data is using the highest quality source data. With traditional labor market information (LMI), such as employment statistics and demographic data, selecting source data is relatively easy, as there are “benchmark” sources for these data that have become the standard; examples include the Bureau of Labor Statistics, the Census Bureau, and the Bureau of Economic Analysis. When it comes to non-traditional LMI – job postings and resume data – there are no “benchmark” sources; and, in fact, there are many sources available for these data, with widely varying degrees of quality. We are constantly testing, analyzing, and validating these different data sources to ensure the best data are used in JobsEQ.

With this most recent update to the Resume Forensics analytic we are incorporating new source data that has been determined to be of higher quality, based on three primary factors:

  • consistency,
  • recency, and
  • relevancy.

Consistency: Anyone who has ever reviewed a resume for hiring purposes knows there is no standardized format or structure for a resume; while the overall contents of a resume tend to be similar – experience, education, skills, etc. – the presentation of contents vary greatly. Variation can make it difficult to analyze the data and require more complexity in modeling, all of which increases the potential for error and the likelihood of a particular resume being deemed unusable. Our updated source data now follows a fixed and standardized format, minimizing the potential for error by eliminating some of the model complexities.

Recency: A common issue with resume data is that it tends to be out of date. Often a resume will only be updated when an individual is searching for a job and will not be updated once a job has been found to include this new position, resulting in a situation where the “most recent” version of a resume available is actually representative of the individual’s previous occupation and not the current. Our new source data has been found to be updated more frequently and much more likely to include an individual’s current occupation and skill set.

Relevancy: Highly related to the recency of the resume, the relevancy of resume data decreases when the most recent available data is old and outdated. Along with the occupation information, outdated location, education, and skill information discounts the real state of a regional labor market. By using source data that is updated more frequently we are ensuring that the data stays current and relevant.

How did the data change?

Overall, switching to the new source allowed us to massively expand our dataset, going from less than 20 million resumes to over 37 million. This increase is largely due to the consistency and higher quality of the data enabling us to bring in over 20 million resumes with “unknown” SOC classifications that may have otherwise been unusable. These unknown SOC resumes are ones that do not have enough detail in the work experience section to allow for our SOC classification model to confidently classify the resume to an occupation, but otherwise have good information – such as job titles, education history, skills, and locations. For example, of the unknown occupation resumes 87% of them have a school identified, 76% a degree program, and 53% have at least one skill or certification. Common job titles for the unknown occupation resumes include “Project Manager”, “Teacher”, “Manager”, and “Consultant”, titles that may not be detailed enough to classify to a specific occupation (e.g. at the 6-digit SOC level), but provide valuable information on their own, particularly when used with other information from the resumes.

Providing the best possible data to our users is of utmost importance in the work we do at Chmura, and one of the keys to providing the best data is using the highest quality source data.

When it comes to classified occupations, by far the largest change is an increased representation of self-employed businesses owners who are classified in these data as Chief Executives (11-1011), the largest occupation in the data set, accounting for 15% of all occupations while only making up about 2% of the total in the previous data. White-collar and professional occupations are also more represented in the updated data, while blue-collar and service occupations make up a relatively smaller proportion of the mix.[1] After excluding the outlier of chief executives, white-collar and professional occupations still account for about 68.4% of occupations in the updated data while blue-collar and service occupations make up about 31.6%. Conversely, white-collar and professional occupations make up only about 42% of the occupations in the previous data, while blue-collar and service occupations account for about 58%.

One possible explanation for the difference in the occupation mix between the data sources goes to the recency and relevancy issues with resume data, where the previous data consisted of more resumes where the individual was actively searching for a job, resulting in a higher proportion of resumes where the last occupation was one with relatively higher turnover – primarily blue-collar and service occupations. Occupations with relatively low turnover will be less represented as those resumes will be updated less frequently, potentially aging out of the sample. Interestingly, when comparing the resume data to job postings data, the job postings data have a much more evenly split occupation mix; of all jobs posted in 2021 on RTI,[2] 48% are white-collar and professional occupations while 52% are blue-collar and service occupations.

Similar to the occupation mix, there are large differences in the educational attainment mix of the resume sources as well, with the updated data having a larger proportion of higher educational attainment than the previous data; 64.9% of resumes in the updated data have a bachelor’s degree or higher compared to only 33.7% of resumes in the previous data. One very important note with the educational attainment data is that about 38% of education entries in the previous data could not be successfully classified to a degree level, compared to about 25% of education entries in the updated data. This decrease in unclassified educational attainment data is a direct result of the standardized format in the updated data.[3]

Are the old data wrong?

With this update resulting in such large changes to the data in Resume Forensics, it is understandable that one might question the old data, especially if it was used in prior analysis. As previously stated, there is no “benchmark” data source for non-traditional LMI, and the use and analysis of these data are still relatively new. It is an ongoing process of discovering data sources and then testing and validating the data to ensure the best data are being used. Both the previous and updated data are of high quality, and much of the differences in classification are due to different sources representing different samples of the population, rather than one source having incorrect data. However, for the reasons discussed earlier in the blog, there is a higher degree of confidence with the updated data, which is ultimately why the change was made. Much like with the RTI job postings data set, over time the Resume Forensics resume data will continue to evolve and be adjusted to ensure we are providing our users with the best data available.

[1] For purposes of this comparison, white-collar and professional occupations include: Management Occupations (11-0000), Business and Financial Operations Occupations (13-0000), Computer and Mathematical Occupations (15-0000), Architecture and Engineering Occupations (17-0000), Life, Physical, and Social Science Occupations (19-0000), Community and Social Service Occupations (21-0000), Legal Occupations (23-0000), Educational Instruction and Library Occupations (25-0000), Arts, Design, Entertainment, Sports, and Media Occupations (27-0000), Healthcare Practitioners and Technical Occupations (29-0000), and Healthcare Support Occupations (31-0000). Blue-collar and service occupations include: Protective Service Occupations (33-0000), Food Preparation and Serving Related Occupations (35-0000), Building and Grounds Cleaning and Maintenance Occupations (37-0000), Personal Care and Service Occupations (39-0000), Sales and Related Occupations (41-0000), Office and Administrative Support Occupations (43-0000), Farming, Fishing, and Forestry Occupations (45-0000), Construction and Extraction Occupations (47-0000), Installation, Maintenance, and Repair Occupations (49-0000), Production Occupations (51-0000), and Transportation and Material Moving Occupations (53-0000)

[2] Real-Time Intelligence, the job ads data set available in JobsEQ.

[3] Education data represents the most recent education entry in a resume

Explore similar posts