Case Study
August 20, 2021

How To Create and Clean Data: A Case Study from Datacie

Preparing data for use is a complex, work-intensive task. Take a walk with Datacie through the process of cleaning and QAing employee headcount data.

Cloud icon

Robin Slomian

We often get asked why procuring a dataset requires so much time and effort, particularly if public information is available that one could hypothetically scrape.

Datacie’s process with producing employee headcount data is a great example of how sourcing data and preparing it for use is no simple task. It’s a work-intensive process that requires attention to detail, industry knowledge, and technical know-how.

In this blog post, we take you through the steps that Datacie went through to prepare employee headcount data for use.

Acquiring Unstructured Data

The journey to build the employee count database starts with acquiring all the documents and other unstructured data sources that contain relevant information.

For U.S. companies, the employee count information can be found in SEC filings thanks to regulation S-K prescribed under the U.S. Securities Act of 1933. For international companies, the employee count information is usually found in annual reports, CSR documents, press releases, and other documents that public companies disclose to investors.

To acquire these documents, Datacie has developed proprietary scraping and website monitoring capabilities to track and acquire the raw data from the web minutes after such documents are made publicly available by reporting entities or regulatory agencies.

Data Extraction

Following the raw data acquisition comes the data extraction process. Datacie leverages state-of-the-art technologies to identify and extract precise data points among terabytes of unstructured content. The data extraction process essentially boils down to a series of inter-dependent deep learning models taking roots in Computer Vision and Natural Language Processing.

Each algorithm that constitutes the data extraction pipeline is trained to achieve a precise task (for example, language detection, layout segmentation, tabular data detection, information retrieval, etc.).

The last stage of the extraction pipeline consists of encoder-decoder neural networks trained to tag and classify employee-related data points from relevant sentences, data tables, and figures.

Human-in-the-loop Quality Assurance

What makes Datacie technology truly unique is its human-in-the-loop architecture: each and every prediction made at any step of the data extraction pipeline is associated with various quality indicators. Any time these confidence scores are below certain thresholds, the observations are automatically sent to qualified human annotators that validate or reject the algorithms' predictions. This continuous feedback loop allows Datacie to improve its algorithms' performances continuously, but most importantly, the company is committed to creating error-free data products.

Datacie's technological goal is not to achieve 100% automation; instead, it is to detect and differentiate where human intelligence is needed from where it is not.

The employee count information is particularly challenging to automate because of the lack of standardization in the reporting process. Around 60 to 70% of the companies report their total number of employees, their total number of full-time/part-time employees, or their total number of full-time equivalents; for these companies, deriving the average full-time equivalents figure is straightforward. However, the remaining 30% showcased a large variety of edge cases that needed additional considerations, to name a few:

  • How to account for contractors, seasonal, temporary, at-will, hourly, or leased employees?
  • How to detect companies that report their employee count under each of their reporting segments?
  • How to handle companies that report having no employees?
  • How to consider subsidiaries, affiliates, parents, or joint venture-related personnel?

Datacie's team keeps tight quality assurance thresholds, resulting in tens of thousands of documents that required to be reviewed by human annotators who were asked to follow a strict annotation guideline to the letter.

To achieve consistent data entries across annotators, the team ensured that each employee count data point was reviewed at least by two different persons before continuing its journey in the data extraction pipeline.

Full Database Audit

After initial inception, the entire dataset was audited and evaluated for trustworthiness. Every employee count passed through hundreds of quality checks that automatically identify outliers and potentially misreported observations. Additional manual checks were performed on low-confident data points, ensuring that the final employee count database is free from poor-quality observations.

Get Started with Employee Headcount Data from Datacie

And there you have it – an overview of what goes into sourcing, extracting, cleaning, auditing, and delivering datasets.

This final step is where IEX Cloud comes into play. The platform now provides employee headcount data from Datacie with its Core, and soon, its Premium Data offering. To learn more about how to get started, click below.

We hope you found this blog post insightful!

Get employee headcount data from Datacie

About Datacie

Datacie is an information company that crafts unique alternative and traditional databases from the world’s most trusted content sources. The company develops state-of-the-art AI models and an ergonomic human-in-the-loop data validation platform to extract precise data points from terabytes of unstructured data, resulting in actionable and fully customized data products.

Datacie counts some of the world’s leading exchanges as customers, and its ESG data offering is available to more than 150’000 users worldwide. The firm is led by three founders that believe that everyone can benefit from data to make better decisions, improve efficiency and even help tackle some of the world’s most pressing societal challenges. Learn more at

IEX Cloud Services LLC makes no promises or guarantees herein regarding results from particular products and services, and neither the information, nor any opinion expressed here, constitutes a solicitation or offer to buy or sell any securities or provide any investment advice or service.

Still looking?

Have question about our platform and how to get started?