Data scraping: online data vs. Infobel Pro data

Last Update : 27.05.24 • Publication : 13.03.23 • Reading :

Data scraping has become a popular method of collecting data from the web. The process involves using software tools to extract information from websites and storing it in a structured format for later analysis or use.

While scraping data has many benefits and there are tools that make it more or less easy to engage in data harvesting, scraping is not without risks and hidden costs for companies that improvise this kind of data harvesting.

In this article, Marc Wahba (co-founder and CTO of Infobel) shares his experience with data scraping. He shows you:

  • The flaws in the data you can find on the web.
  • The hidden costs of data scraping.
  • The potential risks of using scraped data.

Ultimately, you will understand the fundamental differences between data that can be found online and data that is available on specialized platforms like Infobel Pro.

What data can be found on the web (and why we should be wary of User Generated Content)?

There is a major difference between the data that Infobel Pro offers and what is usually found on the Internet.

On the web, there is a lot of user-generated content or content that is automatically generated by certain platforms.

For example, on LinkedIn or on Google My Business, anyone can create content to be visible. The user creates it when he needs to make his activity or his company known. But if his company goes bankrupt, this content will remain online without anyone being able to delete it (or thinking of doing so).

It's kind of like the opt-in principle: someone gives consent, but if they die, they can't opt-out.

On LinkedIn, about 25% of current business listings no longer exist or never existed. On services such as Google My Business or Google Maps, you can find companies that have been out of business for several years. You can even find companies that have been closed for more than 10 years on the platform of one of the leading reviewers in Europe.

In order to avoid these problems, Infobel verifies its data using data feeds based on registrations with the chambers of commerce.

For example, in Belgium, at the Crossroads Bank for Enterprises, it is possible to know immediately when a company is liquidated or goes bankrupt. This information is also certain and irrevocable.

So, when User Generated Content is alone, it does not indicate if the information is still valid or not. This is why it is important to compare this content with official data. By linking these two sources, you get very powerful data.

Scraping may seem free, but in reality it requires the development of expensive techniques

There is a lot of scrapable and collectable data available on the market.

For example, you can find millions of company data on a site like Infobel. But that's not why you can scrape this data (indeed, harvesting this data is against the terms and conditions).

On the other hand, there is also data collected on DNS to obtain the list of domain names registered in Belgium or France, but this data is often incomplete and difficult to exploit as is.

So, scraping may seem free, but in reality, it requires the development of expensive techniques such as :

  • Rental of scraping capabilities - i.e. servers and software that automate data collection. This rental can be quite expensive.
  • Changing IP addresses - to avoid being detected as a bot by the targeted website, it is often necessary to change IP addresses regularly. This can be done by renting IP addresses from specialized providers.
  • Managing Scraped Data - after scraping, data is often raw and unstructured. To make it usable, sometimes complex processes are required, such as the ETL method.

To get large volumes of data, the cost of scraping is often higher than using a provider (like Infobel Pro) to get good data.

There is a lot of scrapable and collectable data available on the market.

Scraped data is often of poor quality

As explained above, scraped data is often incomplete, outdated, or even wrong, and still needs to be enriched afterwards.

People who scrap often think they are saving money, but in reality they are wasting time and money by producing poor quality data. In the end, they often end up using service providers to get quality data.

The legality of scrap data can be questionable

For example, there is mobility data available in India for the entire world, including Belgium, that identifies the movements of any device.

However, this poses a problem in terms of privacy, because even if the data is anonymized, there is still the phone ID floating around. If that ID can be associated with a person or a phone number, the information is no longer confidential and it is possible to track that person's movements.

Compliance with regulations is one of the biggest problems with data scraping. Retrieving data and running a campaign targeting individual profiles is a violation of GDPR, even for data that is considered public on LinkedIn.

Conclusion: use Infobel Pro data

While data scraping is a popular method of collecting data from the web, it is important to understand the risks and hidden costs associated with this practice.

More and more people are keen to buy or access data coming from trusted, reliable sources.

Data available on specialized platforms like ours is data with 95%+ quality, updated in real-time, and you can access it or buy it in affordable prices.

Since 1994, we have been collecting data thanks to a team of specialists who are continuously trained on the evolution of techniques and legislation related to online data collection.  This allows us to offer you complete, qualitative and up-to-date databases.

Marc Wahba
Author Marc Wahba

Meet Marc, the co-founder and CTO of Infobel. He is in charge of software development. In 1991, he obtained a degree in civil electromechanical engineering from the Polytechnic Faculty and later earned a master's degree in management from the Solvay School of Brussels. Along with his brother, he founded Infobel in 1995, which was the first online directory to offer an online white pages directory. Marc's innovative mindset has led to the launch of new data products and services that have become a global success, serving clients all over the world.