Web scraping is something enigmatic in the public’s eye. It’s not entirely surprising – every innovative solution takes some time before it’s understood and accepted. Generally, widespread acceptance closely follows industry-related legislation or regulation.
However, web scraping is immensely useful for the public good. Anything that slows down its adoption among the wider public is causing harm in the long term. In this article, I’ll argue that any dedicated regulation, at least industry self-regulation, would be beneficial for the world at large.
Someone’s money versus everyone’s money
In any free-market economy, private capital spending is fairly unrestricted. Public money, on the other hand, follows numerous rules and regulations. While they may differ across countries, governments, and political/economical ideologies, public money spending is project-based.
Usually, project-based work means someone has to convince someone else to provide the required sum of money. That indicates public spending is based on two factors – persuasion and law. Clearly, or rather hopefully, no government entity would set their signature upon a project that breaks the law.
On the other hand, persuasion is needed because those who sign off on public spending are people. Their perceptions, ideas, hopes, and dreams all influence the decision-making process. Presented with a new and unknown process, people are wary. Especially if there’s no legislation surrounding it.
Private capital, on the other hand, hasn’t got many masters and wardens. It can be spent according to the desires of the person controlling it. There is often no need to present an argument why private capital should be spent one way or another. However, business-centric individuals will often reinvest the money towards innovation, leading to the creation of new industries over time.
These differences in spending mean that private capital creates industry, some of which comes with innovation. Industries for a while exist in a limbo state but are eventually regulated. The associated innovations then gain legitimacy in the public consciousness. Public spending can then begin.
Of course, this is slightly reductionist as there are cases where web scraping is being used right now in science and for the public good (e.g. such as to catch “copied and pasted” laws pushed by lobbyists). When COVID-19 originally hit the world, we supported CoronaMapper (powered by web scraping) to help them in their project.
As another exception, now we are helping the Lithuanian government in finding illicit visual content on the internet through the power of web scraping pro bono. Other examples can range from economists acquiring additional data sources for research or discovering inefficiencies in property tax allocation.
However, all of these examples are an exception to the rule, not the other way around. If web scraping would be identified as legitimate by the world at large, the possibilities outlined here would just be the tip of the iceberg.
Role of sectoral laws
Even when industries exist in a limbo state, courts still have to resolve disputes between participants. Although no laws that would directly regulate web scraping exist, courts follow previous practice and interpret current legislation that might be indirectly related to the case at hand.
Web scraping is going through such a stage. There are many court cases that have allowed our industry to interpret what could be considered a good way of scraping. Most of the interpretation in case-law is performed through sectoral laws, usually those that define or regulate the use of data in general (e.g., GDPR, CCPA, and others).
When sectoral laws are in use, you can arrive at a lot of… unusual conclusions. For example, most social media websites argue that personal data stored in them belongs to the people and that the company only protects user privacy. Yet, Facebook recently banned public NYU researchers who were collecting data on the platform. They used a browser extension that collected the data only from those who had installed it. Of course, they got the explicit permission of every extension user, meaning that they agreed that these researchers will collect their personal data through these extensions.
Such action is essentially a contradiction to the argument raised by many social media giants. If the data belongs to the users, they can grant consent for it to be collected. If, as in Facebook’s case, they cannot grant consent, the social media platform has to make the case that data is their private Facebook’s asset instead of one that belongs to people. Such a position could be seen as contradictory.
Facebook can only strongarm researchers through their Terms of Service in such a manner because industry regulation is lacking. These gaps can be used to devise Terms of Service in a self-interested manner without breaking the law, even if web scraping or other means of automated data collection would be used for the public good.
Over time, web scraping industry players have developed a common and mutual understanding that scraping publicly accessible data is OK. Data accessible only after a log-in, on the other hand, is off limits. However, a case can also be made that the widespread public adoption of some social media such as Facebook should be factored into these considerations.
According to Statista, there are 2.85 billion monthly active users on Facebook. With such a large percentage of the global population participating in the platform, it may be argued that the data is no longer de facto non-public. Such data, it could be argued, should not fall under the hegemony of one company.
Self-regulation until regulation
Some time ago, the UK Joint Industry Committee for Web Standards (JICWEBS) started using Trustworthy Accountability Group’s Certified Against Fraud (TAG), which is intended to reduce ad fraud. However, JICWEBS was late to the party. Many digital advertising companies had been using TAG independently. Independent use of TAG, which is essentially self-regulation, has had tremendous results.
Back when GDPR and CCPA were being rolled out, industry leaders reached out to policymakers. Not to lobby for softer laws (although I’m sure that happened as well), but to provide greater clarity to certain definitions.
There are many other industries where self-regulation has led to impressive results (e.g., American Bar Association). Clearly, business can work hand-in-hand with governments and lawmakers. Most importantly, self-regulation can provide a foundation for government regulation.
Web scraping is the perfect candidate for self-regulation. As it’s a highly technical and difficult topic, legislation will take some time to catch up. It will take us some time to explain and educate the world at large about the entire process from the ground up. Until then, it’s hard to expect that any proper legislation can be enacted.
Self-regulation is also not only about getting to the stage where legislation happens. It also reduces the influence of bad actors on the market and improves the overall perception of the industry. Finally, it’s a lot easier to spread awareness about something if an association for it exists.
Conclusion
Web scraping can benefit the world at large. However, it will struggle to do so if it remains an enigma that exists in some legal limbo. While it’s easy to just sit back and wait until someone else does the work for us, we shouldn’t aim for easy. Web scraping can self-regulate so that the world may reap its benefits.
The opinions expressed in this post belongs to the individual contributors and do not necessarily reflect the views of Information Security Buzz.