Scraping, data mining and data harvesting

Many websites incorporate data obtained from other websites. It is sometimes thought that, where the data obtained is not protected by copyright (e.g. data consisting of postal addresses arranged alphabetically) there are no legal problems. This is however a mistake: the collection and re-use of such data can present significant legal risks.

As a matter of English law, the key risks arise under:

the database right legislation;
the law of contract; and
the Computer Misuse Act.

Database right

A database is defined in the Copyright Designs and Patents Act 1988 as “a collection of independent works, data or other materials which – (a) are arranged in a systematic or methodical way, and (b) are individually accessible by electronic or other means.” In general terms, databases falling with this definition will be protected by database right if there has been “a substantial investment in obtaining, verifying or presenting the contents of the database.” Database right will be infringed where a person extracts or reutilises all or a substantial part of a protected database without the consent of the database owner.

The law on database right is in a state of flux, and unfortunately the scope of the right is not entirely clear. Nonetheless, it is clear that the harvesting of data from other sites can in some circumstances constitute an infringement of this right.

Law of contract

Website terms of use sometimes expressly prohibit the collection and republication of data from websites. If you are considering extracting data from another website for use on your own website, you should check their terms of use of that other site. If they expressly prohibit what you intend to do – and if the website owner can establish that the terms are enforceable against you – then you may be found liable for breach of contract (or licence) if you go ahead.

Computer Misuse Act

The Computer Misuse Act provides for a specific offence in the case of unauthorised access to a computer: “(1) A person is guilty of an offence if— (a) he causes a computer to perform any function with intent to secure access to any program or data held in any computer; (b) the access he intends to secure is unauthorised; and (c) he knows at the time when he causes the computer to perform the function that that is the case. (2) The intent a person has to have to commit an offence under this section need not be directed at— (a) any particular program or data; (b) a program or data of any particular kind; or (c) a program or data held in any particular computer. (3) A person guilty of an offence under this section shall be liable on summary conviction to imprisonment for a term not exceeding six months or to a fine not exceeding level 5 on the standard scale or to both.“

It may be argued that, where a website’s terms of use prohibit data data mining, then such activities could fall within the Computer Misuse Act. This could lead to civil liability (under the tort of breach of statutory duty) as well as criminal liability. To the best of my knowledge, this kind of argument has not yet been tested in the UK courts.

Comments

Hi Alasdair, Great article, the most authoritative I have found, but here goes with my thoughts, and would be really keen to know your own. You are careful to say can present significant risks, as opposed to you are breaking the law and I think I see why.

Risk 1 is a VERY high bar. If we look at the formidable case where the jockey club lost (1) British Horseracing Board Ltd (2) The Jockey Club & (3) Weatherbys Group Ltd v William Hill Organization Ltd if they could not get over the bar, I don’t really see it as a significant risk.

Risk 2 Law of contract – If a site has this protection, then this is a risk, but let’s say there isn’t in most cases of scraping.

Risk 3 Criminal. Looking at the type of cases that are brought under this legislation and the level of proof required, it might account for the fact there were no cases in 2009. Has there been any since?

It’s likely that many have felt aggrieved by scraping, which means there are likely to have been many legal attempts to prosecute – yet we don’t see this.

Do you still see the risk as significant as any prosecution would seemingly need to be ground-breaking. The longer we go without one, isn’t it increasingly unlikely?

Hi Chris, I would need to look at the current law in some detail to respond usefully here, and unfortunately I don’t have the time atm.

Hi, What if you are scraping data to come to a result while not publishing the content you are scraping but holding all the content in your database?

For example: Scrape Yelp to find how many people wrote a review on McDonald’s but than only publish the count and not the actual reviews while holding all the reviews in non-published databases for future counts?

TIA

I don’t think there is much of a difference here in principle. Eg the database right infringement does not require publication.

Hi, I have a few questions regarding data extraction of factual information (taking university courses as an example, assuming unversity course names and UCAS codes would constitute factual material?)

(1) is it legally permitted to compile (e.g.) a database of all university courses in the uk?

(2) Would it make a difference how that information is obtained e.g. through word of mouth, newspaper university guides or university websites or wikipedia?

(3) What if the total course list is compiled from a range of sources so that no single source contributes a “substantial” amount of the total list?

There’s no rule of English law that prohibits the compilation of a list of UK university courses as such – but it very much depends how it is compiled.

Ensuring that no single source contributes a “substantial part” of the resulting database won’t necessarily help you. In both database law and copyright law, the substantiality test is applied to the thing that is copied etc, not the thing that results from the copying.

If a ‘harvester’ is used to collect email addresses, with the express purpose of further research (recipient’s name, job title, dept, etc) by a person, can this data be used to direct market by email legally? (Provided all relevent laws are followed in regard to direct email marketing).

There are a number of legal problems with this proposal. The biggest problem is data protection law. You can only “process” (i.e. do anything) personal data (which includes personal names etc) if you meet one of the requirements of Schedule 2 to the DPA 1998:

http://www.legislation.gov.uk/ukpga/1998/29/schedule/2

Reading Schedule 2, you might think that your activity would be covered by 6(1), but the received wisdom is that that provision would be construed narrowly, and that direct marketing without consent will prejudice the rights of the data subject.

In other words, you need consent if your activities involve any personal data processsing.

How come websites like indeed.com / careerjet.com / simplyhired.com / linkup.com harvest job postings from thousands of job boards, newspapers or career pages…? How legal is that? They harvest job postings and they do mention the source page however.

Please explain if it is enough to mention the source of the job posting in order to be consider legally ok.

Regards,

Christian

I don’t know how those particular sites operate. They may get express consent. Alternatively they may be relying upon implied consent. A website operator can specify in a robot.txt file whether spidering/scraping is permitted. A failure to prohibit could (?) be interpreted as a consent of sorts.

If scraping data from a given source infringes copyright, database right or any other legal right, then a source citation won’t constitute a defence in these sorts of circumstances. It may however reduce the chances of a complaint.

Is the ‘manual’ harvesting of email addresses visible on websites with the intention of later contacting them via a targetted email campaign a legally allowed practice in the UK? If it is not, how would a court distinguish between someone obtaining an email address from a company website to contact a company employee with a general enquiry and doing this on a large scale to hundreds of contacts?

Even where the collection of such email addresses isn’t legally problematic, their use for marketing will be governed by the Privacy and Electronic Communications (EC) Directive Regulations, as well as data protection legislation. The courts would distinguish the different activities in the usual way: on the basis of the evidence put before them.

Are these risks equally relevant if you are not re-publishing the scraped data i.e. if it is for internal use only? If yes, how do screen-scraping sites like Twenga get away with it?

Publication or re-publication isn’t usually relevant. I don’t know how Twenga works, but all sites/applications that copy content from websites need to be carefully designed. For these kinds of sites/applications, there is also the issue of copyright infringement. In Google’s case, it has been found to infringe copyright in a number of cases around the world. However, the courts have in a number of other cases found that this sort of activity is “fair use” or subject to an implied licence. The directives in a robots.txt file could be said to constitute an express licence.

Scraping, data mining and data harvesting

Comments

Add a new comment Cancel reply

Recent posts

Categories

Useful Links

Our services

English law

Web cookies

Docular Limited

Our other websites