Browser fingerprinting techniques

May 06, 2019

In-depth article about browser fingerprinting techniques and ways to protect against it

Browser fingerprinting is a big topic in the scraping community and a way to identify bots and individuals regardless of their connection information (ip address, geolocation etc.). While most of the programmers and available software tend to ignore this issue, I consider it (and I'm not alone) to be the most important aspect when doing any sort of scraping and proxy activity so we're going to raise the bar here and discuss everything in detail.

Browser fingerprinting is ultimately a mechanism of defense against abusers. It's not an active defense system but aids in identifying automated/abusing activity by identifying individuals that hide behind proxies in order to gain access to limited resources (limited sneakers, concert tickets etc) or to abuse various services (voting, fake visits, data scraping, automated account creation etc).

As a programmer doing all sorts of scrapers, working with proxies, and struggling to stay anonymous all the time I consider this period to be revolutionary. Deep learning and machine learning is accessible to all of us and things are getting quite competitive and hard to keep track of. The proxy business is always on the edge; the type of activity where you can't sleep or risk working for nothing. There's a constant "catch me if you can" game going on and the winners are those prepared to work harder and smarter.

The majority of people I speak with consider the IP address and geolocation to be the most important aspect of becoming anonymous. It certainly is "up there" but I don't see it the same way. Proxies are many and easy to obtain so that's an issue easily accessible to everyone. Big players don't even look at your IP address the same way any more and are instead turning to other, technologically advanced, techniques of identifying it's users. One of such techniques will be discussed in this article: browser fingerprinting.

When pages are accessed using the browser certain scripts can be executed in the background and information collected and sent to the backend server for storage and analysis.

A browser fingerprinting is a set of features and various information related to the browser. Each of these pieces are building blocks that, as a whole, result in a fingerprint that is, in many cases, unique. The information is usually drawn from the underlying hardware which is considered as being harder to spoof as opposed to other information such as the headers for example.

This technique is not something new. It was used (still is) by Microsoft as well in an attempt to detect license key abuse. It used to be the case that if you change more than 60% of the hardware in your computer, Windows would complain and ask for a new license key. The sum of hardware parts as well as their particular details would build this fingerprint. The same principle is applied to the browsers.

The information is leaked by the browser using various methods. Many of the commands executed inside a browser can leak a lot of data about the machine. The enrichment of browser features of the years have brough forward exciting new ways of digesting content. Video, audio, drawing, dragging and dropping are just some examples. All these features are made possible by various plugins that a browser uses. I'm sure you see where this is going…

For a plugin to be able to do its job it needs to collect some hardware information in order to see if the system allows for it or has the necessary resources. You can't play YouTube videos without having the flash player installed or without a HTML5 capable browser. The collected data is available to the plugin and to the browser. The fingerprint is starting to emerge:

has Flash installed: yes/no
has Java installed: yes/no
has HTML5: yes/no
supports ES5: yes/no
can render PDF: yes/no
…

The browser itself is one of them so I'm going to skip it as being the most obvious and focus on other parts instead.

Java runs in many browsers but, what is more important, it runs on the system; the browser just communicates with it. Running on the system it has access to all resources and information and not just data from the browser.

The same set of principles are applied with ActiveX or Flash. As a test you can open a browser, set a proxy and visit any page that detects proxies using Java. Since it runs on the system and not inside the browser, the traffic does not channel through the proxy being set inside the browser. A Java app can access an external url to see the IP address or it can even "have a look at your proxy settings". Here's a sample code:

I keep Java disabled on all my browsers and tend to stay away from websites that require it in order to function. On or Off, the fingerprint is still being computed.

The Canvas API provides means of manipulating or drawing graphics on a drawing surface. You can search of "online paint" or similar tools to see it in action. The canvas has access to operating system properties such as the operating system, browser information and version, details about the GPU (graphics card), system fonts, sub-pixel hinting or antialiasing. All these properties are required by the canvas in order to draw its renderings.

The WebGL API is related to the Canvas but it allows for 3D rendering and manipulation of objects. A study was performed recently which conncluded that WebGL can be used for fingerprinting. Part of the study was to create a 3D surface and apply a very specific image on many individual browsers. They observed 50 distinct renders from 270 samples. This makes for almost a 20% identifiable material so the WebGL method of fingerprinting can be powerful. Maybe not alone but, in tandem with other methods, it is reliable.

Browsers have an Audio API that allows websites and applications to playback or record audio. Certain features and modules can be distinctive like the filtering, channels or compression. Websites can create audio signals using an OscillatorNode. The playback of this audio being generated can vary from device to device, based on the hardware underneath. It's the same principle that sets apart audio amplifiers for example. Some of them add noise or too mach base, cut on the highs or mids. The variables discussed here vary greatly and the Audio API makes for a reliable method of fingerprinting.

Almost everyone has extensions installed. They allow user customization of the browsing experience; to block ads, to create screenshots, to pick colors, to post on pinterest, to manage passwords and the list goes on. The number of extensions and extensions themselves form a fingerprint. Having none at all form a fingeprint also.

The extensions are harder to detect by websites. The browser offer no API to manage or detect them but we can look for their traces. For example, I know for a fact that you're blocking ads with a simple code like this one used to detect ad blockers:

Another way of detecting the installed extensions is by looking for their logo. Each extension has a logo and that file is placed in well know locations. Knowing just the extension ID allows you to check if the logo/extension exists or not by verifying the path: extension://<extensionID>/<pathToFile>. Many extensions manipulate the DOM directly. One such extension adds buttons to YouTube. Simply searching for that button in the webpage allws one to identify its presence.

Just remember, having no extensions at all raises more flags than having the same extensions on multiple sessions. Deleting everything or spawning a pristine session each time can be detrimental. If you're scraping I recommend running all your sessions with some common extensions and a WebRTC (known to leak information) blocker. Spawning a browser session with its own extensions can be challenging but, oftentimes, rewarding.

Both of these technologies are changing fast with new additions and capabilities added monthly. By running tests and lokking for certain functionalities, we can differentiate even between minor versions of a browser. For example, if you're sending me a header saying that your browser is IE6 and I find the said browser to support CSS3 I call BS on your identity straight away. My advice is to NOT lie about the browser version or try to mess with the headers and always use a real browser when scraping.

Usually when we benchmark we're interested in the time required to complete a task. That's how we separate processors, graphic cards, memories and other hardware. As you would expect, the resulted time differs from one machine to another. The result can be used for fingerprinting.

With the increase of mobile usage over the years, this new API was introduced and it now allows developers to detect the battery status of a device. Knowing the batery status allows us to also calculate a discharge rate given a small benchmark that is used to stress things out a bit. Battery status plus discharge rate form a good fingerprint. Batteries lose their properties over time and the discharge rate increases. That's why your phones tend to stay on less and less - the battery is losing "juice".

Apart from what we already discussed you should also consider the following parameters when scraping.

screen resolution, browser dimensions (avoid using the same resolutions/dimensions)
color depth
browser headers
list of plugins (PDF, Flash etc.) __
cookies enabled/disabled local/session storage enabled/disabled
timezone
list of available fonts
do not track setting on/off
mouse movement to execute events (mouse should move and not point directly to targets)
usage of onwheel event (mouse scroll - the majority scroll pages using their mouse)
use of random pause to emulate human behavior (pause to read, pause to find on page)
mouse speed

You would say I'm paranoid but I can assure you this is not the case. Here's a small example that calculates the mouse speed and acceleration (note the * 1000, since the timestamp is in milliseconds. The speed is here in pixels/second.):

Ok so what did we learn so far? A fingerprint cannot be avoided. Enabling or disabling things do not help as every choice or measure creates a fingerprint. Every little identifiable detail adds to an entropy. The higher the entropy is, the more identifiable a fingerprint will be.

Disabling Java, Flash, Canvas and other features in your browser might be the first thing that comes to mind but this is a fingerprint itself and this fingerprint might be strikingly similar to other visitors and that trying to stay anonymous so there's a red flag. A machine learning algorithm can feed on past activity and learn the identifiable particularities of users with "shady" activity so, having all these disabled, can bring the unwanted attention.

One of the first things that I do is load these browser extensions for every instance:

WebRTC Control
uBlock Origin
Disconnect me
Canvas Defender

The extensions are good because they block WebRTC and also do their best at preventing tracking scripts to load, not to mention that they also block ads which makes your bots run faster.

The best approach is to spoof the information, whenever you can. I know it's easier said than done but it shouldn't supposed to be easy, I warned you from the very beginning. Use real browsers (I recommend selenium automation) and don't mess with changing headers or versions via headers. Use more than one browser and frequently change/upgrade/downgrade them. Use extensions, even dummy extensions that serve no purpose to your job. Your goal is to blend in and you can't blend in if you stand out. Proxies play an important role but, depending on the target and its capabilities + determination, they might not be the most important aspect.

Browser fingerprinting techniques

What is browser fingerprinting

How is the information collected

What technologies are leaking

Java, ActiveX, Flash, WebRTC

Canvas

WebGL

Audio API

Browser extensions

Javascript and CSS engine versions

Benchmarking

Battery status indicator

Other fingerprinting methods

How to avoid browser fingerprinting

Table of contents

More like this

Proxy scraping the right way