Proxy scraping the right wayApril 15, 2019
In this article we will be discussing about some of the things to keep in mind when scraping with proxies.
All scripts/tools that use proxies should make a note in a database of the last time a proxy was used on a particular website. You want to avoid using the same one too often. Keep in mind that the target website also has a database and is probably analyzing logs mathing ip addresses, counting access times and even going into more advanced stuff that I'll be talking about later in this article.Scraping websites is not hard at all. You don't have to be a programmer to scrape websites. There are many tools that will do the job for you, quite reliably. The tricky part is to remain undetected because bigger and more important websites usually have algorithms in place to try and prevent such activities that eats away their bandwidth, resources or even to keep it fair for everybody else (buying all sneakers, limited tickets, best seats to a concert etc.).
I've been working and making lots of money using proxies and I can tell you from experience that, what seems like an easy job, gets truly complicated fast.
In this article we will be discussing about some of the things to keep in mind when scraping or botting with proxies. Some of them are easy to remember and implement while others are quite hard. The goal is to bypass restrictions and continue your job undetected which, in turn, will allow you to burn less proxies and spend less money of course.
The following is a small list that requires limited technical knowledge; things that should always be done when scraping to ensure a healthy list of proxies.
Poor proxy management will leave a big whole in your pocket. Proxies can get costly and the majority of the providers (esp for dedicated proxies) will only allow you to change your list once or twice a month - many won't let you change them at all. This means you have to make sure that good care is always taken.
People read pages don't consume them like a bot jumping from one to the next in a matter of seconds. We occasionally do that when searching for something specific but not for tens/hundreds of pages. This kind of scraping gets picked up quite early and banned straight away.
When I make a scraper I always make sure to leave between 2 and 30 seconds between pages and the number is a random one (never the same). 30 may seem like too much but I'm using parallelism and many bots, each with its own proxy so it doesn't really affect me that much.
I don't access more than 3 times on the same day using the same proxy. Sometimes I did it more often but the initial value was still 3 and my increments were low and monitored carefully between them.
This is the whole point right? Doing what you do without revealing that you're using proxies. This is a list with more advanced/challenging stuff that will require technical knowledge.
Almost everyone you talk with will recommend that you delete cookies every time but think about it, do you ever delete cookies on your real browser? If I'm running a very popular website and I encounter an IP without any cookies or prior visits I immediately set a red flag next to it. Nothing big, just a red flag…3 of them usually will result in a ban.
Depending on the website being scraped, keeping cookies between sessions can be a good thing, especially for social websites such as Instagram, Facebook, Twitter etc.
If you're using proxies to do social media activity or things that require an active session with a website, your tool should be able to lock a proxy with an account. This means that, any future sessions/logins on a given account should be done using the same proxy or/and (in case you lost it) the same area/city. Being careless about these sessions and the way your accounts are being locked can trigger more red flags.
Wait, are you not using a browser? This one is not for you then but keep in mind, I never scraped seriously without automating real browsers. "Oh I can change the header and lie about my browser - noone will know I'm not using a real one" - I will, in less than a second. Use a real browser please.
Always go with a browser that has high standards when it comes to privacy? I suggest Firefox here. Don't be that guy using SeaMonkey just because it's cool. It may be but not when scraping because you're attracting attention and that's the last thing you need. Go for the most common things and versions.
When it comes to trust…just don't! Your bot should be disabling Java, Activescript, Flash and WebRTC for every instance. That's the first thing I do because these technologies may be triggered from the browser but they run on your computer, where there is no proxy (unless you're using a VPN or something similar). Not blocking them can lead to data leaking and you're spotted right away. Most can be disabled in the browser's preferences while others can be disabled via a plugin or extension in which case, things get complicated because browser automation with extensions is hard.
Let's talk about latency. This one is more in the hands of the proxy provider than yourself but it doesn't mean you don't have to keep note of it and try to cross it of your list.
You see, most proxies are backed by a server. The server runs a proxy software which uses a class (or multiple ones) of public IP addresses that are allocated. That IP address with a communication port and you have a proxy.
The IP class used by the server is announced to organizations such as RIPE and ARIN where you state ownership info, address, business and …a location. Each IP address points to a location. Between two locations (your proxy and the website being scraped), if the distance is small (New York to New York for example) but the latency is big it means it's a proxy server that is probably located somewhere else but with the IP addresses announced somewhere else.
I recommend finding ways to ping proxy IPs and see what's the latency. If I buy New York proxies I usually search for a tool that allows me to ping from somewhere close and see what the latency is - it should not be bigger than 90 - 100. A simple search for latency check should bring back enough results that you can use to check.
Remember to change the page resolution and version (via headers) from time to time.
Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36
This subject takes us into much deeper waters. It's for those that stand out from the rest and want to go the extra mile to ensure that everything was done properly to retain the anonymity.
Each browser has some unique features that are harder to fake. Some of these fingerprinting techniques used to identify users are quite the challenge sometimes. It's the type of thing that allows big boys such as Google or Facebook to know who you are even if you change computers.
Some of the issues are easier to fix while others are near impossible. I'll skip the cookie part because it's already discussed earlier and just everyone should know a thing or two about cookies before paying real money for proxies IMHO.
<canvas> as a HTML5 element has the ability to draw more information from the browser. Not unique stuff but, with enough of such details, fingerprints start getting a shape, closer and closer to a unique one. Settings such as font family, font size, default background color, the number of installed extensions and a great deal of other types of details.
When working with proxies you need to understand that it's not always the proxy that can be blamed. Big players don't even rank the IP address that high.
- Use real browsers. Remember to always use real browsers and not some other types of simpler libraries such as
PhantomJs. I suggest to go for
seleniumif you're a programmer.
- Don't play with browser headers. As I told you, certain versions have new functionality or missing functionality. That can be tested.
- Keep or dump cookies when needed. It's odd for a returning client to have no cookies. Think about it. Cookies are simple text files that can be saved and used.
- Find good proxies. Test for latency and test that the proxies are not added in databases of proxies. Also test for latency like I explained.
- Rotate or use same proxy when needed. Be aware of when you should rotate or use the same proxy. This involves a database but it's worth it 100%.
- Don't abuse the proxies. Keep track of last use in that database and don't use the same proxy on the same website more than 2-3 times per day.
- Avoid fingerprints. Change resolutions, default settings and other identifiable parts.
I almost forgot about this one. I use virtual boxes and docker containers that I limit in terms of RAM and CPU. Docker is a fantastic tool because it can be instantiated super fast and without having to boot a Virtual Machine. Docker is a bit harder to for launching GUI apps but it can be done.