There are many services that provide a large quantity of proxies that you can use when scraping websites to avoid getting timed out. But what if you want to buy a single proxy package and use it from multiple computers on different networks?
Types of security
The main types of security that proxy services use in order to avoid abuse of their system are:
- Username and password authentication
- IP whitelisting (1 or more)
- Maximum number of threads
or any combination of these three.
While the first is easy to bypass (just use the same username and password on all your devices), the other two can be complex to work around. Although for the latter there is probably no solution, we can find a way to evade the IP whitelisting security. I will not reveal which services are vulnerable to the workarounds I’m going to illustrate for obvious reasons.
The “easy” way
The easiest way is to use a VPN. The VPN will change the IP address that the proxy service sees when you send a HTTP request to one of their proxies, and so you can simply bind the service to the VPN host IP and connect all your devices to that same host with a VPN tunnel.
Doing this free of charge, though, can be difficult.
- OpenVPN, which can be self-hosted, only allows for up to 2 users to connect to your VPN if you don’t buy a license
- Paid VPN services hosted on 3rd party servers will give you a variety of IPs to choose from, but you need a static IP so you need to connect to the same server every time
- Finally you could host an open source VPN
The hacky way
Since I wasn’t happy with the solution above, I started investigating on how to make a two-step proxy that will take all incoming traffic from all my devices and send it to one of the proxies of the provider, making it look like it came from that IP alone.
I started investigating a node.js solution and found the proxy-chain module, which sounded very appealing at first, but when I tested it with around 1000 proxies it performed very badly.
Finally I started thinking that I could simply use nginx and make it redirect all incoming traffic to the addresses I specify. This system works pretty well, and after testing it for some days I feel confident to say that it’s a decent zero-cost solution to the problem (provided that you can host a linux server of course).
First of all we want to install nginx. I prefer to use the mainline branch instead of the stable because it offers more features. You can find instructions here. Then add the nginx user
useradd nginx
After that we need to edit the configuration file (check out where it’s located on your installation)
vi /var/nginx/nginx.conf
then delete all the content using
:1,$d
paste this configuration
load_module /usr/lib/nginx/modules/ngx_stream_module.so;
worker_rlimit_nofile 65536;
user nginx;
worker_processes 1;
error_log /var/log/nginx/error.log warn;
pid /var/run/nginx.pid;
events { worker_connections 65536; }
http { }
include /etc/nginx/tcpconf.d/*;
and finally save and quit.
This configuration will load the stream module, which is what we need to take the incoming TCP stream on a local port and redirect it to another host on its port. We also specify a high limit for the number of workers since we need to make sure nginx can work on enough files at the same time. It’s also important to notice that the worker limit MUST be a power of 2 or it will be disregarded! Yeah, that costed me almost an hour of debugging…
We should check if the file limits for the nginx user are high enough or if we should increase them. To test it, you can type
su - nginx
ulimit -Hn
ulimit -Sn
If the limits are too low, you can edit this file
vi /etc/sysctl.conf
and append this line
fs.file-max = 70000
After saving and closing, edit this other file
vi /etc/security/limits.conf
and add these two lines
nginx soft nofile 10000
nginx hard nofile 30000
Make sure to use tabs instead of spaces as the latter can cause issues to some people. Finally, edit the file
vi /etc/pam.d/common-session
and append this line
session required pam_limits.so
Now that we edited all the required files, we can reload the sysctl process and nginx (although killing the nginx would be a better way to ensure it reloads entirely)
sysctl -p
nginx -s reload
We’re almost good to go! We just need a few more steps. First of all, create a folder called tcpconf.d inside your nginx root folder (where the nginx.conf file is)
mkdir tcpconf.d
cd tcpconf.d
Inside this folder we need to create a file with this syntax (you can give it any name you want)
stream {
server { listen 30000; proxy_pass 1.2.3.4:8080; }
server { listen 30001; proxy_pass 5.6.7.8:8080; }
}
You should be able to generate this file pretty easily by using this python script that I wrote for the occasion
output = open('output.txt', 'w+')
p = open('proxies.txt', 'r')
proxies = p.read().splitlines()
p.close()
output.write('stream{\n')
i = 30000
for proxy in proxies:
output.write('server { listen %d; proxy_pass %s; }\n' % (i, proxy))
i += 1
output.write('}\n')
output.close()
This script will take your proxy list in the format of one proxy per line (e.g. 12.34.56.78:8080) and generate a file called output.txt that is ready to be placed inside the tcpconf.d folder. By default this script will use the ports from 30000 onwards, so if you want to use other ports you can edit that value but do not choose a low value since many of the ports are reserved for other processes (like SSH on port 22) and will give you errors when you start nginx.
Finally we need to generate a proxy list that we can use in our favourite scraping software (like OpenBullet for example). Run this python script
output = open('output.txt', 'w+')
ip = 'YOUR_SERVER_IP'
for i in range(30000, 55000):
output.write('%s:%d\r\n' % (ip, i))
output.close()
You have to replace your own IP and define your port range basing on how many proxies you used (I had 25000 so I needed to output 25000 ports starting from port 30000, which is the beginning of the default port range of the previous script).
Finally, it’s very important that you import these proxies as SOCKS5 or they will not work, since we’re redirecting TCP streams and not HTTP requests (they are on two different layers of the protocol stack).
Congratulations, we are officially done and we can finally start nginx
service nginx start
If you have a large amount of proxies this might take some time, but if you patiently wait you should have your middle proxies visible on all the incoming ports, ready to accept some traffic!
Conclusion
This was a great learning experience as I tried to use nginx for something I didn’t know it could do, by using the stream module which is optional and needs to be activated with a directive in the configuration file. I also learned how linux handles the maximum open files allowed for each process and messed around with various configuration files in order to get it to work. Next time I will try one of the open source VPN solutions and report my findings on this blog.