Randomized and approximized algorithms
Loading...
Date
2022
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
In modern times, as the number of websites expands, so does the volume of data. Automated apps that can scan and extract the essential information, process it, and save it in a user-friendly format are in high demand among users. These programs are called web scrapers. The immense popularity of these applications compels online service owners to take additional steps to decrease the number of bots in their Internet service traffic. If Web Services Protection identifies the user as a bot, the server will block the user entirely. The primary objective of this master’s thesis is to create a Node.js parser using the Selenium tool. In addition, in order to replicate human activities, our parser will utilize randomized algorithms. In this thesis, we will examine the work of server protection in greater detail; we will examine four common indicators by which the server identifies that the user is a parser, as well as the server’s methods for preventing bots. We will analyze how using the Selenium tool and the introduction of randomized algorithms will help us bypass the blocking from the server. To obtain the results, we will parse five subcategories of the chosen website and evaluate the stability of our software based on four parameters: the number of pages parsed, the amount of data processed, the total number of errors and blocks, and the rate of data parsing per unit of time. In order to do a qualitative analysis, we will compare these indicators using the same parser but without the application of randomized algorithms.
Description
Keywords
web scrapers, applications, bots, Internet service traffic, Web Services Protection, algorithms