Unblocked Web
This project maintains a suite of tools for protecting the web's open knowledge. Its primary function is to create a web-scraping engine that mimics a human interacting with a website - both from a user behavior, as well as from a "browser" perspective.
Using this Repository
This is a Monorepo to work on the Browser Detect + Evade workflow of building an automated engine. It requires Yarn workspaces.
You can work with the project by:
- Cloning the repository and installing git submodules (you can add --recursive to your initial clone request).
- Run
yarn build:all
.
Projects
This repository is home to several of the projects needed to create an "unblocked" automated browser engine. We imagine a world where there are many participants sharing evasions and emulations for all the web features into a single repository. They will live right next to an advanced bot blocking detection engine that can analyze every facet of a web scraping session (TCP, TLS, HTTP, DOM, User Interactions, etc). A vault contains all the information we have discovered about how to profile and analyze browser behavior. And an implementation of an agent is provided that can run all the evasions and run unblocked.
- Specifications. This contains generic specifications for what an automated browser needs to expose so that it can be hooked into to emulate a normal, headed browser engine. To properly mask the differences between headless Chrome on a linux machine, and a headed Chrome running on a home operating system, a series of "hooks" needs to be exposed. These include things like before browsers start, web pages launch, and web workers have a javascript environment. This specification will be the minimum spec needed to open up the browser to plugin authors.
- JsPath. A specification is provided for a method to serialize DOM nodes, properties and visibility information so it can be remotely queried.
- Agent. A basic automated engine that implements the full reference Specifications.
- Plugins. Unblocked community plugins that enhance a browser to mask Browser, Network, User Interaction and Operating System "markers" that can be used to block web scrapers.
- DoubleAgent. A series of tests that can be run to analyze real Browsers on real machines, and then compare all the detected markers to an automated setup.
- Browser Vault. A data repository containing every version of the Chrome browser with auto-updating removed along with data Profiles of all measurable attributes of each Browser. To be imported
- Emulator Builder. A library to use the collected data from Browser Vault to "patch" runtime headless Chrome to match headed Chrome on a home Operating System. To be imported
- Mission Impossible. Real world measurement of what DOM Apis are being analyzed on the top websites, and how many are detecting and blocking the Unblocked Agent + Community Plugins. To be imported
Contributing
We'd love your help improving Unblocked tools. Please don't hesitate to send a pull request. The best starting place is to add an evasion to the Unblocked Plugins or to add detections to DoubleAgent.
All Unblocked
projects use eslint for code standards and ensure lint + test are run before allowing any pushes.
This project has a code of conduct. By interacting with this repository, organization, or community you agree to abide by its terms.