A suite of tools for protecting the web's open knowledge.

Overview

Unblocked Web

This project maintains a suite of tools for protecting the web's open knowledge. Its primary function is to create a web-scraping engine that mimics a human interacting with a website - both from a user behavior, as well as from a "browser" perspective.

Using this Repository

This is a Monorepo to work on the Browser Detect + Evade workflow of building an automated engine. It requires Yarn workspaces.

You can work with the project by:

  1. Cloning the repository and installing git submodules (you can add --recursive to your initial clone request).
  2. Run yarn build:all.

Projects

This repository is home to several of the projects needed to create an "unblocked" automated browser engine. We imagine a world where there are many participants sharing evasions and emulations for all the web features into a single repository. They will live right next to an advanced bot blocking detection engine that can analyze every facet of a web scraping session (TCP, TLS, HTTP, DOM, User Interactions, etc). A vault contains all the information we have discovered about how to profile and analyze browser behavior. And an implementation of an agent is provided that can run all the evasions and run unblocked.

  • Specifications. This contains generic specifications for what an automated browser needs to expose so that it can be hooked into to emulate a normal, headed browser engine. To properly mask the differences between headless Chrome on a linux machine, and a headed Chrome running on a home operating system, a series of "hooks" needs to be exposed. These include things like before browsers start, web pages launch, and web workers have a javascript environment. This specification will be the minimum spec needed to open up the browser to plugin authors.
  • JsPath. A specification is provided for a method to serialize DOM nodes, properties and visibility information so it can be remotely queried.
  • Agent. A basic automated engine that implements the full reference Specifications.
  • Plugins. Unblocked community plugins that enhance a browser to mask Browser, Network, User Interaction and Operating System "markers" that can be used to block web scrapers.
  • DoubleAgent. A series of tests that can be run to analyze real Browsers on real machines, and then compare all the detected markers to an automated setup.
  • Browser Vault. A data repository containing every version of the Chrome browser with auto-updating removed along with data Profiles of all measurable attributes of each Browser. To be imported
  • Emulator Builder. A library to use the collected data from Browser Vault to "patch" runtime headless Chrome to match headed Chrome on a home Operating System. To be imported
  • Mission Impossible. Real world measurement of what DOM Apis are being analyzed on the top websites, and how many are detecting and blocking the Unblocked Agent + Community Plugins. To be imported

Contributing

We'd love your help improving Unblocked tools. Please don't hesitate to send a pull request. The best starting place is to add an evasion to the Unblocked Plugins or to add detections to DoubleAgent.

All Unblocked projects use eslint for code standards and ensure lint + test are run before allowing any pushes.

This project has a code of conduct. By interacting with this repository, organization, or community you agree to abide by its terms.

License

MIT

Comments
  • collaborate on a PoC mobile browser emulator plugin

    collaborate on a PoC mobile browser emulator plugin

    To facilitate this collaboration I created a repository already on which we will push our work for this Proof of Concept (Poc): https://github.com/OTA-Insight/default-mobile-browser-emulator

    opened by GlenDC 1
  • SqliteError: attempt to write a readonly database

    SqliteError: attempt to write a readonly database

    Since upgrading Hero and the like to version alpha-15, I get the following SQL errors at runtime (in production):

    Screenshot 2022-11-18 at 03 56 29

    Not always, but when I do get it, it breaks the session.

    opened by GlenDC 1
  • fix(plugins): toString leaking proxy; fix(navigator): fix user agent data

    fix(plugins): toString leaking proxy; fix(navigator): fix user agent data

    1. toString on a proxy of a proxy was not correctly logging an error originating from an Object
    2. UserAgent data was not being properly retrieved

    Closes #5

    opened by blakebyrnes 0
  • Lint fixes + OOPIF

    Lint fixes + OOPIF

    Disables out of process iframe (OOPIF) since those create a weird navigation swap in/out that we don't know how to track properly.

    We also fixed lint, which required changes to be up to standard

    opened by blakebyrnes 0
  • hero.reload: wrong sec-fetch-site

    hero.reload: wrong sec-fetch-site

    How to repeat?

    const { request: { headers}  } = await hero.reload()
    console.log(headers['sec-fetch-site']) 
    

    same-origin is expected, but the browser sends none.

    opened by wireguard-dev 0
  • Chrome on Android Tracking Issue

    Chrome on Android Tracking Issue

    • [x] fail default-browser-emulator for android (mobile) targets:
      • As part of the effort on supporting mobile browsers natively to unblocked, a first good step would be to make sure the default browser emulator does not accept mobile targets (e.g. android chrome).
    • [ ] collaborate on a PoC mobile browser emulator plugin:
      • To facilitate this collaboration I created a repository already on which we will push our work for this Proof of Concept (Poc): https://github.com/OTA-Insight/default-mobile-browser-emulator
      • [ ] publish PoC project to NPM
      • [ ] have a (private) project make use of this plugin and report on its usage + iterate while using it
    • [ ] copy portions of the default emulator into chrome for android emulator to emulate user agent and activate touch
      • This is where an emulator says it can handle a user agent https://github.com/ulixee/unblocked/blob/d468b4aba38b0f3cd800b1dc241586265ff13a38/plugins/default-browser-emulator/index.ts#L226
    • [ ] activate mobile user agent strings in unblocked/real-user-agents
    • [ ] study the headers/tls/tcp/etc in double-agent profiles
    • [ ] automatically generate double-agent profiles based on our findings
    • [ ] add missing evasion techniques
    • [ ] refactor common evasion technique functionality
    opened by GlenDC 0
  • Support screenshot outside viewport

    Support screenshot outside viewport

    The end goal of this pull request is to use the flag captureBeyondViewport added in chrome 87, and migrate all screenshot logic to this approach. This would solve #13, #21, and in general make the screenshot logic much simpler.

    This initial commit is in no way meant to be merged, but is here to show that is now possible using the default/latest chrome in the unblocked repo.

    What still needs to be done:

    • refactor code
    • test if scale still works as expected
    • extend tests, and remove saving images to /tmp
    • test in which versions of chrome this works (find oldest working version)

    These are the images created in tests, using the old and new logic (left and right respectively): Should be able to take a full page screenshot full-merged

    Should screenshot only the visible page visible-merged

    Should be able to take a viewport screenshot: viewport-merged

    Should be able to take a clipped rect screenshot clipped-merged

    New test, should be able to take a clipped rect screenshot outside of viewport clipped-outside-new

    opened by soundofspace 1
  • Fullscreen screenshot has side effects

    Fullscreen screenshot has side effects

    Taking a fullscreen screenshot does not always create the intended image. Sometimes sides effect can be noticed, either on the image, or when running with a headfull chrome instance. These effects are cause, by this code:

    await this.devtoolsSession.send('Emulation.setDeviceMetricsOverride', {
              ...contentSize,
              deviceScaleFactor: scale,
              mobile: false,
    });
    

    Changing the viewport of the device is a clever way to take fullscreen images, but has side effects on website where this change is detected. Some issues that I saw:

    • shift of content, probably related to aspected ratio change
    • extra context that was opened by hero (querySelector.click...), is closed again
    • random weird behaviour in some components

    This can partially be linked to #13., and can be solved in newer chrome versions by using captureBeyondViewport: true.

    opened by soundofspace 1
  • takeScreenshot fails to work when passing in a rectangle outside the viewport

    takeScreenshot fails to work when passing in a rectangle outside the viewport

    I can take a screenshot as follows:

    const mySel = await hero.querySelector('#myId');
    const screenshotRectangle =
       // eslint-disable-next-line @typescript-eslint/no-explicit-any
       (await mySel.getBoundingClientRect()) as any;
    screenshotRectangle.scale = 1;
    const buffer = await hero.takeScreenshot({
       format: 'jpeg',
       rectangle: screenshotRectangle,
       jpegQuality: 100,
       fullPage: true,
    });
    

    This gives however the full page, ignoring the given rectangle. If I however leave off the fullPage (which is what I tried first), I get an error which states that the given rectangle is outside the viewport...

    https://github.com/unblocked-web/agent/blob/main/core/lib/Page.ts#L770-L773 that's the assert that I get apparently. two observations there:

    • for a full page screenshot, you adapt the viewport size to the full page size
    • in https://github.com/unblocked-web/agent/blob/main/core/lib/Page.ts#L487 I read that apparently you can capture beyond viewport

    So given the second observation I think we can even just drop that assert. But if you do want to support < 87, then I suppose you would want to also support overriding the screen dimensions based on a given rectangle, rather then only doing it for fullPage?


    According to @blakebyrnes we might need to resize to fit the element size if it’s that big, see on Discord for his original message: https://discord.com/channels/966293220780806174/966293220780806177/1032432492822667305

    opened by GlenDC 4
  • Double Agent - Future Plugins Omnibus

    Double Agent - Future Plugins Omnibus

    This list can serve as a central locator as we create issues to tackle sub-issues

    • [ ] browser/dom-worker

      • check full scope of js env in various workers (service, dedicated, shared)
      • [x] collect worker dom environments (NOTE: built collect in browser-dom-environment, but turned off)
      • [ ] analyze dom environments + dom-bridges
    • [ ] browser/dom-frame

      • check full scope of js env in various iframes (sandbox, cross-domain, etc)
      • [x] collect in browser-dom-environment (NOTE: turned off and not in browser-profiler-data)
      • [ ] analyze dom environments + dom-bridges
    • [ ] browser/tampering

      • look for Javascript proxies, mismatch of dom (Creepjs is great at this)
    • [ ] browser/vm

    • look for red pills

    • [ ] browser/webgl

    • [ ] browser/render

      • browser renders dimensions differently per platform/browser
    • [ ] browser/voice

      • [x] Collect look for available voices - some are OS specific
      • [ ] needs Analyze plugin
    • [ ] http/referrers

    • [ ] http/cache

      • measure cache header responses to various server headers
    • [ ] http/favicon

      • favicons are requested in headed differently than headless
    • [ ] tls/clienthello-ws

      • websockets don't have an alpn for http2 in Chrome
      • [x] this exists in a wss-tls branch, but branch is outdated
    • [ ] user/interaction

      • measure interaction with forms/typing/etc for automated patterns. Might need a human alternative
    • [ ] user/mouse

      • measure use of mouse to get from one place to another. Some good research papers about this
    • [ ] http/headers + http/cookies:

      • Http delete/update (trigger from forms?)
      • Direct loads without referrers
      • Prefetch
      • Signed Exchange
      • Http2 headers/trailers
      • [x] Sec CH-UA headers
    • [ ] ip/address:

      • Socket reuse
    • [ ] Audio context (new plugin?):

      • https://github.com/cozylife/audio-fingerprint
    enhancement good first issue help wanted 
    opened by blakebyrnes 0
Releases(v2.0.0-alpha.17)
Owner
Unblocked
The Web Scraper Protection Bureau
Unblocked
Hemsida för personer i Sverige som kan och vill erbjuda boende till människor på flykt

Getting Started with Create React App This project was bootstrapped with Create React App. Available Scripts In the project directory, you can run: np

null 4 May 3, 2022
Kurs-repo för kursen Webbserver och Databaser

Webbserver och databaser This repository is meant for CME students to access exercises and codealongs that happen throughout the course. I hope you wi

null 14 Jan 3, 2023
🛠 Solana Web3 Tools - A set of tools to improve the user experience on Web3 Solana Frontends.

?? Solana Web3 Tools - A set of tools to improve the user experience on Web3 Solana Frontends.

Holaplex 30 May 21, 2022
Open Source projects are a project to improve your JavaScript knowledge with JavaScript documentation, design patterns, books, playlists.

It is a project I am trying to list the repos that have received thousands of stars on Github and deemed useful by the JavaScript community. It's a gi

Cihat Salik 22 Aug 14, 2022
There can be more than Notion and Miro. Affine is a next-gen knowledge base that brings planning, sorting and creating all together. Privacy first, open-source, customizable and ready to use.

AFFiNE.PRO The Next-Gen Knowledge Base to Replace Notion & Miro. Planning, Sorting and Creating all Together. Open-source, Privacy-First, and Free to

Toeverything 12.1k Jan 9, 2023
An open-source knowledge management app.

Cuby Text What Cuby Text is: An experimental knowledge management app An app focused on writing An open source app A personal project Cuby Text is NOT

Vincent Chan 545 Dec 23, 2022
A simple but powerful tweening / animation library for Javascript. Part of the CreateJS suite of libraries.

TweenJS TweenJS is a simple tweening library for use in Javascript. It was developed to integrate well with the EaselJS library, but is not dependent

CreateJS 3.5k Jan 3, 2023
🤪 A linter, prettier, and test suite that does everything as-simple-as-possible.

Features Fully Featured Code Grading Knowing if you need to work on your code is important- that's why we grade your code automatically. But, unlike o

Fairfield Programming Association 18 Sep 25, 2022
A suite of utilities to add more features to the details element

A suite of utilities to add more features to the details element

Zach Leatherman 206 Dec 22, 2022
A suite of utilities to add more features to the details element.

A suite of utilities to add more features to the details element.

Zach Leatherman 206 Dec 22, 2022
MagicCap is a image/GIF capture suite for Mac and Linux

MagicCap is a image/GIF capture suite for Mac and Linux. You can get a precompiled copy from the releases page of this GitHub page.

null 5 Sep 15, 2022
A Gun DB extension that ships secure* ephemeral messaging between Gun peers using Bugout, secured by Gun's SEA suite

Bugoff A Gun DB extension that ships secure* ephemeral messaging between Gun peers using Bugout, secured by Gun's SEA suite About Bugoff creates an SE

Daniel Raeder 14 Nov 12, 2022
The official API of the OwnStore suite.

This project is part of OwnStore suite. Learn more here: https://ownstore.dev The suite contains the following projects: Website API CMS Doc Apps TWA

OwnStore 10 Aug 13, 2022
Boilerplate project to run MOBILE Test Automation with WebdriverIO v7, Mocha, Appium, Allure reporting and Momentum Suite cloud device farm support

WebdriverIO Mocha Appium Momentumsuite WebdriverIO Integration with local or Momentum Suite real mobile farm devices Supports Native or Hybrid Android

Momentum Suite 21 Dec 5, 2022
REST API complete test suite using openapi.json

Openapi Test Suite Objective This package aims to solve the following two problems: Maintenance is a big problem to solve in any test suite. As the AP

PLG Works 21 Nov 3, 2022
Uma suíte completa de leitura: pesquise, baixe e leia livros gratuitamente.

bibliomar-react A complete rewrite of Bibliomar based on React. Português Do que se trata? Bibliomar é um buscador de livros que usa o acervo do Libra

null 10 Jan 7, 2023
ToolJet an open-source low-code framework to build and deploy internal tools quickly without much effort from the engineering teams

ToolJet is an open-source low-code framework to build and deploy internal tools quickly without much effort from the engineering teams. You can connect to your data sources, such as databases (like PostgreSQL, MongoDB, Elasticsearch, etc), API endpoints (ToolJet supports importing OpenAPI spec & OAuth2 authorization), and external services (like Stripe, Slack, Google Sheets, Airtable) and use our pre-built UI widgets to build internal tools.

ToolJet 15.6k Jan 3, 2023
Cheatsheet for the JavaScript knowledge you will frequently encounter in modern projects.

Modern JavaScript Cheatsheet Image Credits: Ahmad Awais ⚡️ If you like this content, you can ping me or follow me on Twitter ?? Introduction Motivatio

Manuel Beaudru 23.9k Jan 4, 2023
a tunisian platform made to share knowledge. :dizzy:

⭐ tha9fni.tn a tunisian platform made to share knowledge. ?? What's tha9fni will be like? We're still tweaking the wireframes but this is How its goin

التوانسة إلي يحبوا البـرمجة 18 Nov 17, 2021