The high efficent browser driver on top of puppeteer, ready for production scenarios.

Last update: Jan 6, 2023

Overview

browserless is an efficient driver for controlling headless browsers built on top of puppeteer developed for scenarios where performance matters.

Highlights

Puppeteer-like API for common tasks (text, screenshot, html, pdf).
Built-in evasion techniques to prevent being blocked.
Built-in adblocker for canceling unnecessary requests.
Shell interaction via Browserless CLI.
Easy Google Lighthouse integration.
Automatic retry & error handling.
Sensible good defaults.

Installation

You can install it via npm:

$ npm install browserless puppeteer --save

browserless is backed by puppeteer, so you need to install it as well.

You can use it next to puppeteer, puppeteer-core or puppeteer-firefox, interchangeably.

Usage

This is a full example for showcase all the browserless capabilities:

const createBrowserless = require('browserless')
const termImg = require('term-img')

// First, create a browserless factory 
// that it will keep a singleton process running
const browserlessFactory = createBrowserless()

// After that, you can create as many browser context
// as you need. The browser contexts won't share cookies/cache 
// with other browser contexts.
const browserless = await browserlessFactory.createContext()

// Perform the action you want, e.g., getting the HTML markup
const buffer = await browserless.screenshot('http://example.com', {
  device: 'iPhone 6'
})

console.log(termImg(buffer))

// After your task is done, destroy your browser context
await browserless.destroyContext()

// At the end, gracefully shutdown the browser process
await browserlessFactory.close()

As you can see, browserless is implemented using a single browser process and creating/destroying specific browser contexts.

You can read more about that at technical details section.

If you're already using puppeteer, you can upgrade to use browserless instead almost with no effort.

Additionally, you can use some specific packages in your codebase, interacting with them from puppeteer.

Initialization

All methods follow the same interface:

<url>: The target URL. It's required.
[options]: Specific settings for the method. It's optional.

The methods follows an async interface, returning a Promise.

.constructor(options)

It initializes a singleton browserless process, returning a factory that will be used for creating browser contexts:

const browserlessFactory = require('browserless')

const { createContext } = browserlessFactory({
  timeout: 25000,
  lossyDeviceName: true,
  ignoreHTTPSErrors: true 
})

// Now every time you call `createContext` 
// it will be create a browser context.
const browserless = await createContext({ retry: 2 })

They are some propetary browserless options; The rest of options will be passed to puppeter.launch.

options

See puppeteer.launch#options.

Additionally, you can setup:

defaultDevice

type: string
default: 'Macbook Pro 13'

Sets a consistent device viewport for each page.

lossyDeviceName

type: boolean
default: false

It enables lossy detection over the device descriptor input.

const browserless = require('browserless')({ lossyDeviceName: true })

browserless.getDevice({ device: 'macbook pro 13' })
browserless.getDevice({ device: 'MACBOOK PRO 13' })
browserless.getDevice({ device: 'macbook pro' })
browserless.getDevice({ device: 'macboo pro' })

This setting is oriented for find the device even if the descriptor device name is not exactly the same.

mode

type: string
default: launch
values: 'launch' | 'connect'

It defines if browser should be spawned using puppeteer.launch or puppeteer.connect

timeout

type: number
default: 30000

This setting will change the default maximum navigation time.

puppeteer

type: Puppeteer
default: puppeteer|puppeteer-core|puppeteer-firefox

It's automatically detected based on your dependencies being supported puppeteer, puppeteer-core or puppeteer-firefox.

.createContext(options)

Now you have your browserless factory instantiated, you can create browser contexts on demand:

const browserless = browserlessFactory.createContext({ 
  retry: 2 
})

Every browser context is isolated. They won't share cookies/cache with other browser contexts. They also can contain specific options.

options

retry

type: number
default: 2

The number of retries that can be performed before considering a navigation as failed.

.browser

It returns the Browser instance associated with your browserless factory.

const browser = await browserlessFactory.browser()
console.log('My browser PID is', browser.proces().pid)

.respawn

It will respawn the singleton browser associated with your browserless factory.

const getPID = promise => (await promise).process().pid

console.log('Process PID:', await getPID(browserlessFactory.browser()))

await browserlessFactory.respawn()

console.log('Process PID:', await getPID(browserlessFactory.browser()))

This method is am implementation detail, normally you don't need to call it.

.close

It will close the singleton browser associated with your browserless factory.

const onExit = require('signal-exit')

onExit(async (code, signal) => {
  console.log('shutting down all the things')
  await browserlessFactory.close()
  console.log(`exit with code ${code} (${signal})`)
})

It should be used to gracefully shutdown your resources.

Methods

.html(url, options)

It serializes the content from the target url into HTML.

const html = await browserless.html('https://example.com')
console.log(html)

options

See browserless.goto to know all the options and values supported.

.text(url, options)

It serializes the content from the target url into plain text.

const text = await browserless.text('https://example.com')
console.log(text)

options

See browserless.goto to know all the options and values supported.

.pdf(url, options)

It generates the PDF version of a website behind an url.

const buffer = await browserless.pdf('https://example.com')
console.log(`PDF generated in ${buffer.byteLength()} bytes`)

options

This method use the following options by default:

{
  margin: '0.35cm',
  printBackground: true,
  scale: 0.65
}

See browserless.goto to know all the options and values supported.

Also, any page.pdf option is supported.

Additionally, you can setup:

margin

type: string | string[]
default: '0.35cm'

It sets paper margins. All possible units are:

px for pixel.
in for inches.
cm for centimeters.
mm for millimeters.

You can pass an object object specifying each corner side of the paper:

const buffer = await browserless.pdf(url.toString(), {
  margin: {
    top: '0.35cm',
    bottom: '0.35cm',
    left: '0.35cm',
    right: '0.35cm'
  }
})

Or, in case you pass an string, it will be used for all the sides:

const buffer = await browserless.pdf(url.toString(), {
  margin: '0.35cm'
})

.screenshot(url, options)

It takes a screenshot from the target url.

const buffer = await browserless.screenshot('https://example.com')
console.log(`Screenshot taken in ${buffer.byteLength()} bytes`)

options

This method use the following options by default:

{
  device: 'macbook pro 13'
}

See browserless.goto to know all the options and values supported.

Also, any page.screenshot option is supported.

Additionally, you can setup:

codeScheme

type: string
default: 'atom-dark'

When this value is present and the response 'Content-Type' header is 'json', it beautifies HTML markup using Prism.

The syntax highlight theme can be customized, being possible to setup:

A prism-themes identifier (e.g., 'dracula').
A remote URL (e.g., 'https://unpkg.com/prism-theme-night-owl').

element

type: string

Capture the DOM element matching the given CSS selector. It will wait for the element to appear in the page and to be visible.

overlay

type: object

After the screenshot has been taken, this option allows you to place the screenshot into a fancy overlay

You can configure the overlay specifying:

browser: It sets the browser image overlay to use, being light and dark supported values.
background: It sets the background to use, being supported to pass:
- An hexadecimal/rgb/rgba color code, eg. #c1c1c1.
- A CSS gradient, eg. linear-gradient(225deg, #FF057C 0%, #8D0B93 50%, #321575 100%)
- An image url, eg. https://source.unsplash.com/random/1920x1080.

const buffer = await browserless.screenshot(url.toString(), {
  hide: ['.crisp-client', '#cookies-policy'],
  overlay: {
    browser: 'dark',
    background:
      'linear-gradient(45deg, rgba(255,18,223,1) 0%, rgba(69,59,128,1) 66%, rgba(69,59,128,1) 100%)'
  }
})

.destroyContext

It will destroy the current browser context

const browserless = await browserlessFactory.createContext({ retry: 0 })

const content = await browserless.html('https://example.com')

await browserless.destroyContext()

.getDevice(options)

Giving a specific device descriptons, this method will be the devices settings for it.

browserless.getDevice({ device: 'Macbook Pro 15' })
// {
//   userAgent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.89 Safari/537.36',
//   viewport: {
//     width: 1440,
//     height: 900,
//     deviceScaleFactor: 2,
//     isMobile: false,
//     hasTouch: false,
//     isLandscape: false
//   }
// }

It extends from puppeteer.devices, adding some missing devices there.

options

device

type: string

The device descriptor name. It's used to find the rest presets associated with it.

When lossyDeviceName is enabled, a fuzzy search rather than a strict search will be performed in order to maximize getting a result back.

viewport

type: object

An extra of viewport settings that will be merged with the device presets.

browserless.getDevice({ 
  device: 'iPad', 
  viewport: {
    isLandscape: true
  } 
})

headers

type: object

An extra of headers that will be merged with the device presets.

browserless.getDevice({ 
  device: 'iPad', 
  headers: {
    'user-agent': 'googlebot'
  } 
})

.evaluate(fn, gotoOpts)

It exposes an interface for creating your own evaluate function, passing you the page and response.

The fn will receive page and response as arguments:

const ping = browserless.evaluate((page, response) => ({
  statusCode: response.status(),
  url: response.url(),
  redirectUrls: response.request().redirectChain()
}))

await ping('https://example.com')
// {
//   "statusCode": 200,
//   "url": "https://example.com/",
//   "redirectUrls": []
// }

You don't need to close the page; It will be closed automatically.

Internally, the method performs a browserless.goto, being possible to pass extra arguments as second parameter:

const serialize = browserless.evaluate(
  page => page.evaluate(() => document.body.innerText), {
  waitUntil: 'domcontentloaded'
})

await serialize('https://example.com')
// '<!DOCTYPE html><html><div>…'

.goto(page, options)

It performs a page.goto with a lot of extra capabilities

const browserless = require('browserless')

const page = await browserless.page()
const { response, device } = await browserless.goto(page, { url: 'http://example.com' })

options

Any option passed here will bypass to page.goto.

Additionally, you can setup:

abortTypes

type: array
default: []

It sets the ability to abort requests based on the resource type.

adblock

type: boolean
default: true

It enabled the builtin adblocker by Cliqz that aborts unnecessary third party requests associated with ads services.

animations

type: boolean
default: false

Disable CSS animations and transitions, also it sets prefers-reduced-motion consequently.

click

type: string | string[]

Click the DOM element matching the given CSS selector.

device

type: string
default: 'macbook pro 13'

It specifies the device descriptor to use in order to retrieve userAgent and viewport.

evasions

type: string[]
default: require('@browserless/goto').evasions

It makes your Headless undetectable, preventing to being blocked.

These techniques are used by antibot systems to check if you are a real browser and block any kind of automated access. All the evasion techniques implemented are:

Evasion	Description
`chromeRuntime`	Ensure `window.chrome` is defined.
`stackTraces`	Prevent detect Puppeteer via variable name.
`mediaCodecs`	Ensure media codedcs are defined.
`navigatorPermissions`	Mock over `Notification.permissions`.
`navigatorPlugins`	Ensure your browser has `NavigatorPlugins` defined.
`navigatorWebdriver`	Ensure `Navigator.webdriver` exists.
`randomizeUserAgent`	Use a different `User-Agent` every time.
`webglVendor`	Ensure `WebGLRenderingContext` & `WebGL2RenderingContext` are defined.

The evasion techniques are enabled by default. You can omit techniques just filtering them:

const createBrowserless = require('browserless')

const evasions = require('@browserless/goto').evasions.filter(
  (evasion) => evasion !== 'randomizeUserAgent'
)

const browserlessFactory = createBrowserless({ evasions });

headers

type: object

An object containing additional HTTP headers to be sent with every request.

const browserless = require('browserless')

const page = await browserless.page()
await browserless.goto(page, {
  url: 'http://example.com',
  headers: {
    'user-agent': 'googlebot',
    cookie: 'foo=bar; hello=world'
  }
})

hide

type: string | string[]

Hide DOM elements matching the given CSS selectors.

const buffer = await browserless.screenshot(url.toString(), {
  hide: ['.crisp-client', '#cookies-policy']
})

This sets visibility: hidden on the matched elements.

html

type: string

In case you provide HTML markup, a page.setContent avoiding fetch the content from the target URL.

javascript

type: boolean
default: true

When it's false, it disables JavaScript on the current page.

mediaType

type: string
default: 'screen'

Changes the CSS media type of the page using page.emulateMediaType.

modules

type: string | string[]

Injects <script type="module"> into the browser page.

It can accept:

Absolute URLs (e.g., 'https://cdn.jsdelivr.net/npm/@microlink/[email protected]/src/browser.js').
Local file (e.g., `'local-file.js').
Inline code (e.g., "document.body.style.backgroundColor = 'red'").

const buffer = await browserless.screenshot(url.toString(), {
  modules: [
    'https://cdn.jsdelivr.net/npm/@microlink/[email protected]/src/browser.js',
    'local-file.js',
    "document.body.style.backgroundColor = 'red'"
  ]
})

remove

type: string | string[]

Remove DOM elements matching the given CSS selectors.

const buffer = await browserless.screenshot(url.toString(), {
  remove: ['.crisp-client', '#cookies-policy']
})

This sets display: none on the matched elements, so it could potentially break the website layout.

colorScheme

type: string
default: 'no-preference'

Sets prefers-color-scheme CSS media feature, used to detect if the user has requested the system use a 'light' or 'dark' color theme.

scripts

type: string | string[]

Injects <script> into the browser page.

It can accept:

Absolute URLs (e.g., 'https://cdn.jsdelivr.net/npm/@microlink/[email protected]/src/browser.js').
Local file (e.g., `'local-file.js').
Inline code (e.g., "document.body.style.backgroundColor = 'red'").

const buffer = await browserless.screenshot(url.toString(), {
  scripts: [
    'https://cdn.jsdelivr.net/npm/[email protected]/dist/jquery.min.js',
    'local-file.js',
    "document.body.style.backgroundColor = 'red'"
  ]
})

Prefer to use modules whenever possible.

scroll

type: string

Scroll to the DOM element matching the given CSS selector.

styles

type: string | string[]

Injects <style> into the browser page.

It can accept:

Absolute URLs (e.g., 'https://cdn.jsdelivr.net/npm/[email protected]/dist/dark.css').
Local file (e.g., `'local-file.css').
Inline code (e.g., "body { background: red; }").

const buffer = await browserless.screenshot(url.toString(), {
  styles: [
    'https://cdn.jsdelivr.net/npm/[email protected]/dist/dark.css',
    'local-file.css',
    'body { background: red; }'
  ]
})

timezone

type: string

It changes the timezone of the page.

url

type: string

The target URL.

viewport

It will setup a custom viewport, using page.setViewport method.

waitForSelector

type:string

Wait a quantity of time, selector or function using page.waitForSelector.

waitForTimeout

type:number

Wait a quantity of time, selector or function using page.waitForTimeout.

waitUntil

When to consider navigation succeeded.

If you provide an array of event strings, navigation is considered to be successful after all events have been fired.

Events can be either:

'auto': A combination of 'load' and 'networkidle2' in a smart way to wait the minimum time necessary.
'load': Consider navigation to be finished when the load event is fired.
'domcontentloaded': Consider navigation to be finished when the DOMContentLoaded event is fired.
'networkidle0': Consider navigation to be finished when there are no more than 0 network connections for at least 500 ms.
'networkidle2': Consider navigation to be finished when there are no more than 2 network connections for at least 500 ms.

.context

It returns the BrowserContext associated with your instance.

const browserContext = await browserless.context()

console.log({ isIncognito: browserContext.isIncognito() })
// => { isIncognito: true }

.page

It returns a standalone Page associated with the current browser context.

const page = await browserless.page()
await page.content()
// => '<html><head></head><body></body></html>'

Command Line Interface

You can perform any browserless action from your terminal.

Just you need to install @browserless/cli globally:

npm install @browserless/cli --global

Additionally, can do it under demand using npx:

npx @browserless/cli --help

That's the preferred way to interact with the CLI under CI/CD scenarios.

Lighthouse

browserless has a Lighthouse integration that connects to a Puppeteer instance in a simple way.

const lighthouse = require('@browserless/lighthouse')
const { writeFile } = require('fs/promises')

const report = await lighthouse('https://example.com')

await writeFile('report.json', JSON.stringify(report, null, 2))

The report will be generated url, extending from lighthouse:default settings, being these settings the same than Google Chrome Audits reports on Developer Tools.

options

The second argument can contain lighthouse specific settings The following options are used by default:

{
  logLevel: 'error',
  output: 'json',
  device: 'desktop',
  onlyCategories: ['perfomance', 'best-practices', 'accessibility', 'seo']
}

See Lighthouse configuration to know all the options and values supported.

Additionally, you can setup:

getBrowserless

type: function
default: require('browserless')

The browserless instance to use for getting the browser.

logLevel

type: string
default: 'error'
values: 'silent' | 'error' | 'info' | 'verbose'

The level of logging to enable.

output

type: string | string[]
default: 'json'
values: 'json' | 'csv' | 'html'

The type(s) of report output to be produced.

device

type: string
default: 'desktop'
values: 'desktop' | 'mobile' | 'none'

How emulation (useragent, device screen metrics, touch) should be applied. 'none' indicates Lighthouse should leave the host browser as-is.

onlyCategories

Includes only the specified categories in the final report.

Packages

browserless is internally divided into multiple packages for ensuring just use the minimum quantity of code necessary for your use case.

Package	Version
`browserless`
`@browserless/benchmark`
`@browserless/cli`
`@browserless/devices`
`@browserless/examples`
`@browserless/errors`
`@browserless/function`
`@browserless/goto`
`@browserless/pdf`
`@browserless/screenshot`
`@browserless/lighthouse`

FAQ

Q: Why use browserless over puppeteer?

browserless not replace puppeteer, it complements. It's just a syntactic sugar layer over official Headless Chrome oriented for production scenarios.

Q: Why do you block ads scripts by default?

Headless navigation is expensive compared with just fetch the content from a website.

In order to speed up the process, we block ads scripts by default because they are so bloat.

Q: My output is different from the expected

Probably browserless was too smart and it blocked a request that you need.

You can active debug mode using DEBUG=browserless environment variable in order to see what is happening behind the code:

Consider open an issue with the debug trace.

Q: I want to use browserless with my AWS Lambda like project

Yes, check chrome-aws-lambda to setup AWS Lambda with a binary compatible.

License

browserless © Microlink, Released under the MIT License.
Authored and maintained by Microlink with help from contributors.

The logo has been designed by xinh studio.

microlink.io · GitHub @MicrolinkHQ · Twitter @microlinkhq

Comments

Evasion techniques
Libraries

https://github.com/paulirish/headless-cat-n-mouse

https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth

URLs to test

[x] https://www.stuff.co.nz/national/politics/108051628/poll-shows-labour-overtake-national

[x] https://www.zomato.com/bangalore/bold-marathahalli

[x] https://www.reddit.com/r/news/comments/8vebjp/lebron_james_takes_154_million_4year_deal_with/?st=JJ41AUKB&sh=be6868cb

[x] https://www.reddit.com/r/funny/comments/8ltsck/this_sea_dog_is_the_ultimate_prankster/

[ ] https://www.zillow.com/homedetails/4611-Cardinal-Ridge-Way-Flowery-Branch-GA-30542/83350172_zpid

[ ] https://www.kmart.com.au/product/portable-charger-15000mah/2168305

[ ] https://www.bloomberg.com/news/articles/2019-01-15/here-are-five-volatility-charts-keeping-wall-street-up-at-night?srnd=premium-canada

[ ] https://www.scmp.com/week-asia/opinion/article/2112486/inconvenient-truths-murder-journalism-india

[ ] https://startse.com/noticia/netflix-do-esporte-planeja-chegada-ao-brasil-ma-noticia-para-globo

[ ] https://www.coches.net/segunda-mano/

[ ] https://www.ouest-france.fr/

[ ] https://www.washingtonpost.com/nation/2020/06/25/coronavirus-live-updates-us/

Related

https://timvanscherpenzeel.github.io/detect-gpu/
opened by Kikobeats 11
Lighthouse: images for desktop reports returning mobile interface
Bug Report

Current Behavior When I use the following MQL API, the report returns the result.data.insights.lighthouse.audits['final-screenshot'] is returned as a base64 encoded image. However, this image is of the mobile view and not of the desktop view of the website.

const url = 'https://anywebsitehere.com'; const payload = { meta: false, insights: { lighthouse: { device: 'desktop', onlyCategories: ['performance', 'best-practices', 'accessibility', 'seo'], }, technologies: false, }, }; const result = await mql(url, payload);

Expected behavior/code

I'd expect the above to return the desktop variation of the image and not a mobile version.

Additional context/Screenshots

Can be provided upon request.
enhancement
opened by dustinsgoodman 6
chore: update adblocker and use pre-built engine from CDN
Closes https://github.com/microlinkhq/browserless/pull/133

Fix fetch abstraction on top of 'got'

Update 'got' to latest to get 'text', 'json', and 'buffer' helpers

Make use of prebuilt engine from CDN whenever possible
opened by remusao 6
build: only use compatible rules

After using https://github.com/StevenBlack/hosts on production scenarios, I noted it makes the execution slow.

I'm not sure if this is happening based on the number of rules or because I'm doing an adaptation from the original rule definition.

opened by Kikobeats 5
update adblocker + make use of puppeteer helpers
Hi there,

We've published a new release of @cliqz/adblocker. Among some small bug fixes, it contains a new blocker helper to ease the use of the library in the context of Puppeteer projects. I took the liberty of updating browserless with this change; let me know if this is acceptable or if you'd like me to make extra modifications. Alternatively, feel free to cherry-pick the commit; as I was not sure what the best way to make the PR was.

Here are a few improvements this PR would bring:

Enabling adblocking for Puppeteer takes only one line now and there is not need to deal with requests explicitly (the use of tldts for parsing has also be internalized): await engine.enableBlockingInPage(page);

More ads are now blocked (you get the same full-blown experience as with a WebExtension in the browser). This is the results of a few extra capabilities added in the context of Puppeteer:

requests can be redirected to data URLs (e.g.: google analytics would not be blocked, but instead a fake response would be injected so that the site does not break, but there is no tracking and no performance cost since the real request did not happen)

full cosmetic filtering is applied; which means that more ads will be blocked (in fact some of them might now be hidden whereas it was not possible before with only network filtering). Check Google ads for example (well in fact you should not see them anymore...). This feature should also reduce the likelihood of breakage as well as defuse "paywalls" (where a site asks you to disable the adblocker to proceed = no more!).

I changed postinstall.js from goto package so that the full adblocker is created and dumped on disk in its binary form at install time. Initializing PuppeteerBlocker from this binary blob is extremely fast (i.e.: at least 3 orders of magnitude faster than parsing the lists from scratch, we're talking about less than 10ms on cold start). Since postinstall.js is done once, but goto(...) can be called lots of times, this means that warm-up time will be drastically reduced when using goto.

Removed this particular list 'https://raw.githubusercontent.com/uBlockOrigin/uAssets/master/recipes/recipes_en.txt' which is not supported. There should be no visible difference in terms of blocking.

Caveat: the debug logs for adblocker have been removed as there is currently no way to know which requests where blocked (everything happens internally in PuppeteerBlocker). If that is a desired capability, we could easily add a hook/callback to get some statistics about blocked requests.

Best, Rémi
opened by remusao 5

feat: page numbers

Why

Puppeteer (as every HTML to PDF tool out there) still lacks support for generated content. This usually comes as a surprise for most people that require some "advanced" features like creating table of contents (TOC) or being able to refer content by it's page number.

Alternative to puppeteer is wkhtmltopdf which does support TOC generation. However it fails to supply a HTML template API, so it's non customisable. Also does not support page number references within a pdf document. On top of that, the browser it uses is dated, which makes it difficult to user modern charting libraries and other advanced css and javascript features.

I needed those features for some client work, so ended up implementing it and port to this library.

How

It's simple to use:

const browserless = require('browserless')

;(async () => {
  const url = 'https://example.com'
  const buffer = await browserless.pdf(url, { page_numbers: true })
  console.log(`PDF generated!`)
})()

On the HTML part use elements <span class="pageNumber"> or <span class="pageNumber" rel="someElementId">. The resulting PDF will have both elements replaced by current page number or page number that corresponds to the referred someElementId.

Full HTML Example

<!DOCTYPE html>
<html lang="en">
    <head>
        <meta charset="utf-8">
        <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1, user-scalable=no">
        <!-- <link rel="shortcut icon" href="" type="image/x-icon"> -->
        <title>Example</title>

        <style>
         section {page-break-after: always;}
        </style>        
    </head>
    <body>
        <section id="toc">
            <h1>Table of Contents:</h1>
            <ul>
                <li>Section 1 -- Page <span class="pageNumber" rel="section1">X</span></li>
                <li>Section 2 -- Page <a href="#section2" class="pageNumber" rel="section2">Y</a> with navigation</li>
                <li>Section 3 -- Page <span class="pageNumber" rel="section3"></span></li>
                <li>Section 4 -- Page <span class="pageNumber" rel="section4">Z</span></li>                
            </ul>
        </section>

        <section id="section1">
            <h1>Section 1</h1>
            <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>
            <p>This should be page: <span class="pageNumber"></span></p>
        </section>

        <section id="section2">
            <h1>Section 2</h1>
            <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>
            <p>This should be page: <span class="pageNumber"></span></p>
        </section>

        <section id="section3">
            <h1>Section 3</h1>
            <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>
            <p>This should be page: <span class="pageNumber"></span></p>
            <p>And <b>Section 4</b> should be on page: <span class="pageNumber" rel="section4">PLACEHOLDER</span></p>
        </section>        

        <section id="section4">
            <h1>Section 4</h1>
            <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>
            <p>This should be page: <span class="pageNumber"></span></p>
            <p>And <b>Section 2</b> should be on page: <span class="pageNumber" rel="section2"></span></p>            
        </section>
        
    </body>
</html>

Implementation Details

I added some code to browserless in order to make extension features easier to implement. It could be more polished or allow some kind of dependency injection, making features pluggable.

Page numbers implementation uses pdf-extract, which have some dependencies that must be previously installed in the OS. OCR support is not required.

This implementation requires an extra PDF to be generated, so it will make the whole PDF processing and generation slower when using page_numbers option. This might have an impact for more processing intense production environments.

Thanks!

opened by josemf 5

Cannot find module 'puppeteer'

Saw this linked on echo.js, decided to give a couple of the examples a shot.

Copy/pasted the screenshot example from the docs into a js file, added a package.json, installed and saved browserless, then ran node on the js file. I'm assuming that would be a standard use-case for the lib.

Here's the error. I will spend some time chasing it down when I get home from work later.

    throw err;                                                                                                                                                                                            
    ^                                                                                                                                                                                                     
                                                                                                                                                                                                          
Error: Cannot find module 'puppeteer'                                                                                                                                                                     
    at Function.Module._resolveFilename (module.js:557:15)                                                                                                                                                
    at Function.Module._load (module.js:484:25)                                                                                                                                                           
    at Module.require (module.js:606:17)                                                                                                                                                                  
    at require (internal/module.js:11:18)                                                                                                                                                                 
    at Object.<anonymous> (/home/mike/dev/test/node_modules/browserless/index.js:6:19)                                                                                                                    
    at Module._compile (module.js:662:30)                                                                                                                                                                 
    at Object.Module._extensions..js (module.js:673:10)                                                                                                                                                   
    at Module.load (module.js:575:32)                                                                                                                                                                     
    at tryModuleLoad (module.js:515:12)                                                                                                                                                                   
    at Function.Module._load (module.js:507:3) ```

question

opened by maximumdata 5

build(deps-dev): bump p-all from 3.0.0 to 4.0.0
Bumps p-all from 3.0.0 to 4.0.0.

Release notes

Sourced from p-all's releases.

v4.0.0

Breaking

Require Node.js 12.20 d2abd1e

This package is now pure ESM. Please read this.

Improvements

Improve TypeScript types by using variadic tuple instead of overloads (#9) ea9c277

This means the strongly-typed return type is no longer limited to 10 elements.

https://github.com/sindresorhus/p-all/compare/v3.0.0...v4.0.0

Commits

df11988 4.0.0

d2abd1e Require Node.js 12 and move to ESM

ea9c277 Improve TypeScript types by using variadic tuple instead of overloads (#9)

d464beb Move to GitHub Actions (#8)

See full diff in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

dependencies javascript
opened by dependabot[bot] 4
build(deps): bump meow from 9.0.0 to 10.1.3
Bumps meow from 9.0.0 to 10.1.3.

Release notes

Sourced from meow's releases.

v10.1.3

Fix return type for .showHelp() (#213) db55316

https://github.com/sindresorhus/meow/compare/v10.1.2...v10.1.3

v10.1.2

Fix engines field (#203) 1368ae0

https://github.com/sindresorhus/meow/compare/v10.1.1...v10.1.2

v10.1.1

Fix failure with isMultiple when isRequired function returns false (#194) e1f0e24

https://github.com/sindresorhus/meow/compare/v10.1.0...v10.1.1

v10.1.0

Upgrade dependencies 829aab0

Allow default property of Flag types to accept arrays (#190) ae73466

https://github.com/sindresorhus/meow/compare/v10.0.1...v10.1.0

v10.0.1

Upgrade dependencies (#185) a0daf20

https://github.com/sindresorhus/meow/compare/v10.0.0...v10.0.1

v10.0.0

Breaking

Require Node.js 12 (#181) 05320ac

This package is now pure ESM. Please read this.

You must now pass in the importMeta option so meow can find your package.json:

const cli = meow(…, { + importMeta: import.meta });

Previously, meow used some tricks to infer the location of your package.json, but this no longer works in ESM.

https://github.com/sindresorhus/meow/compare/v9.0.0...v10.0.0

Commits

85dc1e9 10.1.3

db55316 Fix return type for .showHelp() (#213)

bc489b4 Bump dev dependencies (#207)

aede62e Fix readme typo

6d60ae3 10.1.2

1368ae0 Fix engines field (#203)

075f96f 10.1.1

79c7039 Fix CI

e1f0e24 Fix failure with isMultiple when isRequired function returns false (#194)

5d3bb31 10.1.0

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

dependencies javascript
opened by dependabot[bot] 4
Add types
Prerequisites

[x] I'm using the last version.

[x] My node version is the same as declared as package.json.

Subject of the issue

When you import the browserless on a typescript based project, the error is given that it has no types, i also don't see them in the repo or in @types.

Steps to reproduce

Just create a new typescript project and import the browserless.

import createBrowserless from 'browserless';

Tell us how to reproduce this issue.

Expected behaviour

Browserless should have types, so that it can be used easily in typescript and help the user with intellisense.

Actual behaviour

It has no inbuilt types or no info on them being installed separately.
enhancement
opened by spa5k 4
build(deps): bump @cliqz/adblocker-puppeteer from 1.4.24 to 1.5.0
Bumps @cliqz/adblocker-puppeteer from 1.4.24 to 1.5.0.

Release notes

Sourced from @cliqz/adblocker-puppeteer's releases.

v1.5.0

:nail_care: Polish

adblocker

#414 Implement retry mechanism while fetching resources (@remusao)

adblocker-webextension

#413 webextension: handler for runtime messages now returns a promise (@remusao)

:house: Internal

adblocker-benchmarks, adblocker-circumvention, adblocker-content, adblocker-electron-example, adblocker-electron, adblocker-puppeteer-example, adblocker-puppeteer, adblocker-webextension-cosmetics, adblocker-webextension-example, adblocker-webextension, adblocker

#415 Clean-up tooling (@remusao)

Committers: 1

Rémi (@remusao)

Changelog

Sourced from @cliqz/adblocker-puppeteer's changelog.

v1.5.0 (2020-01-16)

:nail_care: Polish

adblocker

#414 Implement retry mechanism while fetching resources (@remusao)

adblocker-webextension

#413 webextension: handler for runtime messages now returns a promise (@remusao)

:house: Internal

adblocker-benchmarks, adblocker-circumvention, adblocker-content, adblocker-electron-example, adblocker-electron, adblocker-puppeteer-example, adblocker-puppeteer, adblocker-webextension-cosmetics, adblocker-webextension-example, adblocker-webextension, adblocker

#415 Clean-up tooling (@remusao)

Committers: 1

Rémi (@remusao)

v1.4.20 (2020-01-15)

:house: Internal

#412 Migrate local GitHub actions to TypeScript (@remusao)

Committers: 1

Rémi (@remusao)

v1.4.19 (2020-01-15)

:house: Internal

#410 Add dependabot config into repository (@remusao)

Committers: 1

Rémi (@remusao)

v1.4.12 (2020-01-15)

:house: Internal

#409 Add action to release on NPM (@remusao)

Committers: 1

Rémi (@remusao)

v1.4.2 (2020-01-15)

:memo: Documentation

#404 docs: add support for PR labels (@remusao)

:house: Internal

#407 Add GitHub actions for releasing on GitHub (@remusao)

... (truncated)

Commits

d7c16b3 chore(release): publish v1.5.0

af3bf3c Bump engine internal version

8d13b4d Clean-up tooling (#415)

4ea3f9e Implement retry mechanism while fetching resources (#414)

8bb8dc4 webextension: handler for runtime messages now returns a promise (#413)

74de503 Update CHANGELOG.md [skip ci]

See full diff in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

@dependabot badge me will comment on this PR with code to add a "Dependabot enabled" badge to your readme

Additionally, you can set the following in your Dependabot dashboard:

Update frequency (including time of day and day of week)

Pull request limits (per update run and/or open at any time)

Out-of-range updates (receive only lockfile updates, if desired)

Security updates (receive only security updates, if desired)

dependencies
opened by dependabot-preview[bot] 4
Add `@browserless/screencast`
It could be really cool to do something similar to page.screenshot, but oriented for video content.

Since this feature is not fully implemented as part of puppeteer API, we can ship a standalone package under the browserless umbrella.

The main difference between screenshot vs. screencast it how to specify the actions to be done as part of the video content (like scroll, click, etc.).

The browserless way could be something like:

const { createScreencast } = require('@browserless/screencast') /* let's assume you have `page` as precondition */ const screencast = await createScreencast(page, { path: '/my/video/path.mp4' }) /* actions that will be recorded */ await screencast.goto('http://example.org') await screencast.scrollTo({ selector: '#footer', duration: 1000 }) // fancy smooth animation await screencast.click('a') await screencast.waitForTimeout(3000) /* serialize actions into video */ await screencast.stop()

Another approximation could be specifying the actions as configuration file:

const screencast = require('@browserless/screencast') /* let's assume you have `page` as precondition */ await screencast(page, { path: '/my/video/path.mp4', actions: [ ['goto', 'http://example.org'], ['scrollTo', { selector: '#footer', duration: 1000 }], ['click', 'a'], ['waitForTimeout', '3000'] ] })

Related

https://github.com/puppeteer/puppeteer/issues/478

Inspiration

https://github.com/prasanaworld/puppeteer-screen-recorder ⭐️⭐️ – pretty near to the goal.

https://github.com/qawolf/playwright-video ⭐ – A solution created in the era playwright doesn't have an official API.

https://playwright.dev/docs/videos ⭐ – The Playwright official API.

https://github.com/browserless/chrome/blob/master/src/apis/screencast.ts ⭐ – A solution using canvas for recording.

https://github.com/Flam3rboy/puppeteer-stream ⭐ – An implementation using MediaRecorder API.

https://github.com/clipisode/puppeteer-recorder ⭐️ – frame-to-frame solution using ffmpeg.

https://github.com/muralikg/puppetcam

https://github.com/tungs/timesnap

https://github.com/anishkny/webgif

https://gist.github.com/muralikg/23cfed0b099b3df812bb2b27ba1be6a4

https://github.com/transitive-bullshit/puppeteer-lottie

https://github.com/tungs/timecut

https://developer.chrome.com/docs/devtools/recorder/#open

enhancement
opened by Kikobeats 0
[screenshot] mobile overlay
Similar to

https://github.com/sindresorhus/capture-website/pull/27/files?short_path=f1d7f01#diff-f1d7f01715e29ea2a7cbaf4f2f8117cc

Related

https://github.com/microlinkhq/browserless/tree/master/packages/screenshot/media

https://browserframe.com/
opened by Kikobeats 0

Releases(v9.8.0)

v9.8.0(Dec 9, 2022)
What's Changed

feat(goto): add onPageRequest by @Kikobeats in https://github.com/microlinkhq/browserless/pull/430

Full Changelog: https://github.com/microlinkhq/browserless/compare/v9.7.3...v9.8.0
Source code(tar.gz)
Source code(zip)
v9.7.3(Dec 1, 2022)

Full Changelog: https://github.com/microlinkhq/browserless/compare/v9.7.2...v9.7.3
Source code(tar.gz)
Source code(zip)
v9.7.1(Dec 1, 2022)
9.7.1 (2022-12-01)

Bug Fixes

don't reconnect under connect mode (#429) (e4f07a9)

Source code(tar.gz)
Source code(zip)
v9.7.0(Nov 1, 2022)

9.7.0 (2022-11-01)

Note: Version bump only for package browserless
Source code(tar.gz)
Source code(zip)
v9.7.0-beta.0(Oct 26, 2022)
What's Changed

refactor(function): use thread instead of process by @Kikobeats in https://github.com/microlinkhq/browserless/pull/425

Full Changelog: https://github.com/microlinkhq/browserless/compare/v9.6.12...v9.7.0-beta.0
Source code(tar.gz)
Source code(zip)
v9.6.12(Oct 18, 2022)
9.6.12 (2022-10-18)

Bug Fixes

browserless: associate a context id (#424) (f8137f2)

Source code(tar.gz)
Source code(zip)
v9.6.11(Oct 6, 2022)
9.6.11 (2022-10-06)

Bug Fixes

goto: pass waitUntil (#420) (7b0d2c8)

Source code(tar.gz)
Source code(zip)
v9.6.10(Sep 18, 2022)

9.6.10 (2022-09-18)

Note: Version bump only for package browserless
Source code(tar.gz)
Source code(zip)
v9.6.9(Sep 1, 2022)

9.6.9 (2022-09-01)

Note: Version bump only for package browserless
Source code(tar.gz)
Source code(zip)
v9.6.8(Aug 28, 2022)

9.6.8 (2022-08-28)

Note: Version bump only for package browserless
Source code(tar.gz)
Source code(zip)
v9.6.7(Aug 28, 2022)

9.6.7 (2022-08-28)

Note: Version bump only for package browserless
Source code(tar.gz)
Source code(zip)
v9.6.6(Aug 26, 2022)

9.6.6 (2022-08-26)

Note: Version bump only for package browserless
Source code(tar.gz)
Source code(zip)
v9.6.5(Aug 24, 2022)

9.6.5 (2022-08-24)

Note: Version bump only for package browserless
Source code(tar.gz)
Source code(zip)
v9.6.4(Aug 13, 2022)

9.6.4 (2022-08-13)

Note: Version bump only for package browserless
Source code(tar.gz)
Source code(zip)
v9.6.3(Aug 12, 2022)
9.6.3 (2022-08-12)

Bug Fixes

browserless: pass context options (4082929)

Source code(tar.gz)
Source code(zip)
v9.6.2(Aug 12, 2022)
9.6.2 (2022-08-12)

Bug Fixes

goto: bypass CSP when is necessary (#402) (d44cdfe)

goto: setup module type properly (#401) (b677379)

Source code(tar.gz)
Source code(zip)
v9.6.1(Aug 8, 2022)

9.6.1 (2022-08-08)

Note: Version bump only for package browserless
Source code(tar.gz)
Source code(zip)
v9.6.0(Aug 2, 2022)

9.6.0 (2022-08-02)

Note: Version bump only for package browserless
Source code(tar.gz)
Source code(zip)
v9.5.4(Jul 23, 2022)
9.5.4 (2022-07-23)

Bug Fixes

goto: truncate excessive long URLs (#390) (491d5eb)

Source code(tar.gz)
Source code(zip)
v9.5.4-alpha.0(Jul 23, 2022)
9.5.4-alpha.0 (2022-07-23)

Bug Fixes

screenshot: bring page to front before screenshot (a615192)

Source code(tar.gz)
Source code(zip)
v9.5.3(Jul 13, 2022)
9.5.3 (2022-07-13)

Bug Fixes

function: trim code before remove semicolon (c9b78d7)

Source code(tar.gz)
Source code(zip)
v9.5.2(Jul 10, 2022)

9.5.2 (2022-07-10)

Note: Version bump only for package browserless
Source code(tar.gz)
Source code(zip)
v9.5.1(Jun 30, 2022)

9.5.1 (2022-06-30)

Note: Version bump only for package browserless
Source code(tar.gz)
Source code(zip)
v9.5.0(Jun 30, 2022)
9.5.0 (2022-06-30)

Features

delete hide and remove options (ac85313)

Source code(tar.gz)
Source code(zip)
v9.4.1(Jun 29, 2022)

9.4.1 (2022-06-29)

Note: Version bump only for package browserless
Source code(tar.gz)
Source code(zip)
v9.4.0(Jun 9, 2022)
9.4.0 (2022-06-09)

Features

driver: rename getPid → pid (#381) (dc96ca7)

Source code(tar.gz)
Source code(zip)
v9.3.21(Jun 9, 2022)

9.3.21 (2022-06-09)

Note: Version bump only for package browserless
Source code(tar.gz)
Source code(zip)
v9.3.20(May 17, 2022)

9.3.20 (2022-05-17)

Note: Version bump only for package browserless
Source code(tar.gz)
Source code(zip)
v9.3.19(May 16, 2022)

9.3.19 (2022-05-16)

Note: Version bump only for package browserless
Source code(tar.gz)
Source code(zip)
v9.3.18(May 12, 2022)
9.3.18 (2022-05-12)

Bug Fixes

driver: remove unnecessary flag (0e5e3b1)

Source code(tar.gz)
Source code(zip)