Discovers and parses news, blog and podcast posts from any website

Overview

post-feed-reader

npm license

A library to fetch news, blog or podcast posts from any site. It works by auto-discovering a post source, which can be an RSS/Atom/JSON feed or the Wordpress REST API, then fetches and parses the list of posts.

It's meant for NodeJS, but as it is built on Isomorphic Javascript, it can work on browsers if the website allows cross-origin requests.

Originally built for apps that need to list the posts with their own UI, but don't actually manage the blog and need automatic fallbacks when the blog does change.

Features

Getting Started

Install it with NPM or Yarn:

npm install post-feed-reader # or yarn add post-feed-reader

You first need to discover the post source, which will return an object containing a URL to the RSS/Atom/JSON Feed or the Wordpress REST API.

Then you can pass the discovered source to the getPostList, which will fetch and parse it.

import { discoverPostSource, getPostList } from 'post-feed-reader';

// Looks for metadata pointing to the Wordpress REST API or Atom/RSS Feeds
const source = await discoverPostSource('https://www.nytimes.com');

// Retrieves the posts from the given source
const list = await getPostList(source);

// Logs all post titles
console.log(list.posts.map(post => post.title));

Simple enough, eh?

Output

See an example of the post list based on the Mozilla blog.

Options

const source = await discoverPostSource('https://techcrunch.com', {
  // Custom axios instance
  axios: axios.create(...),

  // Whether it will prioritize feeds over the wordpress api
  preferFeeds: false,

  // Custom data source filtering
  canUseSource: (source: DiscoveredSource) => true,

  // Whether it will try to guess wordpress api and feed urls if auto-discovery doesn't work
  tryToGuessPaths: false,
  
  // The paths that it will try to guess for both the Wordpress API or the RSS/Atom/JSON feed
  wpApiPaths: ['./wp-json', '?rest_route=/'],
  feedPaths: ['./feed', './atom', './rss', './feed.json', './feed.xml', '?feed=atom'],
});

const posts = await getPostList(source, {
  // Custom axios instance
  axios: axios.create(...),

  // Whether missing plain text contents will be filled automatically from html contents
  fillTextContents: false,

  // Wordpress REST API only options
  wordpress: {
    // Whether it will include author, taxonomy and media data from the wordpress api
    includeEmbedded: true,

    // Whether it will fetch the blog info, such as the title, description, url and images
    // Setting this to true adds one extra http request
    fetchBlogInfo: false,

    // The amount of items to return
    limit: 10,

    // The search string filter
    search: '',

    // The author id filter
    authors: [...],

    // The category id filter
    categories: [...],

    // The tag id filter
    tags: [...],

    // Any additional querystring parameter for the wordpress api you may want to include
    additionalParams: { ... },
  },
});

Skip the auto-discovery

If you already have an Atom/RSS/JSON Feed or the Wordpress REST API url in hands, you can fetch the posts directly:

// RSS, Atom or JSON Feed
const feedPosts = await getFeedPostList('https://news.google.com/atom');

// Wordpress API
const wpApiPosts = await getWordpressPostList('https://blog.mozilla.org/en/wp-json/');

Pagination

The post list may have pagination metadata attached. You can use it to navigate through pages. Here's an example:

const result = await getPostList(...);

if (result.pagination.next) {
  // There is a next page!
  
  const nextResult = await getPostList(result.pagination.next);
  
  // ...
}

// You can also check for result.pagination.previous, result.pagination.first and result.pagination.last

Why support other sources, isn't RSS enough?

RSS is the most widely feed format used on the web, but not only it lacks information that might be trivial to your application, the specification is a mess with many vague to implementation properties, meaning how the information is formatted differs from feed to feed. For instance, the description can be the full post as HTML, or just an excerpt, or in plain text, or even just an HTML link to the post page.

Atom's specification is way more rigid and robust, which makes relying on the data trustworthier. It's definitely the way to go in the topic of feeds. But it still lacks some properties that can only be fetched through the Wordpress REST API.

Since WordPress is by far the most used CMS, supporting its API is a great alternative. The Wordpress REST API supports the following over RSS and Atom feeds:

  • Filtering by category, tag and/or author
  • Searching
  • Pagination
  • Featured media
  • Author profile

The JSON Feed format is also just as good as the Atom format, but at the moment very few websites produce it.

How does the auto-discovery works?

  1. Fetches the site's main page
  2. Looks for Wordpress Link headers
  3. Looks for RSS, Atom and JSON Feed <link> metatags
  4. If tryToGuessPaths is set to true, it will look for the paths to try to find a feed or the WP API.

Most properties are optional, what am I guaranteed to have?

Nothing.

Yeah, there's no property that is required in all specs, thus we can't guarantee they will be present.

But! The most basic properties are very likely to be present, such as guid, title and link.

For all the other properties, it's highly recommended implementing your own fallbacks. For instance, showing a substring of the content when the summary isn't available.

The library will try its best to fetch the most data available.

Comments
  • Support for h-feed from Microformats2

    Support for h-feed from Microformats2

    Implement a parser for h-feed, part of the microformats2 spec.

    The format could be automatically discovered by looking for a h-feed or hfeed class in the HTML body.

    Although this should be relatively easy to support, it is an old format that doesn't look like it is being widely used. Statistics on usage from both sites and readers, and how it compares to other already supported formats (such as RSS) should be taken in consideration before implementing it.

    enhancement 
    opened by Guichaguri 1
  • Fix support for content on RSS feeds

    Fix support for content on RSS feeds

    I found some inconsistencies, while parsing a RSS feed the content property of a post will contains its <description> instead of the actual content. source

    Here is my proposal:

    interface PostItem {
       ...
      /**
       * The item content
       *
       * `content` from Atom 1.0 and WP API
       * `content` from RSS 0.91, RSS 1.0 and RSS 2.0
       * `content_html` and `content_text` from JSON Feed 1.1
       */
      content?: PostContent;
    
      /**
       * The item summary or excerpt
       *
       * `summary` from Atom 1.0 and JSON Feed 1.1
       * `description` from RSS 0.91, RSS 1.0 and RSS 2.0
       * `excerpt` from WP API
       */
      summary?: PostContent;
       ...
    }
    
    bug 
    opened by zaosoula 1
  • Add support for pagination

    Add support for pagination

    It seems that pagination is supported by all formats:

    • WordPress API supports pagination as a core feature
    • Atom and RSS feeds may support pagination as described by RFC5005
    • JSON feeds may support pagination as described by the original spec (next_url)

    It's definitely possible to fetch "the next page" where available. The only thing to think is how the API should be designed, here is my proposal:

    // Add a property to the PostList object, which tells information about the pagination
    interface PostList {
       ...
       pagination: PostListPagination;
       ...
    }
    
    interface PostListPagination {
       currentPage?: number; // Supported only by WP API
       totalPages?: number; // Supported only by WP API
       totalPosts?: number; // Supported only by WP API
       next?: DiscoveredSource; // Supported by all formats
       previous?: DiscoveredSource; // Supported by Atom/RSS and WP API
       first?: DiscoveredSource; // Supported by all formats
       last?: DiscoveredSource; // Supported by Atom/RSS and WP API
    }
    
    // Fetching the next one is as simple as fetching the first
    const nextList = await getPostList(list.pagination.next);
    
    enhancement 
    opened by Guichaguri 1
  • build(deps): bump follow-redirects from 1.14.7 to 1.14.8

    build(deps): bump follow-redirects from 1.14.7 to 1.14.8

    Bumps follow-redirects from 1.14.7 to 1.14.8.

    Commits
    • 3d81dc3 Release version 1.14.8 of the npm package.
    • 62e546a Drop confidential headers across schemes.
    • See full diff in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • build(deps): bump follow-redirects from 1.14.6 to 1.14.7

    build(deps): bump follow-redirects from 1.14.6 to 1.14.7

    Bumps follow-redirects from 1.14.6 to 1.14.7.

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Add license scan report and status

    Add license scan report and status

    Your FOSSA integration was successful! Attached in this PR is a badge and license report to track scan status in your README.

    Below are docs for integrating FOSSA license checks into your CI:

    opened by fossabot 0
Releases(v1.2.1)
  • v1.2.1(Oct 28, 2022)

    • Added RSS-in-JSON support
    • Fixed "summary" and "content" from RSS feeds
    • Fixed 400 errors when passing an empty categories array to the WordPress API
    Source code(tar.gz)
    Source code(zip)
  • v1.2.0(Mar 21, 2022)

    • Implemented an option for fetching the site info for wordpress sites
      • This makes the site title, description, url and images properties available.

    Here's an example:

    const discovered = await discoverPostSource('https://blog.mozilla.org/en/');
    
    const result = await getPostList(discovered, {
      wordpress: {
        fetchBlogInfo: true,
      },
    });
    
    console.log(result);
    

    Result JSON:

    {
      "title": "The Mozilla Blog",
      "url": "https://blog.mozilla.org/en/",
      "description": {
        "text": "News and Updates about Mozilla"
      },
      // ...
    }
    

    Note: The Mozilla blog has no icon or logo images configured to be returned by the API, so that's why the images field is not present

    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(Jan 29, 2022)

  • v1.0.1(Jan 10, 2022)

  • v1.0.0(Jan 10, 2022)

simple-remix-blog is a blog template built using Remix and TailwindCSS. Create your own blog in just a few minutes!

simple-remix-blog is a blog template built using remix.run and TailwindCSS. It supports markdown and MDX for the blog posts. You can clone it and star

José Miguel Álvarez Vañó 8 Dec 8, 2022
OP3: The Open Podcast Prefix Project

op3 OP3: The Open Podcast Prefix Project The Open Podcast Prefix Project (OP3) is a free and open-source podcast prefix analytics service committed to

null 29 Dec 15, 2022
News API Wrapper for Violetics API News

News API Wrapper for Violetics API News

Violetics 3 Mar 23, 2022
An API that allows you to scrape blog posts and articles and get a list of notes or a summary back.

EZAI-Web-Scraper An API that allows you to scrape blog posts and articles and get a list of notes or a summary back. Recommendations Use browserless.i

null 9 Dec 8, 2022
Follow along with blog posts, code samples, and practical exercises to learn how to build serverless applications from your local Integrated development environment (IDE).

Getting started with serverless This getting started series is written by the serverless developer advocate team @AWSCloud. It has been designed for d

AWS Samples 55 Dec 28, 2022
An indexed compendium of graphics programming papers, articles, blog posts, presentations, and more

Paper Bug CONTRIBUTIONS WANTED What is this? The idea is to have an annotated and easily searchable repository of papers, presentations, articles, etc

Jeremy Ong 64 Dec 16, 2022
Sachit Yadav 6 Nov 3, 2022
Market Watcher - a blog where you can inform yourself about the latest economic-related news

Market Watcher - a blog where you can inform yourself about the latest economic-related news

Adrien 4 Aug 5, 2022
O projeto ig.news é um blog onde os usuários podem ter acesso ao conteúdo de cada postagem de acordo com o status de sua assinatura.

IGNEWS - Portal de notícias ?? ?? Sobre | Demo | Tecnologias | Requerimentos | Começando ?? Sobre O projeto ig.news é um blog onde os usuários podem t

Gabriel Castro 5 Sep 28, 2022
FortBlog adds a nice UI where you can manage a publication of any size with posts, pages, tags, and authors

FortBlog adds a nice UI where you can manage a publication of any size with posts, pages, tags, and authors. You can add photos, code blocks, featured images, social media & SEO attributes, embedded HTML (YouTube Videos, Embedded Podcasts Episodes, Tweets, ...), and markdown! Dark & Light modes available so everyone is happy

Haseeb Ahmad 11 Jan 2, 2023
It redirects the website request from facebook to any blog while keeping the meta data for the each link.

Vercel Redirect It redirects the website request from facebook to any blog while keeping the meta data for the each link. This app uses Next.js and th

Vishwa R 8 Dec 4, 2022
Automaticly parses known pocket ips patch resources, scans folders or zip files for matching roms and applies the patches.

Pocket Automaton Automaticly parses known pocket ips patch resources, scans folders or zip files for matching roms and applies the patches. Usage pock

null 3 Nov 27, 2022
CLI utility that parses argv, loads your specified file, and passes the parsed argv into your file's exported function. Supports ESM/TypeScript/etc out of the box.

cleffa CLI tool that: Parses argv into an object (of command-line flags) and an array of positional arguments Loads a function from the specified file

Lily Scott 9 Mar 6, 2022
ln-charts parses the output of bos accounting commands into various charts for your Lightning Node.

ln-charts ln-charts parses the output of bos accounting commands into various charts for your Lightning Node. It runs on Angular, JS, HTML, CSS, ngx-c

Steven Ellis 21 Dec 18, 2022
Parses natural language to date schedules.

DateParrot DateParrot parses natural language into a unified schedule object or ISO date. This package is in a very early stage and not yet production

Jörg Bayreuther 7 Aug 3, 2022
parses human-readable strings for JavaScript's Temporal API

?? temporal-parse What is the temporal-parse? Temporal is the next generation of JavaScript's standard Date API. It's currently proposed to TC39 (see:

Eser Ozvataf 22 Jan 2, 2023
Gatsby-blog-cosmicjs - 🚀⚡️ Blazing fast blog built with Gatsby and the Cosmic Headless CMS 🔥

Gatsby + Cosmic This repo contains an example blog website that is built with Gatsby, and Cosmic. See live demo hosted on Netlify Uses the Cosmic Gats

Priya Chakraborty 0 Jan 29, 2022