Create a Newsletter Sourcing Data using MongoDB

priya raj
7 min readMay 25, 2021

--

There are many news delivery websites are available like ndtv.com. In this article, let us see the very useful and interesting feature of how to get the data from ndtv.com via scraping feature i.e. extracting the contents from ndtv.com and storing them into MongoDB. MongoDB is a NoSQL documentum model database.

Using Mongoose, Node JS, Cheerio, NDTV news website is scraped and data is loaded into Mongo DB database. This is a full-stack JavaScript app built using MongoDB, Mongoose, Node.js, Express.js, Handlebars.js, HTML, and CSS. It scrapes the [NDTV](https://ndtv.com/) homepage and stores article titles and links.

Module Installation: Install the required modules using the following command.

npm install body-parser npm install cheerio npm install express npm install express-handlebars npm install mongoose npm install request

Project Structure: It will look like this.

Implementation:

ReadMe :

# Scrape news headlines into MongoDB

This is a full-stack JavaScript app built using MongoDB, Mongoose, Node.js, Express.js, Handlebars.js, HTML, and CSS. It scrapes the [NDTV](https://ndtv.com/) homepage and stores article titles and links

## Get new articles, save favorites, and write notes

To see updated news stories, click `Get new articles` at the top of the `Home` page. To view the full article, click the `View article on NDTV` link.

When you see an article in the list that you want to mark as a favorite, click `Save article`, and read it again by clicking `Saved articles` in the top menu bar. After you have an article in your saved list, you can comment on it by clicking `Add note`.

_Note: There is no login system, so all saved articles and comments are visible to and can be deleted by all users._

## Local set up for development purposes

These must be installed to run the app locally: For any Node JS projects, this is the mandatory step

- [Node.js](https://nodejs.org/en/)

- [npm](https://www.npmjs.com/get-npm)

-

## IDE

Visual Studio Code is the best IDE to code Node JS and Mongoose related projects

## Important instructions

1. You first need to make a local MongoDB database named `ndtvnews`.

2. Then, in a terminal window, navigate into the folder where you downloaded this app and type `npm install`. All the dependencies mentioned in package.json file will be installed and this is a mandatory step

3. To start the app, type `node server.js` and open your browser to `localhost:3000`.

4. In case if 3000 port is used by some other process in your system, feel free to change the port number in “server.js”

## Technology

- HTML, CSS, jQuery, Bootstrap, [Handlebars.js](https://handlebarsjs.com/)

- JavaScript

- Node.js

- MongoDB and [Mongoose](http://mongoosejs.com/)

- [Express.js](https://expressjs.com/)

- npm, including [express](https://www.npmjs.com/package/express) and [body-parser](https://www.npmjs.com/package/body-parser) packages.

- [cheerio](https://cheerio.js.org/) for scraping the website

Filename: server.js: This is the important file required to start the app running. To call the ndtv site, scrape the data, and store it in MongoDB database.

Steps to run the application: Run the server.js file using the following command.

node server.js

Output: We will see the following output on the terminal screen.

App is running

Now open any browser and go to http://localhost:3000/, we will get a similar page like below.

To get the news from ndtv.com, we need to click on Get New Articles. This will internally call our /scrape path. Once this call is done, in MongoDB, under ndtvnews database, articles named collection got filled with the data as shown below:

Here, the initially saved attribute will be false, id is automatically got created in MongoDB and this is the unique identification of a document in a collection. This attribute only helps to view a document, save a document, etc.

On click on View article on NDTV, it will navigate to the respective article. This is getting possible only because of id attribute which is present in the articles collection. So, when we click on View article on NDTV, as it is a hyperlink, directly that document _id value is internally picked up and the link is displayed. When Save article is clicked, _Id value will be the identification part for that article.

Important source code is attached here. Entire working set of code is available at https://github.com/raj123raj/NdtvNewsScraperUsingMongoDB

// require dependencies

//Express is a minimal and flexible Node.js web application framework that provides a robust set of features for web and mobile applications.

var express = require(“express”);

//To communicate with mongodb

var mongoose = require(“mongoose”);

//As we need to call ndtv website and access the urls, it is a HTTP request

var request = require(“request”);

//Cheerio parses markup and provides an API for traversing/manipulating the resulting data structure

var cheerio = require(“cheerio”);

//Node.js body parsing middleware.

//Parse incoming request bodies in a middleware before your handlers, available under the req.body property.

var bodyParser = require(“body-parser”);

var exphbs = require(“express-handlebars”);

//WE can explicitly set the port number provided no other instances running on that port

var PORT = process.env.PORT || 3000;

// initialize Express

var app = express();

// use body-parser for handling form submissions

app.use(bodyParser.urlencoded({

extended: false

}));

// We are getting the output in the form of application/json

app.use(bodyParser.json({

type: “application/json”

}));

// serve the public directory

app.use(express.static(“public”));

// use promises with Mongo and connect to the database

//Let us have our mongodb database name to be ndtvnews

//By using Promise, Mongoose async operations, like .save() and queries, return thenables.

mongoose.Promise = Promise;

var MONGODB_URI = process.env.MONGODB_URI || “mongodb://localhost/ndtvnews”;

mongoose.connect(MONGODB_URI);

// use handlebars

app.engine(“handlebars”, exphbs({

defaultLayout: “main”

}));

app.set(“view engine”, “handlebars”);

// Hook mongojs configuration to the db variable

var db = require(“./models”);

// We need to filter out NdtvArticles from the database that are not saved

// It will be called on startup of url

app.get(“/”, function(req, res) {

db.Article.find({

saved: false

},

function(error, dbArticle) {

if (error) {

console.log(error);

} else {

//We are passing the contents to index.handlebars

res.render(“index”, {

articles: dbArticle

});

}

})

})

// use cheerio to scrape stories from NDTV and store them

//We need to do this on one time basis each day

app.get(“/scrape”, function(req, res) {

request(“https://ndtv.com/", function(error, response, html) {

// Load the html body from request into cheerio

var $ = cheerio.load(html);

//By inspecting the web page we know how to get the title i.e. headlines of news.

//From view page source also we can able to get it. It differs in each web page

$(“h2”).each(function(i, element) {

// trim() removes whitespace because the items return \n and \t before and after the text

var title = $(element).find(“a”).text().trim();

console.log(“title”,title);

var link = $(element).find(“a”).attr(“href”);

console.log(“link”,link);

// if these are present in the scraped data, create an article in the database collection

if (title && link ) {

db.Article.create({

title: title,

link: link

},

function(err, inserted) {

if (err) {

// log the error if one is encountered during the query

console.log(err);

} else {

// otherwise, log the inserted data

console.log(inserted);

}

});

// if there are 10 articles, then return the callback to the frontend

console.log(i);

if (i === 10) {

return res.sendStatus(200);

}

}

});

});

});

// route for retrieving all the saved articles. User has the option to save the article.

//Once it is saved, “saved” column in the collection is set to true. Below routine helps to find the articles that are saved

app.get(“/saved”, function(req, res) {

db.Article.find({

saved: true

})

.then(function(dbArticle) {

// if successful, then render with the handlebars saved page

// this time saved.handlebars is called and that page is rendered

res.render(“saved”, {

articles: dbArticle

})

})

.catch(function(err) {

// If an error occurs, send the error back to the client

res.json(err);

})

});

// route for setting an article to saved

// In order to save an article, this routine is used.

//_id column in collection is unique and it will determine the uniqueness of the news

app.put(“/saved/:id”, function(req, res) {

db.Article.findByIdAndUpdate(

req.params.id, {

$set: req.body

}, {

new: true

})

.then(function(dbArticle) {

// this time saved.handlebars is called and that page is rendered

res.render(“saved”, {

articles: dbArticle

})

})

.catch(function(err) {

res.json(err);

});

});

// route for saving a new note to the db and associating it with an article

app.post(“/submit/:id”, function(req, res) {

db.Note.create(req.body)

.then(function(dbNote) {

var articleIdFromString = mongoose.Types.ObjectId(req.params.id)

return db.Article.findByIdAndUpdate(articleIdFromString, {

$push: {

notes: dbNote._id

}

})

})

.then(function(dbArticle) {

res.json(dbNote);

})

.catch(function(err) {

// If an error occurs, send it back to the client

res.json(err);

});

});

// route to find a note by ID

app.get(“/notes/article/:id”, function(req, res) {

db.Article.findOne({“_id”:req.params.id})

.populate(“notes”)

.exec (function (error, data) {

if (error) {

console.log(error);

} else {

res.json(data);

}

});

});

app.get(“/notes/:id”, function(req, res) {

db.Note.findOneAndRemove({_id:req.params.id}, function (error, data) {

if (error) {

console.log(error);

} else {

}

res.json(data);

});

});

// listen for the routes

app.listen(PORT, function() {

console.log(“App is running”);

});

Working: Entire working model of project is explained in the video:

Conclusion: It is easier and simpler to scrape any news website and display the title contents alone along with a link that follows to proceed, and we can save the article and check out the saved articles easily.

Reference: https://github.com/raj123raj/NdtvNewsScraperUsingMongoDB

Originally published at https://www.geeksforgeeks.org on May 25, 2021.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

priya raj
priya raj

Written by priya raj

Freelancer, Software Consultant. Having Industrial experience of around 12 + years of experience in the fields of Java, Android,SQL,MongoDB

No responses yet

Write a response