A data pipeline to generate stats from logs.

Last update: May 21, 2022

Overview

Logs Data Pipeline

This project consists of implementing a data pipeline. The goal of our pipeline is to bring insights from the logs of deployed microservices. Since we didn't really implement, nor deployed microservices, we will simulate these logs using the following Kaggle dataset.

Architecture

The process is simple, each log line will be sent to a Kafka topic. We'll have a Spark Streaming process that takes each line of log from Kafka and cleans it - taking the IP address and extracting the country from it, trimming edges of strings, droping useless information, etc. The cleaned data then will be stored to MongoDB. There's a Spark Batch process that will get all of the data from MongoDB and generate stats - the possibilities are endless. In our case, we extracted two stats; one related to APIs' response time and the other related to the distribution of countries when it comes to sending requests. These informations are stored in MongoDB to be consumed by a Flask API when the Angular dashboad fetches the data.

How to run the project

First, you will need to run the containers, so start by running the following

docker-compose up

While waiting forever for the containers to run - yes, it does take time sadly - let's install the libraries you'll need in order to run the python scripts.

python3 -m pip install requirements.txt

I use Linux, but I believe if you're a Windows user, you should probably either use py -m pip install requirements.txt or simply pip install requirements.txt.

Now moving forward, if your containers are ready, what you'll need to do next is download the Kaggle dataset and put it in the root of the project.

Now, we have our containers running, we have our libraries installed, we have our dataset - next thing you'll need to do is run the producer that will send to Kafka the logs line by line.

python3 producer.py

Now, while that process is running, run the Spark Streaming process.

python3 preprocessing.py

And now, Spark is treating the logs line by line and is saving the processed data in Mongo.

To run the Spark Batch process;

python3 processing.py

Now we have our stats in Mongo, let's run the flask app! Let's run the following;

cd api

export FLASK_APP=api

flask run

And your flask app is running! Next thing is to run the Angular dashboard. Open an another terminal and;

cd dashboard

npm install

npm start

You should be met with the following interface;

Now I know that it's not the best UI in the world, but it's honest work...

In order to purge all of the data on MongoDB, just run;

python3 purge.py

A data pipeline to generate stats from logs.

Related tags

Overview

Logs Data Pipeline

Architecture

How to run the project

And that's all folks, I hope you enjoyed the project!

You might also like...

A chat logs online saver for discord bots to save messages history & cleared messages online

Introduction to Metrics, Logs and Traces session companion code.

A VS Code utility that cleans up logs in your Elixir application.

logs ROBLOX's updates and new versions

Discord.js v14 bot that logs everything on your Discord server

AWS SAM project that adds the snippets from serverlessland.com/snippets as saved queries in CloudWatch Logs Insights

Sharing short code samples, logs or links is now easier than ever!

A yearly review of your public GitHub repository stats.

Explore units, stats and more

Owner

Amine Haj Ali

Example-browserstack-reporting - This repository contains an example of running Selenium tests and reporting BrowserStack test results, including full CI pipeline integration.

Use pipeline to prettier code.

A sample CICD Deployment Pipeline for your Alexa Skills, using AWS CDK, CodeBuild and CodePipeline

Github action to collect metrics (CPU, memory, I/O, etc ...) from your workflows to help you debug and optimize your CI/CD pipeline

Dockerfiles used in PingCAP's docs CI pipeline

Sample solution to build a deployment pipeline for Amazon SageMaker.

Tool to sign data with a Cardano-Secret-Key // verify data with a Cardano-Public-Key // generate CIP-8 & CIP-36 data

Keep your sensitive information out of chat logs, emails, and more with heavily encrypted secrets.

Get the last logs of your /var/log folder

Logs the output, time, arguments, and stacktrace of any function when it's called in a gorgeous way.