A data pipeline to generate stats from logs.

Overview

Logs Data Pipeline

This project consists of implementing a data pipeline. The goal of our pipeline is to bring insights from the logs of deployed microservices. Since we didn't really implement, nor deployed microservices, we will simulate these logs using the following Kaggle dataset.

Architecture

architecture

The process is simple, each log line will be sent to a Kafka topic. We'll have a Spark Streaming process that takes each line of log from Kafka and cleans it - taking the IP address and extracting the country from it, trimming edges of strings, droping useless information, etc. The cleaned data then will be stored to MongoDB. There's a Spark Batch process that will get all of the data from MongoDB and generate stats - the possibilities are endless. In our case, we extracted two stats; one related to APIs' response time and the other related to the distribution of countries when it comes to sending requests. These informations are stored in MongoDB to be consumed by a Flask API when the Angular dashboad fetches the data.

How to run the project

First, you will need to run the containers, so start by running the following

docker-compose up

While waiting forever for the containers to run - yes, it does take time sadly - let's install the libraries you'll need in order to run the python scripts.

python3 -m pip install requirements.txt

I use Linux, but I believe if you're a Windows user, you should probably either use py -m pip install requirements.txt or simply pip install requirements.txt.

Now moving forward, if your containers are ready, what you'll need to do next is download the Kaggle dataset and put it in the root of the project.

Now, we have our containers running, we have our libraries installed, we have our dataset - next thing you'll need to do is run the producer that will send to Kafka the logs line by line.

python3 producer.py

Now, while that process is running, run the Spark Streaming process.

python3 preprocessing.py

And now, Spark is treating the logs line by line and is saving the processed data in Mongo.

To run the Spark Batch process;

python3 processing.py

Now we have our stats in Mongo, let's run the flask app! Let's run the following;

cd api

export FLASK_APP=api

flask run

And your flask app is running! Next thing is to run the Angular dashboard. Open an another terminal and;

cd dashboard

npm install

npm start

You should be met with the following interface;

dashboard

Now I know that it's not the best UI in the world, but it's honest work...

In order to purge all of the data on MongoDB, just run;

python3 purge.py

And that's all folks, I hope you enjoyed the project!

You might also like...

A chat logs online saver for discord bots to save messages history & cleared messages online

A chat logs online saver for discord bots to save messages history & cleared messages online

Chat Logs NPM package that saves messages online to view it later Useful for bots where users can save messages history & cleared messages online Supp

Dec 28, 2022

Introduction to Metrics, Logs and Traces session companion code.

Introduction to Metrics, Logs and Traces in Grafana This is the companion repository to a series of presentations over the three pillars of observabil

Dec 24, 2022

A VS Code utility that cleans up logs in your Elixir application.

ex-cleanse A utility that cleans up logs in your Elixir application by glamboyosa Available as a VS Code extension Screen.Recording.2022-06-07.at.00.5

Jun 10, 2022

logs ROBLOX's updates and new versions

logs ROBLOX's updates and new versions

roblox-update-notifier logs ROBLOX's updates and new versions This is meant to be ran in NodeJS, 24/7, using something like pm2. NPM packages required

Oct 23, 2022

Discord.js v14 bot that logs everything on your Discord server

Discord.js v14 bot that logs everything on your Discord server

Discord Server Logger Bot Discord bot that logs all changes on your Discord server! When using this code please give credits to DEEM#6666! Deployment

Dec 26, 2022

AWS SAM project that adds the snippets from serverlessland.com/snippets as saved queries in CloudWatch Logs Insights

AWS SAM project that adds the snippets from serverlessland.com/snippets as saved queries in CloudWatch Logs Insights

cw-logs-insights-snippets Serverlessland.com/snippets hosts a growing number of community provided snippets. Many of these are useful CloudWatch Logs

Nov 8, 2022

Sharing short code samples, logs or links is now easier than ever!

Sharing short code samples, logs or links is now easier than ever!

Pastebin Sharing short code samples, logs or links is now easier than ever. Explore the docs » • Report Bug • Request Feature • About The Project With

Nov 26, 2022

A yearly review of your public GitHub repository stats.

Repos Wrapped A yearly review of your public GitHub repository stats. View your stats Endpoints require trailing slashes Append your GitHub username t

Jul 29, 2022

Explore units, stats and more

AoE 4 Explorer An useful visualization and UI to explore units, their stats and abilities and all possible upgrades and technologies for Age Of Empire

Dec 8, 2022
Owner
Amine Haj Ali
Software Engineering student at INSAT
Amine Haj Ali
Example-browserstack-reporting - This repository contains an example of running Selenium tests and reporting BrowserStack test results, including full CI pipeline integration.

BrowserStack reporting and Selenium test result example This repository contains an example of running Selenium tests and reporting BrowserStack test

Testmo 1 Jan 1, 2022
Use pipeline to prettier code.

js-pipy Use pipeline to prettier code. Installation guide clone this repo locally npm i @htibor/js-pipy --save Usage const {pipy} = require('@htibor/j

Tibor Hegedűs 1 Feb 12, 2022
A sample CICD Deployment Pipeline for your Alexa Skills, using AWS CDK, CodeBuild and CodePipeline

Alexa Skils - CI/CD CDK Pipeline This repository will help you setting up a CI/CD pipeline for your Alexa Skills. This pipeline is powered by AWS Clou

null 5 Nov 23, 2022
Github action to collect metrics (CPU, memory, I/O, etc ...) from your workflows to help you debug and optimize your CI/CD pipeline

workflow-telemetry-action A GitHub Action to track and monitor the resource metrics of your GitHub Action workflow runs. If the run is triggered via a

Thundra 32 Dec 30, 2022
Dockerfiles used in PingCAP's docs CI pipeline

openapi-scripts Scripts used in postprocessing OpenAPI document for ReDoc. Postprocess the JSON file generated by bufbuild/buf dereference Use @apidev

Aolin 3 Jul 20, 2022
Sample solution to build a deployment pipeline for Amazon SageMaker.

Amazon SageMaker MLOps Build, Train and Deploy your own container using AWS CodePipeline and AWS CDK This’s a sample solution to build a deployment pi

AWS Samples 4 Aug 4, 2022
Tool to sign data with a Cardano-Secret-Key // verify data with a Cardano-Public-Key // generate CIP-8 & CIP-36 data

Tool to sign data with a Cardano-Secret-Key // verify data with a Cardano-Public-Key // generate CIP-8 & CIP-36 data

Martin Lang 11 Dec 21, 2022
Keep your sensitive information out of chat logs, emails, and more with heavily encrypted secrets.

Free encrypted secret sharing for everyone! This application is to be used to share encrypted secrets cross organizations, or as private persons. Hemm

Hemmelig 246 Dec 31, 2022
Get the last logs of your /var/log folder

var-log-crawler Get the last logs of your /var/log folder Requirements: Node installed. Hot to use: Rename .env.sample to .env and fill with your valu

David William Rigamonte 2 Jan 5, 2022
Logs the output, time, arguments, and stacktrace of any function when it's called in a gorgeous way.

Function.prototype.log Logs the output, time, arguments, and stacktrace of any function when it's called. How to use: Like this: function yourFunction

--Explosion-- 4 Apr 9, 2022