Edis | Data Engineer

⚙️ BigQuery Cost Optimization

Problem:

Escalating Google BigQuery costs due to inefficient querying of large raw data dumps, impacting budget and query performance.

Solution:

Developed and implemented strategies to optimize GBQ usage, including query restructuring, table partitioning, and clustering. This resulted in a 35% reduction in query costs and improved data processing efficiency by 50%.

Tech Stack:

Google BigQuery SQL Python

View on GitHub

💾 Tableau Backup Tool

Problem:

Manual and inconsistent backups of critical Tableau workbooks and datasources, leading to a high risk of data loss and lack of version control.

Solution:

Engineered an automated Python tool that backs up Tableau assets to Git, incorporating parallel processing, progress tracking, and intelligent file handling. This ensured reliable versioning, reduced manual backup effort by approximately 5 hours per week, and significantly minimized potential data recovery time.

Tech Stack:

Python Tableau API Git

View on GitHub

☁️ Azure Medallion API

Problem:

Lack of a structured and scalable way to process raw customer purchase data from CSVs into curated datasets and serve them for analytical or application use.

Solution:

Designed and built an Azure-based Medallion data pipeline (Bronze/Silver/Gold layers) and a FastAPI endpoint to serve processed customer data. This improved data quality by an estimated 20%, enabled consistent data access for 3 downstream analytical systems, and served as a foundation for enhanced sales reporting.

Tech Stack:

Azure (Blob Storage, Functions, etc.) FastAPI Python Pandas

View on GitHub

✨ PySpark ML Features

Problem:

Difficulty in scaling Scikit-Learn based machine learning feature engineering pipelines for large datasets on distributed computing platforms like Azure Databricks.

Solution:

Implemented PySpark versions of 6 common Scikit-Learn ML features, optimizing them for distributed execution on Azure Databricks. This enabled scalable ML feature generation, reducing processing time for large datasets by up to 60% and facilitating more complex model development for customer churn prediction.

Tech Stack:

PySpark Azure Databricks Scikit-Learn (concept)

View on GitHub

🔗 Airflow-dbt-DuckDB-Medallion

Problem:

Need for a modern, cost-effective, and robust data platform for orchestrating complex data transformations and enabling fast analytical queries on evolving datasets.

Solution:

Architected and implemented a data platform leveraging Airflow for orchestration, dbt for SQL-based transformations (Medallion architecture), and DuckDB for high-performance local analytics. This solution streamlined data pipeline development by 25%, improved data quality through versioned transformations, and reduced query latency for analytical workloads by 70%.

Tech Stack:

Airflow dbt DuckDB PySpark SQL Python

View on GitHub

📝 Text Summarizer

Problem:

Information overload from lengthy documents and articles, requiring significant time for manual summarization to extract key insights.

Solution:

Developed a BART-powered text summarizer with batch processing and quantization for efficiency. The tool processes multiple documents, reducing reading and analysis time by an average of 75% while retaining crucial information.

Tech Stack:

Python PyTorch Hugging Face Transformers NLTK

View on GitHub

😊 Sentiment Analysis Tool

Problem:

Need to quickly gauge public opinion or customer feedback from large volumes of text data (e.g., reviews, social media) without time-consuming manual analysis.

Solution:

Created a user-friendly CLI tool for text sentiment analysis using state-of-the-art transformer models. It accurately classifies sentiment and provides confidence scores, enabling analysis of over 1,000 text entries per minute for rapid insights.

Tech Stack:

Python PyTorch Hugging Face Transformers Pandas

View on GitHub

📍 Geocoding Automation

Problem:

Manually finding latitude and longitude for large lists of addresses is time-consuming and error-prone, hindering location-based analysis and services.

Solution:

Developed a Python script to automate geocoding of addresses from CSV files using ArcGIS and Komoot APIs. The script efficiently processes batches, appends coordinates, and handles API errors, saving an estimated 8 hours of manual work per 1,000 addresses processed.

Tech Stack:

Python ArcGIS API Komoot API Pandas

View on GitHub

🗺️ Geo Profile Generator & Visualizer

Problem:

Difficulty in generating realistic synthetic datasets with geographical attributes for testing location-based applications or demographic analysis simulations.

Solution:

Created a Python Jupyter Notebook that generates fictional German profiles with plausible addresses and visualizes their geo-distribution on an interactive Folium map. This accelerated the testing cycle for geo-fencing algorithms by 40%.

Tech Stack:

Python Jupyter Notebook Folium Pandas Faker

View on GitHub

📈 Sales Performance Dashboard

Problem:

Lack of a centralized and interactive view of key sales metrics, making it difficult to track performance, identify trends, and make timely data-driven decisions.

Solution:

Developed a web-based dashboard providing in-depth insights into sales performance (revenue, growth, regional sales). Interactive visualizations improved sales team's access to critical data, contributing to a 10% increase in targeted sales strategies effectiveness.

Tech Stack:

HTML CSS JavaScript D3.js

View on GitHub

⚡️ Energy Data Collector & Analyzer

Problem:

Fragmented energy data sources (consumption, carbon emissions, pricing) making comprehensive analysis and identification of optimization opportunities difficult.

Solution:

Built a Python script to aggregate energy data from various APIs. The system stores and processes this data, enabling analysis that identified potential annual energy cost savings of 15% through optimized usage patterns and tariff selection.

Tech Stack:

Python Pandas Requests Grafana (for visualization)

View on GitHub

🦆 DuckDB Analyzer

Problem:

Analyzing large CSV datasets (multi-GB) locally using traditional tools like Pandas alone can be slow and memory-intensive for complex queries and aggregations.

Solution:

Developed a Python-based tool leveraging DuckDB for high-performance analysis of large CSV files. This approach reduced query times for complex aggregations by up to 70% and lowered memory consumption by 50% compared to purely Pandas-based methods on datasets over 5GB.

Tech Stack:

Python DuckDB Pandas SQL

View on GitHub

Featured Data Engineering Projects

Problem:

Solution:

Tech Stack:

Problem:

Solution:

Tech Stack:

Problem:

Solution:

Tech Stack:

Problem:

Solution:

Tech Stack:

Problem:

Solution:

Tech Stack:

Problem:

Solution:

Tech Stack:

Problem:

Solution:

Tech Stack:

Problem:

Solution:

Tech Stack:

Problem:

Solution:

Tech Stack:

Problem:

Solution:

Tech Stack:

Problem:

Solution:

Tech Stack:

Problem:

Solution:

Tech Stack: