Welcome to my portfolio! Here, you'll find a showcase of projects that blend my professional expertise with academic insights. Each project represents a unique idea and practical application, crafted to demonstrate the skills and knowledge I've developed throughout my journey. For each project, Iβve included links to explore further on GitHub. In addition to GitHub, Iβve also recorded some walkthroughs and demos on YouTube so you can see the projects in action. I hope you find these projects inspiring, insightful, and reflective of my passion for data science and technology.
Gradio, FastAPI, Uvicorn, Whisper
LlamaIndex, PyMuPDF, MLflow, Docker
I am an ML Engineer with a strong foundation in NLP, machine learning, and document processing. My experience spans a range of impactful projects, including fine-tuning large language models (LLMs), building Retrieval-Augmented Generation (RAG) systems, and developing semantic similarity search engines using vector databases. I have expertise in document clustering, OCR solutions, and creating advanced Named Entity Recognition (NER) models for information extraction. Additionally, Iβve implemented Explainable AI to make model outputs accessible to stakeholders and used visualization tools like Tableau to drive data insights. My technical skills extend to deploying scalable microservices with Docker and FastAPI, and managing projects with GitLab and Confluence. My thesis focused on enhancing text simplification in LegalTech using curriculum learning and the T5 model, showcasing my dedication to pushing the boundaries of NLP applications.
• Fine-tuned LLMs (GPT-4, LLaMA 2, Deepseek, BERT) for various NLP tasks
• Developed RAG and GraphRAG systems using vector databases (Qdrant/FAISS) for document search and retrieval
• Built a document clustering pipeline with OCR, layout analysis, and embedding-based similarity
• Designed an NLP microservice pipeline for chatbot deployment with retrieval and synthetic text generation
• Developed NLP microservices with FastAPI, Docker, and Streamlit for scalable model deployment
• Implemented semantic similarity search using vector databases
• Applied unsupervised document clustering to organize and cluster related documents
• Benchmarked tokenization strategies (WordPiece, Byte-Pair Encoding, SentencePiece) for structured document processing
• Developed NER models for extracting information from unstructured text
• Implemented OCR solutions for structured text extraction
• Applied Explainable AI techniques to interpret model decisions
• Collaborated with cross-functional teams to integrate AI-driven solutions
• Fine-tuned transformers (BERT, RoBERTa, T5) for classification and emotion detection
• Applied Explainable AI techniques to interpret model insights for stakeholders and clients
• Created visualizations and dashboards with Tableau
• Analyzed dataset features for model optimization
• Web scraping across countries for automotive data using Playwright, API calls, Selenium, Beautiful Soup
• Processed PDF documents for text extraction and analysis
• Managed projects with GitLab, Confluence, Anaconda, Poetry, Pre-commit
• Integrated AWS S3 and Lambda for data storage and processing
• Contributed to an AI-based persona project for the automotive sector using German and English datasets
• Fine-tuned transformer models (BERT, RoBERTa, T5) for text classification and emotion detection
• Prepared and processed complex datasets with Pandas for analysis
• Benchmarked ML (sklearn) and DL (PyTorch) models for customer review insights
• Analyzed customer interviews using spaCy to support persona development
• Documented code and presented findings to stakeholders
• Completed a Masterβs degree in Politics & Technology at the Technical University of Munich.
• Focused on data science and machine learning modules with an emphasis on:
• Advanced programming skills
• Hands-on, practical applications
• Gained a broad and interdisciplinary foundation, enhancing expertise in:
• Natural Language Processing (NLP)
• Machine Learning
• Developed the skills and knowledge needed to address complex, real-world challenges across multiple domains.
Thesis: Enhancing Text Simplification through Baby-Step Curriculum Learning: A Case Study on Privacy Policies Using the T5 Model
• NLP for Text Simplification: Addressed the challenges of simplifying complex texts, specifically privacy policies, using advanced NLP techniques.
• Curriculum Learning Approach: Applied a baby-step curriculum learning approach to improve text simplification, a novel technique that structures training in progressive stages.
• Model and Tools: Leveraged the T5 Transformer model, with evaluations based on BLEU, SARI, and FKGL metrics.
A curated list of tools and technologies I frequently use, categorized based on their application areas.
Transformers, SentenceTransformers, OpenAI, LangChain, TRL, Ollama, PEFT/QLoRA, DeepSeek-R1, CohereForAI
Groq LLM Inference Engine, CLIP Models, FastText, Gensim, Mixedbread Tokenizers (German Language Processing)
Qdrant, FAISS, ChromaDB, Haystack, Weaviate, JinaAI
PyTorch, PyTorch Lightning, Scikit-learn, MLFlow, DVC, ONNX
Mistral OCR, Tesseract, HURIDOCS/pdf-document-layout-analysis, LayoutLMv3, PDFMiner, PyMuPDF, Textract, PaddleOCR
SpaCy, NLTK, Doccano
Tableau, Matplotlib, Seaborn, Plotly
Pandas, NumPy
BeautifulSoup, Selenium, Playwright, Scrapy
AWS-EC2, AWS-Lambda, AWS-S3, Docker, Prefect
FastAPI, Celery, Redis, Flower
Chainlit, Chainforge, Streamlit, Gradio
SQL, PostgreSQL, MongoDB
Linux, Agile-Scrum
In addition to my professional work, I am passionate about giving back to the community and helping others on their career paths.
Volunteer teacher in Machine Learning at ReDI School, helping to bridge the gap in technology education.
If you would like a one-on-one session for career guidance, feel free to arrange a free session by sending an email to mustafa.gencc94@gmail.com.