MCA · AKTU Ghaziabad · NLP & Systems Research

Tathagat
Builds Language
From Scratch.

I pretrained a GPT-style transformer on consumer hardware, published empirical findings on context window degradation, and build full-stack AI systems end-to-end.

Ask me anything → Read the paper GitHub ↗
57.5M
Parameters — trained from scratch
150M
Tokens on single RTX 3050
5
Novel empirical findings published
IJIRT
Published · March 2026
01 — Publication

Research Paper

IJIRT Vol.12 Issue 10 · March 2026 · ISSN 2349-6002
Context Window Degradation in a Resource-Constrained Language Model: Empirical Evidence from MiniLLM

An empirical study of context window degradation in MiniLLM — a 57.5M parameter GPT-style transformer trained from scratch on ~150M tokens using a single NVIDIA RTX 3050 (6GB). Four fine-tuned checkpoints evaluated through identical protocols across positional recall probes and multi-turn perplexity measurements, producing five novel findings about small-scale LLM behaviour.

Lost-in-the-middle at 57.5M — middle-position facts degrade 50% at 350 tokens. Smallest scale ever demonstrated.
Positional recall is QA-specific, not architectural. Non-QA checkpoints score 0% across all 180 probes.
Two-phase collapse: perplexity spike then fluency attractor retreat. Repetition rate must also be tracked.
Effective context ≈ 100–150 tokens across all checkpoints despite 512-token nominal window.
NLPContext WindowLLM from Scratch PyTorchPerplexity EvaluationRAG DesignTransformers
Read full paper ↗
02 — Engineering

Selected Projects

PROJECT — 01 · 2024–2025
MiniLLM

GPT-style language model pretrained from scratch on 150M tokens, deployed as a production web app with real-time streaming.

  • 57.5M parameter transformer, fp16 + gradient accumulation, single RTX 3050
  • 4 fine-tuned personalities: QA, farming, storytelling, poetry
  • FastAPI + PostgreSQL + JWT auth + email verification + SSE streaming
  • React frontend: live token streaming, guest mode, Wikipedia context injection
PyTorchFastAPIReactPostgreSQLHuggingFace
github.com/tathagat-git/MiniLLM ↗
PROJECT — 02 · 2024
RAG Pipeline

End-to-end retrieval-augmented generation system supporting PDF, DOCX, and TXT ingestion with fully local inference.

  • Recursive chunking, deduplication, all-MiniLM-L6-v2 embeddings
  • FAISS vector store with top-k similarity retrieval
  • Fully local — no external API required
LangChainFAISSHuggingFacesentence-transformers
github.com/tathagat-git/RAG-Pipeline ↗
PROJECT — 03 · Aug 2025
Hand2Math

Deep learning app that converts handwritten mathematical expressions into LaTeX — real-time inference via FastAPI.

  • Custom CNN pipeline for handwritten math symbol classification
  • EasyOCR for image-to-text extraction
  • FastAPI REST endpoint for real-time inference
Deep LearningOCRFastAPIPython
github.com/tathagat-git ↗
PROJECT — 04 · Jul 2025
Contact Extractor

Web app that auto-extracts structured contact info from PDFs, DOCX, and images using NLP + OCR.

  • spaCy NER for named entity extraction
  • EasyOCR for image-based text recognition
  • JSON output with timestamps via FastAPI
spaCyEasyOCRFastAPINLP
github.com/tathagat-git/Contact-Extractor ↗
03 — Capabilities

Technical Skills

Generative AI & LLMs
LLM Pretraining (from scratch) Fine-tuning & RLHF concepts RAG Pipeline Design Prompt Engineering Perplexity Evaluation
AI / ML
Deep Learning Transformers (GPT-style) NLP & Text Processing CNN / Computer Vision Model Training & Evaluation
Libraries
PyTorch HuggingFace Transformers LangChain · FAISS sentence-transformers spaCy · EasyOCR · scikit-learn
Backend & Infrastructure
FastAPI · PostgreSQL JWT Auth · SSE Streaming Docker · AWS Cloud React · GitHub Python · SQL · JavaScript
04 — Background

Experience & Education

Feb 2026 – Mar 2026
Published Researcher
IJIRT — International Journal of Innovative Research in Technology
First-authored empirical paper on context window degradation in small-scale LLMs. Designed and ran full evaluation suite across 4 model checkpoints. Published in IJIRT Vol.12 Issue 10, ISSN 2349-6002.
May 2024 – Jun 2024
Data Analyst Intern
PUCHO Online · Patna, Bihar
Cleaned and structured 50,000+ records using Python and Excel. Built automated dashboards reducing manual reporting effort. Conducted EDA to improve data collection processes.
Sep 2024 – Aug 2026
MCA — Master of Computer Applications
Dr. A.P.J. Abdul Kalam Technical University (AKTU), Lucknow
Focus on AI/ML systems. Research group MCA25-26MNP1 at RD Engineering College, Ghaziabad.
Aug 2019 – Aug 2023
BCA — Bachelor of Computer Applications
Veer Kunwar Singh University (DK College) · 72.24%
Foundations in computer science, databases, and programming.

Ask My Portfolio AI

RAG-Powered · Gemini 1.5 Flash · Cloudflare Proxy

This chatbot has my full profile embedded as a knowledge base — resume, research paper findings, all projects. Ask it anything.

T
Hey! I'm Tathagat's portfolio AI. I know everything about his research, projects, and background. What would you like to know?

// Powered by Gemini · Secured via Cloudflare Worker · Knowledge base: resume + paper + projects

06 — Connect

Get in Touch

Actively looking for research internships at AI labs. If you're working on language models, NLP systems, or AI infrastructure — let's talk.

// WHAT I BRING
01 Hands-on LLM experience — actually pretrained a GPT model, not just fine-tuned a wrapper
02 Published first author — empirical research with novel findings on context degradation
03 Full-stack AI deployment — FastAPI, React, PostgreSQL, SSE, auth — production-ready systems
04 Resource-constrained mindset — built research-quality experiments on a 6GB consumer GPU