MCA · AKTU Ghaziabad · NLP & Systems Research

Tathagat
Builds Language
From Scratch.

I pretrained a GPT-style transformer on consumer hardware, published empirical findings on context window degradation, and build full-stack AI systems end-to-end.

Ask me anything → Read the paper GitHub ↗

57.5M

Parameters — trained from scratch

150M

Tokens on single RTX 3050

Novel empirical findings published

IJIRT

Published · March 2026

01 — Publication

Research Paper

● IJIRT Vol.12 Issue 10 · March 2026 · ISSN 2349-6002

Context Window Degradation in a Resource-Constrained Language Model: Empirical Evidence from MiniLLM

An empirical study of context window degradation in MiniLLM — a 57.5M parameter GPT-style transformer trained from scratch on ~150M tokens using a single NVIDIA RTX 3050 (6GB). Four fine-tuned checkpoints evaluated through identical protocols across positional recall probes and multi-turn perplexity measurements, producing five novel findings about small-scale LLM behaviour.

◈

Lost-in-the-middle at 57.5M — middle-position facts degrade 50% at 350 tokens. Smallest scale ever demonstrated.

◈

Positional recall is QA-specific, not architectural. Non-QA checkpoints score 0% across all 180 probes.

◈

Two-phase collapse: perplexity spike then fluency attractor retreat. Repetition rate must also be tracked.

◈

Effective context ≈ 100–150 tokens across all checkpoints despite 512-token nominal window.

NLPContext WindowLLM from Scratch PyTorchPerplexity EvaluationRAG DesignTransformers

Read full paper ↗

02 — Engineering

Selected Projects

PROJECT — 01 · 2024–2025

MiniLLM

GPT-style language model pretrained from scratch on 150M tokens, deployed as a production web app with real-time streaming.

57.5M parameter transformer, fp16 + gradient accumulation, single RTX 3050
4 fine-tuned personalities: QA, farming, storytelling, poetry
FastAPI + PostgreSQL + JWT auth + email verification + SSE streaming
React frontend: live token streaming, guest mode, Wikipedia context injection

PyTorchFastAPIReactPostgreSQLHuggingFace

github.com/tathagat-git/MiniLLM ↗

PROJECT — 02 · 2024

RAG Pipeline

End-to-end retrieval-augmented generation system supporting PDF, DOCX, and TXT ingestion with fully local inference.

Recursive chunking, deduplication, all-MiniLM-L6-v2 embeddings
FAISS vector store with top-k similarity retrieval
Fully local — no external API required

LangChainFAISSHuggingFacesentence-transformers

github.com/tathagat-git/RAG-Pipeline ↗

PROJECT — 03 · Aug 2025

Hand2Math

Deep learning app that converts handwritten mathematical expressions into LaTeX — real-time inference via FastAPI.

Custom CNN pipeline for handwritten math symbol classification
EasyOCR for image-to-text extraction
FastAPI REST endpoint for real-time inference

Deep LearningOCRFastAPIPython

github.com/tathagat-git ↗

PROJECT — 04 · Jul 2025

Contact Extractor

Web app that auto-extracts structured contact info from PDFs, DOCX, and images using NLP + OCR.

spaCy NER for named entity extraction
EasyOCR for image-based text recognition
JSON output with timestamps via FastAPI

spaCyEasyOCRFastAPINLP

github.com/tathagat-git/Contact-Extractor ↗

03 — Capabilities

Technical Skills

Generative AI & LLMs

LLM Pretraining (from scratch) Fine-tuning & RLHF concepts RAG Pipeline Design Prompt Engineering Perplexity Evaluation

AI / ML

Deep Learning Transformers (GPT-style) NLP & Text Processing CNN / Computer Vision Model Training & Evaluation

Libraries

PyTorch HuggingFace Transformers LangChain · FAISS sentence-transformers spaCy · EasyOCR · scikit-learn

Backend & Infrastructure

FastAPI · PostgreSQL JWT Auth · SSE Streaming Docker · AWS Cloud React · GitHub Python · SQL · JavaScript

04 — Background

Experience & Education

Feb 2026 – Mar 2026

Published Researcher

IJIRT — International Journal of Innovative Research in Technology

First-authored empirical paper on context window degradation in small-scale LLMs. Designed and ran full evaluation suite across 4 model checkpoints. Published in IJIRT Vol.12 Issue 10, ISSN 2349-6002.

May 2024 – Jun 2024

Data Analyst Intern

PUCHO Online · Patna, Bihar

Cleaned and structured 50,000+ records using Python and Excel. Built automated dashboards reducing manual reporting effort. Conducted EDA to improve data collection processes.

Sep 2024 – Aug 2026

MCA — Master of Computer Applications

Dr. A.P.J. Abdul Kalam Technical University (AKTU), Lucknow

Focus on AI/ML systems. Research group MCA25-26MNP1 at RD Engineering College, Ghaziabad.

Aug 2019 – Aug 2023

BCA — Bachelor of Computer Applications

Veer Kunwar Singh University (DK College) · 72.24%

Foundations in computer science, databases, and programming.

05 — Interactive

Ask My Portfolio AI

RAG-Powered · Gemini 1.5 Flash · Cloudflare Proxy

This chatbot has my full profile embedded as a knowledge base — resume, research paper findings, all projects. Ask it anything.

Hey! I'm Tathagat's portfolio AI. I know everything about his research, projects, and background. What would you like to know?

// Powered by Gemini · Secured via Cloudflare Worker · Knowledge base: resume + paper + projects

06 — Connect

Get in Touch

Actively looking for research internships at AI labs. If you're working on language models, NLP systems, or AI infrastructure — let's talk.

✉

tathagat615@gmail.comPrimary email

⌥

github.com/tathagat-gitAll projects & code

↗

linkedin.com/in/tathagat-tathagat3460LinkedIn profile

◈

IJIRT PublicationContext Window Degradation paper

// WHAT I BRING

01 Hands-on LLM experience — actually pretrained a GPT model, not just fine-tuned a wrapper

02 Published first author — empirical research with novel findings on context degradation

03 Full-stack AI deployment — FastAPI, React, PostgreSQL, SSE, auth — production-ready systems

04 Resource-constrained mindset — built research-quality experiments on a 6GB consumer GPU