Paper Trail

I built this RAG agent as a fact checker for myself. ArXiv is a huge repository for preprints and a great source to read about computer science, math, and physics. I’ve included a hardcoded demo of a real generation (since APIs aren’t free).

Highlights

Breaks claims into sub-claims, each with its own verdict and confidence score.
When a sub-claim (e.g. recent, commercial) can’t be verified directly, an analgous search is done to find similar ideas.
Opinions are also categorized independently.

Tech Stack

Python + FastAPI - Serves pipeline
LangGraph - Orchestration (categorizes the sub-claims non-linearly)
Gemini 2.5 Flash - Synthesizes the output
Chroma - Vector database
all-MiniLM-L6-v2 - Sentence transformer (embeddings)

Notes

First project dealing with NLP.
During testing, I tried messing around with the same claim, just worded differently, and I was surprised to see different papers getting pulled everytime.
Results could be more consistent between runs.
Checking claims about AI I see on social media always gives out interesting outputs.
I wonder how I can speed up the runtime, it takes 60-90s on avg…

paper trail

ai-powered academic fact-checking against arxiv

"Transformer models struggle with tasks requiring true mathematical reasoning, which is why chain-of-thought prompting was developed to improve their performance on multi-step problems"

verdict verified

Academic research confirms that Transformer models exhibit struggles with tasks demanding true mathematical reasoning, particularly concerning stable symbolic manipulation and multi-step problems. Consequently, chain-of-thought prompting was developed as a strategy to enhance these models' reasoning capabilities and improve their performance on multi-step problem-solving.

2 sub-claims analyzed

8 sources cited

94.55s processing time

sub-claim analysis

Transformer models struggle with tasks requiring true mathematical reasoning.

supported 95%

Academic literature indicates that transformer-based models, including leading MLLMs and LLMs, exhibit brittleness on tasks requiring stable symbolic manipulation and demonstrate struggles with visual-mathematical reasoning, recursive program synthesis, and certain mathematical reasoning benchmarks. These models show characteristic failure modes, such as variable confusion and inconsistency, on tasks that involve logical relations and mathematical abstraction.

Sources (3)

Attention as Binding: A Vector-Symbolic Perspective on Transformer Reasoning

arXiv:2512.14709v1

FractalBench: Diagnosing Visual-Mathematical Reasoning Through Recursive Program Synthesis

arXiv:2511.06522v1

I-RAVEN-X: Benchmarking Generalization and Robustness of Analogical and Mathematical Reasoning in Large Language and Reasoning Models

arXiv:2510.17496v2

Chain-of-thought prompting was developed to improve the performance of Transformer models on multi-step problems.

supported 90%

Chain-of-thought prompting is described as significantly enhancing the reasoning capability and improving multi-step problem-solving in large language models. These models are commonly understood to be Transformer-based, and the demonstrated improvements in reasoning potential confirm its purpose in addressing multi-step problems.

Sources (5)

Complexity-Based Prompting for Multi-Step Reasoning

arXiv:2210.00720v2

The Zero-Step Thinking: An Empirical Study of Mode Selection as Harder Early Exit in Reasoning Models

arXiv:2510.19176v1

Resprompt: Residual Connection Prompting Advances Multi-Step Reasoning in Large Language Models

arXiv:2310.04743v2

DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMs

arXiv:2601.03559v1

CoT-Valve: Length-Compressible Chain-of-Thought Tuning

arXiv:2502.09601v1