Memory Sparse Attention Breaks the 100M Token Barrier with Hybrid VRAM/RAM KV Cache Architecture
Read original source ↗More to read
All storiesLlama 8B Rivals 70B on Multi-Hop QA Using Structured Prompting — No Fine-Tuning Required
Researcher Greedy-Teach1533 demonstrated that Llama 3.1 8B can match or exceed Llama 3.3 70B on multi-hop question answering benchmarks using two inference-time techniques: structured chain-of-thought prompting and 60% context compression via graph traversal. The experiments, conducted using Graph RAG (KET-RAG), revealed that retrieval is largely a solved problem — the answer is in the context 77–91% of the time — but reasoning remains the critical bottleneck, accounting for 73–84% of failures. The approach was validated across HotpotQA, MuSiQue, and 2WikiMultiHopQA (500 questions each) at roughly 12x lower cost.
Reddit r/LocalLLaMA
Chinese AI Model MiniMax M2.7 Actively Participated in Its Own Development
Chinese AI company MiniMax has released M2.7, a model that played an active role in its own development through autonomous optimization loops — updating its knowledge stores, building capabilities, and refining its own reward-based training. Over 100 autonomous rounds, M2.7 independently analyzed failures, adjusted code, and tested results, achieving a reported 30% performance boost on internal evaluations. The model competes closely with leading Western models like GPT-5.4 and Gemini 3.1 Pro across multiple benchmarks, and MiniMax envisions future AI self-evolution progressing toward full autonomy without human involvement.
The Decoder
Medical AI Bias Exposed: Automated Training Labels Make Models 66% Worse — And Benchmarks Miss It
A new study on fairness in medical AI for breast cancer tumor segmentation reveals that models perform significantly worse for younger patients — not simply because of higher breast density, but because younger patients' tumors are larger, more variable, and qualitatively harder to learn from. The research also uncovers that training on automated labels can amplify model bias by up to 40%, while standard benchmarks fail to detect this degradation due to a "biased ruler" effect — where biased labels used for evaluation mask true model performance. The paper has been accepted as an oral presentation at ISBI 2026.
Reddit r/MachineLearning
AI-Powered Smart Wheelchairs Aim to Autonomously Navigate Real-World Obstacles
Researchers at the German Research Center for Artificial Intelligence (DFKI) have developed prototype sensor-equipped smart wheelchairs capable of both semi-autonomous and fully autonomous navigation, using natural language commands, SLAM mapping, and drone-based cameras for obstacle detection. Presented at the CSUN Assistive Technology Conference in Anaheim, the project — called REXASI-PRO — integrates LiDAR, 3D cameras, and open-source navigation systems to guide users safely through complex environments. Experts in the field caution that cost, reliability in real-world conditions, and diverse user needs remain significant barriers to mainstream adoption.
IEEE Spectrum AI
Qwen3.5-9B on MacBook M5 Pro Scores 93.8% on Home Security AI Benchmark — Just 4 Points Behind GPT-5.4
A new domain-specific benchmark called HomeSec-Bench v1 pits local LLMs against cloud models on 96 real-world home security AI tasks across 15 test suites. Running on a MacBook Pro M5 Pro with 64GB unified memory via llama.cpp, the Qwen3.5-9B model achieved a 93.8% pass rate — only 4.1 points behind GPT-5.4 and surpassing GPT-5.4-nano — all with zero API costs and full data privacy. The benchmark covers critical security workflows including threat classification, tool use, prompt injection resistance, and multi-camera event deduplication.
Hacker News Front Page
Qualcomm Shrinks AI Reasoning Chains by 2.4x to Run Thinking Models Directly on Smartphones
Qualcomm AI Research has developed a modular framework that enables reasoning-capable language models to run locally on smartphones by compressing verbose chain-of-thought outputs by an average factor of 2.4x — and up to 8x in some cases — using reinforcement learning. The system builds on a base model (Qwen2.5-7B-Instruct) enhanced with LoRA adapters and 4-bit weight compression, allowing it to switch between fast chat and deep reasoning modes while keeping sensitive data on-device. Despite the impressive technical achievement, deep system integration with apps like email and calendars still relies on cloud-based models in practice.
The Decoder