NNGPT: The AutoML System That Writes, Runs, Judges, and Improves Its Own Neural Networks

In partnership with

SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | May 14, 2026

Neural Architecture Search (NAS) has a runtime problem. A comprehensive 2023 survey of 1,000 NAS papers (arXiv:2301.08727) documents the dominant pattern: grid search or evolutionary search over a predefined architecture space, training each candidate to partial completion, measuring proxy metrics, and selecting the best. The compute bill for a competitive NAS run on ImageNet is measured in GPU-days to GPU-weeks. The search space itself requires expert definition. And the output is a single architecture, not a growing corpus of diverse alternatives.

NNGPT (ABrain-One/nn-gpt, MIT, arXiv:2511.20333) is an open-source framework that reimagines this pipeline: instead of searching over an architecture space, an LLM generates complete executable training specifications from a single prompt. Instead of discarding the evaluation result, every generated network that runs successfully gets incorporated back into the system's training data. The loop is closed: generation, execution, and self-improvement run continuously.

The key numbers from the paper: NN-RAG (retrieval-augmented architecture synthesis) achieves 73% executability on 1,289 generation targets. The code-aware accuracy predictor reaches RMSE 0.14 with Pearson r=0.78, better than running the training to completion for short-horizon decisions. HPO on LEMUR achieves RMSE 0.60, outperforming Optuna's 0.64. One-shot prediction matches search-based AutoML on common datasets, reducing the number of required trials from hundreds to one. Over 10,000 validated models have been generated (5,000+ through the self-improving LLM loop), all incorporated into LEMUR with verified outcomes.

This newsletter dissects NNGPT as a systems document: what the five integrated pipelines do and how they compose, how the LEMUR dataset grounds generation in runnable PyTorch code, what hash-based deduplication prevents in a continuous generation loop, and why the code-aware predictor matters more than the generation accuracy.

Scope: NNGPT architecture (five pipelines, LEMUR dataset, NN-RAG, HPO, accuracy predictor, RL loop), DeepSeek Coder 7B as the base model, LoRA fine-tuning on LEMUR, and the LangGraph multi-agent orchestration mode. Not covered: NNGPT's broader ABrain ecosystem beyond LEMUR, or the extended comparisons in the Supplementary Material.

What It Actually Does

NNGPT is an AutoML engine that uses a fine-tuned LLM to generate complete PyTorch training specifications (model architecture, data transforms, metrics, optimizer, schedule) from a single natural language prompt, executes them, and uses the execution logs to continuously improve the underlying model.

Five integrated pipelines (all operating within one closed loop):

Pipeline	Function	Key result
Zero-shot architecture synthesis	LLM generates complete training spec from prompt	Runnable PyTorch code, not abstract templates
Hyperparameter optimization (HPO)	LLM recommends hyperparameters	RMSE 0.60 vs Optuna's 0.64
Code-aware accuracy predictor	Predicts final accuracy + early-stop epoch	RMSE 0.14, Pearson r=0.78
NN-RAG (retrieval-augmented synthesis)	RAG over LEMUR corpus for scope-closed generation	73% executability on 1,289 targets
RL-based improvement loop	LoRA/RL updates from execution logs	5,000+ validated models from self-improvement

Base model: DeepSeek Coder 7B, a code generation model pretrained on 2 trillion tokens. Fine-tuned on LEMUR with LoRA (rank r=32, alpha=32, ~35M trainable parameters, ~0.5% of base model).

LEMUR dataset: An audited corpus of executable neural programs with unified preprocessing and reproducible metrics. All generated architectures that run successfully are incorporated back into LEMUR, growing the corpus continuously.

The system has generated and trained over 10,000 distinct models, all with verified outcomes. This is not a benchmark paper. The corpus exists.

The Architecture, Unpacked

Focus on the closed loop at the bottom: the LLM that generates architectures is fine-tuned on the outputs of those architectures. Every successful run makes future generation slightly more accurate. Every failure is a negative reward signal. At 10,000+ models generated, the corpus feedback has become the dominant training signal.

The Code, Annotated

Snippet One: Training Spec Generation with NN-RAG and Schema Validation

# Reconstructed from NNGPT architecture documentation and paper description
# Source: github.com/ABrain-One/nn-gpt, arXiv:2511.20333

import json
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

# DeepSeek Coder 7B fine-tuned on LEMUR via LoRA
# r=32, alpha=32 → ~35M trainable params out of ~7B = 0.5% of base model
# ← small LoRA rank is sufficient because LEMUR is a domain-specific corpus:
#   the model only needs to learn PyTorch architecture patterns, not general coding

BASE_MODEL = "deepseek-ai/deepseek-coder-7b-instruct-v1.5"
LORA_ADAPTER = "abrain-one/nngpt-lora-v2"  # LEMUR-fine-tuned adapter

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
model = PeftModel.from_pretrained(
    AutoModelForCausalLM.from_pretrained(BASE_MODEL, torch_dtype="auto"),
    LORA_ADAPTER,
)

# LEMUR API retrieval: fetch similar architectures for NN-RAG context
def retrieve_similar_architectures(task_description: str, k: int = 3) -> list[dict]:
    """
    Query the LEMUR corpus for k architectures most similar to the task.
    Used to construct the few-shot portion of the prompt.

    ← THIS is the trick for NN-RAG: "scope-closed" generation
    Instead of generating arbitrary PyTorch code, the LLM generates code
    that is grounded in real, validated LEMUR architectures.
    This is why executability reaches 73% — the model generates within
    a known-good code vocabulary rather than inventing new patterns.
    """
    # LEMUR API returns: {name, model_code, transforms, metrics, accuracy}
    return lemur_api.search(task_description, limit=k)

def generate_training_spec(
    task: str,
    k_shot: int = 3,
    temperature: float = 0.6,
) -> dict | None:
    """
    Generate a complete training specification from a natural language task.
    Returns None if deduplication check fails (duplicate of existing LEMUR entry).
    """
    # Retrieve similar architectures for few-shot context
    similar = retrieve_similar_architectures(task, k=k_shot)

    # Construct prompt: task description + k few-shot examples
    few_shot_block = "\n\n".join([
        f"# Example architecture for: {ex['name']}\n{ex['model_code']}"
        for ex in similar
    ])

    prompt = f"""
# Task: {task}

# Reference architectures from LEMUR corpus:
{few_shot_block}

# Generate a complete PyTorch training specification for the above task.
# Output must be a valid JSON training spec with keys:
# model_code, transforms, optimizer, lr_schedule, epochs, metrics
# The model_code must be runnable PyTorch, not pseudocode.
"""

    # Generation parameters from paper:
    # temperature=0.6 (lower than default: code needs syntactic correctness)
    # top_k=50, top_p=0.95 (nucleus sampling for diversity)
    # max_tokens=65,536 (long context for full model definitions)
    outputs = model.generate(
        tokenizer(prompt, return_tensors="pt").input_ids,
        temperature=0.6,
        top_k=50,
        top_p=0.95,
        max_new_tokens=4096,
        do_sample=True,
    )
    generated = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Parse and validate JSON schema
    try:
        spec = json.loads(extract_json(generated))
    except json.JSONDecodeError:
        return None  # generation failed schema validation

    # ← Deduplication: hash the model_code and check against LEMUR
    # Without this, the model generates small variations of the same architecture
    # (whitespace changes, comment variations) that waste training compute
    # The paper notes this "saves hundreds of redundant training runs"
    code_hash = hash_model_code(spec['model_code'])
    if lemur_api.hash_exists(code_hash):
        return None  # duplicate: skip execution, save GPU time

    return spec

The temperature=0.6 choice is significant. Standard creative generation uses 0.7-1.0. Code generation needs lower temperature because syntactic errors make the output unusable regardless of semantic quality. At 0.6, the model balances architectural diversity (via nucleus sampling) with syntactic correctness. The paper's 73% executability rate is partly a consequence of this temperature calibration.

Snippet Two: Code-Aware Accuracy Predictor and RL Reward Loop

# Code-aware predictor: predicts final accuracy from code + early training metrics
# This is the component that makes "one-shot AutoML" practical:
# instead of training every candidate to completion, the predictor
# stops training early for candidates that won't reach target accuracy

import torch
import torch.nn as nn
from transformers import AutoModel

class CodeAwareAccuracyPredictor(nn.Module):
    """
    Predicts final model accuracy from:
    - model_code: the PyTorch source code of the architecture
    - early_metrics: training loss/accuracy at epochs 1-5

    ← THIS is the architectural insight: code structure contains information
    about final accuracy that is not captured by early training metrics alone.
    A Transformer with attention can see: "this architecture has a bottleneck
    at the penultimate layer that will cause gradient vanishing at epoch 20."
    A purely metric-based predictor cannot see this from epoch-5 curves.

    Result: RMSE 0.14, Pearson r=0.78
    Compare: random predictor RMSE ~0.3, early-metric-only predictor RMSE ~0.19
    """

    def __init__(self, code_encoder_model: str = "microsoft/codebert-base"):
        super().__init__()
        # CodeBERT encodes the PyTorch source code
        self.code_encoder = AutoModel.from_pretrained(code_encoder_model)
        # Simple MLP head for metric encoding
        self.metric_encoder = nn.Sequential(
            nn.Linear(10, 64),  # 10 early metrics: loss + acc × 5 epochs
            nn.ReLU(),
            nn.Linear(64, 128),
        )
        # Fusion: code embedding (768-dim) + metric embedding (128-dim) → prediction
        self.predictor = nn.Sequential(
            nn.Linear(768 + 128, 256),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(256, 2),  # output: [predicted_final_accuracy, stop_epoch]
        )

    def forward(self, model_code: str, early_metrics: list[float]) -> tuple[float, int]:
        code_tokens = tokenize_code(model_code)
        code_emb = self.code_encoder(**code_tokens).last_hidden_state[:, 0]  # CLS
        metric_emb = self.metric_encoder(torch.tensor(early_metrics).unsqueeze(0))
        fused = torch.cat([code_emb, metric_emb], dim=-1)
        pred = self.predictor(fused)
        return float(pred[0, 0]), int(pred[0, 1])


def rl_reward(spec: dict, execution_result: dict) -> float:
    """
    Compute reward signal for the RL-based improvement loop.

    The reward function encodes what "good architecture generation" means:
    - Executability is the floor: code that errors gets negative reward
    - Accuracy is the ceiling: better accuracy gets stronger positive signal
    - Novelty matters: duplicate architectures get zero reward
    - Efficiency is a soft bonus: lower parameter count for same accuracy

    ← The reward function IS the research philosophy of the system.
    It defines what counts as a "better" architecture generator.
    """
    if execution_result['status'] == 'error':
        return -1.0  # execution failure: strong negative signal

    if execution_result['is_duplicate']:
        return 0.0   # duplicate: no learning signal (already in LEMUR)

    # Accuracy bonus: sigmoid-scaled to prevent reward hacking at high accuracy
    accuracy = execution_result.get('test_accuracy', 0.0)
    accuracy_reward = torch.sigmoid(torch.tensor(accuracy * 10 - 5)).item()

    # Efficiency bonus: reward architectures that are accurate AND small
    param_count = execution_result.get('parameter_count', float('inf'))
    target_params = spec.get('target_parameters', 1e6)
    efficiency_bonus = 0.1 if param_count <= target_params else 0.0

    return accuracy_reward + efficiency_bonus

The code-aware predictor's RMSE of 0.14 vs. a metric-only predictor's ~0.19 shows that architectural structure (the code itself) carries predictive signal beyond what early training curves reveal. This is the technical contribution that justifies the code encoder: for deciding whether to spend compute on a full training run, reading the source code is valuable information.

It In Action: End-to-End Worked Example

Task: Generate a lightweight CNN for CIFAR-10 classification with under 2 million parameters.

Step 1: LEMUR retrieval (NN-RAG, k=3)

Retrieved from LEMUR corpus:
  1. MobileNetV2-lite: 1.4M params, 91.2% CIFAR-10 test accuracy
  2. ShuffleNetV2: 1.8M params, 90.7% CIFAR-10 test accuracy
  3. EfficientNet-B0-mini: 1.6M params, 91.8% CIFAR-10 test accuracy

Retrieval time: ~0.3 seconds
Context tokens: ~2,800 (3 architecture examples)

Step 2: LLM generation (DeepSeek Coder 7B + LEMUR LoRA)

Prompt tokens: ~3,200 (task + 3 few-shot examples)
Generation time: ~4.2 seconds on A100
Output: complete training spec (JSON, ~800 tokens)

# Generated model code (representative output):
class GeneratedCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1), nn.BatchNorm2d(32), nn.ReLU(),
            nn.Conv2d(32, 64, 3, stride=2, padding=1), nn.BatchNorm2d(64), nn.ReLU(),
            nn.Conv2d(64, 128, 3, stride=2, padding=1), nn.BatchNorm2d(128), nn.ReLU(),
            # Depthwise separable conv (learned from MobileNetV2 context)
            nn.Conv2d(128, 128, 3, padding=1, groups=128), nn.BatchNorm2d(128), nn.ReLU(),
            nn.Conv2d(128, 256, 1), nn.BatchNorm2d(256), nn.ReLU(),
            nn.AdaptiveAvgPool2d(1),
        )
        self.classifier = nn.Linear(256, num_classes)
        # 1.87M parameters ← within constraint

# Generated HPO:
# optimizer: AdamW, lr=3e-4, weight_decay=1e-4
# schedule: cosine annealing, T_max=100 epochs
# transforms: RandomCrop(32, padding=4), RandomHorizontalFlip, Normalize

Step 3: Hash deduplication check

Code hash: sha256("class GeneratedCNN...") → d4f7a92b...
LEMUR check: hash NOT in corpus → proceed to execution
(If hash matched: skip, generate new → saves ~4 hours of training)

Step 4: Execution

Training on CIFAR-10 (RTX 3090, batch_size=128):
  Epoch 1:  train_loss=1.847, train_acc=0.341, test_acc=0.412
  Epoch 3:  train_loss=1.124, train_acc=0.621, test_acc=0.683
  Epoch 5:  train_loss=0.891, train_acc=0.741, test_acc=0.779

Step 5: Code-aware accuracy predictor

Inputs: model_code + early_metrics[0:5]
Prediction: final_test_accuracy ≈ 0.887 ± 0.02
             recommended_stop_epoch = 67 (of 100)
Actual (if trained to completion): 0.891 at epoch 72
Predictor error: |0.887 - 0.891| = 0.004  ← within RMSE 0.14 bound
Decision: train to completion (prediction above 0.85 target threshold)

Step 6: LEMUR update and RL reward

Result: test_accuracy=0.891, parameters=1.87M
Hash stored in LEMUR corpus
RL reward: sigmoid(0.891 × 10 - 5) + 0.1 (efficiency bonus) = 0.978

LoRA update applied to DeepSeek Coder 7B:
  Loss = -log_prob(generated_spec) × reward
  This increases likelihood of similar architecture choices for similar prompts.

New LEMUR entry: {name: "generated_cnn_d4f7a92b", accuracy: 0.891, params: 1.87M}
Available for future NN-RAG retrievals.

Total wall clock time:

Retrieval:          0.3s
LLM generation:     4.2s
Deduplication:      0.1s
Training (full):    3.4 hours
Predictor:          0.2s

With predictor early-stopping (if accuracy below threshold):
  Training: 67 epochs × 2.1 min/epoch = 2.4 hours (saves 1 hour)

Why This Design Works, and What It Trades Away

The LEMUR dataset is the foundation that makes everything else tractable. Prior LLM-based AutoML work (LLMatic, arXiv:2306.01102; AutoML-GPT, arXiv:2309.01125) generates architectures as abstract descriptions or pseudocode. NNGPT generates runnable PyTorch code grounded in LEMUR's corpus of verified implementations. The distinction is executability: abstract generation fails silently (syntactically valid but semantically wrong code), while LEMUR-grounded generation fails loudly (syntax error, caught in 5 seconds rather than 4 hours into training). The 73% executability rate is the direct result of this grounding.

The hash-based deduplication addresses a failure mode specific to closed-loop LLM generation that the community has not fully confronted: LLMs trained on a corpus tend to generate small variations of the same high-probability outputs. Without deduplication, the growing LEMUR corpus would increasingly contain near-identical architectures, wasting compute and producing a biased fine-tuning signal. The hash check is applied at the code level (not the specification level), which means architecturally equivalent networks with different formatting are caught as duplicates. The paper notes this "saves hundreds of redundant training runs."

The LoRA fine-tuning (r=32, 0.5% of parameters) over DeepSeek Coder 7B is the correct choice for domain adaptation. Full fine-tuning of a 7B model on a domain-specific corpus risks catastrophic forgetting of general code generation capability. LoRA preserves the base model's general abilities while specializing the adapter for LEMUR's architecture vocabulary. At rank 32, the adapter has enough capacity to learn LEMUR's specific patterns (depthwise separable convolutions, squeeze-and-excitation blocks, specific normalization choices) without overfitting.

What NNGPT trades away:

Coverage across domains. The current system is primarily designed for computer vision tasks on CIFAR-10, CIFAR-100, ImageNet subsets, and similar datasets. LEMUR is a computer vision corpus. Extending to NLP architectures, time-series models, or graph networks requires building a comparable domain-specific corpus with verified metrics.

Architecture search in novel spaces. NNGPT generates architectures that are similar to LEMUR's existing entries (by design via NN-RAG). This is a strength for executability but a weakness for discovering truly novel architecture patterns. A network that requires architectural concepts absent from LEMUR is unlikely to be generated correctly.

Interpretability of the generation process. When the system generates an architecture that achieves 92% accuracy, the reason for that architecture choice is not traceable. The LoRA adapter and the DeepSeek Coder base model produce the output jointly, and the contribution of each architectural decision (depthwise separable conv vs. standard conv, skip connections, width) to the final accuracy is not attributable.

Technical Moats

The LEMUR corpus with 10,000+ verified outcomes. The LEMUR dataset is not just a training corpus: it is an audited benchmark with reproducible metrics. Every entry has been trained to completion with standardized preprocessing and evaluation. Building a comparable corpus from scratch requires training 10,000+ models to completion, which is the compute moat that makes NNGPT's results credible. The paper reports this has already happened: "The system has already generated over 5K validated models" (on top of the existing LEMUR baseline), all with verified outcomes. Competing systems that cannot ground generation in a comparable corpus will not reproduce the 73% executability rate.

The code-aware predictor's Pearson r=0.78. A predictor that can read PyTorch source code and early training metrics to forecast final accuracy with RMSE 0.14 is a non-trivial ML engineering result. Most NAS accuracy predictors operate on architecture graph representations (DAGs, cell-based descriptions). A predictor that operates on source code captures information that graph-based representations miss: specific initialization choices, optimizer coupling with architecture, batch normalization placement relative to activations. Replicating this requires both the code encoder training data and the architectural diversity of LEMUR.

The closed RL loop over 10,000 training runs. The dataset of (prompt, generated_spec, execution_result, accuracy) tuples accumulated over 10,000 training runs is itself a moat. Each LoRA update makes future generation slightly better calibrated to what actually works in LEMUR. A competitor starting from zero would need to generate and train thousands of architectures before the RL signal becomes meaningful. The system that has been running longest generates best.

Insights

Insight One: NNGPT is not AutoML. It is a neural architecture corpus generator with a prediction layer, and that distinction changes what it is useful for.

Traditional AutoML optimizes one architecture for one task. NNGPT generates a diverse population of architectures for a class of tasks. The 10,000 generated models are not 10,000 attempts to find the best CIFAR-10 model. They are 10,000 points in architecture space, each with verified metrics, expanding the LEMUR corpus. The correct use case for NNGPT is not "find me the best architecture for my task" but "grow the population of architectures I can choose from and reason about." The distinction matters for evaluation: NNGPT should not be compared to Bayesian optimization on a fixed task. It should be compared to NAS methods that explore architecture diversity across many tasks.

Insight Two: The hash-based deduplication is the most important engineering decision in the paper, and it receives the least attention. Without it, the closed-loop self-improvement collapses into mode collapse.

The natural tendency of a fine-tuned LLM generating architectures is to converge toward high-probability outputs: the architectures most represented in LEMUR, with small perturbations. In a closed loop, generated architectures get added to LEMUR, which increases their probability in future generation, which generates more similar architectures. This is mode collapse in a generative system. Hash-based deduplication breaks this feedback: if a generated architecture already exists (or is a near-duplicate), it is discarded. This forces the system to explore. The consequence: the growing LEMUR corpus contains genuinely diverse architectures, not a cluster of similar high-accuracy networks. The RL reward for "novelty" (zero reward for duplicates) encodes the same principle at the signal level.

Takeaway

NNGPT's code-aware accuracy predictor was compared against Optuna in the HPO task and outperformed it (RMSE 0.60 vs. 0.64 on LEMUR). This means a fine-tuned LLM reading hyperparameter configurations in natural language produces better HPO recommendations than a purpose-built Bayesian optimization library. The comparison is not cherry-picked: the paper (Kochnev et al., ICCVW 2025) is the Optuna-vs-Code-Llama study that established this result before NNGPT was published.

The implication is counterintuitive: hyperparameter optimization does not require iterative evaluation and Bayesian updating. An LLM that has read thousands of training logs can look at a model architecture and a dataset description and recommend learning rate, batch size, weight decay, and schedule without running a single trial. The recommendation quality, measured by RMSE against optimal hyperparameters found by exhaustive search, is better than Optuna's model-based approach. For practitioners spending days on HPO, this is the finding that most directly affects their workflow.

TL;DR For Engineers

NNGPT (ABrain-One/nn-gpt, MIT, arXiv:2511.20333) is a closed-loop AutoML engine: DeepSeek Coder 7B fine-tuned on LEMUR via LoRA (r=32, ~35M trainable params) generates complete executable PyTorch training specs from one prompt, executes them, and fine-tunes itself on the results. Over 10,000 validated models generated.
Five integrated pipelines: zero-shot architecture synthesis, HPO (RMSE 0.60 vs. Optuna's 0.64), code-aware accuracy predictor (RMSE 0.14, Pearson r=0.78), NN-RAG (73% executability on 1,289 targets), and RL-based self-improvement. All running in one closed loop.
LEMUR corpus is the foundation: an audited database of executable PyTorch programs with reproducible metrics. NN-RAG retrieves similar architectures for few-shot grounding, preventing the abstract-template failure mode of prior LLM-based AutoML.
Hash-based deduplication prevents mode collapse in the self-improvement loop. Without it, the system converges on near-duplicate architectures. The hash check is applied at the code level, catching formatting variations of the same architecture.
HPO finding: a fine-tuned LLM reading hyperparameter configurations outperforms Optuna on LEMUR. One-shot prediction reduces the number of required trials from hundreds to one on common datasets.

The Loop Is the Architecture

NNGPT's central engineering contribution is not any individual component. It is the closed loop: generate, execute, predict, update. Every generated architecture that runs improves the generator. Every failure provides a negative signal. Every successful run expands the retrieval corpus for future generation. The system does not need to be told which architectures work. It finds out by running them and remembers what it learned.

At 10,000 validated models, the corpus feedback has become meaningful. At 100,000, it will be the dominant signal. The system that has been running longest and has generated the most validated models will have the strongest generation capabilities. That is the moat, and it compounds.

References

NNGPT: Rethinking AutoML with Large Language Models, arXiv:2511.20333, Kochnev et al., November 2025
ABrain-One/nn-gpt GitHub Repository, MIT license
Enhancing LLM-Based Neural Network Generation: Few-Shot Prompting and Efficient Validation, arXiv:2512.24120
LLMatic: Neural Architecture Search via LLMs and Quality Diversity, arXiv:2306.01102
AutoML-GPT: Large Language Model for AutoML, arXiv:2309.01125
AutoML in the Age of Large Language Models, arXiv:2306.08107
Neural Architecture Search: Insights from 1000 Papers, arXiv:2301.08727
Voyager: An Open-Ended Embodied Agent with Large Language Models, arXiv:2305.16291 — the skill library pattern that inspired NNGPT's corpus growth
LEMUR Dataset for Neural Networks, the audited corpus underlying NNGPT
Optuna vs Code Llama: Are LLMs a New Paradigm for Hyperparameter Tuning? ICCVW 2025, Kochnev et al.

NNGPT (arXiv:2511.20333, ABrain-One, November 2025) is a closed-loop AutoML engine that fine-tunes DeepSeek Coder 7B on the LEMUR dataset via LoRA (r=32, ~35M trainable parameters), generating complete executable PyTorch training specifications from natural language prompts, executing them, and using execution logs for RL-based self-improvement. Five integrated pipelines compose into one system: NN-RAG achieves 73% executability on 1,289 generation targets through LEMUR-grounded few-shot retrieval, the code-aware accuracy predictor reaches RMSE 0.14 with Pearson r=0.78 by reading source code alongside early training metrics, and HPO outperforms Optuna (RMSE 0.60 vs. 0.64). Hash-based deduplication at the code level prevents mode collapse in the self-improvement loop. The system has generated over 10,000 validated models, all incorporated into LEMUR with verified outcomes.

Sponsored Ad

If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad — it helps us keep building and delivering value 🚀

Your prompts are leaving out 80% of what you're thinking.

When you type a prompt, you summarize. When you speak one, you explain. Wispr Flow captures your full reasoning — constraints, edge cases, examples, tone — and turns it into clean, structured text you paste into ChatGPT, Claude, or any AI tool. The difference shows up immediately. More context in, fewer follow-ups out.

89% of messages sent with zero edits. Used by teams at OpenAI, Vercel, and Clay. Try Wispr Flow free — works on Mac, Windows, and iPhone.

Start flowing free