Research Approach Summary — AI × Soybean Genomics

Glossary — Every Term Explained in Plain English

All technical terms used in this document — explained without jargon so the science speaks for itself.

Artificial Intelligence & Machine Learning

Agentic AI

AI that acts autonomously — like a researcher who works independently. It plans what to do next, executes the plan, evaluates the result, and decides the next step. It does not wait to be told what to do at each stage.

Multi-Agent System

A team of specialized AI programs that each handle a different task and share findings with each other. Like a research team where one person does statistics, another reads literature, and another designs experiments — except they're all AI and work simultaneously.

Large Language Model (LLM)

An AI trained on vast scientific text — capable of reading research papers, understanding their content, and reasoning about what they mean. Think of it as an AI that has read millions of scientific papers and can synthesize them instantly. Used here for interpretation, not computation.

Machine Learning (ML)

A type of AI that learns patterns from data rather than following fixed rules. Feed it genotype data and yield measurements from 10,000 plants, and it learns to predict yield from genotype — without being explicitly programmed with rules for how genes affect yield.

Reinforcement Learning (RL)

AI that learns by trial and error, guided by a reward signal. Like training a dog — good decisions get rewarded, bad ones don't. Here, the AI learns which experiments to recommend next (which plants to phenotype) to maximize what is learned from each field season.

Deep Learning

A powerful type of machine learning using layered neural networks (inspired by the brain). It can detect complex non-linear patterns — for example, it can learn that Gene A only affects yield when Gene B is also present, something simpler methods cannot detect.

Random Forest

A machine learning algorithm that builds hundreds of decision trees and combines their votes. Excellent at detecting which variables (SNPs) interact with each other. Used here to scan for gene-gene interactions (epistasis) at genome-wide scale.

XGBoost / LightGBM

High-performance machine learning algorithms that build prediction models by iteratively correcting their mistakes. Known to outperform traditional linear models on genomic data. Used here to predict crop phenotype (yield, protein, oil) from genotype data.

Self-Improving Loop

Every time the system makes a prediction that turns out to be wrong, it automatically learns from that error. The incorrect assumption is identified and corrected, so the next prediction is more accurate. The system accumulates knowledge permanently — it never forgets what it learned.

Knowledge Base (RAG)

A curated digital library of relevant research papers that AI agents can search instantly. When an agent needs to know "what is known about this gene?" it searches the knowledge base and retrieves relevant passages from published studies — with citations. RAG = Retrieval Augmented Generation.

Genomics & Genetics

GWAS (Genome-Wide Association Study)

A method that scans millions of positions across the entire genome of thousands of plants and asks: "Is this specific DNA position statistically linked to the trait we care about (e.g., yield)?" It produces a "Manhattan plot" showing which genomic locations are significantly associated with the trait.

SNP (Single Nucleotide Polymorphism)

A single-letter difference in DNA between individuals. Like a spelling variation — one plant has 'A' at a position where another has 'G'. The soybean genome has millions of SNPs. GWAS tests which SNPs are statistically associated with traits of interest.

QTL (Quantitative Trait Locus)

A region of the genome statistically linked to a measurable trait (like yield or protein content). Unlike single-gene traits, most agricultural traits are controlled by many QTLs each with small effects. Think of it as a "hot zone" on the chromosome map worth investigating.

Epistasis

When two or more genes interact — the combined effect is different (often much larger) than what each gene does alone. Like a lock that needs two keys simultaneously. Gene A alone does nothing; Gene B alone does nothing; but Gene A + Gene B together dramatically increases yield. Traditional GWAS cannot detect this.

Genomic Selection / Genomic Prediction

Using a plant's full DNA profile to predict how it will perform in the field — before it's ever grown. Instead of waiting years to evaluate offspring, breeders can select the best plants based on their DNA alone. Dramatically speeds up the breeding cycle.

GBLUP (Genomic Best Linear Unbiased Prediction)

The current standard method for genomic prediction. It assumes all genetic effects are additive (genes act independently and their effects simply add up). This works well when genetics is simple but misses the complex interactions (epistasis) that our ML models can capture.

Fine-Mapping

GWAS identifies a region (e.g., 1 million base pairs) associated with a trait. Fine-mapping narrows this down to the specific variant most likely causing the effect — from thousands of candidates to a handful. Like finding which single brick in a wall is cracked rather than just knowing which section is unstable.

Heritability (h²)

The proportion of trait variation between plants that is due to genetic differences (vs. environmental differences). If heritability = 0.6 for yield, it means 60% of yield differences between plants are due to their DNA. Higher heritability means genetic approaches will be more effective.

Linkage Disequilibrium (LD)

When two DNA positions are inherited together more often than expected by chance — because they're physically close on the chromosome. It means a GWAS hit might not be the causal variant but rather a "hitchhiker" travelling alongside it. Fine-mapping resolves LD to find the true cause.

Population Structure

Genetic relatedness patterns within a study population. If two groups of plants are genetically distinct (e.g., different geographic origins) AND differ in a trait, GWAS can falsely conclude there's a genetic association when it's just ancestry confounding. Mixed models in GWAS correct for this.

Multi-Omics & Molecular Biology

Multi-Omics

Combining multiple layers of biological data simultaneously: Genomics (DNA sequence), Transcriptomics (which genes are active), Proteomics (which proteins are present), Metabolomics (which metabolites are produced). Each layer tells a different part of the story; together they reveal mechanism.

RNA-seq (Transcriptomics)

A method that measures which genes are currently "switched on" and how strongly. DNA is the instruction manual; RNA-seq tells you which instructions are being actively read right now — in a specific tissue, at a specific time, under specific conditions. Shows which genes respond to drought, heat, disease, etc.

eQTL (Expression Quantitative Trait Locus)

A DNA position that controls how much a gene is expressed (switched on/off). When a GWAS hit is also an eQTL for a nearby gene, it strongly suggests the variant works by changing that gene's activity — this is the bridge between statistical association and biological mechanism.

Co-expression Network (WGCNA)

Genes that are consistently switched on and off together across many samples. Like friends who always show up to the same events — they're probably connected. WGCNA (Weighted Gene Co-expression Network Analysis) identifies these gene "communities" to reveal which biological pathways are involved in a trait.

Phenotype vs. Genotype

Genotype = what's in the DNA (the blueprint). Phenotype = what you actually observe — yield in tonnes per hectare, protein percentage, height, disease score. The goal of this research is to predict phenotype from genotype accurately — so breeders can select the best plants from DNA alone.

High-Throughput Phenotyping

Using cameras, drones, sensors, or imaging systems to measure plant traits automatically — instead of manual measurement. A camera can measure seed size, shape, and color for 10,000 plants in hours. These image-derived measurements become additional data that improves genomic prediction.

Statistical & Analytical Methods

Mixed Linear Model (MLM)

A statistical model used in GWAS that accounts for the fact that plants in the study are genetically related. Without correction, related plants sharing a trait would look like a genetic association — MLM removes this confounding and ensures only true genetic effects are reported.

SuSiE / FINEMAP (Fine-Mapping Methods)

Statistical algorithms that take a GWAS-identified region and calculate the probability that each variant within it is the true causal one. They output a "credible set" — a small list of variants (sometimes just 1-3) that together have a 95% chance of containing the causal variant.

G×E (Genotype × Environment Interaction)

The same plant variety performs differently in different environments (drought vs. irrigated, hot vs. cool climate). G×E means a variety that ranks #1 in Punjab might rank #5 in Maharashtra. The AI system models this — identifying which genetic variants are stable across environments vs. which are environment-specific.

PLINK / GAPIT / GEMMA

Standard, widely-used software tools for GWAS analysis in plant and animal genetics. PLINK handles data management and basic QC; GAPIT runs mixed model association tests; GEMMA handles large-scale mixed models efficiently. These tools are trusted by the global genetics community and form the computational backbone of the Discovery Agent.

MAF (Minor Allele Frequency)

How common the less-frequent version of a SNP is in the study population. A MAF of 0.05 means 5% of plants carry the alternative allele. Standard GWAS filters out very rare variants (MAF < 1-5%) because there aren't enough examples to test statistically — but our ML system can detect patterns in rare variants that GWAS discards.

DNABERT (DNA Language Model)

A pre-trained AI model that understands DNA sequences the way LLMs understand text. It reads DNA as sequences of "k-mers" (short overlapping DNA words like ATCGGT) and learns their biological meaning. Can predict whether a genetic variant disrupts a gene's function, even in regions far from the gene.

Cloud Infrastructure & Computing

AWS (Amazon Web Services)

The cloud computing platform used to run all analyses. Instead of buying expensive servers, we rent computing power on demand — pay only when computing, zero cost when idle. Provides essentially unlimited processing power, accessible from any internet connection, located in the AWS Mumbai data centre for data sovereignty.

Serverless Architecture

Computing infrastructure that automatically scales up when needed and costs nothing when idle. There are no permanent servers running and waiting — the system activates on-demand when an analysis is triggered and shuts down when done. Like a taxi that appears when called vs. owning a car that sits unused 90% of the time.

Amazon Bedrock

AWS's service for accessing powerful AI language models (like Claude) through an API. The AI agents use Bedrock to do their reasoning — reading results, synthesising literature, forming hypotheses. You pay only per query, with no need to manage AI infrastructure.

AgentCore Runtime

AWS's dedicated environment for running autonomous AI agents. It allows agents to work for hours (up to 8 hours per session), maintain memory across a long analysis, collaborate with other agents, and access files and databases — all without any server management by the researcher.

Amazon Omics

AWS's specialized service for storing and querying genomic data at scale. Instead of downloading multi-terabyte VCF files and parsing them manually, researchers query variants for specific chromosomal regions across thousands of accessions in seconds. Purpose-built for genomics workflows.

SageMaker

AWS's machine learning platform. Used to train custom ML models (genomic prediction, epistasis detection) on GPU hardware — then deploy them as instant-response prediction services. Models auto-terminate when training is complete (no idle GPU cost). Spot instances reduce training costs by 70%.

What If an AI System CouldThink Like a Genomicist?

What If an AI System Could
Think Like a Genomicist?