umma.dev

What's Bioinformatics?

A reference guide to the core concepts in bioinformatics, bridging biology and computation.

The Biology

Cell Biology

Cell structure

  • Nucleus: The control order containing genetic material (DNA).
  • Mitochondria: Generates chemical energy for the cell.
  • Ribosomes: The site of protein synthesis.

Cell cycle

The series of events that take place in a cell that cause it to divide into two daughter cells.

Cell signalling pathways

The ability of cells to receive, process, and transmit signals with their environment and with themselves.

Cell differentiation

The process where a cell changes from one cell type to another (e.g., a stem cell becoming a muscle cell).

Biochemistry

Protein structure

Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Their function is determined by their 3D shape.

Enzymes and binding

Enzymes are proteins that act as biological catalysts. Binding refers to the interaction between the enzyme and its substrate.

Protein-protein interactions

The physical contacts with high specificity established between two or more protein molecules.

Folding and misfolding

Folding is the physical process by which a protein chain acquires its native 3-dimensional structure. Misfolding can lead to inactive or toxic proteins.

Molecular Biology

DNA, RNA, proteins

  • DNA: Stores genetic information.
  • RNA: Transmits genetic information.
  • Proteins: Perform the functions encoded by the DNA.

Central Dogma

The flow of genetic information: DNA makes RNA, and RNA makes Protein.

Genes, alleles, loci

  • Gene: A sequence of nucleotides in DNA or RNA that encodes the synthesis of a gene product.
  • Allele: A variant form of a gene.
  • Locus: A specific, fixed position on a chromosome where a particular gene or genetic marker is located.

Exons vs introns

  • Exons: Coding regions of DNA that are expressed.
  • Introns: Non-coding regions that are spliced out.

Mutations (SNPs, insertions, deletions)

  • SNP (Single Nucleotide Polymorphism): A variation in a single nucleotide.
  • Insertion: Adding one or more nucleotides.
  • Deletion: Removing one or more nucleotides.

Codons and genetic code

A codon is a sequence of three DNA or RNA nucleotides that corresponds with a specific amino acid.

Gene expression

The process by which information from a gene is used in the synthesis of a functional gene product.

Immunology

Antibodies

Proteins produced by the immune system to identify and neutralize foreign objects like bacteria and viruses.

T-cells and B-cells

  • T-cells: Directly kill infected host cells or activate other immune cells.
  • B-cells: Produce antibodies.

Immune repertoire diversity

The immense variety of antibodies and T-cell receptors produced by the immune system to recognize a wide range of pathogens.

Genetics and Genomics

Genome structure

The complete set of genes or genetic material present in a cell or organism.

Sequencing technologies

Methods used to determine the exact sequence of bases in a DNA molecule (e.g., Illumina, Nanopore).

Variants

Differences in DNA sequence between individuals or populations.

Genotype vs phenotype

  • Genotype: The genetic constitution of an individual organism.
  • Phenotype: The set of observable characteristics resulting from the interaction of its genotype with the environment.

Heritability

A statistic used in breeding and genetics that estimates the degree of variation in a phenotypic trait in a population that is due to genetic variation between individuals in that population.

GWAS (Genome-Wide Association Studies)

An observational study of a genome-wide set of genetic variants in different individuals to see if any variant is associated with a trait.

Systems biology and pathways

Gene regulatory networks

A collection of molecular regulators that interact with each other and with other substances in the cell to govern the gene expression levels of mRNA and proteins.

Metabolic pathways

A linked series of chemical reactions different occurring within a cell.

Feedback loops

A system structure that causes output from one node to eventually influence input to that same node.

Evolution and phylogenetics

Homology vs analogy

  • Homology: Similarity due to shared ancestry.
  • Analogy: Similarity due to convergent evolution (not shared ancestry).

Sequence alignment

A way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity.

Phylogenetic trees

A diagram that represents the evolutionary relationships among organisms.

Disease Biology

Cancer biology basics

Cancer is a disease caused by uncontrolled division of abnormal cells in a part of the body.

Genetic diseases

A genetic problem caused by one or more abnormalities in the genome.

Complex diseases

Diseases caused by a combination of genetic, environmental, and lifestyle factors.

The Computation

Bioinformatics relies on algorithms to process biological data. Here are examples of standard operations using TypeScript.

DNA Transcription

Transcription replaces Thymine (T) with Uracil (U).

const transcribe = (dna: string): string => dna.replace(/T/g, "U");

// Example
const dna = "ATCGATCG";
console.log(transcribe(dna)); // AUCGAUCG

Reverse Complement

DNA is double-stranded. The reverse complement finds the sequence of the opposite strand. pairs: A-T, C-G.

const reverseComplement = (dna: string): string => {
  const pairs: Record<string, string> = { A: "T", T: "A", C: "G", G: "C" };
  return dna
    .split("")
    .reverse()
    .map((base) => pairs[base] || base)
    .join("");
};

// Example
console.log(reverseComplement("ATCG")); // CGAT

GC Content

The percentage of nitrogenous bases on a DNA or RNA molecule that are either Guanine (G) or Cytosine (C). High GC content indicates high stability.

const getGCContent = (sequence: string): number => {
  const matches = sequence.match(/[GCgc]/g) || [];
  return (matches.length / sequence.length) * 100;
};

// Example
console.log(getGCContent("ATCG")); // 50
console.log(getGCContent("GGCC")); // 100

Hamming Distance

Measures the number of substitutions required to change one string into another. Useful for finding Single Nucleotide Polymorphisms (SNPs).

const hammingDistance = (seq1: string, seq2: string): number => {
  if (seq1.length !== seq2.length) throw new Error("Sequences must be equal length");
  
  return seq1.split("").reduce((acc, base, i) => 
    base !== seq2[i] ? acc + 1 : acc, 0
  );
};

// Example
console.log(hammingDistance("GAGCCTACTAACGGGAT", "CATCGTAATGACGGCCT")); // 7

The Data

  • Formats: Standard file formats include FASTA (sequences), BAM (alignments), and VCF (variants).
  • Volume: A single human genome is approximately 3GB. Population studies reach petabyte scales.
  • Databases: Major repositories include GenBank (nucleotide sequences) and UniProt (protein sequences).

The Applications

  • Personalized Medicine: Tailoring medical treatment to the individual characteristics of each patient.
  • Drug Discovery: Using computational methods to predict how drugs interact with biological targets.
  • Agriculture: Genetic engineering for crop improvement (e.g., drought resistance).
  • Forensics: DNA profiling for identification.

The Future

  • Single-cell Analysis: Analyzing genomic information at the individual cell level rather than bulk tissue.
  • Multi-omics: Integrating data from genomics, transcriptomics, proteomics, and metabolomics.