There are four major organizational units for next-generation DNA sequencing processes that we use throughout the GATK documentation:
Lane: The basic machine unit for sequencing. The lane reflects the basic independent run of a high-throughput sequencing machine. For Illumina machines, this is the physical sequencing lane.
Library: A unit of DNA preparation that at some point is physically pooled together. Multiple lanes can be run from aliquots from the same library. The DNA library and its preparation is the natural unit that is being sequenced. For example, if the library has limited complexity, then many sequences are duplicated and will result in a high duplication rate across lanes. If working with RNAseq, the library preparation process involves reverse transcription into cDNA.
Sample: A biological sample coming from a single individual. Multiple libraries with different properties can be constructed from the original sample DNA source. Throughout our documentation, we treat samples as independent individuals whose genome sequence we are attempting to determine. Note that from this perspective, tumor / normal samples are different despite coming from the same individual.
Note that many GATK commands can be run at the lane level, but will give better results seeing all of the data for a single sample, or even all of the data for all samples. Unfortunately, there's a trade-off in computational cost, since running these commands across all of your data simultaneously requires much more computing power. Please see the documentation for each step to understand what is the best way to group or partition your data for that particular process.