Add comprehensive tests for BLEU and lexical metrics including edge cases, batch scoring, and aggregate statistics.