Benchmarks

Use Case Specific Benchmarks

GenSeC develops three benchmarks aligned with representative security related applications. Each benchmark is designed to reflect domain specific requirements while adhering to shared quality criteria regarding validity, reproducibility, and comparability.
1. The benchmark for security oriented text generation focuses on summarization and report generation from heterogeneous and multilingual sources. It evaluates coherence, contextual appropriateness, and factual reliability, with additional emphasis on robustness under misleading or contradictory input conditions.
2. The geospatial vector data benchmark addresses the generation and generalization of geographic data. It evaluates compliance with established geospatial standards, correctness of object representation, and suitability for downstream analytical tasks. Particular attention is given to completeness, topological consistency, and behavior across different levels of spatial abstraction.
3. The multimodal map based question answering benchmark evaluates spatial reasoning over visual map representations. It examines the interpretation of geographic relationships, handling of vague spatial expressions, and resilience against adversarial queries that aim to distort geospatial understanding.

Holistic Benchmarking Framework

In addition to the use case specific benchmarks, GenSeC develops a holistic benchmarking framework that addresses cross cutting evaluation dimensions. This framework integrates factuality, transparency, and security resilience independent of a single application scenario. The holistic benchmark supports adversarial stress testing and enables comparative analysis across models and use cases. By combining automated metrics with expert informed evaluation procedures, it addresses known limitations of accuracy centered benchmarking approaches.
As part of the holistic benchmark, we will release an accompanying white paper that analyses the theoretical limits of benchmarking foundation models in security critical contexts. It reviews key challenges for benchmark design, assesses existing benchmarks with respect to data quality, safety, and security risks, and outlines core properties of high quality and holistic benchmarks, including factual reliability, robustness, transparency, and security compliance.