ICE-ID: A Novel Historical Census Dataset for Longitudinal Identity Resolution
arXiv:2506.13792v2 Announce Type: replace Abstract: We introduce \textbf{ICE-ID}, a benchmark dataset comprising 984,028 records from 16 Icelandic census waves spanning 220 years (1703--1920), with 226,864 expert-curated person identifiers. ICE-ID combines hierarchical geography (farm$\to$parish$...
arXiv:2506.13792v2 Announce Type: replace
Abstract: We introduce \textbf{ICE-ID}, a benchmark dataset comprising 984,028 records from 16 Icelandic census waves spanning 220 years (1703--1920), with 226,864 expert-curated person identifiers. ICE-ID combines hierarchical geography (farm$\to$parish$\to$district$\to$county), patronymic naming conventions, sparse kinship links (partner, father, mother), and multi-decadal temporal drift -- challenges not captured by standard product-matching or citation datasets. This paper presents an artifact-backed analysis of temporal coverage, missingness, identifier ambiguity, candidate-generation efficiency, and cluster distributions, and situates ICE-ID against classical ER benchmarks (Abt--Buy, Amazon--Google, DBLP--ACM, DBLP--Scholar, Walmart--Amazon, iTunes--Amazon, Beer, Fodors--Zagats). We also define a deployment-faithful temporal OOD protocol and release the dataset, splits, regeneration scripts, analysis artifacts, and a dashboard for interactive exploration. Baseline model comparisons and end-to-end ER results are reported in the companion methods paper.