A Streamlined Pipeline to Generate Synthetic Identity Documents

Nag, Soham

In contemporary society, the proliferation of fake identity documents presents a profound menace that permeates various facets of the social fabric. The advent of artificial intelligence coupled with sophisticated printing techniques has significantly exacerbated this issue. The ramifications of counterfeit…

In contemporary society, the proliferation of fake identity documents presents a profound menace that permeates various facets of the social fabric. The advent of artificial intelligence coupled with sophisticated printing techniques has significantly exacerbated this issue. The ramifications of counterfeit identity documents extend far beyond the legal infractions and financial losses incurred by victims of identity theft because they pose a severe threat to public safety, national security, and societal trust. Given these multifaceted threats, the imperative to detect and thwart fraud identity documents has become paramount. The efficacy of fraud detection tools is contingent upon the availability of extensive identity document datasets for training purposes. However, existing benchmark datasets such as MIDV-500, MIDV-2020, and FMIDV exhibit notable deficiencies such as a limited number of samples, insufficient coverage of various fraud patterns, and occasional alterations in critical personal identifier fields, particularly portrait images. These limitations constrain their effectiveness in training models capable of detecting realistic fraud instances while also safeguarding privacy. This thesis delineates the research work to address this gap by proposing a streamlined pipeline for generating synthetic identity documents and introducing the resultant benchmark dataset, named IDNet. IDNet is meticulously crafted to propel advancements in privacy-preserving fraud detection initiatives and comprises 597,900 images of synthetically generated identity documents, amounting to approximately 350 gigabytes of data. These documents are categorized into 20 types, encompassing identity documents from 10 U.S. states and 10 European countries. Additionally, the dataset includes identity documents consisting of either a single fraud pattern or multiple fraud patterns, to cater to various model training requirements.

Copyright Statement