Generative AI is a branch of artificial intelligence that aims to create novel and realistic data, such as images, text, audio, or video, from existing data or latent variables. Generative AI has many potential applications, such as data augmentation, content creation, anomaly detection, and privacy preservation. However, generative AI also poses significant challenges for data authenticity and integrity, especially in the context of sensor data collected at the edge of the network. Edge computing is a paradigm that enables data processing and analysis near the source of the data, rather than in the cloud or a centralized server. Edge computing can offer benefits such as low latency, high bandwidth, reduced cost, and enhanced privacy. However, edge devices, such as smartphones, cameras, or IoT sensors, are often constrained by limited resources, such as memory, battery, and computing power. Moreover, edge devices are more vulnerable to physical attacks, tampering, or spoofing, which can compromise the quality and reliability of the sensor data. Therefore, it is crucial to develop methods and techniques to ensure data authenticity and integrity at the edge, in the presence of generative AI.
Data authenticity ensures that data is genuine, original, and not fabricated or manipulated by unauthorized parties. Data authenticity is important for many reasons, such as establishing trust, accountability, and transparency among data producers, consumers, and intermediaries; protecting intellectual property rights and privacy; and preventing fraud, misinformation, and cyberattacks. Data authenticity becomes even more important in the context of generative AI, which can create realistic and convincing data that can be hard to distinguish from human data. Generative AI data can be used for malicious purposes, such as impersonating or deceiving people, spreading false or misleading information, or compromising the security or performance of systems or networks. Therefore, it is essential to develop methods and techniques to verify and validate the source, origin, and quality of the data, and to detect and reject any generative AI data that is not authorized or intended. This is the main motivation and challenge of this PhD project, which focuses on sensor data collected at the edge of the network, where the risk of generative AI data is higher and the resources for data authenticity are lower, yet the compute resides closer to the sensor and therefore constitutes the beginning of the life cycle of a data point. After the inception of the data point, the question becomes how to track the different computations that have been executed on the data point, to make observable and transparent the lineage and provenance of the data at the time of its usage. Techniques that can improve the transparency of this flow, which also being able to retain a trace of the authenticity of the data are at the center of this PhD topic.
Research Objectives and Questions
The main objective of this PhD project is to investigate how to distinguish synthetic, generative AI sensor data from real sensor data, and how to build a data governance that keeps this authenticity verifiable even when derivations of the data are created. E.g., the data coming from a location sensor that measures longitude and latitude will be derived into a statement like “Person X is in Ghent at this moment”.
The project will be supervised by Prof. Pieter Colpaert and Dr. Tanguy Coenen, and will be conducted in collaboration with the imec research groups in the “AI and Algorithms department. The project will address the following research questions:
How can we adapt hardware design so that the software level has a means to verify the authenticity of sensory output.
Can we build data provenance chains in which the authenticity of derived data can still be verified?
How could this be implemented on edge devices, i.e. close to the sensor? Edge devices have limited resources and capabilities, which pose challenges for implementing data authenticity and integrity mechanisms. The project will study how to optimize the performance, scalability, and security of the proposed methods and techniques, and how to leverage existing edge computing frameworks and platforms.
Expected Outcomes and Contributions
The expected outcomes of this PhD project are:
A comprehensive literature review and state-of-the-art analysis of the existing methods and techniques for data authenticity and integrity at the edge, in the context of generative AI.
A novel and robust framework for data labeling, fingerprinting, and watermarking.
A prototype implementation and evaluation of the proposed framework on real-world edge devices and sensors, and on synthetic and real data sets.
A dissemination of the research results through publications in peer-reviewed journals and conferences, and through presentations and demonstrations in academic and industrial events.
The expected contributions of this PhD project are: