Author: Luisa Verdoliva
University Federico II of Naples, Italy
The tutorial will focus on deepfake detection. First, it will present the most reliable supervised approaches based both on deep learning and on handcrafted features (e.g., corneal specular highlights, heart variations, landmark locations) together with the main datasets used in this field. The most interesting directions for gaining generalization and robustness will be described, such as one-class learning, few shot learning and incremental learning. In this context, the concepts of camera fingerprints and artificial fingerprints will be introduced. Then, identity-based methods, aimed at detecting both face swapping and facial reenactment, will be presented. Multimodal approaches that detect audio-visual inconsistencies will be also considered. Results on challenging datasets and realistic scenarios, such as the spreading of manipulated images and videos over social networks, will be presented. In addition, the robustness of such methods to adversarial attacks will be analyzed. The tutorial will consider mostly deepfake videos, but will also include examples of fully generated images using generative adversarial networks (GANs).
Author: Petros Maragos
School of E.C.E., National Technical University of Athens, Athens 15773, Greece
Athena Research Center, Robot Perception and Interaction Unit, Greece
Tropical geometry is a relatively recent field in mathematics and computer science combining elements of algebraic geometry and polyhedral geometry. The scalar arithmetic of its analytic part pre-existed in the form of max-plus and min-plus semiring arithmetic used in finite automata, nonlinear image processing, convex analysis, nonlinear control, optimization, and idempotent mathematics.
Tropical geometry recently emerged successfully in the analysis and extension of several classes of problems and systems in both classical machine learning and deep learning. Such areas include (1) Deep Neural Networks (DNNs) with piecewise-linear (PWL) activation functions, (2) Morphological Neural Networks, (3) Neural Network Minimization, (4) Probabilistic Graphical Models, and (5) Nonlinear regression with PWL functions. Areas (1), (2) and (3) have been applied to image classification problems.
The proposed tutorial will cover the following topics:
Elements from Tropical Geometry and Max-Plus Algebra. We will first summarize introductory ideas and objects of tropical geometry, including tropical curves and surfaces and Newton polytopes. We will also provide a brief introduction to the max-plus algebra that underlies tropical geometry. This will involve scalar and vector/signal operations defined over a class of nonlinear spaces and optimal solutions of systems of max-plus equations. Tropical polynomials will be defined and related to classical polynomials through Maslov dequantization. Then, the above introductory concepts and tools will be applied to analyzing and/or providing solutions for problems in the following broad areas of machine learning.
• Neural Networks with Piecewise-linear (PWL) Activations. Tropical geometry recently emerged in the study of deep neural networks (DNNs) and variations of the perceptron operating in the max-plus semiring. Standard activation functions employed in DNNs, including the ReLU activation and its “leaky” variants, induce neural network layers which are PWL convex functions of their inputs and create a partition of space well-described by concepts from tropical geometry. We will illustrate a purely geometric approach for studying the representation power of DNNs — measured via the concept of a network’s “linear regions” — under the lens of tropical geometry.
• Morphological Neural Networks. Recently there has been a resurgence of networks whose layers operate with max-plus arithmetic (inspired by the fundamental operators of morphological image processing). Such networks enjoy several promising aspects including faster training and capability of being pruned to a large degree without severe degradation of their performance. We will present several aspects from this emerging class of neural networks from some modern perspectives by using ideas from tropical geometry and mathematical morphology. Subtopics include methods for their training and pruning resulting in sparse representations.
• Neural Network Minimization. The field of tropical algebra is closely linked with the domain of neural networks with PWL activations, since their output can be described via tropical polynomials in the max- plus semiring. In this tutorial, we will briefly present methods stemming from a form of approximate division of such polynomials, which relies on the approximation of their Newton Polytopes, in order to minimize networks trained for multiclass classification problems. We will also present experimental evaluations on known datasets, which demonstrate a significant reduction in network size, while retaining adequate performance.
• Probabilistic Graphical Models and Algorithms. A novel application of tropical geometry is its usage for analyzing parametric statistical models, including hidden Markov models and restricted Boltzmann machines. Further, among the max-sum and max-product algorithms used in graphical models for statistical inference a prime representative is the Viterbi algorithm. This can also be viewed in the general setting of Weighted Finite State Transducers which have found extensive use in speech recognition and other decoding schemes. We shall present a tropical modeling of such algorithms which leads to a compact and elegant representation, while highlighting geometric properties.
• Piecewise-linear (PWL) Regression. Fitting PWL functions to data is a fundamental regression problem in multidimensional signal modeling and machine learning, since approximations with PWL functions have proven analytically and computationally very useful in many fields of science and engineering. We focus on functions that admit a convex representation as the maximum of affine functions (e.g. lines, planes), represented with max-plus tropical polynomials. This allows us to use concepts and tools from tropical geometry and max-plus algebra to optimally approximate the shape of curves and surfaces by fitting tropical polynomials to data, possibly in the presence of noise; this yields polygonal or polyhedral shape approximations. For this convex PWL regression problem we present optimal solutions w.r.t. $\ell_p$ error norms and efficient algorithms.
Throughout this tutorial we also outline problems and future directions in machine learning that can benefit from the tropical-geometric point of view.
T3: Stochastic Bayesian methods for imaging inverse problems: from Monte Carlo to score-matching and deep learning
Authors: Valentin De Bortoli (1), Julie Delon (2), Marcelo Pereyra (3)
(1) CNRS and ENS Paris, France
(2) Université de Paris, France
(3) Heriot-Watt University and the Maxwell Institute for Mathematical Sciences, Edinburgh, UK
The tutorial is structured in three parts of two hours of duration, organised as follows:
• The first part of this tutorial will introduce the Bayesian statistical framework and key concepts of Bayesian analysis and computation in the context of imaging. We first introduce the Bayesian modelling paradigm and then quickly progress to fundamental concepts of Bayesian decision theory that are relevant to imaging sciences, such as point estimation and uncertainty quantification analyses, hierarchical and empirical approaches to calibrate unknown model parameters, and model selection. This is then followed by an introduction to efficient Bayesian computation approaches. We pay special attention to methods based on the overdamped Langevin stochastic differential equation, to proximal Markov chain Monte Carlo algorithms, and to stochastic approximation methods that intimately combine ideas from stochastic optimisation and Langevin sampling. These computation techniques are illustrated with a series of imaging experiments where they are used to perform some of the advanced Bayesian analyses previously introduced.
• The second part of this tutorial is devoted to Plug-and-Play methods and Tweedie based approaches. In the Bayesian framework introduced in the first part, image models are used as priors or regularisers and combined to explicit likelihood functions to define posterior distributions. These posterior distributions can be used to derive Maximum A Posteriori (MAP) estimators, leading to optimization problems that may be convex or not, but are well studied and understood. Sampling schemes can also be used to explore these posterior distributions, to derive Minimum Mean Square Error (MMSE) estimators, quantify uncertainty or perform other advanced inferences. While research on inverse problems has focused for many years on explicit image models (either directly in the image space, or in a transformed space), an important trend nowadays is to use implicit image models encoded by denoising neural networks. These denoising networks can be seen in particular as approximating the gradient or the proximal operator of the log-prior on natural images, and can therefore be used in many classical optimization or sampling schemes. These methods, commonly known as Plug & Play (PnP), open the way to restoration algorithms that exploit more powerful and accurate prior models for natural images but raises novel challenges and questions on the corresponding posterior distributions and their resulting estimators. The goal of this part is to in- troduce these Plug & Play approaches, and to provide some perspectives and present recent developments on these questions.
• The third part of this tutorial is devoted to score-based generative modelling for inverse problems. These models to sample from given posterior distribution are adapted from methods used for generative modelling, i.e. the task of generating new samples from a data distribution. Score-based generative modelling is a recently developed approach to solve this problem and exhibits state-of-the-art performance on several image synthesis problems. These methods can be roughly described as follows. First, noise is incrementally added to the data to obtain an easy-to-sample distribution. Then, we learn the time-reversed denoising dynamics using a neural network. When initialized at the easy-to-sample distribution we obtain a generative model. These dynamics can be analyzed through the lens of stochastic analysis. In particular, it is useful to describe these processes as Stochastic Differential Equations (SDEs). The time-reversed SDE is a diffusion whose drift depends on the logarithmic gradients of the perturbed data distributions, i.e. the Stein scores. These scores are computed leveraging score-matching methods and in particular the Tweedie identity as well as neural network approximations. These generative models can be conditioned on observed data and give rise to efficient solvers for in- verse problems. We will draw connections between these machine-learning models and the PnP methods introduced in image processing and present applications to some classical inverse problems in image processing.
T4: Soft Video Delivery: Getting seamless quality adaptation in mobile and latency-critical applications
Authors: Anthony Trioux (1), François-Xavier Coudoux (1), Marco Cagnazzo (2,3), Michel Kieffer (4)
(1) IEMN-DOAE Laboratory, Univ. Polytechnique Hauts-de-France, CNRS, Univ. Lille, YNCREA, Centrale Lille, France
(2) LTCI, Télécom ParisTech, Institut Polytechnique de Paris, France
(3) University of Padua, Department of Information Engineering, Italy
(4) Univ. Paris-Saclay, CNRS, CentraleSupélec, L2S, 91192 Gif-sur-Yvette, France
Conventional video coding and transmission systems are currently based on digital video compression (e.g., HEVC) on a suitable network protocol (802.11, 4G, or 5G) and rely on Shannon separation theorem. However, they suffer from some inherent limitations when the video content is transmitted over wireless error-prone networks. First, the coding choices (compression rate, channel coding rate) are decided a priori and at the transmitter and are the same for all the potential receivers. They could misfit with the actual channel conditions. Some user(s) with degraded channels may undergo digital cliff (glitches or freeze of the video) while other(s) may have a very good channel and yet not taking fully benefit of it since the design choices are based on more pessimistic hypotheses. Second, the traditional techniques require a permanent adaptation of the coding parameters by the transmitter relying on an estimate of the rate-distortion characteristic of the source and on an estimation of the channel characteristics, implying additional delay to perform this adaptation. Third, delay is introduced by the various buffers present at the encoder, within the network, and at the receiver. They are either required to smooth out variations of the encoding rate and of the channel characteristics, or due to the shared network infrastructure.
Soft Video Delivery (SVD) architectures, pioneered by the SoftCast scheme, have demonstrated over the last decade a high potential to address/mitigate these issues. SVD architectures are joint source-channel video coding and transmission schemes that process pixels by successive linear operations (spatio-temporal decorrelation transform, power allocation, analog modulation) and directly transmit the information without quantization or coding. SVD architectures deliver a single data stream that can be decoded by any receiver, even those experiencing bad channel quality. This data stream allows each receiver to decode a video quality commensurate with its channel quality, without requiring any feedback information, while avoiding the complex adaptation mechanisms of conventional schemes. Moreover, SVD architectures offer a relatively low and controlled latency that can be adjusted through the size of the temporal transform. This is a paradigm break with respect to traditional video transmission architectures, which has the potential of dramatically improving the quality of experience in wireless and latency-constrained scenarios.
This tutorial will first introduce use cases where SVD architectures can make a difference compared to traditional schemes relying on conventional encoded video streams (e.g., HEVC) over a suitable network protocol (802.11, 4G, or 5G). Issues with conventional digital schemes will also be discussed (e.g., complex adaptation, cliff-effect, etc.), justifying the SVD approaches. Then, a block-by-block description of the components of the baseline SoftCast SVD scheme will be presented and visual examples provided to facilitate the understanding. A third part will be devoted to real implementations of SVD architectures, the dense modulation process and bandwidth computation will be detailed. Recent technical innovations and results from the literature will be presented and discussed. Finally, current research challenges related to the development of SVD architectures will be presented.
Author: Ivan V. Bajić
Simon Fraser University, Canada
Visual content is increasingly being used for more than human viewing. For example, traffic video is automatically analyzed to count vehicles, detect traffic violations, estimate traffic intensity, and recognize license plates; images uploaded to social media are automatically analyzed to detect and recognize people, organize images into thematic collections, and so on; visual sensors on autonomous vehicles analyze captured signals to help the vehicle navigate, avoid obstacles, collisions, and optimize their movement. The sheer amount of visual content used for purposes other than human viewing demands rethinking the traditional approaches for image and video compression.
This tutorial is about techniques for compressing images and video for multiple purposes, besides human viewing. We will start the first part of the tutorial by reviewing early attempts at tackling multi-task usage
of compressed visual content. We will discuss several representative problems in “compressed-domain” image and video analysis, such as interest-point detection, face and person detection, saliency detection, and object tracking. We will briefly mention several MPEG standards for encoding features related to image and video analysis, such as Content Description for Visual Search (CDVS) and Content Description for Visual Analysis (CDVA).
The second part of the tutorial is devoted to the recent learning-based image and video compression methods, which offer much more flexibility for multi-task compression. We will review some basic concepts from information theory that will help appreciate subsequent material. We will then present several recent Deep Neural Network (DNN) models for image and video compression and how they might be used in multi-task compression. We will also discuss task-scalability and privacy in the context of multi- task compression. Finally, recent standardization activities related to multi-task compression, such as JPEG AI and MPEG Video Coding for Machines (VCM) will be reviewed.
Author: C.-C. Jay Kuo
University of Southern California, USA
There has been a rapid development of artificial intelligence and machine learning technologies in the last decade. The core lies in a large amount of annotated training data and deep learning networks. Representative deep learning networks include the convolutional neural network, the recurrent neural network, the long short-term memory network, the transformer, etc. Although deep learning networks have made great impacts in various application domains such as computer vision, natural language processing, autonomous driving, robotics navigation, etc., they have several inherent shortcomings. They are mathematically intractable, vulnerable to adversarial attacks and demanding a huge amount of annotated training data. Furthermore, their training is computationally intensive because of the use of backpropagation for end-to-end network optimization.
There is an emerging concern that deep learning technologies are not friendly to the environment since their carbon footprint is a threat to global warming and climate change. As sustainability has become critical to human civilization, one priority in science and engineering is to preserve our environment for future generations. In the field of artificial intelligence, it is urgent to investigate new learning paradigms that are competitive with deep learning in performance yet with significantly lower carbon footprint. Professor C.-C. Jay Kuo has worked towards this goal since 2014. He has published a sequence of influential papers along this direction (see the recent publication list) and coined this emerging field with a term – “green learning”. By definition, green learning demands low power consumption in both training and inference. Besides, it has several attractive characteristics: small model sizes, fewer training samples, mathematical transparency, ease for incremental learning, etc. It is particularly attractive for mobile/edge computing.
I organized two tutorials on this topic at ICIP 2020 and ICIP 2021, respectively, to promote the importance of this emerging area. It has received more attention recently. I focused on the evolution of convolution layers to the unsupervised feature learning module in green learning at ICIP 2020. I presented the unsupervised feature learning module from the filter bank theory viewpoint and some application examples such as face biometrics and point cloud classification, segmentation and registration. For ICIP 2022, I will add two new learning modules and introduce new applications.
Authors: Jiayi Ma (1) and Xiao-Ping Zhang (2)
(1) Wuhan University, China
(2) Ryerson University, Canada
As the most extensive information carrier, image drives the current artificial intelligence to better understand the world. However, a single type of image barely can completely describe the imaging scene, which is not conducive to deep learning technology in high-level semantic inference. In this context, many engineering, medical, remote sensing, environmental, national defense, and civilian applications need to combine information from various types of images to make more precise decisions. As a result, image fusion technology came into being. According to the differences among source images, typical image fusion scenarios can be divided into multi-modality image fusion, digital photography image fusion, and remote sensing image fusion. The diversity of source images and the complexity of the fusion scenario both pose new challenges to the development of algorithms. This tutorial will provide a basic understanding of image fusion as well as a comprehensive analysis of state-of-the-art solutions.
In the first part of the tutorial, a comprehensive overview of the problem will be given for three major categories: multi-modal image fusion, digital photography image fusion, and remote sensing image fusion. We will discuss different aspects of the above categories by considering the imaging principles, application areas, basic technology pipeline, datasets, and evaluation criteria. In the second, third, and fourth parts of the tutorial, we will focus on details of representative state-of-the-art solutions in each category for a deeper understating of designing successful image fusion systems. Moreover, we will also present comparative analyses of state-of-the-art solutions based on various pipelines to demonstrate intuitively the superiority of different pipelines. In the last part of the tutorial, current challenges and future work in image fusion will be considered, such as the non-registered image fusion, task-oriented image fusion, cross-resolution image fusion, real-time image fusion, and fusion quality assessment.
Author: Antonin Chambolle
Université Paris-Dauphine, CNRS, France
The goal of this tutorial is to review saddle points methods for convex problems in optimization, which have been developed over almost 15 years. Most of the material which will be presented is not very new (except maybe some results on the computation of optimal transportation and some recent applications of the non-linear setting).
The tutorial will start with describing a few examples of (basic) optimization tasks for image reconstruction (based on elementary Bayesian models, such as segmentation, deblurring, medical imaging, Wasserstein distances or barycenters). These problems will be modeled as non-smooth convex minimization problems, such as involving l1 norms, the Total Variation, etc.
Then, we will introduce, first in an Euclidean setting, the proximal map of a convex function, and introduce standard elementary splitting methods for solving composite minimization problems. This will lead to the introduction of the “PDHG” or stabilized Arrow-Hurwicz method as in . It will be also related to the proximal-point algorithm, following a remark of He and Yuan (2012). Before this, a very quick introduction to convex conjugacy will be necessary.
In a second part, we will describe some extensions. First, we will try to explain how a O(1/N^2) acceleration can be obtained using varying steps and relaxation. This is the most tricky part, as it is a bit too technical for a 3-hours tutorial in this context and one will need to find a simple way to introduce the main tricks which make the acceleration work without loosing the audience. One will also focus on explaining the meaning of the rates which are obtained in terms of primal-dual gap or energies (a common error being to substitute in such estimates the test point (x,y) with a saddle-point (x∗,y∗), which in non-smooth problems gives an absolutely irrelevant criterion of optimality).
The other improvements and extensions we will recall are the explicit schemes of Condat (2013) and Vũ(2013), the step adaption of Golstein et al (2013), the linesearch variant of Malistky and Pock (2016), the generalization to smooth/non- smooth convex-concave coupling (Boţ et al, 2021). We will very quiclky mention some stochastic extensions such as “APPROX” of Fercoq and Richtárik (2013), and maybe also , yet without details.
In the third section, we will introduce the non-linear setting for optimization in Banach spaces (or simply, finite dimensional optimisation with non-Euclidean norms). The idea is to review the definition of Bregman distances and the proximal Bregman algorithms. We will then show (without too many details, as it is identical to the Euclidean setting) that the theoretical results on the algorithm transfer to this setting without almost any difference, as shown in  (including acceleration in case a function is relatively strongly convex, yet this seems not widely useful). This will be illustrated in the end of the lecture with two applications:
• a comparison between the rate of convergence for solving (approximately) optimal transportation (assignment) problems, using the Euclidean and the Entropy settings ;
• the extension of primal-dual algorithms to problems min_u F (K u) + G(u) with F smooth or G strongly convex, using instead of the “prox” the gradient of F (or G∗), as suggested by Lan and Zhou (2017). In that case, the notion of relative strong convexity is essential and one recovers in this way variants of the Nesterov/Tseng accelerated methods. A possibility is also, if time permits, to address the interesting issue of derivating a loss with respect to the parameters of the algorithm and in particular the coupling operator K between the primal and dual variable. A Piggyback method has been analysed in [6, 1], based on results on inexact algorithms , it is quite practical for problems where these parameters need to be learned.
We hope to have time again to discuss some numerical experiments towards the end, and at least to consider one example in particular and explain in details how it is solved.
 Lea Bogensperger, Antonin Chambolle, and Thomas Pock. Convergence of a Piggyback-style method for the differentiation of solutions of standard saddle-point problems. working paper or preprint, January 2022.
 Antonin Chambolle and Juan Pablo Contreras. Computational optimal transport using accelerated bregman primal-dual algorithms. (preprint, 2022).
 Antonin Chambolle, Matthias J. Ehrhardt, Peter Richt ́arik, and Carola- Bibiane Sch ̈onlieb. Stochastic primal-dual hybrid gradient algorithm with arbitrary sampling and imaging applications. SIAM J. Optim., 28(4):2783– 2808, 2018.
 Antonin Chambolle and Thomas Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision, 40:120–145, 2011.
 Antonin Chambolle and Thomas Pock. On the ergodic convergence rates of a first-order primal–dual algorithm. Math. Program., 159(1-2, Ser. A):253– 287, 2016.
 Antonin Chambolle and Thomas Pock. Learning consistent discretizations of the total variation. SIAM J. Imaging Sci., 14(2):778–813, 2021.
 Julian Rasch and Antonin Chambolle. Inexact first-order primal-dual algo- rithms. Comput. Optim. Appl., 76(2):381–430, 2020.
Authors: Stanley Chan and Nicholas Chimitt
Purdue University, USA
Imaging through the atmospheric turbulence is one of the fastest growing topics in computational photography, image processing, and computer vision. The challenge of doing research in this field is the steep learning curve of optics that beginners often find difficult to manage. As the community grows, a tutorial of the subject presented in the context of image processing is not only timely, but also serves the pressing demand due to the lack of an alternative. The proposed tutorial will be taught by researchers in computational photography with a strong track record in image processing and optics journals. The objective of the tutorial is to bridge the knowledge gap for participants in a number of upcoming major research programs such as IARPA’s BRIAR (launched) and CVPR 2022’s UG2+ challenge on turbulence.
The proposed tutorial aims at providing a working knowledge of the simulation and principles of imaging through turbulence, with the only requirements being familiarity with basic Electrical Engineering principles. The tutorial uses an appropriate balance of theory and programming to suit the ICIP audience, using live Python demos for the purpose of providing the audience some familiarity with the concepts. Python code will be available for download and contain multiple tunable parameters, with suggested inputs, so that those in attendance may change parameters and become accustomed with these concepts through experience while following along.
The course is designed for three hours. Each hour will cover one sub-topic: Fourier optics, atmospheric turbulence simulation, and reconstruction.
Authors: Zhu Li (1), Zhan Ma (2), Shan Liu (3), Xiaozhong Xu (3), Xiang Zhang (3)
(1) University of Missouri, Kansas City, USA
(2) Nanjing University, Jiangsu, China
(3) Tencent Media Lab, USA
Point Cloud compression: Point cloud data arises from 3D sensing and capturing for autonomous driving/navigation/smart city, as well as the VR/AR playback and immersive visual communication applications. Recent advances in sensor technologies and algorithms, especially LiDAR and 77Ghz mmWave radar systems, and ultra-high resolution RGB camera arrays, have made point cloud acquisition and processing closer to the wide adoption in real world applications. Given that point cloud data often present an excessive amount of random, unstructured points in a 3D space, efficient compression of point cloud is highly desired for its successful enabling, especially for networked services. In this tutorial we will review the latest advances in point cloud compression, for both standard based and learning based frameworks, including advanced 3d motion model, deep learning based deblocking, end to end learning based compression of point cloud as well as QoE metrics. The tutorial is based on a series of recent publications listed in the reference.
Volumetric Visual Data Compression: The popularity of volumetric visual data both in applications and technologies has increased rapidly. The common forms of volumetric visual data include point cloud, mesh and light field, etc. which all consume huge communication data bandwidth and storage space due to their volumetric nature. Hence, volumetric data compression techniques are critical to real-world applications and products. This tutorial will introduce the existing and on-going research as well as standard activities on volumetric data compression, with discussions about challenges and applications.
Conventional image and video codecs are typically designed to compress the sensor-captured, 2D video. The quality assessment of such contents (image or video) has also been well studied. However, the past technologies cannot be used directly to handle the data in the 3D space. To address the need for efficient coding and representation of 3D media contents, a number of coding solutions have been studied in the literature. Typically, the volumetric visual data are either converted into one or more types of 2D based video data (thus applicable to use video compression), or they are handled directly in 3D space by taking advantage of the geometric correlations among data points. Further, the quality evaluation technology for volumetric visual data have been explored to better correlate the compression/processing and the new distortion behaviours.
In this tutorial, we will first introduce the fundamental concept of volumetric visual data, typical data formats and their related applications. Then the challenges of efficient representing such high data-rate 3D visual data will be discussed. Later, the compression technologies for point clouds and meshes will be reviewed in detail. Lastly, the issue of using conventional methods for 3D visual data assessment and more advanced developments in the field of quality evaluation will be addressed. In addition, several related compression standards will be reviewed, which are specifically developed to improve the coding efficiency of volumetric visual data or to better assess the quality.