Local Relation Network
Adapt filter according to the appearance affinity
Meaningful and adaptive spatical aggregation
Humans have a remarkable ability to “see the infinite world with finite means” [26, 2].
- Recognition-by-components: a theory of human image understanding.
- W. von Humboldt. On Language: On the Diversity of Human Language Construction and Its Influence on the Mental Development of the Human Species. Cambridge Texts in the History of Philosophy. Cambridge University Press, 1999/1836.
Hierarchical features -> different levels of features
Rather than recognizing how elements can be meaningfully joined together, convolutional layers act as templates
1 filter -> 1 channel
it'a waste of channels.
local relation layer
locality
& geometric priors
accuracy-efficiency trade-off
enlarge receptive field
deformable convolution
relax the requirement for sharing weights(this is too rigid)
locally connected layers
(DeepFace)Capsule Networks
self-enhancement
filter bubble
Given that we prefer to eschew negative experiences, it comes as no surprise that people avoid the immediate psychological discomfort from cognitive dissonance by simply not reading or listening to differing opinions.
long-range
contextThis work
feature extractor
compositionality
directly into represention
Some concepts
bottom-up
& top-down
aggregationgeometric prior
locality
Local-Relation Networks are LR-nets
Suppose
\(C = 24, m = 8, k = 7,C/m = 3\)
We observe no accuracy drop with up to 8 channels (default) sharing the same aggregation(for k)
\(H = 160,W = 160\)
In this architecture, receptive field is relevant to the concept of
Geometry Prior
Or rather, learned Geometry Prior is used withneighborhood
(similar to receptive field.)k is the
neighborhood size
Geometry Prior is analogous to conventionalconvolution filter
However, geometry prior is considered together with appearance composability, which brings about adaption from input
In other words, the geometry prior is conditioned on the input pixels' correlation.
Input Feature Map 24x160x160
3x160*160
(compress #channels from 24 to 3)
query
in c/m = 3
channels.3x160*160
k = 7
, there are many regions
in key maps3x7x7
region/neighbor
\(W_{neighbour} = \text{SoftMax}(\text{Geo.}+\text{App.})\)(Geometry and Appearance)
\(\text{pixel}_{x,y} = W_{neighbour}\text{Input}_{neighbour}\)
k
centered at x,y
(the source and target pixel position.)All of the aggregation is performed in a receptive field of kxk
They claim that LR
(i.e. Local-Relation Layer) can utilize large kernels more effectively
This difference may be due to the representation power of convolution layer being bottlenecked by the number of fixed filters, hence there is no benefit from a larger kernel size.
Weight Sharing across different positions in an image limits the utilization of the representation power of large kernels.
While in previous works the query and key are vectors, in the local relation layer, we use scalars to represent them so that the computation and representation are lightweight.
What is that?