打算在近期看完 “解简单几何题” 相关的论文工作并做个总结。

Diagram Understanding in Geometry Questions

  • AAAI 2014 论文地址
  • 本文提出了一个叫做 G-ALIGNER 的 Image Parse 的工具,并证明了它使用的估价函数是次模函数。
  • 估价函数介绍
    • T={T1,T2,,Tm}T=\{T_1,T_2,\dots,T_m\} 表示题目文本中提到的元素集合,LL 是图片中初步提取后的所有 primitives 集合。本论文默认 LL 已经得到,现在要找一个最优的子集 L^\hat{L} 以准确地表达图片信息。
    • 引入一个 W{0,1}L×TW \in \{0,1\}^{|L| \times |T|} 表示该图上的 primitive 是否与指定文本元素对应。 L^=L×(W×1T×1)\hat{L}=L \times (W \times 1_{|T| \times 1}),可以将求一个最优子集 L^\hat{L} 转化为求一个最优的矩阵 W^\hat{W}
    • 对于一个可能的集合 L^\hat L,估价函数 F\mathcal{F} 设计成三个部分:
      • 所选 primitives 与实际的图像之间的重合程度 P=DL^D\mathcal{P}=\frac{|D_{\hat L}|}{|D|}
      • 所选 primitives 能构成更高相关性 visual element 的程度 C=CL^C\mathcal{C}=\frac{|C_{\hat L}|}{|C|}
      • 所选 primitives 与文本的对应关系 S=TL^TrW^\mathcal{S}=\frac{|T_{\hat L}|}{|T|}-r_{\hat W},其中 rW^r_{\hat W} 表示冗余成分。
    • 可以证明,F=P+C+S\mathcal{F=P+C+S} 是次模函数。
  • 求解步骤
    • 用 Hough transform 提取 primitives,用 Harris Corner detectors 检测 Corners。
    • 优化时采用很暴力的方式:初始时设 L^=\hat {L}=\varnothing,.每次枚举 lLl \in L,把 argmaxlF(L^l)F(L^)\arg \max \limits_{l}F (\hat L \cup l)-F(\hat L)ll 加入 L^\hat L,直到增量 <0< 0 为止。

Solving Geometry Problems: Combining Text and Diagram Interpretation

  • EMNLP 2015 论文地址
  • 提出了一个 GEOS 系统,把文本和图片变成 形式语言(logical expression)再推理。
  • logical expression Ω\Omega 的定义
    • 它是一阶逻辑语言(first-order logic),包含以下四种内容:
      • constants :已知的量或实体。例如 5
      • variables:未知的量或实体。例如 xCE
      • predicates:几何或算术的关系。例如 EqualsIsDiameterIsTangent
      • functions:几何物体的性质或者算术运算。例如 LengthOfSumOf
    • 定义 Literal 是 predicates 的应用实例,如 isTriangle(ABC)
    • 定义 Logical formulas 是以上内容和 existential quantifiers,conjunctions 的组合,如:
      1
      ∃x, IsTriangle(x)∧IsIsosceles(x)

Transformers as Soft Reasoners over Language

A Generalized Earley Parser for Human Activity Parsing and Prediction

  • This paper is an extension of previous ICCV and ICML papers. 1TPAMI 2020 Address

  • Earley Parser is a classical method to parse CFG, while it can not tackle the non-Markovian and compositional properties of high-level semantics. This paper generalize the Earley parser to solve sequence data by only given an arbitrary probabilistic classifier.

  • This Generalized Earley Parser combined detection, parsing and future predictions together, which is really generic and applicable. It finds the optimal (maximizing the joint probability) segmentation and label sentence according to both a symbolic grammar and a classifier output of probabilities of labels for each frame.

  • Brief introduction for EARLEY PARSER

    • Early parser takes a sentence with length TT as input, and find out this CFG.
    • It is similar with LR parser. There are three main process:
      • Prediction: For each (AαBβ)(A \to \alpha \cdot B\beta), add (Bγ)(B \to \cdot \gamma) if BγB \to \gamma exists.
      • Scanning: If (Aαaβ)(A \to \alpha \cdot a \beta) and the next character is α\alpha, add (Aαaβ)(A \to \alpha a \cdot \beta).
      • Completion: If (Aγ)(A \to \gamma \cdot) appears, change (BαAβ)(B \to \alpha \cdot A\beta) to (BαAβ)(B \to \alpha A \cdot \beta).
      • l=k1k2 k3
    • An sample
  • GENERALIZED EARLEY PARSER

    • Consider a sequence with length TT, Generalized Early parser takes an T×KT \times K probability matrix as input, and output sentence ll of labels (lT)(l \le |T|) that best explains the data according to a grammar G, where each label k{0,1,,K1}k \in \{0, 1, \dots, K-1\} corresponds to a segment of sequence.
    • Parsing Probability
      • The parsing probability p(lx0:T)p(l|x_{0:T}) is computed in a dynamic programming fashion.
      • Consider whether this frame is a new frame with label kk. The previous frames can be labeled as either ll or ll^-: p(lx0:t)=ytk(p(lx0:t1)+p(lx0:t1))p(l|x_{0:t}) = y^k_t (p(l|x_{0:t−1}) + p(l^−|x_{0:t−1}))
      • x [0:t] = xxxxxxxx…
      • l = k1k2k3…
      • l=1+…
      • l=1+ 1
      • l=1+1… / l=1+…
    • Prefix Probability
      • Then we compute the prefix probability p(lx0:T)p(l \dots|x_{0:T}). Once the label kk is observed (the transition happens), l becomes the prefix and the rest frames can be labeled arbitrarily.p(lx0:T)=p(lx0)+t=1Tytkp(lx0:t1)p\left(l \ldots | x_{0: T}\right)=p\left(l | x_{0}\right)+\sum_{t=1}^{T} y_{t}^{k} p\left(l^{-} | x_{0: t-1}\right)
      • In practice, the probability p(lx0:t)p(l|x_{0:t}) decreases exponentially as tt increases and will soon lead to numeric underflow. So we should solve it with log\log.
    • Posterior Probability
      • Transition probability p(kl,G)p(k|l^−, G) can be calculated by: $$p\left(k | l^{-}, G\right)=\frac{p(l \ldots | G)}{p\left(l^{-}, | G\right)}$$.
      • So we can modify the posterior of Parsing Probability by: p(lx0:,G)ytk(p(lx0:t1,G)+p(kl,G)p(lx0:t1,G))p\left(l \mid x_{0:}, G\right) \propto y_{t}^{k}\left(p\left(l \mid x_{0: t-1}, G\right)+p\left(k \mid l^{-}, G\right) p\left(l^{-} \mid x_{0: t-1}, G\right)\right)
      • And the posterior of Prefix Probability is: p(lx0:T,G)=p(lx0,G)+t=1Tp(kl,G)ytkp(lx0:t1,G)p\left(l \ldots \mid x_{0: T}, G\right)=p\left(l \mid x_{0}, G\right)+\sum_{t=1}^{T} p\left(k \mid l^{-}, G\right) y_{t}^{k} p\left(l^{-} \mid x_{0: t-1}, G\right)
    • An sample
  • Generalized_Earley_Parser_Sample

Closed Loop Neural-Symbolic Learning via Integrating Neural Perception, Grammar Parsing, and Symbolic Reasoning

  • ICML2020. Propose a Neural-Grammar-Symbolic Model (NGS).
  • Learning neural-symbolic models with RL approach usually converges slowly with sparse rewards. The author proposes a novel top-down back-search, which can also serve as Metropolis-Hastings sampler (a sampling method in MCMC).
  • Model Inference
    • Let xx be the input (e.g.an image or question), zz be the hidden symbolic representation, and yy be the desired output inferred by zz.
    1. Neural Perception: Map xx to a normalized probability distribution of the hidden symbolic representation zz: pθ(zx)=softmax(ϕθ(z,x))p_θ(z|x) = softmax(\phi_θ(z, x))
    2. Grammar Parsing: Find the most probable zz that can be accepted by the grammar: z^=argmaxzL(G)pθ(zx)\hat z = \arg \max \limits_{z \in L(G)} p_θ(z|x). We use Generalized Early Parser doing the grammar parsing.
    3. Symbolic Reasoning: Given the parsed symbolic representation z^\hat z, performs deterministic inference with z^\hat z and the domain-specific knowledge Δ\Delta to predict yy: y^=f(z^;)\hat y = f(\hat z; ∆)
  • Learning Details
    • 1-STEP BACK-SEARCH

      • Previous methods using policy gradient to learn the model discard all the samples with zero reward, but 1-STEP BACK-SEARCH can correct wrong samples and use the corrections as pseudo labels for training.
      • In the 1-BS, assume that only one symbol can be replaced at a time. Given (z^,S,y)(\hat z,S,y), the 1-BS gradually searches down the reasoning tree starting from the root node SS to the leaf nodes and wants to find the most probable correction zz ^{\ast} at one-step distance.
      • During this process, maintain a priority queue storing 3-tuple (A,αA,p)(A, \alpha_A, p): current node, current expected value (If the value of AA changes to αA\alpha_A, z^\hat z will execute to the ground-truth answer yy) and the visiting priority (probability).
      • When transformation, enumerate each child BB of AA and calculate expected value αB\alpha_B such that f(z^(BαB);Δ))=f(z^(AαA);Δ))=yf(\hat z(B \to \alpha_B); \Delta)) = f(\hat z(A \to \alpha_A); \Delta)) = y. New tuple: (B,αB,p(BαB))(B,\alpha_B,p(B \to \alpha_B)).
      • p\left(A \rightarrow \alpha_{A}\right)=\left\{\begin{array}{ll} \frac{1-p(A)}{p(A)}, & \text { if } A \notin \Sigma \\ \frac{p\left(\alpha_{A}\right)}{p(A)}, & \text { if } A \in \Sigma \and \alpha_{A} \in \Sigma \end{array}\right.

    • MAXIMUM LIKELIHOOD ESTIMATION

      • For a pair of (x,y)(x,y), we know p(yx)=zp(y,zx)=zp(yz)pθ(zx)p(y | x)=\sum_{z} p(y, z | x)=\sum_{z} p(y | z) p_{\theta}(z | x)
      • The learning goal is to maximize L(x,y)=logp(yx)L(x, y) = \log p(y|x). By taking derivative:

        θL(x,y)=θlogp(yx)=1p(yx)θp(yx)=zp(yz)pθ(zx)zp(yz)pθ(zx)θlogpθ(zx)=Ezp(zx,y)[θlogpθ(zx)]\begin{aligned} \nabla_{\theta} L(x, y) &=\nabla_{\theta} \log p(y \mid x) \\ &=\frac{1}{p(y \mid x)} \nabla_{\theta} p(y \mid x) \\ &=\sum_{z} \frac{p(y \mid z) p_{\theta}(z \mid x)}{\sum_{z^{\prime}} p\left(y \mid z^{\prime}\right) p_{\theta}\left(z^{\prime} \mid x\right)} \nabla_{\theta} \log p_{\theta}(z \mid x) \\ &=\mathbb{E}_{z \sim p(z \mid x, y)}\left[\nabla_{\theta} \log p_{\theta}(z \mid x)\right] \end{aligned}

      • Computing the posterior distribution is not efficient.
      • We only need to consider Q={z:p(yz)=1}Q=\{z:p(y|z)=1\}, which is usually a small part.
      • So we want to sample from QQ efficiently using Markov chain Monte Carlo.
    • m-BS AS METROPOLIS-HASTINGS SAMPLER

      • Use Metropolis–Hastings algorithm to sampe from the probability distribution.
      • Detailed balance π(s)Q(ss)=π(s)Q(ss)\pi(s)Q(s′|s)=\pi(s′)Q(s|s′) is a sufficient condition to get a stationary distribution. To contruct a valid P(ss)P(s|s'), we use acceptance-rejection model and the ratio is A(s,s)=π(s)Q(s,s)A(s,s')=\pi(s')Q(s',s) so that π(s)Q(ss)A(s,s)=π(s)Q(ss)A(s,s)\pi(s)Q(s′|s)A(s,s')=\pi(s′)Q(s|s′)A(s',s).
      • A(s,s)=min{1,π(s)Q(ss)π(s)Q(ss)}A\left(s, s^{\prime}\right)=\min \left\{1, \frac{\pi\left(s^{\prime}\right) Q\left(s \mid s^{\prime}\right)}{\pi(s) Q\left(s^{\prime} \mid s\right)}\right\}

      • Assume that the transition probability during Random Walk is Poisson distribution. g(z1z2)=Poisson(d(z1,z2);β)g(z1|z2) = Poisson(d(z1, z2); \beta), where β\beta is usually set to 11.
      • So we know A(z,z)=min{1,pθ(zx)pθ(zx)}A(z,z')=\min \{1, \frac{p_{\theta}\left(z' \mid x\right)}{p_{\theta}\left(z \mid x\right)}\}
    • Algorithm for 1-BS and m-BS

    • COMPARISON WITH POLICY GRADIENT

      • Treat pθ(zx)p_θ(z|x) as the policy function and the reward given z,yz, y can be written as

        r(z,y)={0, if f(z;Δ)y1, if f(z;Δ)=yr(z, y)=\left\{\begin{array}{l} 0, \text { if } f(z ; \Delta) \neq y \\ 1, \text { if } f(z ; \Delta)=y \end{array}\right.

      • Similarly, θR(x,y)=Ezpθ(zx)[r(z,y)θlogpθ(zx)] \nabla_{\theta} R(x, y) =\mathbb{E}_{z \sim p_{\theta}(z \mid x)}[r(z, y) \nabla_{\theta} \log p_{\theta}(z \mid x)]

      • θ={0, if f(z;Δ)yθlogpθ(zx), if f(z;Δ)=y\nabla_\theta=\left\{\begin{array}{ll} 0, & \text { if } f(z ; \Delta) \neq y \\ \nabla_{\theta} \log p_{\theta}(z \mid x), & \text { if } f(z ; \Delta)=y \end{array}\right.

      • Among the whole space of zz, only a very small portion can generate the desired yy, which implies that the REINFORCE will get zero gradients from most of the samples.