Chain-of-Sketch: Enables global visual thinking

Today's optical models have found success on benchtops where local features provide critical information about the target. There is now a growing interest in tackling tasks that require more global thinking, where local factors do not provide important information. Minsky and Papert put forward such works in 1969 with their Communication Study, revealing the limitations of the Perceptron model. In this paper, we present an extended set of virtual world datasets that include graphs, lines, mazes and image grids. We show that large-scale optical models still struggle to tackle these tasks well. Likewise, many different LLMSs work well in these fields. We describe this learning performance using the 'Globlional degree' measure. To reduce this, we propose an approach called Chain-of-Sketch (COS). Similar to the Chain-of-Recall techniques and scratchpads used in language models, cos breaks down the initial task into visual intermediate steps to help learn a complex task. Furthermore, we show that not all COS strategies perform equally well. Our main understanding is to constrain the independent Markovian structure of Cos. This leads to the introduction of the 'Cuctive COS' which achieves a better distribution performance and performs well even with small models compared to irregular variations.
- † Microsoft AI
- ** Work Done while at Apple
- ‡ Equal contribution



