认知你不知道的事:无法回答问题数据集

视觉任务之间是否有关系,或者它们是否无关?例如,表面法线可以简化估算图像的深度吗?直觉回答了这些问题,暗示了视觉任务中存在结构。了解这种结构具有显著的价值;它是传递学习的基本概念,并提供了一种原则性的方法来识别任务之间的冗余,例如,无缝地重用相关任务之间的监督或在一个系统中解决许多任务而不会增加复杂性。 我们提出了一种完全计算的方法来建模视觉任务的空间结构。这是通过在隐空间中的二十六个2D,2.5D,3D和语义任务的字典中查找(一阶和更高阶)传递学习依赖性来完成的。该产品是用于任务迁移学习的计算分类地图。我们研究了这种结构的后果,例如:非平凡的关系,并利用它们来减少对标签数据的需求。例如,我们表明,解决一组10个任务所需的标记数据点总数可以减少大约2/3(与独立训练相比),同时保持性能几乎相同。我们提供了一套用于计算和探测这种分类结构的工具,包括用户可以用来为其用例设计有效监督策略。
展开查看详情

1. Taskonomy: Disentangling Task Transfer Learning Amir R. Zamir1,2 Alexander Sax1∗ William Shen1∗ Leonidas Guibas1 Jitendra Malik2 Silvio Savarese1 1 2 Stanford University University of California, Berkeley http://taskonomy.vision/ arXiv:1804.08328v1 [cs.CV] 23 Apr 2018 Abstract 3D Edges Do visual tasks have a relationship, or are they unre- lated? For instance, could having surface normals sim- plify estimating the depth of an image? Intuition answers Point Object Class. these questions positively, implying existence of a structure Matching Reshading (1000 class) among visual tasks. Knowing this structure has notable val- Curvature ues; it is the concept underlying transfer learning and pro- vides a principled way for identifying redundancies across 3D Edges 2.5D Segm. tasks, e.g., to seamlessly reuse supervision among related Normals Reshading Z-Depth Distance Semantic tasks or solve many tasks in one system without piling up Triplet Segm. Point Cam. Pose the complexity. Matching We proposes a fully computational approach for model- Normals ing the structure of space of visual tasks. This is done via finding (first and higher-order) transfer learning dependen- Cam. Pose (fixated) cies across a dictionary of twenty six 2D, 2.5D, 3D, and 2D Segm. Room semantic tasks in a latent space. The product is a computa- 2D Keypoints Autoencoding Cam. Pose Layout (non-fixated) tional taxonomic map for task transfer learning. We study Denoising Vanishing Pts the consequences of this structure, e.g. nontrivial emerged relationships, and exploit them to reduce the demand for la- Figure 1: A sample task structure discovered by the computational beled data. For example, we show that the total number of task taxonomy (taskonomy). It found that, for instance, by combining the labeled datapoints needed for solving a set of 10 tasks can learned features of a surface normal estimator and occlusion edge detector, be reduced by roughly 23 (compared to training indepen- good networks for reshading and point matching can be rapidly trained with little labeled data. dently) while keeping the performance nearly the same. We provide a set of tools for computing and probing this taxo- nomical structure including a solver that users can employ The field of computer vision has indeed gone far without to devise efficient supervision policies for their use cases. explicitly using these relationships. We have made remark- able progress by developing advanced learning machinery (e.g. ConvNets) capable of finding complex mappings from 1. Introduction X to Y when many pairs of (x, y) s.t. x ∈ X, y ∈ Y are given as training data. This is usually referred to as fully su- Object recognition, depth estimation, edge detection, pervised learning and often leads to problems being solved pose estimation, etc are examples of common vision tasks in isolation. Siloing tasks makes training a new task or a deemed useful and tackled by the research community. comprehensive perception system a Sisyphean challenge, Some of them have rather clear relationships: we under- whereby each task needs to be learned individually from stand that surface normals and depth are related (one is a scratch. Doing so ignores their quantifiably useful relation- derivate of the other), or vanishing points in a room are use- ships leading to a massive labeled data requirement. ful for orientation. Other relationships are less clear: how Alternatively, a model aware of the relationships among keypoint detection and the shading in a room can, together, tasks demands less supervision, uses less computation, and perform pose estimation. behaves in more predictable ways. Incorporating such ∗ Equal. a structure is the first stepping stone towards develop- 1

2.ing provably efficient comprehensive/universal perception 13, 23, 55, 92, 90], homomorphic cryptography [42], life- models [34, 4], i.e. ones that can solve a large set of tasks long learning [93, 15, 85, 84], functional maps [71], certain before becoming intractable in supervision or computation aspects of Bayesian inference and Dirichlet processes [54, demands. However, this task space structure and its effects 91, 90, 89, 37, 39], few-shot learning [81, 25, 24, 70, 86], are still largely unknown. The relationships are non-trivial, transfer learning [75, 84, 29, 64, 67, 59], un/semi/self- and finding them is complicated by the fact that we have supervised learning [22, 8, 17, 103, 19, 83], which are stud- imperfect learning models and optimizers. In this paper, ied across various fields [73, 94, 12]. We review the topics we attempt to shed light on this underlying structure and most pertinent to vision within the constraints of space: present a framework for mapping the space of visual tasks. Self-supervised learning methods leverage the inherent Here what we mean by “structure” is a collection of com- relationships between tasks to learn a desired expensive one putationally found relations specifying which tasks supply (e.g. object detection) via a cheap surrogate (e.g. coloriza- useful information to another, and by how much (see Fig. 1). tion) [68, 72, 17, 103, 100, 69]. Specifically, they use a We employ a fully computational approach for this pur- manually-entered local part of the structure in the task space pose, with neural networks as the adopted computational (as the surrogate task is manually defined). In contrast, our function class. In a feedforward network, each layer succes- approach models this large space of tasks in a computational sively forms more abstract representations of the input con- manner and can discover obscure relationships. taining the information needed for mapping the input to the Unsupervised learning is concerned with the redundan- output. These representations, however, can transmit statis- cies in the input domain and leveraging them for forming tics useful for solving other outputs (tasks), presumably if compact representations, which are usually agnostic to the the tasks are related in some form [83, 19, 58, 46]. This is downstream task [8, 49, 20, 9, 32, 77]. Our approach is not the basis of our approach: we computes an affinity matrix unsupervised by definition as it is not agnostic to the tasks. among tasks based on whether the solution for one task can Instead, it models the space tasks belong to and in a way be sufficiently easily read out of the representation trained utilizes the functional redundancies among tasks. for another task. Such transfers are exhaustively sampled, Meta-learning generally seeks performing the learning and a Binary Integer Programming formulation extracts a at a level higher than where conventional learning occurs, globally efficient transfer policy from them. We show this e.g. as employed in reinforcement learning [21, 31, 28], model leads to solving tasks with far less data than learn- optimization [2, 82, 48], or certain architectural mecha- ing them independently and the resulting structure holds on nisms [27, 30, 87, 65]. The motivation behind meta learn- common datasets (ImageNet [78] and Places [104]). ing has similarities to ours and our outcome can be seen as Being fully computational and representation-based, the a computational meta-structure of the space of tasks. proposed approach avoids imposing prior (possibly incor- Multi-task learning targets developing systems that can rect) assumptions on the task space. This is crucial because provide multiple outputs for an input in one run [50, 18]. the priors about task relations are often derived from either Multi-task learning has experienced recent progress and the human intuition or analytical knowledge, while neural net- reported advantages are another support for existence of a works need not operate on the same principles [63, 33, 40, useful structure among tasks [93, 100, 50, 76, 73, 50, 18, 97, 45, 102, 88]. For instance, although we might expect depth 61, 11, 66]. Unlike multi-task learning, we explicitly model to transfer to surface normals better (derivatives are easy), the relations among tasks and extract a meta-structure. The the opposite is found to be the better direction in a compu- large number of tasks we consider also makes developing tational framework (i.e. suited neural networks better). one multi-task network for all infeasible. An interactive taxonomy solver which uses our model Domain adaption seeks to render a function that is de- to suggest data-efficient curricula, a live demo, dataset, and veloped on a certain domain applicable to another [44, 99, code are available at http://taskonomy.vision/. 5, 80, 52, 26, 36]. It often addresses a shift in the input do- main, e.g. webcam images to D-SLR [47], while the task 2. Related Work is kept the same. In contrast, our framework is concerned Assertions of existence of a structure among tasks date with output (task) space, hence can be viewed as task/output back to the early years of modern computer science, e.g. adaptation. We also perform the adaptation in a larger space with Turing arguing for using learning elements [95, 98] among many elements, rather than two or a few. rather than the final outcome or Jean Piaget’s works on In the context of our approach to modeling transfer learn- developmental stages using previously learned stages as ing across tasks: sources [74, 39, 38], and have extended to recent works [76, Learning Theoretic approaches may overlap with any 73, 50, 18, 97, 61, 11, 66]. Here we make an attempt to actu- of the above topics and usually focus on providing gener- ally find this structure. We acknowledge that this is related alization guarantees. They vary in their approach: e.g. by to a breadth of topics, e.g. compositional modeling [35, 10, modeling transferability with the transfer family required 2

3. (I) Task-specific Modeling (II) Transfer Modeling (III) Task Affinity (IV) Compute Taxonomy Layout Normals Reshading Layout Normals Reshading Normalization ... ... 2D Segm. 2D Edges ce 2D Segm. 3D Keypoints 2.5D Segm 2D Segm. 3D Keypoints 2.5D Segm Denoising Autoencoding pa u ts AHP task affinities tp Ou Layout Cam. Pose Vanishing Pts. Binary Integer 2D Keypoints (fix) Cam. Pose ) Program e ac ion Jigsaw (nonfix) Distance Sp ntat Normals In-painting k s s e Z-Depth Reshading Ta epre Colorization Egomotion (r Scene Class. ass. 3D Keypoints Occlusion Edges e p ac 1st Order Matching Object Class. (100) ts Task-specific 2nd Order Object Class. (1000) 0) Semantic Segm. pu 3rd Order 2.5D Segm.Randomm Proj. In Frozen Curvature Figure 2: Computational modeling of task relations and creating the taxonomy. From left to right: I. Train task-specific networks. II. Train (first order and higher) transfer functions among tasks in a latent space. III. Get normalized transfer affinities using AHP (Analytic Hierarchy Process). IV. Find global transfer taxonomy using BIP (Binary Integer Program). Query Image Surface Normals Eucl. Distance Object Class. Scene Class. to map a hypothesis for one task onto a hypothesis for an- Top 5 prediction: Top 2 prediction: other [7], through information-based approaches [60], or sliding door living room home theater, home theatre television room through modeling inductive bias [6]. For these guarantees, studio couch, day bed china cabinet, china closet learning theoretic approaches usually rely on intractable entertainment center computations, or avoid such computations by restricting the Jigsaw puzzle Colorization 2D Segm. 2.5D Segm. Semantic Segm. model or task. Our method draws inspiration from theoreti- cal approaches but eschews (for now) theoretical guarantees in order to use modern neural machinery. 3. Method Vanishing Points 2D Edges 3D Edges 2D Keypoints 3D Keypoints We define the problem as follows: we want to max- imize the collective performance on a set of tasks T = {t1 , ..., tn }, subject to the constraint that we have a limited supervision budget γ (due to financial, computational, or 3D Curvature Image Reshading In-painting Denoising Autoencoding time constraints). We define our supervision budget γ to be the maximum allowable number of tasks that we are willing to train from scratch (i.e. source tasks). The task dictionary is defined as V=T ∪ S where T is the set of tasks which we Cam. Pose (non-fixated) Cam. Pose (fixated) Triplet Cam. Pose Room Layout Point Matching want solved (target), and S is the set of tasks that can be trained (source). Therefore, T − T ∩ S are the tasks that we want solved but cannot train (“target-only”), T ∩ S are the tasks that we want solved but could play as source too, and S − T ∩ S are the “source-only” tasks which we may not directly care about to solve (e.g. jigsaw puzzle) but can Figure 3: Task Dictionary. Outputs of 24 (of 26) task-specific networks be optionally used if they increase the performance on T . for a query (top left). See results of applying frame-wise on a video here. The task taxonomy (taskonomy) is a computationally found directed hypergraph that captures the notion of task is trained. In stage II, all feasible transfers between sources transferability over any given task dictionary. An edge be- and targets are trained. We include higher-order transfers tween a group of source tasks and a target task represents a which use multiple inputs task to transfer to one target. In feasible transfer case and its weight is the prediction of its stage III, the task affinities acquired from transfer function performance. We use these edges to estimate the globally performances are normalized, and in stage IV, we synthe- optimal transfer policy to solve T . Taxonomy produces a size a hypergraph which can predict the performance of any family of such graphs, parameterized by the available su- transfer policy and optimize for the optimal one. pervision budget, chosen tasks, transfer orders, and transfer A vision task is an abstraction read from a raw image. functions’ expressiveness. We denote a task t more formally as a function ft which Taxonomy is built using a four step process depicted in maps image I to ft (I). Our dataset, D, contains for each Fig. 2. In stage I, a task-specific network for each task in S task t a set of training pairs (I, ft (I)), e.g. (image, depth). 3

4. ... Transfers Results (2k training images) Ds→t Ground Task Surface Normal I Es 3rd order Input Truth Specific Reshade Layout 2D Segm. Autoenc. Scratch Estimation nd 2 order Ground Task Representation Es(I) Reshade Layout 2D Segm. Autoenc. Scratch Segmentation Transfer Function Input Truth Specific Source Task Encoder Target Task Output 2.5D Frozen (e.g., curvature) (e.g., surface normal) Figure 4: Transfer Function. A small readout function is trained to map representations of source task’s frozen encoder to target task’s labels. If Figure 5: Transfer results to normals (upper) and 2.5D Segmentation order> 1, transfer function receives representations from multiple sources. (lower) from 5 different source tasks. The spread in transferability among different sources is apparent, with reshading among top-performing ones in this case. Task-specific networks were trained on 60x more data. “Scratch” Task Dictionary: Our mapping of task space is done was trained from scratch without transfer learning. via (26) tasks included in the dictionary, so we ensure they cover common themes in computer vision (2D, 3D, seman- the encoder is large enough to extract powerful represen- tics, etc) to the elucidate fine-grained structures of task tations, and the decoder is large enough to achieve a good space. See Fig. 3 for some of the tasks with detailed def- performance but is much smaller than the encoder. inition of each task provided in the supplementary material. We include tasks with various levels of abstraction, ranging 3.2. Step II: Transfer Modeling from solvable by a simple kernel convolved over the image Given a source task s and a target task t, where s ∈ S (e.g. edge detection) to tasks requiring basic understanding and t ∈ T , a transfer network learns a small readout func- of scene geometry (e.g. vanishing points) and more abstract tion for t given a statistic computed for s (see Fig 4). The ones involving semantics (e.g. scene classification). statistic is the representation for image I from the encoder It is critical to note the task dictionary is meant to be a of s: Es (I). The readout function (Ds→t ) is parameterized sampled set, not an exhaustive list, from a denser space of by θs→t minimizing the loss Lt : all conceivable visual tasks. Sampling gives us a tractable way to sparsely model a dense space, and the hypothesis is Ds→t := arg min EI∈D Lt Dθ Es (I) , ft (I) , (1) that (subject to a proper sampling) the derived model should θ generalize to out-of-dictionary tasks. The more regular / where ft (I) is ground truth of t for image I. Es (I) may or better sampled the space, the better the generalization. We may not be sufficient for solving t depending on the relation evaluate this in Sec. 4.2 with supportive results. For evalu- between t and s (examples in Fig. 5). Thus, the performance ation of the robustness of results w.r.t the choice of dictio- of Ds→t is a useful metric as task affinity. We train transfer nary, see the supplementary material. functions for all feasible source-target combinations. Dataset: We need a dataset that has annotations for ev- Accessibility: For a transfer to be successful, the latent ery task on every image. Training all of our tasks on exactly representation of the source should both be inclusive of suf- the same pixels eliminates the possibility that the observed ficient information for solving the target and have the in- transferabilities are affected by different input data pecu- formation accessible, i.e. easily extractable (otherwise, the liarities rather than only task intrinsics. There has not been raw image or its compression based representations would such a dataset of scale made of real images, so we created be optimal). Thus, it is crucial for us to adopt a low-capacity a dataset of 4 million images of indoor scenes from about (small) architecture as transfer function trained with a small 600 buildings; every image has an annotation for every task. amount of data, in order to measure transferability condi- The images are registered on and aligned with building- tioned on being highly accessible. We use a shallow fully wide meshes similar to [3, 101, 14] enabling us to program- convolutional network and train it with little data (8x to matically compute the ground truth for many tasks without 120x less than task-specific networks). human labeling. For the tasks that still require labels (e.g. Higher-Order Transfers: Multiple source tasks can scene classes), we generate them using Knowledge Distil- contain complementary information for solving a target task lation [43] from known methods [104, 57, 56, 78]. See the (see examples in Fig 6). We include higher-order transfers supplementary material for full details of the process and which are the same as first order but receive multiple rep- a user study on the final quality of labels generated using resentations in the input. Thus, our transfers are functions Knowledge Distillation (showing < 7% error). D : ℘(S) → T , where ℘ is the powerset operator. As there is a combinatorial explosion in the number of 3.1. Step I: Task-Specific Modeling feasible higher-order transfers (|T | × |S| k for k th order), We train a fully supervised task-specific network for we employ a sampling procedure with the goal of filtering each task in S. Task-specific networks have an encoder- out higher-order transfers that are less likely to yield good decoder architecture homogeneous across all tasks, where results, without training them. We use a beam search: for 4

5. Autoencoding Image GT (Normals) Fully Supervised Image GT (Reshade) Fully Supervised Object Class. (1000) Scene Class Curvature Denoising 2D Edges Occlusion Edges Egomotion Cam. Pose (fix) 2D Keypoint 3D Keypoint Cam. Pose (nonfix) Matching nd nd Reshading {Occlusion Edges + Curvature } = 2 order transfer { 3D Keypoints + Surface Normals } = 2 order transfer Z-Depth Distance Normals Layout 2.5D Segm. 2D Segm. Semantic Segm. Vanishing Pts. 2.5D Se ts. Po ose m. . m. . P Seg . Va ntic Segm. t C pe j. nis eg . sha roj. Tas hin m. pec ts. D rvatu g hin on Dis epth E No nce Oc 2 enois re Z-Dding Va gomrmals clu D E ing nti Cla g C Jig ting 2Dion Edges Se Scenatchin0) No tance Z-Dding ific 3D Key dges (no (fix) In-ose ( on In- c Se ss in ) 2.5 Layoals Ob Tas domation Re m Ping las Layfix) Pain gm Ca am D gm Dis epth Re ypoin t Raoloriz saw la od ) t C e C 0) Curizati 0) Ca Ego Edges Ca 3D Key saw Po ey int Colass ( lass Ob Au ss ( ific m. mo es S m 2D Jigting nd h ) clu D ing Se 2D Segut M 100 t (no int sha t Ob S ss. (1 ing D rvatuon Oc 2enois re Ke poin jec k-S Pro Pa fix s. ( ou jec toe 100 Ra Matcnfix Cu codin 2 gP k-S g P jec cen 00 lo 10 sio Edg nis oti ti po rm se po ta c n t C nc ma e D n n o K toe la P se n s ma Au tC C m. jec Ob Figure 6: Higher-Order Transfers. Representations can contain com- plementary information. E.g. by transferring simultaneously from 3D Figure 7: First-order task affinity matrix before (left) and after (right) Edges and Curvature individual stairs were brought out. See our publicly Analytic Hierarchy Process (AHP) normalization. Lower means better available interactive transfer visualization page for more examples. transfered. For visualization, we use standard affinity-distance method dist = e−β·P (where β = 20 and e is element-wise matrix exponential). transfers of order k ≤ 5 to a target, we select its 5 best See supplementary material for the full matrix with higher-order transfers. sources (according to 1st order performances) and include all of their order-k combination. For k ≥ 5, we use a beam responding (ith ) component of the principal eigenvector of of size 1 and compute the transfer from the top k sources. Wt (normalized to sum to 1). The elements of the principal Transitive Transfers: We examined if transitive task eigenvector are a measure of centrality, and are proportional transfers (s → t1 → t2 ) could improve the performance to the amount of time that an infinite-length random walk on over their direct counterpart (a → t2 ), but found that the Wt will spend at any given source [62]. We stack the prin- two had equal performance in almost all cases in both high- cipal eigenvectors of Wt for all t ∈ T , to get an affinity data and low-data scenarios. The experiment is provided in matrix P (‘p’ for performance)—see Fig. 7, right. the supplementary material. Therefore, we need not con- This approach is derived from Analytic Hierarchy Pro- sider the cases where branching would be more than one cess [79], a method widely used in operations research to level deep when searching for the optimal transfer path. create a total order based on multiple pairwise comparisons. 3.3. Step III: Ordinal Normalization using Analytic 3.4. Step IV: Computing the Global Taxonomy Hierarchy Process (AHP) Given the normalized task affinity matrix, we need to We want to have an affinity matrix of transferabilities devise a global transfer policy which maximizes collective across tasks. Aggregating the raw losses/evaluations Ls→t performance across all tasks, while minimizing the used su- from transfer functions into a matrix is obviously problem- pervision. This problem can be formulated as subgraph se- atic as they have vastly different scales and live in different lection where tasks are nodes and transfers are edges. The spaces (see Fig. 7-left). Hence, a proper normalization is optimal subgraph picks the ideal source nodes and the best needed. A naive solution would be to linearly rescale each edges from these sources to targets while satisfying that row of the matrix to the range [0, 1]. This approach fails the number of source nodes does not exceed the supervi- when the actual output quality increases at different speeds sion budget. We solve this subgraph selection problem us- w.r.t. the loss. As the loss-quality curve is generally un- ing Boolean Integer Programming (BIP), described below, known, such approaches to normalization are ineffective. which can be solved optimally and efficiently [41, 16]. Instead, we use an ordinal approach in which the output Our transfers (edges), E, are indexed by i with the form quality and loss are only assumed to change monotonically. ({si1 , . . . , simi }, ti ) where {si1 , . . . , simi } ⊂ S and ti ∈ T . For each t, we construct Wt a pairwise tournament matrix We define operators returning target and sources of an edge: between all feasible sources for transferring to t. The ele- ment at (i, j) is the percentage of images in a held-out test {si1 , . . . , simi }, ti sources −−−−−→ {si1 , . . . , simi } set, Dtest , on which si transfered to t better than sj did (i.e. target Dsi →t (I) > Dsj →t (I)). {si1 , . . . , simi }, ti −−−−→ ti . We clip this intermediate pairwise matrix Wt to be in Solving a task t by fully supervising it is denoted as {t}, t . [0.001, 0.999] as a form of Laplace smoothing. Then we We also index the targets T with j so that in this section, i divide Wt = Wt /WtT so that the matrix shows how many is an edge and j is a target. times better si is compared to sj . The final tournament ratio The parameters of the problem are: the supervision bud- matrix is positive reciprocal with each element wi,j of Wt : get (γ) and a measure of performance on a target from each of its transfers (pi ), i.e. the affinities from P . We can also EI∈Dtest [Dsi →t (I) > Dsj →t (I)] wi,j = . (2) optionally include additional parameters of: rj specifying EI∈Dtest [Dsi →t (I) < Dsj →t (I)] the relative importance of each target task and i specifying We quantify the final transferability of si to t as the cor- the relative cost of acquiring labels for each task. 5

6. The BIP is parameterized by a vector x where each trans- Task avg rand Task avg rand Task avg rand fer and each task is represented by a binary variable; x indi- Denoising 100 99.9 Layout 99.6 89.1 Scene Class. 97.0 93.4 Autoenc. 100 99.8 2D Edges 100 99.9 Occ. Edges 100 95.4 cates which nodes are picked to be source and which trans- Reshading 94.9 95.2 Pose (fix) 76.3 79.5 Pose (nonfix) 60.2 61.9 fers are selected. The canonical form for a BIP is: Inpainting 99.9 - 2D Segm. 97.7 95.7 2.5D Segm. 94.2 89.4 maximize cT x , Curvature 78.7 93.4 Matching 86.8 84.6 Egomotion 67.5 72.3 Normals 99.4 99.5 Vanishing 99.5 96.4 2D Keypnt. 99.8 99.4 subject to Ax b Z-Depth 92.3 91.1 Distance 92.4 92.1 3D Keypnt. 96.0 96.9 and x ∈ {0, 1}|E|+|V| . Mean 92.4 90.9 Each element ci for a transfer is the product of the im- Table 1: Task-Specific Networks’ Sanity: Win rates vs. random (Gaus- portance of its target task and its transfer performance: sian) network representation readout and statistically informed guess avg. ci := rtarget(i) · pi . (3) Hence, the collective performance on all targets is the sum- website allows the user to specify any desired partition. mation of their individual AHP performance, pi , weighted Network Architectures: We preserved the architectural by the user specified importance, ri . and training details across tasks as homogeneously as possi- Now we add three types of constraints via matrix A to ble to avoid injecting any bias. The encoder architecture is enforce each feasible solution of the BIP instance corre- identical across all task-specific networks and is a fully con- sponds to a valid subgraph for our transfer learning prob- volutional ResNet-50 without pooling. All transfer func- lem: Constraint I: if a transfer is included in the subgraph, tions include identical shallow networks with 2 conv layers all of its source nodes/tasks must be included too, Con- (concatenated channel-wise if higher-order). The loss (Lt ) straint II: each target task has exactly one transfer in, Con- and decoder’s architecture, though, have to depend on the straint III: supervision budget is not exceeded. task as the output structures of different tasks vary; for all Constraint I: For each row ai in A we require ai · x ≤ bi , pixel-to-pixel tasks, e.g. normal estimation, the decoder is a where  15-layer fully convolutional network; for low dimensional |sources(i)| if k = i  tasks, e.g. vanishing points, it consists of 2-3 FC layers. ai,k = −1 if (k − |E|) ∈ sources(i) (4) All networks are trained using the same hyperparameters regardless of task and on exactly the same input images.  0 otherwise  Tasks with more than one input, e.g. relative camera pose, bi = 0. (5) share weights between the encoder towers. Transfer net- Constraint II: Via the row a|E|+j , we enforce that each works are all trained using the same hyperparameters as the target has exactly one transfer: task-specific networks, except that we anneal the learning a|E|+j,i := 2 · ✶{target(i)=j} , b|E|+j := −1. (6) rate earlier since they train much faster. Detailed definitions Constraint III: the solution is enforced to not exceed the of architectures, training process, and experiments with dif- budget. Each transfer i is assigned a label cost i , so ferent encoders can be found in the supplementary material. a|E|+|V|+1,i := i , b|E|+|V|+1 := γ. (7) Data Splits: Our dataset includes 4 million images. We The elements of A not defined above are set to 0. The made publicly available the models trained on full dataset, problem is now a valid BIP and can be optimally solved in but for the experiments reported in the main paper, we a fraction of a second [41]. The BIP solution x ˆ corresponds used a subset of the dataset as the extracted structure stabi- to the optimal subgraph, which is our taxonomy. lized and did not change when using more data (explained in Sec. 5.2). The used subset is partitioned into training (120k), validation (16k), and test (17k) images, each from 4. Experiments non-overlapping sets of buildings. Our task-specific net- With 26 tasks in the dictionary (4 source-only tasks), our works are trained on the training set and the transfer net- approach leads to training 26 fully supervised task-specific works are trained on a subset of validation set, ranging from networks, 22 × 25 transfer networks in 1st order, and 22 × 1k images to 16k, in order to model the transfer patterns un- 25 th k for k order, from which we sample according to the der different data regimes. In the main paper, we report all procedure in Sec. 3. The total number of transfer functions results under the 16k transfer supervision regime (∼10% of trained for the taxonomy was ∼3,000 which took 47,886 the split) and defer the additional sizes to the supplementary GPU hours on the cloud. material and website (see Sec. 5.2). Transfer functions are Out of 26 tasks, we usually use the following 4 as source- evaluated on the test set. only tasks (described in Sec. 3) in the experiments: col- How good are the trained task-specific networks? Win orization, jigsaw puzzle, in-painting, random projection. rate (%) is the proportion of test set images for which a However, the method is applicable to an arbitrary partition- baseline is beaten. Table 1 provides win rates of the task- ing of the dictionary into T and S. The interactive solver specifc networks vs. two baselines. Visual outputs for a ran- 6

7. Supervision Budget 2 Supervision Budget 8 Supervision Budget 15 Supervision Budget 26 Coloorizaation 2D Segm. Vanishing Pts. Occlusion Edges Cam. Pose (nonfix) In-paintting g Cam. Pose (fix) Normals Egomotion Jiggsaw w 2D Keypoints Denoising Occlusion Edges Semantic Segm. Normals Jigsaw Transfer Order 1 Cam. Pose (fix) Reshading 2D Edges 2D Edges Normals Reshading Reshading Distance i Curvature Supervision Budget 8 - Order 4 (zoomed) 2.5D Segm. Autoencoding Semantic Layout Layout Reshading Semantic Segm. Z-Depth 2D Segm. Vanishing Pts. Segm. In-ppaintting Egomotion Z-Depth Distance Curvature Jiggsaw Matching Egomotion Cam. Pose Matching Z-Depth Egomotion (nonfix) 2D Edges 2D Keypoints Semantic Segm. Object Class. Denoising Randdom Proj o. Occlusion Edges Normals 3D Keypoints Rando om Proj o. (1000) 3D Keypoints Vanishing Pts. 2D Scene Class. Random Proj Object Class. Vanishing o. 2.5D Segm. Autoencoding Segm. 2D Colo orizaation (1000) Pts. 2.5D Segm. Cam. Pose (fix) Scene Class. Edges Cam. Pose (nonfix) Cam. Pose (fix) Jig gsaw w Autoencoding 3D Keypoints Scene Class. Matching Distance Scene Class. Collorization In-ppainttingg Layout Denoising Autoencoding 2D Segm. 2D Keypoints Autoencoding Curvature Object Class. (1000) Denoising Layout 3D Keypoints 2D Edges Cam. Pose (nonfix) In-painttingg 2D Segm. 2D Keypoints Colorizattion Occlusion Edges Object Class. (1000) Denoising Curvature Randoom Proj o. 2.5D Segm. Z-Depth Matching Distance Colorrizattion Denoising Matching Occlusion Edges Autoencoding Autoencoding 2D Keypoints 2D Edges 2D Segm. Occlusion Edges Distance Layout In-painting 3D Keypoints 2D Keypoints Cam. Pose Transfer Order 2 Normals Z-Depth 2D Edges 2D Edges Semantic Segm. 2D Keypoints (fix) Vanishing Pts. 2D Keypoints Semantic Segm. Scene Class. Normals Jiigsaaw Egomotion Cam. Pose (fix) Cam. Pose Curvature Cam. Pose Scene Class. 2.5D Segm. Layout Distance 2D Segm. Colo orizaation (fix) Curvature 2.5D Segm. 3D Keypoints Object Class. Cam. Pose (nonfix) Jigssaw (nonfix) (1000) Rando om Proj o. Randomm Proj o. 3D Keypoints Z-Depth Coloorizaation Cam. Pose (fix) Denoising Curvature Layout Distance Normals Semantic Egomotion Reshading 2D Segm. Normals In-painting 2.5D Segm. Matching Object Class. (1000) Cam. Pose (nonfix) Semantic Segm. Distance Scene Class. Distance . Jiigsaw w Segm. Z-Depth Reshading Occlusion Edges Cam. Pose (fix) Z-Depth Vanishing Pts. 2D Edges In-painting Scene Class. Reshading 2.5D Segm. 2D Keypoints In-ppaintting Egomotion Normals Curvature Autoencoding Reshading Colorization Object Class. Cam. Pose (nonfix) Layout Vanishing Pts. Layout Matching Z-Depth Egomotion (1000) Randdom Proj o. Reshading Cam. Pose (nonfix) Vanishing Pts. Jigsaaw 3D Keypoints Denoising Coloorizaation Vanishing Pts. Scene Class. Egomotion Occlusion Edges 2D Segm. In-ppainnting Object Class. Jiigsaw w Matching Denoising Autoencoding (1000) Randdom Proj o. 3D Keypoints Occlusion Edges 2D Segm. Jiggsaaw Object Class. 2D Edges In-paintting 2D Segm. (1000) Matching Colorrizattion Denoising Semantic Autoencoding Cam. Pose (fix) Autoencoding Object Class. Egomotion In-ppainnting Layout Jiigsaw w Segm. Cam. Pose (fix) Reshading Denoising (1000) Transfer Order 4 2D Edges Z-Depth Z-Depth Semantic Segm. 2D Edges Distance Egomotion Vanishing Pts. 2.5D Random Proj o. 2D Keypoints Cam. Pose (nonfix) 2D Segm. Cam. Pose (nonfix) Segm. Cam. Pose (fix) Layout Cam. Pose Vanishing Pts. Layout Denoising Curvature Distance 2D Segm. 2D Keypoints Normals Matching Coloorizattion (fix) Cam. Pose Autoencoding 3D Keypoints Vanishing Pts. Scene Class. igsaaw Reshading Denoising Curvature (nonfix) Autoencoding Layout Distance Scene Class. Normals Normals In-painting Z-Depth 2D Keypoints Ranndom Proj o. 2.5D Segm. Matching Z-Depth Reshading 2D Edges Normals Scene Class Coloorizaation 2D Keypoints Colorization Occlusion Edges Reshading Occlusion Edges Egomotion Distance Scene Class. andoom Proj o. Occlusion Egomotion Matching Semantic Segm. Object Class. Cam. Pose (nonfix) 3D Keypoints Occlusion Edges Edges Semantic Segm. (1000) Randdom Proj o. Vanishing Pts. Matching 3D Keypoints Curvature Semantic Segm. 2.5D Segm. In-ppainnting Object Class. (1000) Curvature Jiigsaw w 2.5D Segm.Random Proj o. Object Class. 3D Keypoints Curvature 2.5D Segm. (1000) Figure 8: Computed taxonomies for solving 22 tasks given various supervision budgets (x-axes), and maximum allowed transfer orders (y-axes). One is magnified for better visibility. Nodes with incoming edges are target tasks, and the number of their incoming edges is the order of their chosen transfer function. Still transferring to some targets when tge budget is 26 (full budget) means certain transfers started performing better than their fully supervised task-specific counterpart. See the interactive solver website for color coding of the nodes by Gain and Quality metrics. Dimmed nodes are the source-only tasks, and thus, only participate in the taxonomy if found worthwhile by the BIP optimization to be one of the sources. Supervision Budget Increase (→) dom test sample are in Fig. 3. The high win rates in Table 1 Budget and qualitative results show the networks are well trained and stable and can be relied upon for modeling the task space. See results of applying the networks on a YouTube video frame-by-frame here. A live demo for user uploaded queries is available here. To get a sense of the quality of our networks vs. state-of- the-art task-specific methods, we compared our depth esti- mator vs. released models of [53] which led to outperform- ing [53] with a win rate of 88% and losses of 0.35 vs. 0.47 (further details in the supplementary material). In general, we found the task-specific networks to perform on par or max transfer order=1 Gain max transfer order=4 max transfer order=1 Quality max transfer order=4 better than state-of-the-art for many of the tasks, though we do not formally benchmark or claim this. Figure 9: Evaluation of taxonomy computed for solving the full task dictionary. Gain (left) and Quality (right) values for each task using the policy suggested by the computed taxonomy, as the supervision budget 4.1. Evaluation of Computed Taxonomies increases(→). Shown for transfer orders 1 and 4. Fig. 8 shows the computed taxonomies optimized to solve the full dictionary, i.e. all tasks are placed in T and S policies by two metrics of Gain and Quality, defined as: (except for 4 source-only tasks that are in S only). This was Gain: win rate (%) against a network trained from scratch done for various supervision budgets (columns) and max- using the same training data as transfer networks’. That imum allowed order (rows) constraints. Still seeing trans- is, the best that could be done if transfer learning was not fers to some targets when the budget is 26 (full dictionary) utilized. This quantifies the gained value by transferring. means certain transfers became better than their fully super- Quality: win rate (%) against a fully supervised network vised task-specific counterpart. trained with 120k images (gold standard). While Fig. 8 shows the structure and connectivity, Fig. 9 Red (0) and Blue (1) represent outperforming the ref- quantifies the results of taxonomy recommended transfer erence method on none and all of test set images, respec- 7

8. Taxonomy Taxonomy Significance Test ImageNet[51] Noroozi.[68] Zhang.[103] Agrawal.[1] Zamir.[100] full sup. Wang.[96] scratch 9 Order Increase (→) Order Task 7 Depth 88 88 93 89 88 84 86 43 - .03 .04 .04 .03 .04 .03 .03 .02 .02 5 80 52 83 74 74 71 75 15 - Scene Cls. 3.30 2.76 3.56 3.15 3.17 3.09 3.19 2.23 2.63 3 Sem. Segm. 78 79 82 85 76 78 84 21 - 1.74 1.88 1.92 1.80 1.85 1.74 1.71 1.42 1.53 1 79 54 82 76 75 76 76 34 - Object Cls. 4.08 3.57 4.27 3.99 3.98 4.00 3.97 3.26 3.46 Supervision Budget Supervision Budget Normals 97 98 98 98 98 97 97 6 - .22 .30 .34 .28 .28 .23 .24 .12 .15 80 93 92 89 90 84 87 40 - Figure 11: Structure Significance. Our taxonomy compared with ran- 2.5D Segm. .21 .34 .34 .26 .29 .22 .24 .16 .17 dom transfer policies (random feasible taxonomies that use the maximum Occ. Edges 93 96 95 93 94 93 94 42 - .16 .19 .18 .17 .18 .16 .17 .12 .13 allowable supervision budget). Y-axis shows Quality or Gain, and X-axis 88 94 89 85 88 92 88 29 - is the supervision budget. Green and gray represent our taxonomy and ran- Curvature .25 .28 .26 .25 .26 .26 .25 .21 .22 dom connectivities, respectively. Error bars denote 5th –95th percentiles. 79 78 83 77 76 74 71 59 - Egomotion 8.60 8.58 9.26 8.41 8.34 8.15 7.94 7.32 6.85 Layout 80 76 85 79 77 78 70 36 - most cases, the results often get close with win rates in 40% .66 .66 .85 .65 .65 .62 .54 .37 .41 range. These observations suggests the space has a rather Figure 10: Generalization to Novel Tasks. Each row shows a novel predicable and strong structure. For graph visualization of test task. Left: Gain and Quality values using the devised “all-for-one” transfer policies for novel tasks for orders 1-4. Right: Win rates (%) of the the all-for-one taxonomy policies please see the supplemen- transfer policy over various self-supervised methods, ImageNet features, tary material. The solver website allows generating the tax- and scratch are shown in the colored rows. Note the large margin of win onomy for arbitrary sets of target-only tasks. by taxonomy. The uncolored rows show corresponding loss values. tively (so the transition Red→White→Blue is desirable. 5. Significance Test of the Structure White (.5) represents equal performance to reference). Each column in Fig. 9 shows a supervision budget. As The previous evaluations showed good transfer results in apparent, good results can be achieved even when the super- terms of Quality and Gain, but how crucial is it to use our vision budget is notably smaller than the number of solved taxonomy to choose smart transfers over just choosing any tasks, and as the budget increases, results improve (ex- transfer? In other words, how significant/strong is the dis- pected). Results are shown for 2 maximum allowed orders. covered structure of task space? Fig. 11 quantifies this by showing the performance of our taxonomy versus a large set 4.2. Generalization to Novel Tasks of taxonomies with random connectivities. Our taxonomy outperformed all other connectivities by a large margin sig- The taxonomies in Sec. 4.1 were optimized for solving nifying both existence of a strong structure in the space as all tasks in the dictionary. In many situations, a practitioner well as a good modeling of it by our approach. Complete is interested in a single task which even may not be in the experimental details is available in supplementary material. dictionary. Here we evaluate how taxonomy transfers to a novel out-of-dictionary task with little data. 5.1. Evaluation on MIT Places & ImageNet This is done in an all-for-one scenario where we put one task in T and all others in S. The task in T is target-only To what extent are our findings dataset dependent, and and has no task-specific network. Its limited data (16k) is would the taxonomy change if done on another dataset? We used to train small transfer networks to sources. This basi- examined this by finding the ranking of all tasks for trans- cally localizes where the target would be in the taxonomy. ferring to two target tasks of object classification and scene Fig. 10 (left) shows the Gain and Quality of the transfer classification on our dataset. We then fine tuned our task- policy found by the BIP for each task. Fig. 10 (right) com- specific networks on other datasets (MIT Places [104] for pares the taxonomy suggested policy against some of the scene classification, ImageNet [78] for object classification) best existing self-supervised methods [96, 103, 68, 100, 1], and evaluated them on their respective test sets and metrics. ImageNet FC7 features [51], training from scratch, and a Fig. 12 shows how the results correlate with taxonomy’s fully supervised network (gold standard). ranking from our dataset. The Spearman’s rho between the The results in Fig. 10 (right) are noteworthy. The large taxonomy ranking and the Top-1 ranking is 0.857 on Places win margin for taxonomy shows that carefully selecting and 0.823 on ImageNet showing a notable correlation. See transfer policies depending on the target is superior to fixed supplementary material for complete experimental details. transfers, such as the ones employed by self-supervised 5.2. Universality of the Structure methods. ImageNet features which are the most popular off-the-shelf features in vision are also outperformed by We employed a computational approach with various de- those policies. Additionally, though the taxonomy transfer sign choices. It is important to investigate how specific to policies lose to fully supervised networks (gold standard) in those the discovered structure is. We did stability tests by 8

9. Transferring to ImageNet Transferring to MIT Places (Spearman’s correlation = 0.823) (Spearman’s correlation = 0.857) Top-1 Top-1 Top-5 Top-5 Accuracy Accuracy Figure 12: Evaluating the discovered structure on other datasets: ImageNet [78] (left) for object classification and MIT Places [104] Figure 13: Task Similarity Tree. Agglomerative clustering of tasks (right) for scene classification. Y-axis shows accuracy on the external based on their transferring-out patterns (i.e. using columns of normalized benchmark while bars on x-axis are ordered by taxonomy’s predicted per- affinity matrix as task features). 3D, 2D, low dimensional geometric, and formance based on our dataset. A monotonically decreasing plot corre- semantic tasks clustered together using a fully computational approach. sponds to preserving identical orders and perfect generalization. computing the variance in our output when making changes reducing the need for supervision. The space of tasks is an in one of the following system choices: I. architecture of interesting object of study in its own right and we have only task-specific networks, II. architecture of transfer func- scratched the surface in this regard. We also made a number tion networks, III. amount of data available for training of assumptions in the framework which should be noted. transfer networks, IV. datasets, V. data splits, VI. choice Model Dependence: We used a computational approach of dictionary. Overall, despite injecting large changes (e.g. and adopted neural networks as our function class. Though varying the size of training data of transfer functions by 16x, we validated the stability of the findings w.r.t various archi- size and architecture of task-specific networks and transfer tectures and datasets, it should be noted that the findings are networks by 4x), we found the outputs to be remarkably in principle model and data specific. stable leading to almost no change in the output taxonomy Compositionality: We performed the modeling via a set computed on top. Detailed results and experimental setup of common human-defined visual tasks. It is natural to con- of each tests are reported in the supplementary material. sider a further compositional approach in which such com- mon tasks are viewed as observed samples which are com- 5.3. Task Similarity Tree posed of computationally found latent subtasks. Space Regularity: We performed modeling of a dense Thus far we showed the task space has a structure, mea- space via a sampled dictionary. Though we showed a good sured this structure, and presented its utility for transfer tolerance w.r.t. to the choice of dictionary and transferring learning via devising transfer policies. This structure can to out-of-dictionary tasks, this outcome holds upon a proper be presented in other manners as well, e.g. via a metric of sampling of the space as a function of its regularity. More similarity across tasks. Figure 13 shows a similarity tree for formal studies on properties of the computed space is re- the tasks in our dictionary. This is acquired from agglomer- quired for this to be provably guaranteed for a general case. ative clustering of the tasks based on their transferring-out behavior, i.e. using columns of normalized affinity matrix Transferring to Non-visual and Robotic Tasks: Given P as feature vectors for tasks. The tree shows how tasks the structure of the space of visual tasks and demonstrated would be hierarchically positioned w.r.t. to each other when transferabilities to novel tasks, it is worthwhile to question measured based on providing information for solving other how this can be employed to develop a perception module tasks; the closer two tasks, the more similar their role in for solving downstream tasks which are not entirely visual, transferring to other tasks. Notice that the 3D, 2D, low di- e.g. robotic manipulation, but entail solving a set of (a priori mensional geometric, and semantic tasks are found to clus- unknown) visual tasks. ter together using a fully computational approach, which Lifelong Learning: We performed the modeling in one matches the intuitive expectations from the structure of task go. In many cases, e.g. lifelong learning, the system is space. The transfer taxonomies devised by BIP are consis- evolving and the number of mastered tasks constantly tent with this tree as BIP picks the sources in a way that all increase. Such scenarios require augmentation of the of these modes are quantitatively best covered, subject to structure with expansion mechanisms based on new beliefs. the given budget and desired target set. Acknowledgement: We acknowledge the support of NSF 6. Limitations and Discussion (DMS-1521608), MURI (1186514-1-TBCJE), ONR MURI (N00014-14-1-0671), Toyota(1191689-1-UDAWF), ONR We presented a method for modeling the space of visual MURI (N00014-13-1-0341), Nvidia, Tencent, a gift by tasks by way of transfer learning and showed its utility in Amazon Web Services, a Google Focused Research Award. 9

10.References [19] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional [1] P. Agrawal, J. Carreira, and J. Malik. Learning to see by activation feature for generic visual recognition. In Inter- moving. In Proceedings of the IEEE International Confer- national conference on machine learning, pages 647–655, ence on Computer Vision, pages 37–45, 2015. 8 2014. 2 [2] M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, [20] J. Donahue, P. Kr¨ahenb¨uhl, and T. Darrell. Adversarial fea- D. Pfau, T. Schaul, and N. de Freitas. Learning to learn ture learning. arXiv preprint arXiv:1605.09782, 2016. 2 by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pages 3981–3989, [21] Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, 2016. 2 and P. Abbeel. Rl2: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, [3] I. Armeni, S. Sax, A. R. Zamir, and S. Savarese. Joint 2016. 2 2d-3d-semantic data for indoor scene understanding. arXiv preprint arXiv:1702.01105, 2017. 4 [22] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vin- [4] S. Arora, A. Bhaskara, R. Ge, and T. Ma. Provable bounds cent, and S. Bengio. Why does unsupervised pre-training for learning some deep representations. In International help deep learning? Journal of Machine Learning Re- Conference on Machine Learning, pages 584–592, 2014. 2 search, 11(Feb):625–660, 2010. 2 [5] Y. Aytar and A. Zisserman. Tabula rasa: Model transfer [23] A. Faktor and M. Irani. clustering by composition– for object category detection. In Computer Vision (ICCV), unsupervised discovery of image categories. In European 2011 IEEE International Conference on, pages 2252–2259. Conference on Computer Vision, pages 474–487. Springer, IEEE, 2011. 2 2012. 2 [6] J. Baxter. A bayesian/information theoretic model of learn- [24] L. Fe-Fei et al. A bayesian approach to unsupervised one- ing to learn viamultiple task sampling. Mach. Learn., shot learning of object categories. In Computer Vision, 28(1):7–39, July 1997. 3 2003. Proceedings. Ninth IEEE International Conference [7] S. Ben-David and R. S. Borbely. A notion of task relat- on, pages 1134–1141. IEEE, 2003. 2 edness yielding provable multiple-task learning guarantees. [25] L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning of Machine Learning, 73(3):273–287, Dec 2008. 3 object categories. IEEE transactions on pattern analysis [8] Y. Bengio, A. Courville, and P. Vincent. Representa- and machine intelligence, 28(4):594–611, 2006. 2 tion learning: A review and new perspectives. IEEE [26] B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars. transactions on pattern analysis and machine intelligence, Unsupervised visual domain adaptation using subspace 35(8):1798–1828, 2013. 2 alignment. In Proceedings of the IEEE international con- [9] P. Berkhin et al. A survey of clustering data mining tech- ference on computer vision, pages 2960–2967, 2013. 2 niques. Grouping multidimensional data, 25:71, 2006. 2 [27] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta- [10] E. Bienenstock, S. Geman, and D. Potter. Compositionality, learning for fast adaptation of deep networks. arXiv mdl priors, and object recognition. In Advances in neural preprint arXiv:1703.03400, 2017. 2 information processing systems, pages 838–844, 1997. 2 [28] C. Finn, S. Levine, and P. Abbeel. Guided cost learn- [11] H. Bilen and A. Vedaldi. Integrated perception with re- ing: Deep inverse optimal control via policy optimization. current multi-task neural networks. In Advances in neural CoRR, abs/1603.00448, 2016. 2 information processing systems, pages 235–243, 2016. 2 [29] C. Finn, X. Y. Tan, Y. Duan, T. Darrell, S. Levine, and [12] J. Bingel and A. Søgaard. Identifying beneficial task rela- P. Abbeel. Deep spatial autoencoders for visuomotor learn- tions for multi-task learning in deep neural networks. arXiv ing. In Robotics and Automation (ICRA), 2016 IEEE Inter- preprint arXiv:1702.08303, 2017. 2 national Conference on, pages 512–519. IEEE, 2016. 2 [13] O. Boiman and M. Irani. Similarity by composition. In [30] C. Finn, T. Yu, J. Fu, P. Abbeel, and S. Levine. Generalizing Advances in neural information processing systems, pages skills with semi-supervised reinforcement learning. CoRR, 177–184, 2007. 2 abs/1612.00429, 2016. 2 [14] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Nießner, [31] C. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine. One- M. Savva, S. Song, A. Zeng, and Y. Zhang. Matterport3d: shot visual imitation learning via meta-learning. CoRR, Learning from rgb-d data in indoor environments. arXiv abs/1709.04905, 2017. 2 preprint arXiv:1709.06158, 2017. 4 [15] Z. Chen and B. Liu. Lifelong Machine Learning. Morgan [32] I. K. Fodor. A survey of dimension reduction techniques. & Claypool Publishers, 2016. 2 Technical report, Lawrence Livermore National Lab., CA [16] I. I. CPLEX. V12. 1: Users manual for cplex. International (US), 2002. 2 Business Machines Corporation, 46(53):157, 2009. 5 [33] R. M. French. Catastrophic forgetting in connectionist net- [17] C. Doersch, A. Gupta, and A. A. Efros. Unsupervised vi- works: Causes, consequences and solutions. Trends in Cog- sual representation learning by context prediction. In Pro- nitive Sciences, 3(4):128–135, 1999. 2 ceedings of the IEEE International Conference on Com- [34] R. Ge. Provable algorithms for machine learning problems. puter Vision, pages 1422–1430, 2015. 2 PhD thesis, Princeton University, 2013. 2 [18] C. Doersch and A. Zisserman. Multi-task self-supervised [35] S. Geman, D. F. Potter, and Z. Chi. Composition systems. visual learning. arXiv preprint arXiv:1708.07860, 2017. 2 Quarterly of Applied Mathematics, 60(4):707–736, 2002. 2 10

11.[36] R. Gopalan, R. Li, and R. Chellappa. Domain adaptation [54] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum. for object recognition: An unsupervised approach. In Com- Human-level concept learning through probabilistic pro- puter Vision (ICCV), 2011 IEEE International Conference gram induction. Science, 350(6266):1332–1338, 2015. 2 on, pages 999–1006. IEEE, 2011. 2 [55] B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. Ger- [37] A. Gopnik, C. Glymour, D. Sobel, L. Schulz, T. Kushnir, shman. Building machines that learn and think like people. and D. Danks. A theory of causal learning in children: Behavioral and Brain Sciences, pages 1–101, 2016. 2 Causal maps and bayes nets. 111:3–32, 02 2004. 2 [56] Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei. Fully con- [38] A. Gopnik, C. Glymour, D. M. Sobel, L. E. Schulz, volutional instance-aware semantic segmentation. arXiv T. Kushnir, and D. Danks. A theory of causal learning in preprint arXiv:1611.07709, 2016. 4 children: causal maps and bayes nets. Psychological re- [57] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- view, 111(1):3, 2004. 2 manan, P. Doll´ar, and C. L. Zitnick. Microsoft coco: Com- [39] A. Gopnik, A. N. Meltzoff, and P. K. Kuhl. The scientist in mon objects in context. In European conference on com- the crib: Minds, brains, and how children learn. William puter vision, pages 740–755. Springer, 2014. 4 Morrow & Co, 1999. 2 [58] F. Liu, G. Lin, and C. Shen. CRF learning with CNN [40] A. Graves, G. Wayne, and I. Danihelka. Neural turing ma- features for image segmentation. CoRR, abs/1503.08263, chines. CoRR, abs/1410.5401, 2014. 2 2015. 2 [41] I. Gurobi Optimization. Gurobi optimizer reference man- [59] Z. Luo, Y. Zou, J. Hoffman, and L. F. Fei-Fei. Label ef- ual, 2016. 5, 6 ficient learning of transferable representations acrosss do- [42] K. Henry. The theory and applications of homomorphic mains and tasks. In Advances in Neural Information Pro- cryptography. Master’s thesis, University of Waterloo, cessing Systems, pages 164–176, 2017. 2 2008. 2 [60] M. M. H. Mahmud. On Universal Transfer Learning, pages [43] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowl- 135–149. Springer Berlin Heidelberg, Berlin, Heidelberg, edge in a neural network. arXiv preprint arXiv:1503.02531, 2007. 3 2015. 4 [61] J. Malik, P. Arbel´aez, J. Carreira, K. Fragkiadaki, R. Gir- [44] J. Hoffman, T. Darrell, and K. Saenko. Continuous mani- shick, G. Gkioxari, S. Gupta, B. Hariharan, A. Kar, and fold based adaptation for evolving visual domains. In Pro- S. Tulsiani. The three rs of computer vision: Recognition, ceedings of the IEEE Conference on Computer Vision and reconstruction and reorganization. Pattern Recognition Let- Pattern Recognition, pages 867–874, 2014. 2 ters, 72:4–14, 2016. 2 [45] Y. Hoshen and S. Peleg. Visual learning of arithmetic oper- [62] N. Masuda, M. A. Porter, and R. Lambiotte. Random walks ations. CoRR, abs/1506.02264, 2015. 2 and diffusion on networks. Physics Reports, 716-717:1 – [46] F. Hu, G.-S. Xia, J. Hu, and L. Zhang. Transferring deep 58, 2017. Random walks and diffusion on networks. 5 convolutional neural networks for the scene classification of [63] M. Mccloskey and N. J. Cohen. Catastrophic interference in high-resolution remote sensing imagery. Remote Sensing, connectionist networks: The sequential learning problem. 7(11):14680–14707, 2015. 2 The Psychology of Learning and Motivation, 24, 1989. 2 [47] I.-H. Jhuo, D. Liu, D. Lee, and S.-F. Chang. Robust visual [64] L. Mihalkova, T. Huynh, and R. J. Mooney. Mapping and domain adaptation with low-rank reconstruction. In Com- revising markov logic networks for transfer learning. In puter Vision and Pattern Recognition (CVPR), 2012 IEEE AAAI, volume 7, pages 608–614, 2007. 2 Conference on, pages 2168–2175. IEEE, 2012. 2 [65] T. Mikolov, Q. V. Le, and I. Sutskever. Exploiting simi- [48] D. P. Kingma and J. Ba. Adam: A method for stochastic larities among languages for machine translation. CoRR, optimization. CoRR, abs/1412.6980, 2014. 2 abs/1309.4168, 2013. 2 [49] D. P. Kingma and M. Welling. Auto-encoding variational [66] I. Misra, A. Shrivastava, A. Gupta, and M. Hebert. Cross- bayes. arXiv preprint arXiv:1312.6114, 2013. 2 stitch networks for multi-task learning. In Proceedings [50] I. Kokkinos. Ubernet: Training auniversal’convolutional of the IEEE Conference on Computer Vision and Pattern neural network for low-, mid-, and high-level vision us- Recognition, pages 3994–4003, 2016. 2 ing diverse datasets and limited memory. arXiv preprint [67] A. Niculescu-Mizil and R. Caruana. Inductive transfer for arXiv:1609.02132, 2016. 2 bayesian network structure learning. In Artificial Intelli- [51] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet gence and Statistics, pages 339–346, 2007. 2 classification with deep convolutional neural networks. In [68] M. Noroozi and P. Favaro. Unsupervised learning of vi- NIPS, pages 1097–1105, 2012. 8 sual representations by solving jigsaw puzzles. In European [52] B. Kulis, K. Saenko, and T. Darrell. What you saw is not Conference on Computer Vision, pages 69–84. Springer, what you get: Domain adaptation using asymmetric ker- 2016. 2, 8 nel transforms. In Computer Vision and Pattern Recogni- [69] M. Noroozi, H. Pirsiavash, and P. Favaro. Represen- tion (CVPR), 2011 IEEE Conference on, pages 1785–1792. tation learning by learning to count. arXiv preprint IEEE, 2011. 2 arXiv:1708.06734, 2017. 2 [53] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and [70] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, N. Navab. Deeper depth prediction with fully convolutional A. Frome, G. S. Corrado, and J. Dean. Zero-shot learn- residual networks. In 3D Vision (3DV), 2016 Fourth Inter- ing by convex combination of semantic embeddings. arXiv national Conference on, pages 239–248. IEEE, 2016. 7 preprint arXiv:1312.5650, 2013. 2 11

12.[71] M. Ovsjanikov, M. Ben-Chen, J. Solomon, A. Butscher, [88] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, and L. Guibas. Functional maps: a flexible representation I. J. Goodfellow, and R. Fergus. Intriguing properties of of maps between shapes. ACM Transactions on Graphics neural networks. CoRR, abs/1312.6199, 2013. 2 (TOG), 31(4):30, 2012. 2 [89] J. B. Tenenbaum and T. L. Griffiths. Generalization, sim- [72] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. ilarity, and bayesian inference. Behavioral and Brain Sci- Efros. Context encoders: Feature learning by inpainting. In ences, 24(4):629640, 2001. 2 Proceedings of the IEEE Conference on Computer Vision [90] J. B. Tenenbaum, C. Kemp, T. L. Griffiths, and N. D. Good- and Pattern Recognition, pages 2536–2544, 2016. 2 man. How to grow a mind: Statistics, structure, and abstrac- [73] A. Pentina and C. H. Lampert. Multi-task learning with tion. science, 331(6022):1279–1285, 2011. 2 labeled and unlabeled tasks. stat, 1050:1, 2017. 2 [91] J. B. Tenenbaum, C. Kemp, and P. Shafto. Theory-based [74] J. Piaget and M. Cook. The origins of intelligence in chil- bayesian models of inductive learning and reasoning. In dren, volume 8. International Universities Press New York, Trends in Cognitive Sciences, pages 309–318, 2006. 2 1952. 2 [92] D. G. R. Tervo, J. B. Tenenbaum, and S. J. Gershman. To- [75] L. Y. Pratt. Discriminability-based transfer between neural ward the neural implementation of structure learning. Cur- networks. In Advances in neural information processing rent opinion in neurobiology, 37:99–105, 2016. 2 systems, pages 204–211, 1993. 2 [93] C. Tessler, S. Givony, T. Zahavy, D. J. Mankowitz, and [76] S. R. Richter, Z. Hayder, and V. Koltun. Playing for bench- S. Mannor. A deep hierarchical approach to lifelong learn- marks. In International Conference on Computer Vision ing in minecraft. In AAAI, pages 1553–1561, 2017. 2 (ICCV), 2017. 2 [94] S. Thrun and L. Pratt. Learning to learn. Springer Science [77] S. T. Roweis and L. K. Saul. Nonlinear dimension- & Business Media, 2012. 2 ality reduction by locally linear embedding. science, [95] A. M. Turing. Computing machinery and intelligence. 290(5500):2323–2326, 2000. 2 Mind, 59(236):433–460, 1950. 2 [78] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, [96] X. Wang and A. Gupta. Unsupervised learning of visual S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, representations using videos. In Proceedings of the IEEE et al. Imagenet large scale visual recognition challenge. International Conference on Computer Vision, pages 2794– International Journal of Computer Vision, 115(3):211–252, 2802, 2015. 8 2015. 2, 4, 8, 9 [79] R. W. Saaty. The analytic hierarchy process – what it is and [97] X. Wang, K. He, and A. Gupta. Transitive invariance how it is used. Mathematical Modeling, 9(3-5):161–176, for self-supervised visual representation learning. arXiv 1987. 5 preprint arXiv:1708.02901, 2017. 2 [80] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting [98] T. Winograd. Thinking machines: Can there be? Are we, visual category models to new domains. Computer Vision– volume 200. University of California Press, Berkeley, 1991. ECCV 2010, pages 213–226, 2010. 2 2 [81] R. Salakhutdinov, J. Tenenbaum, and A. Torralba. One-shot [99] J. Yang, R. Yan, and A. G. Hauptmann. Adapting svm clas- learning with a hierarchical nonparametric bayesian model. sifiers to data with shifted distributions. In Data Mining In Proceedings of ICML Workshop on Unsupervised and Workshops, 2007. ICDM Workshops 2007. Seventh IEEE Transfer Learning, pages 195–206, 2012. 2 International Conference on, pages 69–76. IEEE, 2007. 2 [82] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and [100] A. R. Zamir, T. Wekel, P. Agrawal, C. Wei, J. Malik, and P. Abbeel. Trust region policy optimization. CoRR, S. Savarese. Generic 3d representation via pose estimation abs/1502.05477, 2015. 2 and matching. In European Conference on Computer Vi- [83] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carls- sion, pages 535–553. Springer, 2016. 2, 8 son. Cnn features off-the-shelf: an astounding baseline [101] A. R. Zamir, F. Xia, J. He, A. Sax, J. Malik, and S. Savarese. for recognition. In Proceedings of the IEEE conference on Gibson Env: Real-world perception for embodied agents. computer vision and pattern recognition workshops, pages In 2018 IEEE Conference on Computer Vision and Pattern 806–813, 2014. 2 Recognition (CVPR). IEEE, 2018. 4 [84] D. L. Silver and K. P. Bennett. Guest editors introduction: [102] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. special issue on inductive transfer learning. Machine Learn- Understanding deep learning requires rethinking general- ing, 73(3):215–220, 2008. 2 ization. CoRR, abs/1611.03530, 2016. 2 [85] D. L. Silver, Q. Yang, and L. Li. Lifelong machine learning [103] R. Zhang, P. Isola, and A. A. Efros. Colorful image col- systems: Beyond learning algorithms. In in AAAI Spring orization. In European Conference on Computer Vision, Symposium Series, 2013. 2 pages 649–666. Springer, 2016. 2, 8 [86] R. Socher, M. Ganjoo, C. D. Manning, and A. Ng. Zero- [104] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. shot learning through cross-modal transfer. In NIPS, pages Learning deep features for scene recognition using places 935–943, 2013. 2 database. In Advances in neural information processing [87] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and systems, pages 487–495, 2014. 2, 4, 8, 9 R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014. 2 12