Semantic Tooling at Twitter (ScalaDays Copenhagen 2017)

This talk introduces semantic databases, the cornerstone of the Scalameta semantic API, and explains how semantic databases can be used to integrate with Kythe, a language-agnostic ecosystem for developer tools. In this talk, we presented our vision of next-generation semantic tooling for the Scala ecosystem.



3.Agenda ● State of developer tools at Twitter ● Vision of nextgen semantic tooling ● Proposed technology stack S


5.State of Source ● Monorepo ● Consistent build ○ Now: retain agility! ● Persistent rumor: “Twitter is writing less Scala” ○ False. ○ JDK8 landed in Source about 1 year ago. In that period: ■ Scala codebase grew by 35% ■ Java codebase grew by 19%

6.Rewind: Monorepos? Monorepos. ● No diamonds ● Atomic cross-project changes ● Top-to-bottom continuous integration testing ● Linear change history ● No binary incompatibilities except at the boundary ○ ...although really just an argument for source distributions...?

7.Achieving the promise of a monorepo ● Requires tooling! ● Previous talk: Pants (ref). ● Previous talk: dependency hygiene (ref). ● Today: semantic tooling!

8.A day in the life of a core lib dev ● Not a bad environment! ○ Pre-commit unit and integration testing of all dependents ○ Atomic commit of changes to libraries and their consumers ○ Thousands of examples of usage of most APIs ○ Users sit right down the hall ● But not perfect. ○ How do I remove an API?

9.“Avoid deprecations in the common case” 0899f3e util-core: Remove deprecated method Future.get(Duration) ● Dead code in a monorepo is not 28 files changed, 293 insertions(+), 210 deletions(-) like dead code in polyrepos! 60b8b21 util-core: Remove deprecated Future.get 53 files changed, 403 insertions(+), 299 deletions(-) ● Rewriting `Future.get` to 6ed301d Replace calls to Future.get with Await.result `Await.result` (last year) 116 files changed, 1113 insertions(+), 956 deletions(-) 7deee17 Replace calls to Future.get with Await.result required a custom compiler 131 files changed, 923 insertions(+), 760 deletions(-) plugin 2855fa4 Replace calls to Future.get with Await.result 174 files changed, 1476 insertions(+), 1222 deletions(-) dfe0002 Replace calls to Future.get with Await.result 51 files changed, 991 insertions(+), 688 deletions(-) da6f09c Replace calls to Future.get with Await.result 80 files changed, 815 insertions(+), 535 deletions(-)

10.State of semantic tooling ● Very coarse via target level dependencies: ○ ~2^16 targets, ~2^14 roots (tests+binaries) ● Slightly finer (class-level) semantic information via zinc analysis ○ ~2^22 class files (post codegen) ● Very fast text/regex based indexes ○ ~2^25 loc (pre codegen)

11.State of semantic tooling (continued) ● Symbol level information available only in IDEs ● Very old Sourcegraph install recently deprecated ○ Legacy code for both companies: missing features, fragile integration ■ Compiler plugin specific to 1) Sourcegraph, 2) a compiler version ● *but are moving toward using LSP extensions (ref): great direction! ○ Not ruling out future open source collaboration. ● Pants support for scalafix (new!) and scalafmt ○ Not yet widely used internally ○ Big bang rewrite likely coming soon.

13.Code comprehension ● Table stakes; must be: ○ Orders of magnitude faster than grep ○ Find references-to ○ Find definition-of a symbol ● Going further toward understanding with: ○ Inheritance relationships ○ Documentation ○ Type awareness

14.Code review ● Context available for a patch ○ Warnings/errors from the compiler ○ Definitions/references/types on hover

15.Code evolution ● Deprecations should be completely unnecessary for code that doesn’t escape the closed world! ● Decide whether to refactor by... ○ ...exploring class/trait relationships ○ ...filtering calls by scope or the call graph ● Then execute. ○ Scalafix! ○ Generic rewrite tools possible?

16.Executing the vision ● High resolution, antifragile semantic extraction... ● Distributed, language-agnostic* semantic index... ● Integration with language-agnostic tools... E


18.Nextgen metaprogramming library for Scala ● Syntactic API (2014-) ○ Tokens ○ Abstract syntax trees ○ Parsers ○ Quasiquotes ● Semantic API (2017-) ○ An independent open-source foundation for semantic tools ○ Already used at Twitter and at the Scala Center ○ Recently published technology preview within scalameta 1.8.0

19.Old-school semantic tooling for Scala ● Write a compiler plugin that runs after typer ● import global._ ● Fight with compiler internals ● Rewrite your tool when a new minor version of Scala is released

20.Why old school didn’t work Huge surface of the compiler API ● Tens of thousands LOC ● Dozens of different modules ● Thousands of different methods

21.First attempt (scalareflect, 2011) ● Reduce the API surface to several hundred most popular methods ● Guarantee stability across minor and even major Scala releases

22.Second attempt (scalameta, 2014) ● Further “compress” the API surface to several dozen most popular methods ● New data structures to enable new “compressed” APIs ● Convert back and forth between compiler and new data structures

23.Why these attempts didn’t work Still using compiler data structures ● Immense language-version-specific schema ● Very involved pre- and postconditions ● Require a running compiler ● Not serializable

24.Third attempt (scalameta, 2017) ● Dumb data schema to represent semantic information ● Give up on bidirectional interop with compiler data structures ● Still use the significantly reduced API surface from the second attempt

25.Semantic database ● Extremely simple data schema ● ~50 lines of protobuf code ● Supports resolved names, compiler messages, symbol denotations and sugars ● Technology preview for Scala 2.11.11 and Scala 2.12.2


27.Live demo: semantic db for an example Scala file package com.example class Printer { def print(msg: String): Unit = println(msg) } object Example { def main(args: Array[String]): Unit = { val msg = "Hello World" // Comment. new Printer().print(msg) } }

28.Early feedback ● Semantic databases are extremely hackable ● Spawned a family of semantic tools that run outside the compiler ● Great potential for portability ● Great potential for scalability ● Simplicity of data schemas is seriously underrated S

29.language-agnostic* semantic index?