- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 视频嵌入链接 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
DocArray:为机器学习而生的可扩展数据结构
展开查看详情
1 . DocArray: 为机器学习而生的数据结构 felix.wang@jina.ai
2 .AboutAbout Me Me 2021-now, Senior AI Engineer, Jina AI 2020-21, Senior Researcher, Huya AI 2018-19, Senior Researcher, Tencent AI 2011-18, Ph. D., Hong Kong Baptist University 王峰,开源神经搜索框架 Jina 的核心贡献者,专注机器学习与深度学习算法在 NLP,多 模态表征学习和信息检索领域的落地与应用。
3 .About Jina AI 👼 Founded in 2020/2 50 members Berlin(HQ) 🌎 4 offices 💵 Raised $38M+ Beijing 🏆 Top-tier AI company San Jose Shenzhen 3
4 .Fundamental What is DocArray?
5 .What is and Neural Search?
6 .Datatypes What can neural search search? Search any* kind of data Text Images Audio Video 3D mesh Proteins
7 .PDF Video weather forecast Audio Apps Image Text
8 .PDF Video Hi there, what can I do for you? Audio How much is this pair of shoes? App Bluewing Special offer Image $38.00 You may also like Chatbot $38.00 $48.00 $36.90 $43.00
9 .PDF Video Machine Learning.pdf Audio App Image Text Ambiguity…Wasserstein.pdf Autoencoding topology.pdf Jelly bean world…learning.pdf
10 .PDF Video Audio All the Single Ladies - Beyoncé App t Image Say My Name - Destiny's Child Text
11 .PDF Video soccer.mp4 Audio App Image Text
12 .PDF Video sneakers.jpg Audio App Images Text
13 .Fundamental Keyword Search vs Neural Search Keyword Search Example Neural Search Example Query: What is Jina? Query: a happy potato Query vector: [0.01, 0.07, 0.35, ….] Result: Jina is a cloud-native neural Result vectors: [[0.01, 0.06, 0.34, ….] search framework to build SOTA and [0.02, 0.06, 0.38, ..] scalable deep learning search … applications in minutes. ]
14 .What is DocArray? Who will benefit? - If you are a data scientist who works with image, text, video, audio data in Python all day, you should use DocArray: it can greatly accelerate the work on representing, embedding, matching, visualizing, evaluating, sharing data; while stay close with your favorite toolkits, e.g. Torch, Tensorflow, ONNX, PaddlePaddle, JupyterLab, Google Colab. - If you are a deep learning engineer who works on scalable deep learning service, you should use DocArray: it can be the basic building block of your system. Its portable data structure can be wired in Protobuf, compressed bytes, JSON; allowing your engineer friends to happily integrate it into the production system.
15 .What is DocArray? Bridging the gap between prototyping and production 🌩Deployment on Cloud prototyping production
16 .What is DocArray? Design Objectives - DocArray is designed to be extremely intuitive for Python users, no new syntax to learn. If you know how to Python, you know how to DocArray. - DocArray is designed to maximize the local experience, with the requirement of cloud readiness at anytime. - DocArray is designed to represent multimodal data intuitively to face the ever-increasing development of multi/cross-modal applications.
17 .What is DocArray? Three Concepts Document: a data structure for easily representing nested, unstructured data. DocumentArray: a container for efficiently accessing, processing, and understanding multiple Documents. Dataclass: a high-level API for intuitively representing multimodal data.
18 .What is Document? Simple Document ● Content related: uri, text, tensor, blob; ● Common side information or metadata: id, modality, mime_type, offset, location, weight; ○ Further information: tags;
19 .What is Document? Nested Document ● Nest structure related: chunks, matches, granularity, adjacency, parent_id;
20 .
21 .What is DocumentArray? DocumentArray DocumentArray is a list-like container of Document objects. - a Python list, as it implements all list interfaces. - It is also powerful as Numpy ndarray and Pandas DataFrame, allowing you to efficiently access elements and attributes of contained Documents. - greatly accelerate data scientists work on accessing nested elements, evaluating, visualizing, parallel computing, serializing, matching etc.
22 .What is DocumentArray? MutableSequence Sequence Collection Iterable
23 .What is DocArray? Index by nested structure
24 .What is DocArray? Scalable data structure
25 .What is Dataclass? Dataclass DocArray’s dataclass is a high-level API for representing a multimodal document using nested Document structure.
26 .Integrations Jupyter Notebook/Colab Many data scientists work with Jupyter Notebook or Google Colab especially at the early prototyping stage.
27 .Integrations Deep Learning Frameworks DocArray can be easily integrated into PyTorch, Tensorflow, PaddlePaddle frameworks.
28 .Integrations Deep Learning Frameworks dataloader: Load, map, batch in one-shot
29 .Integrations FastAPI/Pydantic DocArray supports pydantic data model via PydanticDocument and PydanticDocumentArray.