tutorials/zh/05_builtin_analytical_algorithms.ipynb (313 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 内置图分析算法\n", "\n", "图分析在真实世界中有广泛的用处。许多算法比如社区发现,路径连通性,中心性算法都被证明在许多商业中非常有用。\n", "\n", "GraphScope 内置了一系列算法,使得用户可以方便的在图数据上做分析。\n", "\n", "这个教程展示了如何使用内置算法来完成图分析任务。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Install graphscope package if you are NOT in the Playground\n", "\n", "!pip3 install graphscope" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 准备工作\n", "\n", "首先,我们需要载入一张属性图。\n", "\n", "这里我们从 [Gnutella peer-to-peer network, August 31 2002](http://snap.stanford.edu/data/p2p-Gnutella31.html)\n", "使用 peer-to-peer network 数据集,并在此基础上为点和边加入了属性。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Import the graphscope module.\n", "\n", "import graphscope\n", "\n", "graphscope.set_option(show_log=False) # enable logging" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load p2p network dataset\n", "\n", "from graphscope.dataset import load_p2p_network\n", "\n", "graph = load_p2p_network(directed=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "来看一下图的数据模型。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(graph.schema)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "如上所示,我们载入了一张属性图,点标签为 \"host\",有两个属性 (\"weight\" 和 \"id\"),边标签为 \"connect\", 有三个属性,分别为 (\"src_label_id\", \"dst_label_id\", \"dist\")。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 投影为简单图\n", "\n", "许多图分析算法都只能查询 **简单图**, 在这里我们定义 **简单图** 为只包含一种点和一种边,且点和边最多只有一个属性。\n", "\n", "`GraphScope` 提供了一个函数 `project` 来将属性图投影为简单图,我们可以选择某一种点和边,以及其一个或零个属性,来获得属性图的一个 **投影**。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "simple_graph = graph.project(vertices={\"host\": []}, edges={\"connect\": [\"dist\"]})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 运行算法\n", "\n", "在接下来的部分,我们将运行几个算法,并查看结果。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 单源最短路径\n", "\n", "单源最短路径算法(Single Source Shortest Path,接下来将以 `sssp` 代称),需要两个参数,第一个参数是 `graph`,第二个参数是查询的出发点 `src`。\n", "\n", "在这里,我们将查询从 ID 为 6 的点出发到所有点的最短路径长度。\n", "\n", "在后端,GraphScope 会生成一个兼容被查询的图的算法,编译为一个动态链接库。额外的 GraphScope 会对类型做一些检查,比如,在这里 `sssp` 算法要求边有一个 `int` 或 `double` 的属性作为距离。\n", "\n", "第一次编译动态链接库时需要一些时间,但是对于同一个算法,这一步在同样类型的图上只会进行一次。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from graphscope import sssp\n", "\n", "sssp_context = sssp(simple_graph, src=6)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "完成计算后,计算结果分布式的存储到了集群上 `vineyard` 的实例上。\n", "\n", "算法会返回一个 `Context` 对象,包含若干种取回结果的方法。\n", "\n", "关于 `Context` 的更多信息,请参照 [Context](https://graphscope.io/docs/reference/context.html)\n", "\n", "在这里,结果表示从起始点出发到所有点的最短路径,我们使用返回的对象来取得一部分结果,以及取回点ID一并展示。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sssp_context.to_dataframe(\n", " selector={\"id\": \"v.id\", \"dist\": \"r\"}, vertex_range={\"begin\": 1, \"end\": 10}\n", ").sort_values(by=\"id\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "另外, 我们还可以将结果输出到客户端的机器的文件系统上。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sssp_context.output_to_client(\"./sssp_result.csv\", selector={\"id\": \"v.id\", \"dist\": \"r\"})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "查看一下输出的文件。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!head ./sssp_result.csv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### PageRank\n", "\n", "PageRank 是非常著名的一个图分析算法,让我们来看一下如何仅用两行计算 PageRank。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from graphscope import pagerank\n", "\n", "pr_context = pagerank(simple_graph, delta=0.85, max_round=10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pr_context.to_dataframe(\n", " selector={\"id\": \"v.id\", \"rank\": \"r\"}, vertex_range={\"begin\": 1, \"end\": 10}\n", ").sort_values(by=\"id\")" ] }, { "cell_type": "markdown", "metadata": { "jupyter": { "outputs_hidden": true } }, "source": [ "输出结果到本地。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pr_context.output_to_client(\n", " \"./pagerank_result.csv\", selector={\"id\": \"v.id\", \"rank\": \"r\"}\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 弱联通分量\n", "\n", "在图理论中,无向图中的一个分量,有时被称为一个联通分量,代表一个任意两个节点之间都有边的子图,且这个子图没有指向子图外的点的边。\n", "\n", "弱联通分量算法(Weakly Connected Components, WCC)计算每个点所属的联通分量。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from graphscope import wcc\n", "\n", "wcc_context = wcc(simple_graph)\n", "wcc_context.to_dataframe(\n", " selector={\"id\": \"v.id\", \"cc\": \"r\"}, vertex_range={\"begin\": 1, \"end\": 10}\n", ").sort_values(by=\"id\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "请查阅 [Builtin algorithms](https://graphscope.io/docs/analytics_engine.html#built-in-algorithms) 来获得更多的内置算法的信息,并欢迎试用更多的算法。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }