tutorials/zh/05_builtin_analytical_algorithms.ipynb (313 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 内置图分析算法\n",
"\n",
"图分析在真实世界中有广泛的用处。许多算法比如社区发现,路径连通性,中心性算法都被证明在许多商业中非常有用。\n",
"\n",
"GraphScope 内置了一系列算法,使得用户可以方便的在图数据上做分析。\n",
"\n",
"这个教程展示了如何使用内置算法来完成图分析任务。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Install graphscope package if you are NOT in the Playground\n",
"\n",
"!pip3 install graphscope"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 准备工作\n",
"\n",
"首先,我们需要载入一张属性图。\n",
"\n",
"这里我们从 [Gnutella peer-to-peer network, August 31 2002](http://snap.stanford.edu/data/p2p-Gnutella31.html)\n",
"使用 peer-to-peer network 数据集,并在此基础上为点和边加入了属性。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Import the graphscope module.\n",
"\n",
"import graphscope\n",
"\n",
"graphscope.set_option(show_log=False) # enable logging"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Load p2p network dataset\n",
"\n",
"from graphscope.dataset import load_p2p_network\n",
"\n",
"graph = load_p2p_network(directed=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"来看一下图的数据模型。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(graph.schema)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"如上所示,我们载入了一张属性图,点标签为 \"host\",有两个属性 (\"weight\" 和 \"id\"),边标签为 \"connect\", 有三个属性,分别为 (\"src_label_id\", \"dst_label_id\", \"dist\")。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 投影为简单图\n",
"\n",
"许多图分析算法都只能查询 **简单图**, 在这里我们定义 **简单图** 为只包含一种点和一种边,且点和边最多只有一个属性。\n",
"\n",
"`GraphScope` 提供了一个函数 `project` 来将属性图投影为简单图,我们可以选择某一种点和边,以及其一个或零个属性,来获得属性图的一个 **投影**。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"simple_graph = graph.project(vertices={\"host\": []}, edges={\"connect\": [\"dist\"]})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 运行算法\n",
"\n",
"在接下来的部分,我们将运行几个算法,并查看结果。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 单源最短路径\n",
"\n",
"单源最短路径算法(Single Source Shortest Path,接下来将以 `sssp` 代称),需要两个参数,第一个参数是 `graph`,第二个参数是查询的出发点 `src`。\n",
"\n",
"在这里,我们将查询从 ID 为 6 的点出发到所有点的最短路径长度。\n",
"\n",
"在后端,GraphScope 会生成一个兼容被查询的图的算法,编译为一个动态链接库。额外的 GraphScope 会对类型做一些检查,比如,在这里 `sssp` 算法要求边有一个 `int` 或 `double` 的属性作为距离。\n",
"\n",
"第一次编译动态链接库时需要一些时间,但是对于同一个算法,这一步在同样类型的图上只会进行一次。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from graphscope import sssp\n",
"\n",
"sssp_context = sssp(simple_graph, src=6)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"完成计算后,计算结果分布式的存储到了集群上 `vineyard` 的实例上。\n",
"\n",
"算法会返回一个 `Context` 对象,包含若干种取回结果的方法。\n",
"\n",
"关于 `Context` 的更多信息,请参照 [Context](https://graphscope.io/docs/reference/context.html)\n",
"\n",
"在这里,结果表示从起始点出发到所有点的最短路径,我们使用返回的对象来取得一部分结果,以及取回点ID一并展示。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sssp_context.to_dataframe(\n",
" selector={\"id\": \"v.id\", \"dist\": \"r\"}, vertex_range={\"begin\": 1, \"end\": 10}\n",
").sort_values(by=\"id\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"另外, 我们还可以将结果输出到客户端的机器的文件系统上。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sssp_context.output_to_client(\"./sssp_result.csv\", selector={\"id\": \"v.id\", \"dist\": \"r\"})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"查看一下输出的文件。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!head ./sssp_result.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### PageRank\n",
"\n",
"PageRank 是非常著名的一个图分析算法,让我们来看一下如何仅用两行计算 PageRank。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from graphscope import pagerank\n",
"\n",
"pr_context = pagerank(simple_graph, delta=0.85, max_round=10)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pr_context.to_dataframe(\n",
" selector={\"id\": \"v.id\", \"rank\": \"r\"}, vertex_range={\"begin\": 1, \"end\": 10}\n",
").sort_values(by=\"id\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"jupyter": {
"outputs_hidden": true
}
},
"source": [
"输出结果到本地。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pr_context.output_to_client(\n",
" \"./pagerank_result.csv\", selector={\"id\": \"v.id\", \"rank\": \"r\"}\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 弱联通分量\n",
"\n",
"在图理论中,无向图中的一个分量,有时被称为一个联通分量,代表一个任意两个节点之间都有边的子图,且这个子图没有指向子图外的点的边。\n",
"\n",
"弱联通分量算法(Weakly Connected Components, WCC)计算每个点所属的联通分量。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from graphscope import wcc\n",
"\n",
"wcc_context = wcc(simple_graph)\n",
"wcc_context.to_dataframe(\n",
" selector={\"id\": \"v.id\", \"cc\": \"r\"}, vertex_range={\"begin\": 1, \"end\": 10}\n",
").sort_values(by=\"id\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"请查阅 [Builtin algorithms](https://graphscope.io/docs/analytics_engine.html#built-in-algorithms) 来获得更多的内置算法的信息,并欢迎试用更多的算法。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}