tools/workload/benchmark_velox/initialize.ipynb (3,150 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# System Setup"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**1. Install system dependencies and python packages. Prepare the environment.**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, install all dependencies and python packages as `root`. Run commands and make sure the installations are successful.\n",
"\n",
"```bash\n",
"apt update\n",
"\n",
"apt install -y sudo locales wget tar tzdata git ccache cmake ninja-build build-essential llvm-11-dev clang-11 libiberty-dev libdwarf-dev libre2-dev libz-dev libssl-dev libboost-all-dev libcurl4-openssl-dev openjdk-8-jdk maven vim pip sysstat gcc-9 libjemalloc-dev nvme-cli curl zip unzip bison flex linux-tools-common linux-tools-generic linux-tools-`uname -r` mailutils\n",
"\n",
"python3 -m pip install notebook==6.5.2\n",
"python3 -m pip install jupyter_server==1.23.4\n",
"python3 -m pip install jupyter_highlight_selected_word\n",
"python3 -m pip install jupyter_contrib_nbextensions\n",
"python3 -m pip install virtualenv==20.21.1\n",
"python3 -m pip uninstall -y ipython\n",
"python3 -m pip install ipython==8.21.0\n",
"python3 -m pip uninstall -y traitlets\n",
"python3 -m pip install traitlets==5.9.0\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"***Required for Ubuntu***\n",
"\n",
"Check that there isn't an entry for your hostname mapped to 127.0.0.1 or 127.0.1.1 in /etc/hosts (Ubuntu is notorious for this). If there is, delete it.\n",
"Then add `<ip>` and `<hostname>` for master and worker nodes.\n",
"\n",
"Example /etc/hosts:\n",
" \n",
"```\n",
"127.0.0.1 localhost\n",
"\n",
"# The following lines are desirable for IPv6 capable hosts\n",
"::1 ip6-localhost ip6-loopback\n",
"fe00::0 ip6-localnet\n",
"ff00::0 ip6-mcastprefix\n",
"ff02::1 ip6-allnodes\n",
"ff02::2 ip6-allrouters\n",
"\n",
"10.0.0.117 sr217\n",
"10.0.0.113 sr213\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**2. Format and mount disks**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a python virtual environment to finish the system setup process:\n",
"\n",
"```bash\n",
"virtualenv -p python3 -v venv\n",
"source venv/bin/activate\n",
"```\n",
"\n",
"And install packages under `venv`:\n",
"```bash\n",
"(venv) python3 -m pip install questionary\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Run script [init_disks.py](./init_disks.py) to format and mount disks. **Be careful when choosing the disks to format.** If you see errors like `device or resource busy`, perhaps the partition has been mounted, you should unmount it first. If you still see this error, reboot the system and try again."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exit `venv`:\n",
"```bash\n",
"(venv) deactivate\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**3. Create user `sparkuser`**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create user `sparkuser` without password and with sudo priviledge. It's recommended to use one of the disks as the home directory instead of the system drive.\n",
"\n",
"```bash\n",
"mkdir -p /data0/home/sparkuser\n",
"ln -s /data0/home/sparkuser /home/sparkuser\n",
"cp -r /etc/skel/. /home/sparkuser/\n",
"adduser --home /home/sparkuser --disabled-password --gecos \"\" sparkuser\n",
"\n",
"chown -R sparkuser:sparkuser /data*\n",
"\n",
"echo 'sparkuser ALL=(ALL:ALL) NOPASSWD:ALL' | EDITOR='tee -a' visudo\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Generate ssh keys for `sparkuser`\n",
"\n",
"```bashrc\n",
"su - sparkuser\n",
"```\n",
"\n",
"```bashrc\n",
"rm -rf ~/.ssh\n",
"ssh-keygen -q -t rsa -N '' -f ~/.ssh/id_rsa <<<y >/dev/null 2>&1\n",
"cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys\n",
"\n",
"exit\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Generate ssh keys for `root`, and enable no password ssh from `sparkuser`\n",
"\n",
"```bash\n",
"rm -rf /root/.ssh\n",
"ssh-keygen -q -t rsa -N '' -f ~/.ssh/id_rsa <<<y >/dev/null 2>&1\n",
"cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys\n",
"cat /home/sparkuser/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Login to `sparkuser` and run the first-time ssh to the `root`\n",
"\n",
"```bash\n",
"su - sparkuser\n",
"```\n",
"\n",
"```bash\n",
"ssh -o StrictHostKeyChecking=no root@localhost ls\n",
"ssh -o StrictHostKeyChecking=no root@127.0.0.1 ls\n",
"ssh -o StrictHostKeyChecking=no root@`hostname` ls\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"***Required for Ubuntu***\n",
"\n",
"Run below command to comment out lines starting from `If not running interactively, don't do anything` in ~/.bashrc\n",
"\n",
"```bash\n",
"sed -i '5,9 s/^/# /' ~/.bashrc\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**4. Configure jupyter notebook**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As `sparkuser`, install python packages\n",
"\n",
"```bash\n",
"cd /home/sparkuser/.local/lib/ && rm -rf python*\n",
"\n",
"python3 -m pip install --upgrade jsonschema\n",
"python3 -m pip install jsonschema[format]\n",
"python3 -m pip install sqlalchemy==1.4.46\n",
"python3 -m pip install papermill Black\n",
"python3 -m pip install NotebookScripter\n",
"python3 -m pip install findspark spylon-kernel matplotlib pandasql pyhdfs\n",
"python3 -m pip install ipywidgets jupyter_nbextensions_configurator ipyparallel\n",
"```\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Configure jupyter notebook. Setup password when it prompts\n",
"\n",
"```bash\n",
"jupyter notebook --generate-config\n",
"\n",
"jupyter notebook password\n",
"\n",
"mkdir -p ~/.jupyter/custom/\n",
"\n",
"echo '.container { width:100% !important; }' >> ~/.jupyter/custom/custom.css\n",
"\n",
"echo 'div.output_stderr { background: #ffdd; display: none; }' >> ~/.jupyter/custom/custom.css\n",
"\n",
"jupyter nbextension install --py jupyter_highlight_selected_word --user\n",
"\n",
"jupyter nbextension enable highlight_selected_word/main\n",
"\n",
"jupyter nbextension install --py widgetsnbextension --user\n",
"\n",
"jupyter contrib nbextension install --user\n",
"\n",
"jupyter nbextension enable codefolding/main\n",
"\n",
"jupyter nbextension enable code_prettify/code_prettify\n",
"\n",
"jupyter nbextension enable codefolding/edit\n",
"\n",
"jupyter nbextension enable code_font_size/code_font_size\n",
"\n",
"jupyter nbextension enable collapsible_headings/main\n",
"\n",
"jupyter nbextension enable highlight_selected_word/main\n",
"\n",
"jupyter nbextension enable ipyparallel/main\n",
"\n",
"jupyter nbextension enable move_selected_cells/main\n",
"\n",
"jupyter nbextension enable nbTranslate/main\n",
"\n",
"jupyter nbextension enable scratchpad/main\n",
"\n",
"jupyter nbextension enable tree-filter/index\n",
"\n",
"jupyter nbextension enable comment-uncomment/main\n",
"\n",
"jupyter nbextension enable export_embedded/main\n",
"\n",
"jupyter nbextension enable hide_header/main\n",
"\n",
"jupyter nbextension enable highlighter/highlighter\n",
"\n",
"jupyter nbextension enable scroll_down/main\n",
"\n",
"jupyter nbextension enable snippets/main\n",
"\n",
"jupyter nbextension enable toc2/main\n",
"\n",
"jupyter nbextension enable varInspector/main\n",
"\n",
"jupyter nbextension enable codefolding/edit\n",
"\n",
"jupyter nbextension enable contrib_nbextensions_help_item/main\n",
"\n",
"jupyter nbextension enable freeze/main\n",
"\n",
"jupyter nbextension enable hide_input/main\n",
"\n",
"jupyter nbextension enable jupyter-js-widgets/extension\n",
"\n",
"jupyter nbextension enable snippets_menu/main\n",
"\n",
"jupyter nbextension enable table_beautifier/main\n",
"\n",
"jupyter nbextension enable hide_input_all/main\n",
"\n",
"jupyter nbextension enable spellchecker/main\n",
"\n",
"jupyter nbextension enable toggle_all_line_numbers/main\n",
"\n",
"jupyter nbextensions_configurator enable --user\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Clone Gluten\n",
"\n",
"```bash\n",
"cd ~\n",
"git clone https://github.com/apache/incubator-gluten.git gluten\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Start jupyter notebook\n",
"\n",
"```bash\n",
"mkdir -p ~/ipython\n",
"cd ~/ipython\n",
"\n",
"nohup jupyter notebook --ip=0.0.0.0 --port=8888 &\n",
"\n",
"find ~/gluten/tools/workload/benchmark_velox/ -maxdepth 1 -type f -exec cp {} ~/ipython \\;\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"heading_collapsed": true
},
"source": [
"# Initialize\n",
"<font color=red size=3> Run this section after notebook restart! </font>"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"Specify datadir. The directories are used for spark.local.dirs and hadoop namenode/datanode."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"datadir=[f'/data{i}' for i in range(0, 8)]\n",
"datadir"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"Specify clients(workers). Leave it empty if the cluster is setup on the local machine."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"clients=''''''.split()\n",
"print(clients)"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"Specify JAVA_HOME"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"java_home = '/usr/lib/jvm/java-8-openjdk-amd64'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"import os\n",
"import socket\n",
"import platform\n",
"\n",
"user=os.getenv('USER')\n",
"print(f\"user: {user}\")\n",
"print()\n",
"\n",
"masterip=socket.gethostbyname(socket.gethostname())\n",
"hostname=socket.gethostname() \n",
"print(f\"masterip: {masterip} hostname: {hostname}\")\n",
"print()\n",
"\n",
"hclients=clients.copy()\n",
"hclients.append(hostname)\n",
"print(f\"master and workers: {hclients}\")\n",
"print()\n",
"\n",
"\n",
"if clients:\n",
" cmd = f\"ssh {clients[0]} \" + \"\\\"lscpu | grep '^CPU(s)'\\\"\" + \" | awk '{print $2}'\"\n",
" client_cpu = !{cmd}\n",
" cpu_num = client_cpu[0]\n",
"\n",
" cmd = f\"ssh {clients[0]} \" + \"\\\"cat /proc/meminfo | grep MemTotal\\\"\" + \" | awk '{print $2}'\"\n",
" totalmemory = !{cmd}\n",
" totalmemory = int(totalmemory[0])\n",
"else:\n",
" cpu_num = os.cpu_count()\n",
" totalmemory = !cat /proc/meminfo | grep MemTotal | awk '{print $2}'\n",
" totalmemory = int(totalmemory[0])\n",
" \n",
"print(f\"cpu_num: {cpu_num}\")\n",
"print()\n",
"\n",
"print(\"total memory: \", totalmemory, \"KB\")\n",
"print()\n",
"\n",
"mem_mib = int(totalmemory/1024)-1024\n",
"print(f\"mem_mib: {mem_mib}\")\n",
"print()\n",
"\n",
"is_arm = platform.machine() == 'aarch64'\n",
"print(\"is_arm: \",is_arm)\n",
"print()\n",
"\n",
"sparklocals=\",\".join([f'{l}/{user}/yarn/local' for l in datadir])\n",
"print(f\"SPARK_LOCAL_DIR={sparklocals}\")\n",
"print()\n",
"\n",
"%cd ~"
]
},
{
"cell_type": "markdown",
"metadata": {
"heading_collapsed": true
},
"source": [
"# Set up clients\n",
"<font color=red size=3> SKIP for single node </font>"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"## Install dependencies"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"Manually configure ssh login without password to all clients\n",
"\n",
"```bash\n",
"ssh-copy-id -o StrictHostKeyChecking=no root@<client ip>\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"for l in clients:\n",
" !ssh root@{l} apt update > /dev/null 2>&1\n",
" !ssh root@{l} apt install -y sudo locales wget tar tzdata git ccache cmake ninja-build build-essential llvm-11-dev clang-11 libiberty-dev libdwarf-dev libre2-dev libz-dev libssl-dev libboost-all-dev libcurl4-openssl-dev openjdk-8-jdk maven vim pip sysstat gcc-9 libjemalloc-dev nvme-cli curl zip unzip bison flex linux-tools-common linux-tools-generic linux-tools-`uname -r` > /dev/null 2>&1"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"## Create user"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"for l in clients:\n",
" !ssh -o StrictHostKeyChecking=no root@{l} ls"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"for l in clients:\n",
" !ssh root@{l} adduser --disabled-password --gecos '\"\"' {user}"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"for l in clients:\n",
" !ssh root@{l} cp -r .ssh /home/{user}/\n",
" !ssh root@{l} chown -R {user}:{user} /home/{user}/.ssh"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"for l in clients:\n",
" !ssh root@{l} \"echo -e 'sparkuser ALL=(ALL:ALL) NOPASSWD:ALL' | EDITOR='tee -a' visudo\""
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"***Required for Ubuntu***\n",
"\n",
"Run below command to comment out lines starting from If not running interactively, don't do anything in ~/.bashrc"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"for l in clients:\n",
" !ssh {l} sed -i \"'5,9 s/^/# /'\" ~/.bashrc"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"Use /etc/hosts on master node"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"for l in clients:\n",
" !scp /etc/hosts root@{l}:/etc/hosts"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"## Setup disks"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"for l in clients:\n",
" !ssh root@{l} apt update > /dev/null 2>&1\n",
" !ssh root@{l} apt install -y pip > /dev/null 2>&1\n",
" !ssh root@{l} python3 -m pip install virtualenv"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"Manually run **2. Format and mount disks** section under [System Setup](#System-Setup)"
]
},
{
"cell_type": "markdown",
"metadata": {
"heading_collapsed": true
},
"source": [
"# Configure Spark, Hadoop"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"## Download packages"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"!wget https://archive.apache.org/dist/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz > /dev/null 2>&1"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"!wget https://archive.apache.org/dist/hadoop/common/hadoop-3.2.4/hadoop-3.2.4.tar.gz > /dev/null 2>&1\n",
"# backup url: !wget https://dlcdn.apache.org/hadoop/common/hadoop-3.2.4/hadoop-3.2.4.tar.gz > /dev/null 2>&1\n",
"if is_arm:\n",
" # download both versions\n",
" !wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.5/hadoop-3.3.5-aarch64.tar.gz > /dev/null 2>&1"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"## Create directories"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"cmd=\";\".join([f\"chown -R {user}:{user} \" + l for l in datadir])\n",
"for l in hclients:\n",
" !ssh root@{l} '{cmd}'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"cmd=\";\".join([f\"rm -rf {l}/tmp; mkdir -p {l}/tmp\" for l in datadir])\n",
"for l in hclients:\n",
" !ssh {l} '{cmd}'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"cmd=\";\".join([f\"mkdir -p {l}/{user}/hdfs/data; mkdir -p {l}/{user}/yarn/local\" for l in datadir])\n",
"for l in hclients:\n",
" !ssh {l} '{cmd}'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"!mkdir -p {datadir[0]}/{user}/hdfs/name\n",
"!mkdir -p {datadir[0]}/{user}/hdfs/namesecondary"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"for l in hclients:\n",
" !scp hadoop-3.2.4.tar.gz {l}:~/\n",
" !scp spark-3.3.1-bin-hadoop3.tgz {l}:~/\n",
" !ssh {l} \"mv -f hadoop hadoop.bak; mv -f spark spark.bak\"\n",
" !ssh {l} \"tar zxvf hadoop-3.2.4.tar.gz > /dev/null 2>&1\"\n",
" !ssh {l} \"tar -zxvf spark-3.3.1-bin-hadoop3.tgz > /dev/null 2>&1\"\n",
" !ssh root@{l} \"apt install -y openjdk-8-jdk > /dev/null 2>&1\"\n",
" !ssh {l} \"ln -s hadoop-3.2.4 hadoop; ln -s spark-3.3.1-bin-hadoop3 spark\"\n",
" if is_arm:\n",
" !ssh {l} \"tar zxvf hadoop-3.3.5-aarch64.tar.gz > /dev/null 2>&1\"\n",
" !ssh {l} \"cd hadoop && mv lib lib.bak && cp -rf ~/hadoop-3.3.5/lib ~/hadoop\""
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"## Configure bashrc"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"\n",
"cfg=f'''export HADOOP_HOME=~/hadoop\n",
"export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin\n",
"\n",
"export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop\n",
"export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop\n",
"\n",
"export SPARK_HOME=~/spark\n",
"export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.9.5-src.zip:$PYTHONPATH\n",
"export PATH=$SPARK_HOME/bin:$PATH\n",
"\n",
"'''"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"if is_arm:\n",
" cfg += 'export CPU_TARGET=\"aarch64\"\\nexport JAVA_HOME=/usr/lib/jvm/java-8-openjdk-arm64\\nexport PATH=$JAVA_HOME/bin:$PATH\\n'\n",
"else:\n",
" cfg += f'export JAVA_HOME={java_home}\\nexport PATH=$JAVA_HOME/bin:$PATH\\n'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"with open(\"tmpcfg\",'w') as f:\n",
" f.writelines(cfg)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"for l in hclients:\n",
" !scp tmpcfg {l}:~/tmpcfg.in\n",
" !ssh {l} \"cat ~/tmpcfg.in >> ~/.bashrc\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"for l in hclients:\n",
" !ssh {l} tail -n10 ~/.bashrc"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"## Configure Hadoop"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"for l in hclients:\n",
" !ssh root@{l} \"apt install -y libiberty-dev libxml2-dev libkrb5-dev libgsasl7-dev libuuid1 uuid-dev > /dev/null 2>&1\""
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"### setup short-circuit "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"for l in hclients:\n",
" !ssh root@{l} \"mkdir -p /var/lib/hadoop-hdfs/\"\n",
" !ssh root@{l} 'chown {user}:{user} /var/lib/hadoop-hdfs/'"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"### enable security.authorization"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"coresite='''<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n",
"<?xml-stylesheet type=\"text/xsl\" href=\"configuration.xsl\"?>\n",
"<!--\n",
" Licensed under the Apache License, Version 2.0 (the \"License\");\n",
" you may not use this file except in compliance with the License.\n",
" You may obtain a copy of the License at\n",
"\n",
" http://www.apache.org/licenses/LICENSE-2.0\n",
"\n",
" Unless required by applicable law or agreed to in writing, software\n",
" distributed under the License is distributed on an \"AS IS\" BASIS,\n",
" WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
" See the License for the specific language governing permissions and\n",
" limitations under the License. See accompanying LICENSE file.\n",
"-->\n",
"\n",
"<!-- Put site-specific property overrides in this file. -->\n",
"\n",
"<configuration>\n",
" <property>\n",
" <name>fs.default.name</name>\n",
" <value>hdfs://{:s}:8020</value>\n",
" <final>true</final>\n",
" </property>\n",
" <property>\n",
" <name>hadoop.security.authentication</name>\n",
" <value>simple</value>\n",
" </property>\n",
" <property>\n",
" <name>hadoop.security.authorization</name>\n",
" <value>true</value>\n",
" </property>\n",
"</configuration>\n",
"'''.format(hostname)\n",
"\n",
"with open(f'/home/{user}/hadoop/etc/hadoop/core-site.xml','w') as f:\n",
" f.writelines(coresite)\n",
" \n",
"for l in clients:\n",
" !scp ~/hadoop/etc/hadoop/core-site.xml {l}:~/hadoop/etc/hadoop/core-site.xml >/dev/null 2>&1"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"### set IP check, note the command <font color=red>\", \"</font>.join"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"hadooppolicy='''<?xml version=\"1.0\"?>\n",
"<?xml-stylesheet type=\"text/xsl\" href=\"configuration.xsl\"?>\n",
"<!--\n",
"\n",
" Licensed to the Apache Software Foundation (ASF) under one\n",
" or more contributor license agreements. See the NOTICE file\n",
" distributed with this work for additional information\n",
" regarding copyright ownership. The ASF licenses this file\n",
" to you under the Apache License, Version 2.0 (the\n",
" \"License\"); you may not use this file except in compliance\n",
" with the License. You may obtain a copy of the License at\n",
"\n",
" http://www.apache.org/licenses/LICENSE-2.0\n",
"\n",
" Unless required by applicable law or agreed to in writing, software\n",
" distributed under the License is distributed on an \"AS IS\" BASIS,\n",
" WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
" See the License for the specific language governing permissions and\n",
" limitations under the License.\n",
"\n",
"-->\n",
"\n",
"<!-- Put site-specific property overrides in this file. -->\n",
"\n",
"<configuration>\n",
" <property>\n",
" <name>security.service.authorization.default.hosts</name>\n",
" <value>{:s}</value>\n",
" </property>\n",
" <property>\n",
" <name>security.service.authorization.default.acl</name>\n",
" <value>{:s} {:s}</value>\n",
" </property>\n",
" \n",
" \n",
" <property>\n",
" <name>security.client.protocol.acl</name>\n",
" <value>*</value>\n",
" <description>ACL for ClientProtocol, which is used by user code\n",
" via the DistributedFileSystem.\n",
" The ACL is a comma-separated list of user and group names. The user and\n",
" group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n",
" A special value of \"*\" means all users are allowed.</description>\n",
" </property>\n",
"\n",
" <property>\n",
" <name>security.client.datanode.protocol.acl</name>\n",
" <value>*</value>\n",
" <description>ACL for ClientDatanodeProtocol, the client-to-datanode protocol\n",
" for block recovery.\n",
" The ACL is a comma-separated list of user and group names. The user and\n",
" group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n",
" A special value of \"*\" means all users are allowed.</description>\n",
" </property>\n",
"\n",
" <property>\n",
" <name>security.datanode.protocol.acl</name>\n",
" <value>*</value>\n",
" <description>ACL for DatanodeProtocol, which is used by datanodes to\n",
" communicate with the namenode.\n",
" The ACL is a comma-separated list of user and group names. The user and\n",
" group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n",
" A special value of \"*\" means all users are allowed.</description>\n",
" </property>\n",
"\n",
" <property>\n",
" <name>security.inter.datanode.protocol.acl</name>\n",
" <value>*</value>\n",
" <description>ACL for InterDatanodeProtocol, the inter-datanode protocol\n",
" for updating generation timestamp.\n",
" The ACL is a comma-separated list of user and group names. The user and\n",
" group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n",
" A special value of \"*\" means all users are allowed.</description>\n",
" </property>\n",
"\n",
" <property>\n",
" <name>security.namenode.protocol.acl</name>\n",
" <value>*</value>\n",
" <description>ACL for NamenodeProtocol, the protocol used by the secondary\n",
" namenode to communicate with the namenode.\n",
" The ACL is a comma-separated list of user and group names. The user and\n",
" group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n",
" A special value of \"*\" means all users are allowed.</description>\n",
" </property>\n",
"\n",
" <property>\n",
" <name>security.admin.operations.protocol.acl</name>\n",
" <value>*</value>\n",
" <description>ACL for AdminOperationsProtocol. Used for admin commands.\n",
" The ACL is a comma-separated list of user and group names. The user and\n",
" group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n",
" A special value of \"*\" means all users are allowed.</description>\n",
" </property>\n",
"\n",
" <property>\n",
" <name>security.refresh.user.mappings.protocol.acl</name>\n",
" <value>*</value>\n",
" <description>ACL for RefreshUserMappingsProtocol. Used to refresh\n",
" users mappings. The ACL is a comma-separated list of user and\n",
" group names. The user and group list is separated by a blank. For\n",
" e.g. \"alice,bob users,wheel\". A special value of \"*\" means all\n",
" users are allowed.</description>\n",
" </property>\n",
"\n",
" <property>\n",
" <name>security.refresh.policy.protocol.acl</name>\n",
" <value>*</value>\n",
" <description>ACL for RefreshAuthorizationPolicyProtocol, used by the\n",
" dfsadmin and mradmin commands to refresh the security policy in-effect.\n",
" The ACL is a comma-separated list of user and group names. The user and\n",
" group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n",
" A special value of \"*\" means all users are allowed.</description>\n",
" </property>\n",
"\n",
" <property>\n",
" <name>security.ha.service.protocol.acl</name>\n",
" <value>*</value>\n",
" <description>ACL for HAService protocol used by HAAdmin to manage the\n",
" active and stand-by states of namenode.</description>\n",
" </property>\n",
"\n",
" <property>\n",
" <name>security.zkfc.protocol.acl</name>\n",
" <value>*</value>\n",
" <description>ACL for access to the ZK Failover Controller\n",
" </description>\n",
" </property>\n",
"\n",
" <property>\n",
" <name>security.qjournal.service.protocol.acl</name>\n",
" <value>*</value>\n",
" <description>ACL for QJournalProtocol, used by the NN to communicate with\n",
" JNs when using the QuorumJournalManager for edit logs.</description>\n",
" </property>\n",
"\n",
" <property>\n",
" <name>security.mrhs.client.protocol.acl</name>\n",
" <value>*</value>\n",
" <description>ACL for HSClientProtocol, used by job clients to\n",
" communciate with the MR History Server job status etc.\n",
" The ACL is a comma-separated list of user and group names. The user and\n",
" group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n",
" A special value of \"*\" means all users are allowed.</description>\n",
" </property>\n",
"\n",
" <!-- YARN Protocols -->\n",
"\n",
" <property>\n",
" <name>security.resourcetracker.protocol.acl</name>\n",
" <value>*</value>\n",
" <description>ACL for ResourceTrackerProtocol, used by the\n",
" ResourceManager and NodeManager to communicate with each other.\n",
" The ACL is a comma-separated list of user and group names. The user and\n",
" group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n",
" A special value of \"*\" means all users are allowed.</description>\n",
" </property>\n",
"\n",
" <property>\n",
" <name>security.resourcemanager-administration.protocol.acl</name>\n",
" <value>*</value>\n",
" <description>ACL for ResourceManagerAdministrationProtocol, for admin commands.\n",
" The ACL is a comma-separated list of user and group names. The user and\n",
" group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n",
" A special value of \"*\" means all users are allowed.</description>\n",
" </property>\n",
"\n",
" <property>\n",
" <name>security.applicationclient.protocol.acl</name>\n",
" <value>*</value>\n",
" <description>ACL for ApplicationClientProtocol, used by the ResourceManager\n",
" and applications submission clients to communicate with each other.\n",
" The ACL is a comma-separated list of user and group names. The user and\n",
" group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n",
" A special value of \"*\" means all users are allowed.</description>\n",
" </property>\n",
"\n",
" <property>\n",
" <name>security.applicationmaster.protocol.acl</name>\n",
" <value>*</value>\n",
" <description>ACL for ApplicationMasterProtocol, used by the ResourceManager\n",
" and ApplicationMasters to communicate with each other.\n",
" The ACL is a comma-separated list of user and group names. The user and\n",
" group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n",
" A special value of \"*\" means all users are allowed.</description>\n",
" </property>\n",
"\n",
" <property>\n",
" <name>security.containermanagement.protocol.acl</name>\n",
" <value>*</value>\n",
" <description>ACL for ContainerManagementProtocol protocol, used by the NodeManager\n",
" and ApplicationMasters to communicate with each other.\n",
" The ACL is a comma-separated list of user and group names. The user and\n",
" group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n",
" A special value of \"*\" means all users are allowed.</description>\n",
" </property>\n",
"\n",
" <property>\n",
" <name>security.resourcelocalizer.protocol.acl</name>\n",
" <value>*</value>\n",
" <description>ACL for ResourceLocalizer protocol, used by the NodeManager\n",
" and ResourceLocalizer to communicate with each other.\n",
" The ACL is a comma-separated list of user and group names. The user and\n",
" group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n",
" A special value of \"*\" means all users are allowed.</description>\n",
" </property>\n",
"\n",
" <property>\n",
" <name>security.job.task.protocol.acl</name>\n",
" <value>*</value>\n",
" <description>ACL for TaskUmbilicalProtocol, used by the map and reduce\n",
" tasks to communicate with the parent tasktracker.\n",
" The ACL is a comma-separated list of user and group names. The user and\n",
" group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n",
" A special value of \"*\" means all users are allowed.</description>\n",
" </property>\n",
"\n",
" <property>\n",
" <name>security.job.client.protocol.acl</name>\n",
" <value>*</value>\n",
" <description>ACL for MRClientProtocol, used by job clients to\n",
" communciate with the MR ApplicationMaster to query job status etc.\n",
" The ACL is a comma-separated list of user and group names. The user and\n",
" group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n",
" A special value of \"*\" means all users are allowed.</description>\n",
" </property>\n",
"\n",
" <property>\n",
" <name>security.applicationhistory.protocol.acl</name>\n",
" <value>*</value>\n",
" <description>ACL for ApplicationHistoryProtocol, used by the timeline\n",
" server and the generic history service client to communicate with each other.\n",
" The ACL is a comma-separated list of user and group names. The user and\n",
" group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n",
" A special value of \"*\" means all users are allowed.</description>\n",
" </property>\n",
"\n",
" \n",
" \n",
" \n",
" \n",
"</configuration>\n",
"'''.format((\",\").join(hclients),user,user)\n",
"\n",
"with open(f'/home/{user}/hadoop/etc/hadoop/hadoop-policy.xml','w') as f:\n",
" f.writelines(hadooppolicy)\n",
" \n",
"for l in clients:\n",
" !scp ~/hadoop/etc/hadoop/hadoop-policy.xml {l}:~/hadoop/etc/hadoop/hadoop-policy.xml >/dev/null 2>&1\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"### hdfs config, set replication to 1 to cache all the data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"code_folding": [],
"hidden": true
},
"outputs": [],
"source": [
"hdfs_data=\",\".join([f'{l}/{user}/hdfs/data' for l in datadir])\n",
"\n",
"hdfs_site=f'''<?xml version=\"1.0\"?>\n",
"<?xml-stylesheet type=\"text/xsl\" href=\"configuration.xsl\"?>\n",
"\n",
"<!-- Put site-specific property overrides in this file. -->\n",
"\n",
"<configuration>\n",
" <property>\n",
" <name>dfs.namenode.secondary.http-address</name>\n",
" <value>{hostname}:50090</value>\n",
" </property>\n",
" <property>\n",
" <name>dfs.namenode.name.dir</name>\n",
" <value>{datadir[0]}/{user}/hdfs/name</value>\n",
" <final>true</final>\n",
" </property>\n",
"\n",
" <property>\n",
" <name>dfs.datanode.data.dir</name>\n",
" <value>{hdfs_data}</value>\n",
" <final>true</final>\n",
" </property>\n",
"\n",
" <property>\n",
" <name>dfs.namenode.checkpoint.dir</name>\n",
" <value>{datadir[0]}/{user}/hdfs/namesecondary</value>\n",
" <final>true</final>\n",
" </property>\n",
" <property>\n",
" <name>dfs.name.handler.count</name>\n",
" <value>100</value>\n",
" </property>\n",
" <property>\n",
" <name>dfs.blocksize</name>\n",
" <value>128m</value>\n",
"</property>\n",
" <property>\n",
" <name>dfs.replication</name>\n",
" <value>1</value>\n",
"</property>\n",
"\n",
"<property>\n",
" <name>dfs.client.read.shortcircuit</name>\n",
" <value>true</value>\n",
"</property>\n",
"\n",
"<property>\n",
" <name>dfs.domain.socket.path</name>\n",
" <value>/var/lib/hadoop-hdfs/dn_socket</value>\n",
"</property>\n",
"\n",
"</configuration>\n",
"'''\n",
"\n",
"\n",
"with open(f'/home/{user}/hadoop/etc/hadoop/hdfs-site.xml','w') as f:\n",
" f.writelines(hdfs_site)\n",
" \n",
"for l in clients:\n",
" !scp ~/hadoop/etc/hadoop/hdfs-site.xml {l}:~/hadoop/etc/hadoop/hdfs-site.xml >/dev/null 2>&1\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"### mapreduce config"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"mapreduce='''<?xml version=\"1.0\"?>\n",
"<?xml-stylesheet type=\"text/xsl\" href=\"configuration.xsl\"?>\n",
"<!--\n",
" Licensed under the Apache License, Version 2.0 (the \"License\");\n",
" you may not use this file except in compliance with the License.\n",
" You may obtain a copy of the License at\n",
"\n",
" http://www.apache.org/licenses/LICENSE-2.0\n",
"\n",
" Unless required by applicable law or agreed to in writing, software\n",
" distributed under the License is distributed on an \"AS IS\" BASIS,\n",
" WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
" See the License for the specific language governing permissions and\n",
" limitations under the License. See accompanying LICENSE file.\n",
"-->\n",
"\n",
"<!-- Put site-specific property overrides in this file. -->\n",
"\n",
"<configuration>\n",
" <property>\n",
" <name>mapreduce.framework.name</name>\n",
" <value>yarn</value>\n",
" </property>\n",
"\n",
" <property>\n",
" <name>mapreduce.job.maps</name>\n",
" <value>288</value>\n",
" </property>\n",
" <property>\n",
" <name>mapreduce.job.reduces</name>\n",
" <value>64</value>\n",
" </property>\n",
"\n",
" <property>\n",
" <name>mapreduce.map.java.opts</name>\n",
" <value>-Xmx5120M -DpreferIPv4Stack=true</value>\n",
" </property>\n",
" <property>\n",
" <name>mapreduce.map.memory.mb</name>\n",
" <value>6144</value>\n",
" </property>\n",
"\n",
" <property>\n",
" <name>mapreduce.reduce.java.opts</name>\n",
" <value>-Xmx5120M -DpreferIPv4Stack=true</value>\n",
" </property>\n",
" <property>\n",
" <name>mapreduce.reduce.memory.mb</name>\n",
" <value>6144</value>\n",
" </property>\n",
" <property>\n",
" <name>yarn.app.mapreduce.am.staging-dir</name>\n",
" <value>/user</value>\n",
" </property>\n",
" <property>\n",
" <name>mapreduce.task.io.sort.mb</name>\n",
" <value>2000</value>\n",
" </property>\n",
" <property>\n",
" <name>mapreduce.task.timeout</name>\n",
" <value>3600000</value>\n",
" </property>\n",
"<!-- MapReduce Job History Server security configs -->\n",
"<property>\n",
" <name>mapreduce.jobhistory.address</name>\n",
" <value>{:s}:10020</value>\n",
"</property>\n",
"\n",
"</configuration>\n",
"'''.format(hostname)\n",
"\n",
"\n",
"with open(f'/home/{user}/hadoop/etc/hadoop/mapred-site.xml','w') as f:\n",
" f.writelines(mapreduce)\n",
" \n",
"for l in clients:\n",
" !scp ~/hadoop/etc/hadoop/mapred-site.xml {l}:~/hadoop/etc/hadoop/mapred-site.xml >/dev/null 2>&1\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"### yarn config"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"code_folding": [],
"hidden": true
},
"outputs": [],
"source": [
"yarn_site=f'''<?xml version=\"1.0\"?>\n",
"<!--\n",
" Licensed under the Apache License, Version 2.0 (the \"License\");\n",
" you may not use this file except in compliance with the License.\n",
" You may obtain a copy of the License at\n",
"\n",
" http://www.apache.org/licenses/LICENSE-2.0\n",
"\n",
" Unless required by applicable law or agreed to in writing, software\n",
" distributed under the License is distributed on an \"AS IS\" BASIS,\n",
" WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
" See the License for the specific language governing permissions and\n",
" limitations under the License. See accompanying LICENSE file.\n",
"-->\n",
"<configuration>\n",
" <property>\n",
" <name>yarn.resourcemanager.hostname</name>\n",
" <value>{hostname}</value>\n",
" </property>\n",
" <property>\n",
" <name>yarn.resourcemanager.address</name>\n",
" <value>{hostname}:8032</value>\n",
" </property>\n",
" <property>\n",
" <name>yarn.resourcemanager.webapp.address</name>\n",
" <value>{hostname}:8088</value>\n",
" </property>\n",
" <property>\n",
" <name>yarn.nodemanager.resource.memory-mb</name>\n",
" <value>{mem_mib}</value>\n",
" </property>\n",
" <property>\n",
" <name>yarn.nodemanager.resource.cpu-vcores</name>\n",
" <value>{cpu_num}</value>\n",
" </property>\n",
" <property>\n",
" <name>yarn.nodemanager.pmem-check-enabled</name>\n",
" <value>false</value>\n",
" </property>\n",
"\n",
" <property>\n",
" <name>yarn.nodemanager.vmem-check-enabled</name>\n",
" <value>false</value>\n",
" </property>\n",
" <property>\n",
" <name>yarn.nodemanager.vmem-pmem-ratio</name>\n",
" <value>4.1</value>\n",
" </property>\n",
" <property>\n",
" <name>yarn.nodemanager.aux-services</name>\n",
" <value>mapreduce_shuffle,spark_shuffle</value>\n",
" </property>\n",
"\n",
" <property>\n",
" <name>yarn.scheduler.minimum-allocation-mb</name>\n",
" <value>1024</value>\n",
" </property>\n",
" <property>\n",
" <name>yarn.scheduler.maximum-allocation-mb</name>\n",
" <value>{mem_mib}</value>\n",
" </property>\n",
" <property>\n",
" <name>yarn.scheduler.minimum-allocation-vcores</name>\n",
" <value>1</value>\n",
" </property>\n",
" <property>\n",
" <name>yarn.scheduler.maximum-allocation-vcores</name>\n",
" <value>{cpu_num}</value>\n",
" </property>\n",
"\n",
" <property>\n",
" <name>yarn.log-aggregation-enable</name>\n",
" <value>false</value>\n",
" </property>\n",
" <property>\n",
" <name>yarn.nodemanager.log.retain-seconds</name>\n",
" <value>36000</value>\n",
" </property>\n",
" <property>\n",
" <name>yarn.nodemanager.delete.debug-delay-sec</name>\n",
" <value>3600</value>\n",
" </property>\n",
" <property>\n",
" <name>yarn.log.server.url</name>\n",
" <value>http://{hostname}:19888/jobhistory/logs/</value>\n",
" </property>\n",
"\n",
" <property>\n",
" <name>yarn.nodemanager.log-dirs</name>\n",
" <value>/home/{user}/hadoop/logs/userlogs</value>\n",
" </property>\n",
" <property>\n",
" <name>yarn.nodemanager.local-dirs</name>\n",
" <value>{sparklocals}\n",
" </value>\n",
" </property>\n",
" <property>\n",
" <name>yarn.nodemanager.aux-services.spark_shuffle.class</name>\n",
" <value>org.apache.spark.network.yarn.YarnShuffleService</value>\n",
" </property>\n",
"</configuration>\n",
"'''\n",
"\n",
"\n",
"with open(f'/home/{user}/hadoop/etc/hadoop/yarn-site.xml','w') as f:\n",
" f.writelines(yarn_site)\n",
" \n",
"for l in clients:\n",
" !scp ~/hadoop/etc/hadoop/yarn-site.xml {l}:~/hadoop/etc/hadoop/yarn-site.xml >/dev/null 2>&1\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"### hadoop-env"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"#config java home\n",
"if is_arm:\n",
" !echo \"export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-arm64\" >> ~/hadoop/etc/hadoop/hadoop-env.sh\n",
"else:\n",
" !echo \"export JAVA_HOME={java_home}\" >> ~/hadoop/etc/hadoop/hadoop-env.sh\n",
"\n",
"for l in clients:\n",
" !scp hadoop/etc/hadoop/hadoop-env.sh {l}:~/hadoop/etc/hadoop/ >/dev/null 2>&1\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"### workers config"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"if clients:\n",
" with open(f'/home/{user}/hadoop/etc/hadoop/workers','w') as f:\n",
" f.writelines(\"\\n\".join(clients))\n",
" for l in clients:\n",
" !scp hadoop/etc/hadoop/workers {l}:~/hadoop/etc/hadoop/ >/dev/null 2>&1\n",
"else:\n",
" !echo {hostname} > ~/hadoop/etc/hadoop/workers"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"### Copy jar from Spark for external shuffle service"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"for l in hclients:\n",
" !scp spark/yarn/spark-3.3.1-yarn-shuffle.jar {l}:~/hadoop/share/hadoop/common/lib/"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"## Configure Spark"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"eventlog_dir=f'hdfs://{hostname}:8020/tmp/sparkEventLog'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"sparkconf=f'''\n",
"spark.eventLog.enabled true\n",
"spark.eventLog.dir {eventlog_dir}\n",
"spark.history.fs.logDirectory {eventlog_dir}\n",
"'''\n",
"\n",
"with open(f'/home/{user}/spark/conf/spark-defaults.conf','w+') as f:\n",
" f.writelines(sparkconf)\n",
" \n",
"for l in clients:\n",
" !scp ~/spark/conf/spark-defaults.conf {l}:~/spark/conf/spark-defaults.conf >/dev/null 2>&1"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"sparkenv = f'export SPARK_LOCAL_DIRS={sparklocals}\\n'\n",
"with open(f'/home/{user}/.bashrc', 'a+') as f:\n",
" f.writelines(sparkenv)\n",
"for l in clients:\n",
" !scp ~/.bashrc {l}:~/.bashrc >/dev/null 2>&1"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"for l in hclients:\n",
" !ssh {l} tail -n10 ~/.bashrc"
]
},
{
"cell_type": "markdown",
"metadata": {
"heading_collapsed": true
},
"source": [
"# Configure monitor & startups"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"!cd ~\n",
"!git clone https://github.com/trailofbits/tsc_freq_khz.git\n",
"\n",
"for l in clients:\n",
" !scp -r tsc_freq_khz {l}:~/\n",
"\n",
"for l in hclients:\n",
" !ssh {l} 'cd tsc_freq_khz && make && sudo insmod ./tsc_freq_khz.ko' >/dev/null 2>&1\n",
" !ssh root@{l} 'dmesg | grep tsc_freq_khz'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"startup=f'''#!/bin/bash\n",
"echo -1 > /proc/sys/kernel/perf_event_paranoid\n",
"echo 0 > /proc/sys/kernel/kptr_restrict\n",
"echo madvise >/sys/kernel/mm/transparent_hugepage/enabled\n",
"echo 1 > /proc/sys/kernel/numa_balancing\n",
"end=$(($(nproc) - 1))\n",
"for i in $(seq 0 $end); do echo performance > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor; done\n",
"for file in $(find /sys/devices/system/cpu/cpu*/power/energy_perf_bias); do echo \"0\" > $file; done\n",
"\n",
"if [ -d /home/{user}/sep_installed ]; then\n",
" /home/{user}/sep_installed/sepdk/src/insmod-sep -g {user}\n",
"fi\n",
"'''\n",
"\n",
"with open('/tmp/tmpstartup', 'w') as f:\n",
" f.writelines(startup)\n",
"\n",
"startup_service=f'''[Unit]\n",
"Description=Configure Transparent Hugepage, Auto NUMA Balancing, CPU Freq Scaling Governor\n",
"\n",
"[Service]\n",
"ExecStart=/usr/local/bin/mystartup.sh\n",
"\n",
"[Install]\n",
"WantedBy=multi-user.target\n",
"'''\n",
"\n",
"with open('/tmp/tmpstartup_service', 'w') as f:\n",
" f.writelines(startup_service)\n",
" \n",
"for l in hclients:\n",
" !scp /tmp/tmpstartup $l:/tmp/tmpstartup\n",
" !scp /tmp/tmpstartup_service $l:/tmp/tmpstartup_service\n",
" !ssh root@$l \"cat /tmp/tmpstartup > /usr/local/bin/mystartup.sh\"\n",
" !ssh root@$l \"chmod +x /usr/local/bin/mystartup.sh\"\n",
" !ssh root@$l \"cat /tmp/tmpstartup_service > /etc/systemd/system/mystartup.service\"\n",
" !ssh $l \"sudo systemctl enable mystartup.service\"\n",
" !ssh $l \"sudo systemctl start mystartup.service\"\n",
" !ssh $l \"sudo systemctl status mystartup.service\""
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"## Install Emon"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"<font color=red size=3> Get the latest offline installer from [link](https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler-download.html?operatingsystem=linux&linux-install-type=offline) </font>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"offline_installer = 'https://registrationcenter-download.intel.com/akdlm/IRC_NAS/e7797b12-ce87-4df0-aa09-df4a272fc5d9/intel-vtune-2025.0.0.1130_offline.sh'\n",
"for l in hclients:\n",
" !ssh {l} \"wget {offline_installer} -q && chmod +x intel-vtune-2025.0.0.1130_offline.sh\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true,
"scrolled": true
},
"outputs": [],
"source": [
"for l in hclients:\n",
" !ssh {l} \"sudo ./intel-vtune-2025.0.0.1130_offline.sh -a -c -s --eula accept\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"for l in hclients:\n",
" !ssh {l} \"sudo chown -R {user}:{user} /opt/intel/oneapi/vtune/ && rm -f sep_installed && ln -s /opt/intel/oneapi/vtune/latest sep_installed\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true,
"scrolled": true
},
"outputs": [],
"source": [
"for l in hclients:\n",
" !ssh {l} \"cd sep_installed/sepdk/src/; echo -e \\\"\\\\n\\\\n\\\\n\\\" | ./build-driver\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true,
"scrolled": true
},
"outputs": [],
"source": [
"for l in hclients:\n",
" !ssh root@{l} \"/home/{user}/sep_installed/sepdk/src/rmmod-sep && /home/{user}/sep_installed/sepdk/src/insmod-sep -g {user}\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"for l in hclients:\n",
" !ssh {l} \"source /home/{user}/sep_installed/sep_vars.sh > /dev/null 2>&1; emon -v | head -n 1\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"for l in hclients:\n",
" !ssh {l} 'echo \"source /home/{user}/sep_installed/sep_vars.sh > /dev/null 2>&1\" >> ~/.bashrc'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"for c in hclients:\n",
" !ssh {c} 'tail -n1 ~/.bashrc'"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"heading_collapsed": true,
"hidden": true,
"run_control": {
"frozen": true
}
},
"source": [
"## Inspect CPU Freq & HT"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"hidden": true,
"run_control": {
"frozen": true
}
},
"outputs": [],
"source": [
"if is_arm:\n",
" t = r'''\n",
" #include <stdlib.h>\n",
" #include <stdio.h>\n",
" #include <assert.h>\n",
" #include <sys/time.h>\n",
" #include <sched.h>\n",
" #include <sys/shm.h>\n",
" #include <errno.h>\n",
" #include <sys/mman.h>\n",
" #include <unistd.h> //used for parsing the command line arguments\n",
" #include <fcntl.h> //used for opening the memory device file\n",
" #include <math.h> //used for rounding functions\n",
" #include <signal.h>\n",
" #include <linux/types.h>\n",
" #include <stdint.h>\n",
"\n",
" static inline uint64_t GetTickCount()\n",
" {//Return ns counts\n",
" struct timeval tp;\n",
" gettimeofday(&tp,NULL);\n",
" return tp.tv_sec*1000+tp.tv_usec/1000;\n",
" }\n",
"\n",
" uint64_t CNT=CNT_DEF;\n",
"\n",
" int main()\n",
" {\n",
"\n",
" uint64_t start, end;\n",
" start=end=GetTickCount();\n",
"\n",
" asm volatile (\n",
" \"1:\\n\"\n",
" \"SUBS %0,%0,#1\\n\"\n",
" \"bne 1b\\n\"\n",
" ::\"r\"(CNT)\n",
" );\n",
"\n",
" end=GetTickCount();\n",
"\n",
" printf(\" total time = %lu, freq = %lu \\n\", end-start, CNT/(end-start)/1000);\n",
"\n",
" return 0;\n",
" }\n",
" '''\n",
"else:\n",
" t=r'''\n",
" #include <stdlib.h>\n",
" #include <stdio.h>\n",
" #include <assert.h>\n",
" #include <sys/time.h>\n",
" #include <sched.h>\n",
" #include <sys/shm.h>\n",
" #include <errno.h>\n",
" #include <sys/mman.h>\n",
" #include <unistd.h> //used for parsing the command line arguments\n",
" #include <fcntl.h> //used for opening the memory device file\n",
" #include <math.h> //used for rounding functions\n",
" #include <signal.h>\n",
" #include <linux/types.h>\n",
" #include <stdint.h>\n",
"\n",
" static inline uint64_t GetTickCount()\n",
" {//Return ns counts\n",
" struct timeval tp;\n",
" gettimeofday(&tp,NULL);\n",
" return tp.tv_sec*1000+tp.tv_usec/1000;\n",
" }\n",
"\n",
" uint64_t CNT=CNT_DEF;\n",
"\n",
" int main()\n",
" {\n",
"\n",
" uint64_t start, end;\n",
" start=end=GetTickCount();\n",
"\n",
" asm volatile (\n",
" \"1:\\n\"\n",
" \"dec %0\\n\"\n",
" \"jnz 1b\\n\"\n",
" ::\"r\"(CNT)\n",
" );\n",
"\n",
" end=GetTickCount();\n",
"\n",
" printf(\" total time = %lu, freq = %lu \\n\", end-start, CNT/(end-start)/1000);\n",
"\n",
" return 0;\n",
" }\n",
" '''"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"hidden": true,
"run_control": {
"frozen": true
}
},
"outputs": [],
"source": [
"%cd ~\n",
"with open(\"t.c\", 'w') as f:\n",
" f.writelines(t)\n",
"!gcc -O3 -DCNT_DEF=10000000000LL -o t t.c; gcc -O3 -DCNT_DEF=1000000000000LL -o t.delay t.c;\n",
"!for j in `seq 1 $(nproc)`; do echo -n $j; (for i in `seq 1 $j`; do taskset -c $i ./t.delay & done); sleep 1; ./t; killall t.delay; sleep 2; done"
]
},
{
"cell_type": "markdown",
"metadata": {
"heading_collapsed": true
},
"source": [
"# <font color=red>Shutdown Jupyter; source ~/.bashrc; reboot Jupyter; run section [Initialize](#Initialize)</font>"
]
},
{
"cell_type": "markdown",
"metadata": {
"heading_collapsed": true
},
"source": [
"# Build gluten"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"## Install docker"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"# Instructions from https://docs.docker.com/engine/install/ubuntu/\n",
"\n",
"# Add Docker's official GPG key:\n",
"!sudo -E apt-get update\n",
"!sudo -E apt-get install ca-certificates curl\n",
"!sudo -E install -m 0755 -d /etc/apt/keyrings\n",
"!sudo -E curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc\n",
"!sudo chmod a+r /etc/apt/keyrings/docker.asc\n",
"\n",
"# Add the repository to Apt sources:\n",
"!echo \\\n",
" \"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \\\n",
" $(. /etc/os-release && echo \"$VERSION_CODENAME\") stable\" | \\\n",
" sudo -E tee /etc/apt/sources.list.d/docker.list > /dev/null\n",
"!sudo -E apt-get update"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"!sudo -E apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin >/dev/null 2>&1"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"Configure docker proxy"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"import os\n",
"http_proxy=os.getenv('http_proxy')\n",
"https_proxy=os.getenv('https_proxy')\n",
"\n",
"if http_proxy or https_proxy:\n",
" !sudo mkdir -p /etc/systemd/system/docker.service.d\n",
" with open('/tmp/http-proxy.conf', 'w') as f:\n",
" s = '''\n",
"[Service]\n",
"{}\n",
"{}\n",
"'''.format(f'Environment=\"HTTP_PROXY={http_proxy}\"' if http_proxy else '', f'Environment=\"HTTPS_PROXY={https_proxy}\"' if https_proxy else '')\n",
" f.writelines(s)\n",
" !sudo cp /tmp/http-proxy.conf /etc/systemd/system/docker.service.d\n",
" \n",
" !ssh root@localhost \"mkdir -p /root/.docker\"\n",
" with open(f'/tmp/config.json', 'w') as f:\n",
" s = f'''\n",
"{{\n",
" \"proxies\": {{\n",
" \"default\": {{\n",
" \"httpProxy\": \"{http_proxy}\",\n",
" \"httpsProxy\": \"{https_proxy}\",\n",
" \"noProxy\": \"127.0.0.0/8\"\n",
" }}\n",
" }}\n",
"}}\n",
" '''\n",
" f.writelines(s)\n",
" !ssh root@localhost \"cp -f /tmp/config.json /root/.docker\""
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"Configure maven proxy"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"!mkdir -p ~/.m2\n",
"\n",
"def get_proxy(proxy):\n",
" pos0 = proxy.rfind('/')\n",
" pos = proxy.rfind(':')\n",
" host = http_proxy[pos0+1:pos]\n",
" port = http_proxy[pos+1:]\n",
" return host, port\n",
"\n",
"if http_proxy or https_proxy:\n",
" with open(f\"/home/{user}/.m2/settings.xml\",\"w+\") as f:\n",
" f.write('''<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n",
"<settings xmlns=\"http://maven.apache.org/SETTINGS/1.2.0\"\n",
" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"\n",
" xsi:schemaLocation=\"http://maven.apache.org/SETTINGS/1.2.0 https://maven.apache.org/xsd/settings-1.2.0.xsd\">\n",
" <proxies>''')\n",
" if http_proxy:\n",
" host, port = get_proxy(http_proxy)\n",
" f.write(f'''\n",
" <proxy>\n",
" <id>http_proxy</id>\n",
" <active>true</active>\n",
" <protocol>http</protocol>\n",
" <host>{host}</host>\n",
" <port>{port}</port>\n",
" </proxy>''')\n",
" if https_proxy:\n",
" host, port = get_proxy(http_proxy)\n",
" f.write(f'''\n",
" <proxy>\n",
" <id>https_proxy</id>\n",
" <active>true</active>\n",
" <protocol>https</protocol>\n",
" <host>{host}</host>\n",
" <port>{port}</port>\n",
" </proxy>''')\n",
" f.write('''\n",
" </proxies>\n",
"</settings>\n",
"''')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"!sudo systemctl daemon-reload"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"!sudo systemctl restart docker.service"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"## Build gluten"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"%cd ~"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"import os\n",
"if not os.path.exists('gluten'):\n",
" !git clone https://github.com/apache/incubator-gluten.git gluten"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"# Build Arrow for the first time build.\n",
"!sed -i 's/--build_arrow=OFF/--build_arrow=ON/' ~/gluten/dev/package-vcpkg.sh"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"%cd ~/gluten/tools/workload/benchmark_velox"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"!bash build_gluten.sh"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"for l in hclients:\n",
" !scp ~/gluten/package/target/gluten-velox-bundle-spark*.jar {l}:~/"
]
},
{
"cell_type": "markdown",
"metadata": {
"heading_collapsed": true
},
"source": [
"# Generate data"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"## Build spark-sql-perf"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"!echo \"deb https://repo.scala-sbt.org/scalasbt/debian all main\" | sudo tee /etc/apt/sources.list.d/sbt.list\n",
"!echo \"deb https://repo.scala-sbt.org/scalasbt/debian /\" | sudo tee /etc/apt/sources.list.d/sbt_old.list\n",
"!curl -sL \"https://keyserver.ubuntu.com/pks/lookup?op=get&search=0x2EE0EA64E40A89B84B2DF73499E82A75642AC823\" | sudo apt-key add\n",
"!sudo -E apt-get update > /dev/null 2>&1\n",
"!sudo -E apt-get install sbt > /dev/null 2>&1"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"import os\n",
"http_proxy=os.getenv('http_proxy')\n",
"https_proxy=os.getenv('https_proxy')\n",
"\n",
"def get_proxy(proxy):\n",
" pos0 = proxy.rfind('/')\n",
" pos = proxy.rfind(':')\n",
" host = http_proxy[pos0+1:pos]\n",
" port = http_proxy[pos+1:]\n",
" return host, port\n",
"\n",
"sbt_opts=''\n",
"\n",
"if http_proxy:\n",
" host, port = get_proxy(http_proxy)\n",
" sbt_opts = f'{sbt_opts} -Dhttp.proxyHost={host} -Dhttp.proxyPort={port}'\n",
"if https_proxy:\n",
" host, port = get_proxy(https_proxy)\n",
" sbt_opts = f'{sbt_opts} -Dhttps.proxyHost={host} -Dhttps.proxyPort={port}'\n",
" \n",
"if sbt_opts:\n",
" %env SBT_OPTS={sbt_opts}"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"!git clone https://github.com/databricks/spark-sql-perf.git ~/spark-sql-perf"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"!cd ~/spark-sql-perf && sbt package"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"!cp ~/spark-sql-perf/target/scala-2.12/spark-sql-perf_2.12-0.5.1-SNAPSHOT.jar ~/ipython/"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"## Start Hadoop/Spark cluster, Spark history server"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"!~/hadoop/bin/hadoop namenode -format"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"!~/hadoop/bin/hadoop datanode -format "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"!~/hadoop/sbin/start-dfs.sh"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"!hadoop dfsadmin -safemode leave"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"!hadoop fs -mkdir -p /tmp/sparkEventLog"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"!cd ~/spark && sbin/start-history-server.sh"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"import os\n",
"\n",
"master=''\n",
"if clients:\n",
" !~/hadoop/sbin/start-yarn.sh\n",
" master='yarn'\n",
"else:\n",
" # If we run on single node, we use standalone mode\n",
" !{os.environ['SPARK_HOME']}/sbin/stop-slave.sh\n",
" !{os.environ['SPARK_HOME']}/sbin/stop-master.sh\n",
" !{os.environ['SPARK_HOME']}/sbin/start-master.sh\n",
" !{os.environ['SPARK_HOME']}/sbin/start-worker.sh spark://{hostname}:7077 -c {cpu_num}\n",
" master=f'spark://{hostname}:7077'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"!jps\n",
"for l in clients:\n",
" !ssh {l} jps"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"## TPCH"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"!rm -rf ~/tpch-dbgen\n",
"!git clone https://github.com/databricks/tpch-dbgen.git ~/tpch-dbgen"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"for l in clients:\n",
" !scp -r ~/tpch-dbgen {l}:~/"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"for l in hclients:\n",
" !ssh {l} cd ~/tpch-dbgen && git checkout 0469309147b42abac8857fa61b4cf69a6d3128a8 && make clean && make OS=LINUX"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"%cd ~/gluten/tools/workload/tpch/gen_data/parquet_dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"# Suggest 2x cpu# partitions.\n",
"scaleFactor = 1500\n",
"numPartitions = 2*cpu_num if len(clients)==0 else len(clients)*2*cpu_num\n",
"dataformat = \"parquet\" # data format of data source\n",
"dataSourceCodec = \"snappy\"\n",
"rootDir = f\"/tpch_sf{scaleFactor}_{dataformat}_{dataSourceCodec}\" # root directory of location to create data in."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"# Verify parameters\n",
"print(f'scaleFactor = {scaleFactor}')\n",
"print(f'numPartitions = {numPartitions}')\n",
"print(f'dataformat = {dataformat}')\n",
"print(f'rootDir = {rootDir}')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"scala=f'''import com.databricks.spark.sql.perf.tpch._\n",
"\n",
"\n",
"val scaleFactor = \"{scaleFactor}\" // scaleFactor defines the size of the dataset to generate (in GB).\n",
"val numPartitions = {numPartitions} // how many dsdgen partitions to run - number of input tasks.\n",
"\n",
"val format = \"{dataformat}\" // valid spark format like parquet \"parquet\".\n",
"val rootDir = \"{rootDir}\" // root directory of location to create data in.\n",
"val dbgenDir = \"/home/{user}/tpch-dbgen\" // location of dbgen\n",
"\n",
"val tables = new TPCHTables(spark.sqlContext,\n",
" dbgenDir = dbgenDir,\n",
" scaleFactor = scaleFactor,\n",
" useDoubleForDecimal = false, // true to replace DecimalType with DoubleType\n",
" useStringForDate = false) // true to replace DateType with StringType\n",
"\n",
"\n",
"tables.genData(\n",
" location = rootDir,\n",
" format = format,\n",
" overwrite = true, // overwrite the data that is already there\n",
" partitionTables = false, // do not create the partitioned fact tables\n",
" clusterByPartitionColumns = false, // shuffle to get partitions coalesced into single files.\n",
" filterOutNullPartitionValues = false, // true to filter out the partition with NULL key value\n",
" tableFilter = \"\", // \"\" means generate all tables\n",
" numPartitions = numPartitions) // how many dsdgen partitions to run - number of input tasks.\n",
"'''\n",
"\n",
"with open(\"tpch_datagen_parquet.scala\",\"w\") as f:\n",
" f.writelines(scala)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"executor_cores = 8\n",
"num_executors=cpu_num/executor_cores\n",
"executor_memory = (totalmemory - 10*1024*1024)/num_executors - 1*1024*1024\n",
"\n",
"# Verify parameters\n",
"print(f'--master {master}')\n",
"print(f'--num-executors {int(num_executors)}')\n",
"print(f'--executor-cores {int(executor_cores)}')\n",
"print(f'--executor-memory {int(executor_memory)}k')\n",
"print(f'--conf spark.sql.shuffle.partitions={numPartitions}')\n",
"print(f'--conf spark.sql.parquet.compression.codec={dataSourceCodec}')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"tpch_datagen_parquet=f'''\n",
"cat tpch_datagen_parquet.scala | {os.environ['SPARK_HOME']}/bin/spark-shell \\\n",
" --master {master} \\\n",
" --name tpch_gen_parquet \\\n",
" --driver-memory 10g \\\n",
" --num-executors {int(num_executors)} \\\n",
" --executor-cores {int(executor_cores)} \\\n",
" --executor-memory {int(executor_memory)}k \\\n",
" --conf spark.executor.memoryOverhead=1g \\\n",
" --conf spark.sql.broadcastTimeout=4800 \\\n",
" --conf spark.driver.maxResultSize=4g \\\n",
" --conf spark.sql.shuffle.partitions={numPartitions} \\\n",
" --conf spark.sql.parquet.compression.codec={dataSourceCodec} \\\n",
" --conf spark.network.timeout=800s \\\n",
" --conf spark.executor.heartbeatInterval=200s \\\n",
" --jars /home/{user}/ipython/spark-sql-perf_2.12-0.5.1-SNAPSHOT.jar \\\n",
"'''\n",
"\n",
"with open(\"tpch_datagen_parquet.sh\",\"w\") as f:\n",
" f.writelines(tpch_datagen_parquet)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"!nohup bash tpch_datagen_parquet.sh"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"## TPCDS"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"!rm -rf ~/tpcds-kit\n",
"!git clone https://github.com/databricks/tpcds-kit.git ~/tpcds-kit"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"for l in clients:\n",
" !scp -r ~/tpcds-kit {l}:~/"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"for l in hclients:\n",
" !ssh {l} \"cd ~/tpcds-kit/tools && make clean && make OS=LINUX CC=gcc-9\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"%cd ~/gluten/tools/workload/tpcds/gen_data/parquet_dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"# Suggest 2x cpu# partitions\n",
"scaleFactor = 1500\n",
"numPartitions = 2*cpu_num if len(clients)==0 else len(clients)*2*cpu_num\n",
"dataformat = \"parquet\" # data format of data source\n",
"dataSourceCodec = \"snappy\"\n",
"rootDir = f\"/tpcds_sf{scaleFactor}_{dataformat}_{dataSourceCodec}\" # root directory of location to create data in."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"# Verify parameters\n",
"print(f'scaleFactor = {scaleFactor}')\n",
"print(f'numPartitions = {numPartitions}')\n",
"print(f'dataformat = {dataformat}')\n",
"print(f'rootDir = {rootDir}')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"scala=f'''import com.databricks.spark.sql.perf.tpcds._\n",
"\n",
"val scaleFactor = \"{scaleFactor}\" // scaleFactor defines the size of the dataset to generate (in GB).\n",
"val numPartitions = {numPartitions} // how many dsdgen partitions to run - number of input tasks.\n",
"\n",
"val format = \"{dataformat}\" // valid spark format like parquet \"parquet\".\n",
"val rootDir = \"{rootDir}\" // root directory of location to create data in.\n",
"val dsdgenDir = \"/home/{user}/tpcds-kit/tools/\" // location of dbgen\n",
"\n",
"val tables = new TPCDSTables(spark.sqlContext,\n",
" dsdgenDir = dsdgenDir,\n",
" scaleFactor = scaleFactor,\n",
" useDoubleForDecimal = false, // true to replace DecimalType with DoubleType\n",
" useStringForDate = false) // true to replace DateType with StringType\n",
"\n",
"\n",
"tables.genData(\n",
" location = rootDir,\n",
" format = format,\n",
" overwrite = true, // overwrite the data that is already there\n",
" partitionTables = true, // create the partitioned fact tables\n",
" clusterByPartitionColumns = true, // shuffle to get partitions coalesced into single files.\n",
" filterOutNullPartitionValues = false, // true to filter out the partition with NULL key value\n",
" tableFilter = \"\", // \"\" means generate all tables\n",
" numPartitions = numPartitions) // how many dsdgen partitions to run - number of input tasks.\n",
"'''\n",
"\n",
"with open(\"tpcds_datagen_parquet.scala\",\"w\") as f:\n",
" f.writelines(scala)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"executor_cores = 8\n",
"num_executors=cpu_num/executor_cores\n",
"executor_memory = (totalmemory - 10*1024*1024)/num_executors - 1*1024*1024\n",
"\n",
"# Verify parameters\n",
"print(f'--master {master}')\n",
"print(f'--num-executors {int(num_executors)}')\n",
"print(f'--executor-cores {int(executor_cores)}')\n",
"print(f'--executor-memory {int(executor_memory)}k')\n",
"print(f'--conf spark.sql.shuffle.partitions={numPartitions}')\n",
"print(f'--conf spark.sql.parquet.compression.codec={dataSourceCodec}')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"tpcds_datagen_parquet=f'''\n",
"cat tpcds_datagen_parquet.scala | {os.environ['SPARK_HOME']}/bin/spark-shell \\\n",
" --master {master} \\\n",
" --name tpcds_gen_parquet \\\n",
" --driver-memory 10g \\\n",
" --num-executors {int(num_executors)} \\\n",
" --executor-cores {int(executor_cores)} \\\n",
" --executor-memory {int(executor_memory)}k \\\n",
" --conf spark.executor.memoryOverhead=1g \\\n",
" --conf spark.sql.broadcastTimeout=4800 \\\n",
" --conf spark.driver.maxResultSize=4g \\\n",
" --conf spark.sql.shuffle.partitions={numPartitions} \\\n",
" --conf spark.sql.parquet.compression.codec={dataSourceCodec} \\\n",
" --conf spark.network.timeout=800s \\\n",
" --conf spark.executor.heartbeatInterval=200s \\\n",
" --jars /home/{user}/ipython/spark-sql-perf_2.12-0.5.1-SNAPSHOT.jar \\\n",
"'''\n",
"\n",
"with open(\"tpcds_datagen_parquet.sh\",\"w\") as f:\n",
" f.writelines(tpcds_datagen_parquet)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"!nohup bash tpcds_datagen_parquet.sh"
]
},
{
"cell_type": "markdown",
"metadata": {
"heading_collapsed": true
},
"source": [
"# Set up perf analysis tools (optional)"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"We have a set of perf analysis scripts under $GLUTEN_HOME/tools/workload/benchmark_velox/analysis. You can follow below steps to deploy the scripts on the same cluster and use them for performance analysis after each run."
]
},
{
"cell_type": "markdown",
"metadata": {
"heading_collapsed": true,
"hidden": true
},
"source": [
"## Install and deploy Trace-Viewer"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"Clone the master branch of project catapult:\n",
"```\n",
"cd ~\n",
"git clone https://github.com/catapult-project/catapult.git -b master\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"Trace-Viewer requires python version 2.7. Create a virtualenv for python2.7:\n",
"```\n",
"sudo apt install -y python2.7\n",
"virtualenv -p /usr/bin/python2.7 py27-env\n",
"source py27-env/bin/activate\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"Apply patch:\n",
"\n",
"```\n",
"cd catapult\n",
"```\n",
"```\n",
"git apply <<EOF\n",
"diff --git a/catapult_build/dev_server.py b/catapult_build/dev_server.py\n",
"index a8b25c485..ec9638c7b 100644\n",
"--- a/catapult_build/dev_server.py\n",
"+++ b/catapult_build/dev_server.py\n",
"@@ -328,11 +328,11 @@ def Main(argv):\n",
"\n",
" app = DevServerApp(pds, args=args)\n",
"\n",
"- server = httpserver.serve(app, host='127.0.0.1', port=args.port,\n",
"+ server = httpserver.serve(app, host='0.0.0.0', port=args.port,\n",
" start_loop=False, daemon_threads=True)\n",
" _AddPleaseExitMixinToServer(server)\n",
" # pylint: disable=no-member\n",
"- server.urlbase = 'http://127.0.0.1:%i' % server.server_port\n",
"+ server.urlbase = 'http://0.0.0.0:%i' % server.server_port\n",
" app.server = server\n",
"\n",
" sys.stderr.write('Now running on %s\\n' % server.urlbase)\n",
"EOF\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"Start the service:\n",
"\n",
"```\n",
"mkdir -p ~/trace_result\n",
"cd ~/catapult && nohup ./bin/run_dev_server --no-install-hooks -d ~/trace_result -p1088 &\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"heading_collapsed": true,
"hidden": true
},
"source": [
"## Deploy perf analysis scripts"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"Create a virtualenv to run the perf analaysis scripts:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"%cd ~"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"!virtualenv -p python3 -v paus-env\n",
"!source ~/paus-env/bin/activate && python3 -m pip install -r ~/gluten/tools/workload/benchmark_velox/analysis/requirements.txt"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"We will put all perf analysis notebooks under `$HOME/PAUS`. Create the directory and start the notebook:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"!mkdir -p ~/PAUS && cd ~/PAUS && source ~/paus-env/bin/activate && nohup jupyter notebook --ip=0.0.0.0 --port=8889 &"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"Package the virtual environment so that it can be distributed to other nodes:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"!cd ~ && tar -czf paus-env.tar.gz paus-env"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true
},
"source": [
"Distribute to the worker nodes:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true
},
"outputs": [],
"source": [
"for l in clients:\n",
" !scp ~/paus-env.tar.gz {l}:~/\n",
" !ssh {l} tar -zxf paus-env.tar.gz"
]
}
],
"metadata": {
"hide_input": false,
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
},
"nbTranslate": {
"displayLangs": [
"*"
],
"hotkey": "alt-t",
"langInMainMenu": true,
"sourceLang": "en",
"targetLang": "fr",
"useGoogleTranslate": true
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
},
"varInspector": {
"cols": {
"lenName": 16,
"lenType": 16,
"lenVar": 40
},
"kernels_config": {
"python": {
"delete_cmd_postfix": "",
"delete_cmd_prefix": "del ",
"library": "var_list.py",
"varRefreshCmd": "print(var_dic_list())"
},
"r": {
"delete_cmd_postfix": ") ",
"delete_cmd_prefix": "rm(",
"library": "var_list.r",
"varRefreshCmd": "cat(var_dic_list()) "
}
},
"types_to_exclude": [
"module",
"function",
"builtin_function_or_method",
"instance",
"_Feature"
],
"window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 2
}