tools/workload/benchmark_velox/initialize.ipynb (3,150 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# System Setup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**1. Install system dependencies and python packages. Prepare the environment.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, install all dependencies and python packages as `root`. Run commands and make sure the installations are successful.\n", "\n", "```bash\n", "apt update\n", "\n", "apt install -y sudo locales wget tar tzdata git ccache cmake ninja-build build-essential llvm-11-dev clang-11 libiberty-dev libdwarf-dev libre2-dev libz-dev libssl-dev libboost-all-dev libcurl4-openssl-dev openjdk-8-jdk maven vim pip sysstat gcc-9 libjemalloc-dev nvme-cli curl zip unzip bison flex linux-tools-common linux-tools-generic linux-tools-`uname -r` mailutils\n", "\n", "python3 -m pip install notebook==6.5.2\n", "python3 -m pip install jupyter_server==1.23.4\n", "python3 -m pip install jupyter_highlight_selected_word\n", "python3 -m pip install jupyter_contrib_nbextensions\n", "python3 -m pip install virtualenv==20.21.1\n", "python3 -m pip uninstall -y ipython\n", "python3 -m pip install ipython==8.21.0\n", "python3 -m pip uninstall -y traitlets\n", "python3 -m pip install traitlets==5.9.0\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***Required for Ubuntu***\n", "\n", "Check that there isn't an entry for your hostname mapped to 127.0.0.1 or 127.0.1.1 in /etc/hosts (Ubuntu is notorious for this). If there is, delete it.\n", "Then add `<ip>` and `<hostname>` for master and worker nodes.\n", "\n", "Example /etc/hosts:\n", " \n", "```\n", "127.0.0.1 localhost\n", "\n", "# The following lines are desirable for IPv6 capable hosts\n", "::1 ip6-localhost ip6-loopback\n", "fe00::0 ip6-localnet\n", "ff00::0 ip6-mcastprefix\n", "ff02::1 ip6-allnodes\n", "ff02::2 ip6-allrouters\n", "\n", "10.0.0.117 sr217\n", "10.0.0.113 sr213\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**2. Format and mount disks**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a python virtual environment to finish the system setup process:\n", "\n", "```bash\n", "virtualenv -p python3 -v venv\n", "source venv/bin/activate\n", "```\n", "\n", "And install packages under `venv`:\n", "```bash\n", "(venv) python3 -m pip install questionary\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run script [init_disks.py](./init_disks.py) to format and mount disks. **Be careful when choosing the disks to format.** If you see errors like `device or resource busy`, perhaps the partition has been mounted, you should unmount it first. If you still see this error, reboot the system and try again." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Exit `venv`:\n", "```bash\n", "(venv) deactivate\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**3. Create user `sparkuser`**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create user `sparkuser` without password and with sudo priviledge. It's recommended to use one of the disks as the home directory instead of the system drive.\n", "\n", "```bash\n", "mkdir -p /data0/home/sparkuser\n", "ln -s /data0/home/sparkuser /home/sparkuser\n", "cp -r /etc/skel/. /home/sparkuser/\n", "adduser --home /home/sparkuser --disabled-password --gecos \"\" sparkuser\n", "\n", "chown -R sparkuser:sparkuser /data*\n", "\n", "echo 'sparkuser ALL=(ALL:ALL) NOPASSWD:ALL' | EDITOR='tee -a' visudo\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Generate ssh keys for `sparkuser`\n", "\n", "```bashrc\n", "su - sparkuser\n", "```\n", "\n", "```bashrc\n", "rm -rf ~/.ssh\n", "ssh-keygen -q -t rsa -N '' -f ~/.ssh/id_rsa <<<y >/dev/null 2>&1\n", "cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys\n", "\n", "exit\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Generate ssh keys for `root`, and enable no password ssh from `sparkuser`\n", "\n", "```bash\n", "rm -rf /root/.ssh\n", "ssh-keygen -q -t rsa -N '' -f ~/.ssh/id_rsa <<<y >/dev/null 2>&1\n", "cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys\n", "cat /home/sparkuser/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Login to `sparkuser` and run the first-time ssh to the `root`\n", "\n", "```bash\n", "su - sparkuser\n", "```\n", "\n", "```bash\n", "ssh -o StrictHostKeyChecking=no root@localhost ls\n", "ssh -o StrictHostKeyChecking=no root@127.0.0.1 ls\n", "ssh -o StrictHostKeyChecking=no root@`hostname` ls\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***Required for Ubuntu***\n", "\n", "Run below command to comment out lines starting from `If not running interactively, don't do anything` in ~/.bashrc\n", "\n", "```bash\n", "sed -i '5,9 s/^/# /' ~/.bashrc\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**4. Configure jupyter notebook**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As `sparkuser`, install python packages\n", "\n", "```bash\n", "cd /home/sparkuser/.local/lib/ && rm -rf python*\n", "\n", "python3 -m pip install --upgrade jsonschema\n", "python3 -m pip install jsonschema[format]\n", "python3 -m pip install sqlalchemy==1.4.46\n", "python3 -m pip install papermill Black\n", "python3 -m pip install NotebookScripter\n", "python3 -m pip install findspark spylon-kernel matplotlib pandasql pyhdfs\n", "python3 -m pip install ipywidgets jupyter_nbextensions_configurator ipyparallel\n", "```\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Configure jupyter notebook. Setup password when it prompts\n", "\n", "```bash\n", "jupyter notebook --generate-config\n", "\n", "jupyter notebook password\n", "\n", "mkdir -p ~/.jupyter/custom/\n", "\n", "echo '.container { width:100% !important; }' >> ~/.jupyter/custom/custom.css\n", "\n", "echo 'div.output_stderr { background: #ffdd; display: none; }' >> ~/.jupyter/custom/custom.css\n", "\n", "jupyter nbextension install --py jupyter_highlight_selected_word --user\n", "\n", "jupyter nbextension enable highlight_selected_word/main\n", "\n", "jupyter nbextension install --py widgetsnbextension --user\n", "\n", "jupyter contrib nbextension install --user\n", "\n", "jupyter nbextension enable codefolding/main\n", "\n", "jupyter nbextension enable code_prettify/code_prettify\n", "\n", "jupyter nbextension enable codefolding/edit\n", "\n", "jupyter nbextension enable code_font_size/code_font_size\n", "\n", "jupyter nbextension enable collapsible_headings/main\n", "\n", "jupyter nbextension enable highlight_selected_word/main\n", "\n", "jupyter nbextension enable ipyparallel/main\n", "\n", "jupyter nbextension enable move_selected_cells/main\n", "\n", "jupyter nbextension enable nbTranslate/main\n", "\n", "jupyter nbextension enable scratchpad/main\n", "\n", "jupyter nbextension enable tree-filter/index\n", "\n", "jupyter nbextension enable comment-uncomment/main\n", "\n", "jupyter nbextension enable export_embedded/main\n", "\n", "jupyter nbextension enable hide_header/main\n", "\n", "jupyter nbextension enable highlighter/highlighter\n", "\n", "jupyter nbextension enable scroll_down/main\n", "\n", "jupyter nbextension enable snippets/main\n", "\n", "jupyter nbextension enable toc2/main\n", "\n", "jupyter nbextension enable varInspector/main\n", "\n", "jupyter nbextension enable codefolding/edit\n", "\n", "jupyter nbextension enable contrib_nbextensions_help_item/main\n", "\n", "jupyter nbextension enable freeze/main\n", "\n", "jupyter nbextension enable hide_input/main\n", "\n", "jupyter nbextension enable jupyter-js-widgets/extension\n", "\n", "jupyter nbextension enable snippets_menu/main\n", "\n", "jupyter nbextension enable table_beautifier/main\n", "\n", "jupyter nbextension enable hide_input_all/main\n", "\n", "jupyter nbextension enable spellchecker/main\n", "\n", "jupyter nbextension enable toggle_all_line_numbers/main\n", "\n", "jupyter nbextensions_configurator enable --user\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Clone Gluten\n", "\n", "```bash\n", "cd ~\n", "git clone https://github.com/apache/incubator-gluten.git gluten\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Start jupyter notebook\n", "\n", "```bash\n", "mkdir -p ~/ipython\n", "cd ~/ipython\n", "\n", "nohup jupyter notebook --ip=0.0.0.0 --port=8888 &\n", "\n", "find ~/gluten/tools/workload/benchmark_velox/ -maxdepth 1 -type f -exec cp {} ~/ipython \\;\n", "```" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "# Initialize\n", "<font color=red size=3> Run this section after notebook restart! </font>" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Specify datadir. The directories are used for spark.local.dirs and hadoop namenode/datanode." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "datadir=[f'/data{i}' for i in range(0, 8)]\n", "datadir" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Specify clients(workers). Leave it empty if the cluster is setup on the local machine." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "clients=''''''.split()\n", "print(clients)" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Specify JAVA_HOME" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "java_home = '/usr/lib/jvm/java-8-openjdk-amd64'" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "import os\n", "import socket\n", "import platform\n", "\n", "user=os.getenv('USER')\n", "print(f\"user: {user}\")\n", "print()\n", "\n", "masterip=socket.gethostbyname(socket.gethostname())\n", "hostname=socket.gethostname() \n", "print(f\"masterip: {masterip} hostname: {hostname}\")\n", "print()\n", "\n", "hclients=clients.copy()\n", "hclients.append(hostname)\n", "print(f\"master and workers: {hclients}\")\n", "print()\n", "\n", "\n", "if clients:\n", " cmd = f\"ssh {clients[0]} \" + \"\\\"lscpu | grep '^CPU(s)'\\\"\" + \" | awk '{print $2}'\"\n", " client_cpu = !{cmd}\n", " cpu_num = client_cpu[0]\n", "\n", " cmd = f\"ssh {clients[0]} \" + \"\\\"cat /proc/meminfo | grep MemTotal\\\"\" + \" | awk '{print $2}'\"\n", " totalmemory = !{cmd}\n", " totalmemory = int(totalmemory[0])\n", "else:\n", " cpu_num = os.cpu_count()\n", " totalmemory = !cat /proc/meminfo | grep MemTotal | awk '{print $2}'\n", " totalmemory = int(totalmemory[0])\n", " \n", "print(f\"cpu_num: {cpu_num}\")\n", "print()\n", "\n", "print(\"total memory: \", totalmemory, \"KB\")\n", "print()\n", "\n", "mem_mib = int(totalmemory/1024)-1024\n", "print(f\"mem_mib: {mem_mib}\")\n", "print()\n", "\n", "is_arm = platform.machine() == 'aarch64'\n", "print(\"is_arm: \",is_arm)\n", "print()\n", "\n", "sparklocals=\",\".join([f'{l}/{user}/yarn/local' for l in datadir])\n", "print(f\"SPARK_LOCAL_DIR={sparklocals}\")\n", "print()\n", "\n", "%cd ~" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "# Set up clients\n", "<font color=red size=3> SKIP for single node </font>" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "## Install dependencies" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Manually configure ssh login without password to all clients\n", "\n", "```bash\n", "ssh-copy-id -o StrictHostKeyChecking=no root@<client ip>\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "for l in clients:\n", " !ssh root@{l} apt update > /dev/null 2>&1\n", " !ssh root@{l} apt install -y sudo locales wget tar tzdata git ccache cmake ninja-build build-essential llvm-11-dev clang-11 libiberty-dev libdwarf-dev libre2-dev libz-dev libssl-dev libboost-all-dev libcurl4-openssl-dev openjdk-8-jdk maven vim pip sysstat gcc-9 libjemalloc-dev nvme-cli curl zip unzip bison flex linux-tools-common linux-tools-generic linux-tools-`uname -r` > /dev/null 2>&1" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "## Create user" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "for l in clients:\n", " !ssh -o StrictHostKeyChecking=no root@{l} ls" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "for l in clients:\n", " !ssh root@{l} adduser --disabled-password --gecos '\"\"' {user}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "for l in clients:\n", " !ssh root@{l} cp -r .ssh /home/{user}/\n", " !ssh root@{l} chown -R {user}:{user} /home/{user}/.ssh" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "for l in clients:\n", " !ssh root@{l} \"echo -e 'sparkuser ALL=(ALL:ALL) NOPASSWD:ALL' | EDITOR='tee -a' visudo\"" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "***Required for Ubuntu***\n", "\n", "Run below command to comment out lines starting from If not running interactively, don't do anything in ~/.bashrc" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "for l in clients:\n", " !ssh {l} sed -i \"'5,9 s/^/# /'\" ~/.bashrc" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Use /etc/hosts on master node" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "for l in clients:\n", " !scp /etc/hosts root@{l}:/etc/hosts" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "## Setup disks" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "for l in clients:\n", " !ssh root@{l} apt update > /dev/null 2>&1\n", " !ssh root@{l} apt install -y pip > /dev/null 2>&1\n", " !ssh root@{l} python3 -m pip install virtualenv" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Manually run **2. Format and mount disks** section under [System Setup](#System-Setup)" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "# Configure Spark, Hadoop" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "## Download packages" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "!wget https://archive.apache.org/dist/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz > /dev/null 2>&1" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "!wget https://archive.apache.org/dist/hadoop/common/hadoop-3.2.4/hadoop-3.2.4.tar.gz > /dev/null 2>&1\n", "# backup url: !wget https://dlcdn.apache.org/hadoop/common/hadoop-3.2.4/hadoop-3.2.4.tar.gz > /dev/null 2>&1\n", "if is_arm:\n", " # download both versions\n", " !wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.5/hadoop-3.3.5-aarch64.tar.gz > /dev/null 2>&1" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "## Create directories" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "cmd=\";\".join([f\"chown -R {user}:{user} \" + l for l in datadir])\n", "for l in hclients:\n", " !ssh root@{l} '{cmd}'" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "cmd=\";\".join([f\"rm -rf {l}/tmp; mkdir -p {l}/tmp\" for l in datadir])\n", "for l in hclients:\n", " !ssh {l} '{cmd}'" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "cmd=\";\".join([f\"mkdir -p {l}/{user}/hdfs/data; mkdir -p {l}/{user}/yarn/local\" for l in datadir])\n", "for l in hclients:\n", " !ssh {l} '{cmd}'" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "!mkdir -p {datadir[0]}/{user}/hdfs/name\n", "!mkdir -p {datadir[0]}/{user}/hdfs/namesecondary" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "for l in hclients:\n", " !scp hadoop-3.2.4.tar.gz {l}:~/\n", " !scp spark-3.3.1-bin-hadoop3.tgz {l}:~/\n", " !ssh {l} \"mv -f hadoop hadoop.bak; mv -f spark spark.bak\"\n", " !ssh {l} \"tar zxvf hadoop-3.2.4.tar.gz > /dev/null 2>&1\"\n", " !ssh {l} \"tar -zxvf spark-3.3.1-bin-hadoop3.tgz > /dev/null 2>&1\"\n", " !ssh root@{l} \"apt install -y openjdk-8-jdk > /dev/null 2>&1\"\n", " !ssh {l} \"ln -s hadoop-3.2.4 hadoop; ln -s spark-3.3.1-bin-hadoop3 spark\"\n", " if is_arm:\n", " !ssh {l} \"tar zxvf hadoop-3.3.5-aarch64.tar.gz > /dev/null 2>&1\"\n", " !ssh {l} \"cd hadoop && mv lib lib.bak && cp -rf ~/hadoop-3.3.5/lib ~/hadoop\"" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "## Configure bashrc" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "\n", "cfg=f'''export HADOOP_HOME=~/hadoop\n", "export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin\n", "\n", "export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop\n", "export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop\n", "\n", "export SPARK_HOME=~/spark\n", "export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.9.5-src.zip:$PYTHONPATH\n", "export PATH=$SPARK_HOME/bin:$PATH\n", "\n", "'''" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "if is_arm:\n", " cfg += 'export CPU_TARGET=\"aarch64\"\\nexport JAVA_HOME=/usr/lib/jvm/java-8-openjdk-arm64\\nexport PATH=$JAVA_HOME/bin:$PATH\\n'\n", "else:\n", " cfg += f'export JAVA_HOME={java_home}\\nexport PATH=$JAVA_HOME/bin:$PATH\\n'" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "with open(\"tmpcfg\",'w') as f:\n", " f.writelines(cfg)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "for l in hclients:\n", " !scp tmpcfg {l}:~/tmpcfg.in\n", " !ssh {l} \"cat ~/tmpcfg.in >> ~/.bashrc\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "for l in hclients:\n", " !ssh {l} tail -n10 ~/.bashrc" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "## Configure Hadoop" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "for l in hclients:\n", " !ssh root@{l} \"apt install -y libiberty-dev libxml2-dev libkrb5-dev libgsasl7-dev libuuid1 uuid-dev > /dev/null 2>&1\"" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "### setup short-circuit " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "for l in hclients:\n", " !ssh root@{l} \"mkdir -p /var/lib/hadoop-hdfs/\"\n", " !ssh root@{l} 'chown {user}:{user} /var/lib/hadoop-hdfs/'" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "### enable security.authorization" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "coresite='''<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n", "<?xml-stylesheet type=\"text/xsl\" href=\"configuration.xsl\"?>\n", "<!--\n", " Licensed under the Apache License, Version 2.0 (the \"License\");\n", " you may not use this file except in compliance with the License.\n", " You may obtain a copy of the License at\n", "\n", " http://www.apache.org/licenses/LICENSE-2.0\n", "\n", " Unless required by applicable law or agreed to in writing, software\n", " distributed under the License is distributed on an \"AS IS\" BASIS,\n", " WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", " See the License for the specific language governing permissions and\n", " limitations under the License. See accompanying LICENSE file.\n", "-->\n", "\n", "<!-- Put site-specific property overrides in this file. -->\n", "\n", "<configuration>\n", " <property>\n", " <name>fs.default.name</name>\n", " <value>hdfs://{:s}:8020</value>\n", " <final>true</final>\n", " </property>\n", " <property>\n", " <name>hadoop.security.authentication</name>\n", " <value>simple</value>\n", " </property>\n", " <property>\n", " <name>hadoop.security.authorization</name>\n", " <value>true</value>\n", " </property>\n", "</configuration>\n", "'''.format(hostname)\n", "\n", "with open(f'/home/{user}/hadoop/etc/hadoop/core-site.xml','w') as f:\n", " f.writelines(coresite)\n", " \n", "for l in clients:\n", " !scp ~/hadoop/etc/hadoop/core-site.xml {l}:~/hadoop/etc/hadoop/core-site.xml >/dev/null 2>&1" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "### set IP check, note the command <font color=red>\", \"</font>.join" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "hadooppolicy='''<?xml version=\"1.0\"?>\n", "<?xml-stylesheet type=\"text/xsl\" href=\"configuration.xsl\"?>\n", "<!--\n", "\n", " Licensed to the Apache Software Foundation (ASF) under one\n", " or more contributor license agreements. See the NOTICE file\n", " distributed with this work for additional information\n", " regarding copyright ownership. The ASF licenses this file\n", " to you under the Apache License, Version 2.0 (the\n", " \"License\"); you may not use this file except in compliance\n", " with the License. You may obtain a copy of the License at\n", "\n", " http://www.apache.org/licenses/LICENSE-2.0\n", "\n", " Unless required by applicable law or agreed to in writing, software\n", " distributed under the License is distributed on an \"AS IS\" BASIS,\n", " WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", " See the License for the specific language governing permissions and\n", " limitations under the License.\n", "\n", "-->\n", "\n", "<!-- Put site-specific property overrides in this file. -->\n", "\n", "<configuration>\n", " <property>\n", " <name>security.service.authorization.default.hosts</name>\n", " <value>{:s}</value>\n", " </property>\n", " <property>\n", " <name>security.service.authorization.default.acl</name>\n", " <value>{:s} {:s}</value>\n", " </property>\n", " \n", " \n", " <property>\n", " <name>security.client.protocol.acl</name>\n", " <value>*</value>\n", " <description>ACL for ClientProtocol, which is used by user code\n", " via the DistributedFileSystem.\n", " The ACL is a comma-separated list of user and group names. The user and\n", " group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n", " A special value of \"*\" means all users are allowed.</description>\n", " </property>\n", "\n", " <property>\n", " <name>security.client.datanode.protocol.acl</name>\n", " <value>*</value>\n", " <description>ACL for ClientDatanodeProtocol, the client-to-datanode protocol\n", " for block recovery.\n", " The ACL is a comma-separated list of user and group names. The user and\n", " group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n", " A special value of \"*\" means all users are allowed.</description>\n", " </property>\n", "\n", " <property>\n", " <name>security.datanode.protocol.acl</name>\n", " <value>*</value>\n", " <description>ACL for DatanodeProtocol, which is used by datanodes to\n", " communicate with the namenode.\n", " The ACL is a comma-separated list of user and group names. The user and\n", " group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n", " A special value of \"*\" means all users are allowed.</description>\n", " </property>\n", "\n", " <property>\n", " <name>security.inter.datanode.protocol.acl</name>\n", " <value>*</value>\n", " <description>ACL for InterDatanodeProtocol, the inter-datanode protocol\n", " for updating generation timestamp.\n", " The ACL is a comma-separated list of user and group names. The user and\n", " group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n", " A special value of \"*\" means all users are allowed.</description>\n", " </property>\n", "\n", " <property>\n", " <name>security.namenode.protocol.acl</name>\n", " <value>*</value>\n", " <description>ACL for NamenodeProtocol, the protocol used by the secondary\n", " namenode to communicate with the namenode.\n", " The ACL is a comma-separated list of user and group names. The user and\n", " group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n", " A special value of \"*\" means all users are allowed.</description>\n", " </property>\n", "\n", " <property>\n", " <name>security.admin.operations.protocol.acl</name>\n", " <value>*</value>\n", " <description>ACL for AdminOperationsProtocol. Used for admin commands.\n", " The ACL is a comma-separated list of user and group names. The user and\n", " group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n", " A special value of \"*\" means all users are allowed.</description>\n", " </property>\n", "\n", " <property>\n", " <name>security.refresh.user.mappings.protocol.acl</name>\n", " <value>*</value>\n", " <description>ACL for RefreshUserMappingsProtocol. Used to refresh\n", " users mappings. The ACL is a comma-separated list of user and\n", " group names. The user and group list is separated by a blank. For\n", " e.g. \"alice,bob users,wheel\". A special value of \"*\" means all\n", " users are allowed.</description>\n", " </property>\n", "\n", " <property>\n", " <name>security.refresh.policy.protocol.acl</name>\n", " <value>*</value>\n", " <description>ACL for RefreshAuthorizationPolicyProtocol, used by the\n", " dfsadmin and mradmin commands to refresh the security policy in-effect.\n", " The ACL is a comma-separated list of user and group names. The user and\n", " group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n", " A special value of \"*\" means all users are allowed.</description>\n", " </property>\n", "\n", " <property>\n", " <name>security.ha.service.protocol.acl</name>\n", " <value>*</value>\n", " <description>ACL for HAService protocol used by HAAdmin to manage the\n", " active and stand-by states of namenode.</description>\n", " </property>\n", "\n", " <property>\n", " <name>security.zkfc.protocol.acl</name>\n", " <value>*</value>\n", " <description>ACL for access to the ZK Failover Controller\n", " </description>\n", " </property>\n", "\n", " <property>\n", " <name>security.qjournal.service.protocol.acl</name>\n", " <value>*</value>\n", " <description>ACL for QJournalProtocol, used by the NN to communicate with\n", " JNs when using the QuorumJournalManager for edit logs.</description>\n", " </property>\n", "\n", " <property>\n", " <name>security.mrhs.client.protocol.acl</name>\n", " <value>*</value>\n", " <description>ACL for HSClientProtocol, used by job clients to\n", " communciate with the MR History Server job status etc.\n", " The ACL is a comma-separated list of user and group names. The user and\n", " group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n", " A special value of \"*\" means all users are allowed.</description>\n", " </property>\n", "\n", " <!-- YARN Protocols -->\n", "\n", " <property>\n", " <name>security.resourcetracker.protocol.acl</name>\n", " <value>*</value>\n", " <description>ACL for ResourceTrackerProtocol, used by the\n", " ResourceManager and NodeManager to communicate with each other.\n", " The ACL is a comma-separated list of user and group names. The user and\n", " group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n", " A special value of \"*\" means all users are allowed.</description>\n", " </property>\n", "\n", " <property>\n", " <name>security.resourcemanager-administration.protocol.acl</name>\n", " <value>*</value>\n", " <description>ACL for ResourceManagerAdministrationProtocol, for admin commands.\n", " The ACL is a comma-separated list of user and group names. The user and\n", " group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n", " A special value of \"*\" means all users are allowed.</description>\n", " </property>\n", "\n", " <property>\n", " <name>security.applicationclient.protocol.acl</name>\n", " <value>*</value>\n", " <description>ACL for ApplicationClientProtocol, used by the ResourceManager\n", " and applications submission clients to communicate with each other.\n", " The ACL is a comma-separated list of user and group names. The user and\n", " group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n", " A special value of \"*\" means all users are allowed.</description>\n", " </property>\n", "\n", " <property>\n", " <name>security.applicationmaster.protocol.acl</name>\n", " <value>*</value>\n", " <description>ACL for ApplicationMasterProtocol, used by the ResourceManager\n", " and ApplicationMasters to communicate with each other.\n", " The ACL is a comma-separated list of user and group names. The user and\n", " group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n", " A special value of \"*\" means all users are allowed.</description>\n", " </property>\n", "\n", " <property>\n", " <name>security.containermanagement.protocol.acl</name>\n", " <value>*</value>\n", " <description>ACL for ContainerManagementProtocol protocol, used by the NodeManager\n", " and ApplicationMasters to communicate with each other.\n", " The ACL is a comma-separated list of user and group names. The user and\n", " group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n", " A special value of \"*\" means all users are allowed.</description>\n", " </property>\n", "\n", " <property>\n", " <name>security.resourcelocalizer.protocol.acl</name>\n", " <value>*</value>\n", " <description>ACL for ResourceLocalizer protocol, used by the NodeManager\n", " and ResourceLocalizer to communicate with each other.\n", " The ACL is a comma-separated list of user and group names. The user and\n", " group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n", " A special value of \"*\" means all users are allowed.</description>\n", " </property>\n", "\n", " <property>\n", " <name>security.job.task.protocol.acl</name>\n", " <value>*</value>\n", " <description>ACL for TaskUmbilicalProtocol, used by the map and reduce\n", " tasks to communicate with the parent tasktracker.\n", " The ACL is a comma-separated list of user and group names. The user and\n", " group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n", " A special value of \"*\" means all users are allowed.</description>\n", " </property>\n", "\n", " <property>\n", " <name>security.job.client.protocol.acl</name>\n", " <value>*</value>\n", " <description>ACL for MRClientProtocol, used by job clients to\n", " communciate with the MR ApplicationMaster to query job status etc.\n", " The ACL is a comma-separated list of user and group names. The user and\n", " group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n", " A special value of \"*\" means all users are allowed.</description>\n", " </property>\n", "\n", " <property>\n", " <name>security.applicationhistory.protocol.acl</name>\n", " <value>*</value>\n", " <description>ACL for ApplicationHistoryProtocol, used by the timeline\n", " server and the generic history service client to communicate with each other.\n", " The ACL is a comma-separated list of user and group names. The user and\n", " group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n", " A special value of \"*\" means all users are allowed.</description>\n", " </property>\n", "\n", " \n", " \n", " \n", " \n", "</configuration>\n", "'''.format((\",\").join(hclients),user,user)\n", "\n", "with open(f'/home/{user}/hadoop/etc/hadoop/hadoop-policy.xml','w') as f:\n", " f.writelines(hadooppolicy)\n", " \n", "for l in clients:\n", " !scp ~/hadoop/etc/hadoop/hadoop-policy.xml {l}:~/hadoop/etc/hadoop/hadoop-policy.xml >/dev/null 2>&1\n" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "### hdfs config, set replication to 1 to cache all the data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "code_folding": [], "hidden": true }, "outputs": [], "source": [ "hdfs_data=\",\".join([f'{l}/{user}/hdfs/data' for l in datadir])\n", "\n", "hdfs_site=f'''<?xml version=\"1.0\"?>\n", "<?xml-stylesheet type=\"text/xsl\" href=\"configuration.xsl\"?>\n", "\n", "<!-- Put site-specific property overrides in this file. -->\n", "\n", "<configuration>\n", " <property>\n", " <name>dfs.namenode.secondary.http-address</name>\n", " <value>{hostname}:50090</value>\n", " </property>\n", " <property>\n", " <name>dfs.namenode.name.dir</name>\n", " <value>{datadir[0]}/{user}/hdfs/name</value>\n", " <final>true</final>\n", " </property>\n", "\n", " <property>\n", " <name>dfs.datanode.data.dir</name>\n", " <value>{hdfs_data}</value>\n", " <final>true</final>\n", " </property>\n", "\n", " <property>\n", " <name>dfs.namenode.checkpoint.dir</name>\n", " <value>{datadir[0]}/{user}/hdfs/namesecondary</value>\n", " <final>true</final>\n", " </property>\n", " <property>\n", " <name>dfs.name.handler.count</name>\n", " <value>100</value>\n", " </property>\n", " <property>\n", " <name>dfs.blocksize</name>\n", " <value>128m</value>\n", "</property>\n", " <property>\n", " <name>dfs.replication</name>\n", " <value>1</value>\n", "</property>\n", "\n", "<property>\n", " <name>dfs.client.read.shortcircuit</name>\n", " <value>true</value>\n", "</property>\n", "\n", "<property>\n", " <name>dfs.domain.socket.path</name>\n", " <value>/var/lib/hadoop-hdfs/dn_socket</value>\n", "</property>\n", "\n", "</configuration>\n", "'''\n", "\n", "\n", "with open(f'/home/{user}/hadoop/etc/hadoop/hdfs-site.xml','w') as f:\n", " f.writelines(hdfs_site)\n", " \n", "for l in clients:\n", " !scp ~/hadoop/etc/hadoop/hdfs-site.xml {l}:~/hadoop/etc/hadoop/hdfs-site.xml >/dev/null 2>&1\n" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "### mapreduce config" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "mapreduce='''<?xml version=\"1.0\"?>\n", "<?xml-stylesheet type=\"text/xsl\" href=\"configuration.xsl\"?>\n", "<!--\n", " Licensed under the Apache License, Version 2.0 (the \"License\");\n", " you may not use this file except in compliance with the License.\n", " You may obtain a copy of the License at\n", "\n", " http://www.apache.org/licenses/LICENSE-2.0\n", "\n", " Unless required by applicable law or agreed to in writing, software\n", " distributed under the License is distributed on an \"AS IS\" BASIS,\n", " WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", " See the License for the specific language governing permissions and\n", " limitations under the License. See accompanying LICENSE file.\n", "-->\n", "\n", "<!-- Put site-specific property overrides in this file. -->\n", "\n", "<configuration>\n", " <property>\n", " <name>mapreduce.framework.name</name>\n", " <value>yarn</value>\n", " </property>\n", "\n", " <property>\n", " <name>mapreduce.job.maps</name>\n", " <value>288</value>\n", " </property>\n", " <property>\n", " <name>mapreduce.job.reduces</name>\n", " <value>64</value>\n", " </property>\n", "\n", " <property>\n", " <name>mapreduce.map.java.opts</name>\n", " <value>-Xmx5120M -DpreferIPv4Stack=true</value>\n", " </property>\n", " <property>\n", " <name>mapreduce.map.memory.mb</name>\n", " <value>6144</value>\n", " </property>\n", "\n", " <property>\n", " <name>mapreduce.reduce.java.opts</name>\n", " <value>-Xmx5120M -DpreferIPv4Stack=true</value>\n", " </property>\n", " <property>\n", " <name>mapreduce.reduce.memory.mb</name>\n", " <value>6144</value>\n", " </property>\n", " <property>\n", " <name>yarn.app.mapreduce.am.staging-dir</name>\n", " <value>/user</value>\n", " </property>\n", " <property>\n", " <name>mapreduce.task.io.sort.mb</name>\n", " <value>2000</value>\n", " </property>\n", " <property>\n", " <name>mapreduce.task.timeout</name>\n", " <value>3600000</value>\n", " </property>\n", "<!-- MapReduce Job History Server security configs -->\n", "<property>\n", " <name>mapreduce.jobhistory.address</name>\n", " <value>{:s}:10020</value>\n", "</property>\n", "\n", "</configuration>\n", "'''.format(hostname)\n", "\n", "\n", "with open(f'/home/{user}/hadoop/etc/hadoop/mapred-site.xml','w') as f:\n", " f.writelines(mapreduce)\n", " \n", "for l in clients:\n", " !scp ~/hadoop/etc/hadoop/mapred-site.xml {l}:~/hadoop/etc/hadoop/mapred-site.xml >/dev/null 2>&1\n" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "### yarn config" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "code_folding": [], "hidden": true }, "outputs": [], "source": [ "yarn_site=f'''<?xml version=\"1.0\"?>\n", "<!--\n", " Licensed under the Apache License, Version 2.0 (the \"License\");\n", " you may not use this file except in compliance with the License.\n", " You may obtain a copy of the License at\n", "\n", " http://www.apache.org/licenses/LICENSE-2.0\n", "\n", " Unless required by applicable law or agreed to in writing, software\n", " distributed under the License is distributed on an \"AS IS\" BASIS,\n", " WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", " See the License for the specific language governing permissions and\n", " limitations under the License. See accompanying LICENSE file.\n", "-->\n", "<configuration>\n", " <property>\n", " <name>yarn.resourcemanager.hostname</name>\n", " <value>{hostname}</value>\n", " </property>\n", " <property>\n", " <name>yarn.resourcemanager.address</name>\n", " <value>{hostname}:8032</value>\n", " </property>\n", " <property>\n", " <name>yarn.resourcemanager.webapp.address</name>\n", " <value>{hostname}:8088</value>\n", " </property>\n", " <property>\n", " <name>yarn.nodemanager.resource.memory-mb</name>\n", " <value>{mem_mib}</value>\n", " </property>\n", " <property>\n", " <name>yarn.nodemanager.resource.cpu-vcores</name>\n", " <value>{cpu_num}</value>\n", " </property>\n", " <property>\n", " <name>yarn.nodemanager.pmem-check-enabled</name>\n", " <value>false</value>\n", " </property>\n", "\n", " <property>\n", " <name>yarn.nodemanager.vmem-check-enabled</name>\n", " <value>false</value>\n", " </property>\n", " <property>\n", " <name>yarn.nodemanager.vmem-pmem-ratio</name>\n", " <value>4.1</value>\n", " </property>\n", " <property>\n", " <name>yarn.nodemanager.aux-services</name>\n", " <value>mapreduce_shuffle,spark_shuffle</value>\n", " </property>\n", "\n", " <property>\n", " <name>yarn.scheduler.minimum-allocation-mb</name>\n", " <value>1024</value>\n", " </property>\n", " <property>\n", " <name>yarn.scheduler.maximum-allocation-mb</name>\n", " <value>{mem_mib}</value>\n", " </property>\n", " <property>\n", " <name>yarn.scheduler.minimum-allocation-vcores</name>\n", " <value>1</value>\n", " </property>\n", " <property>\n", " <name>yarn.scheduler.maximum-allocation-vcores</name>\n", " <value>{cpu_num}</value>\n", " </property>\n", "\n", " <property>\n", " <name>yarn.log-aggregation-enable</name>\n", " <value>false</value>\n", " </property>\n", " <property>\n", " <name>yarn.nodemanager.log.retain-seconds</name>\n", " <value>36000</value>\n", " </property>\n", " <property>\n", " <name>yarn.nodemanager.delete.debug-delay-sec</name>\n", " <value>3600</value>\n", " </property>\n", " <property>\n", " <name>yarn.log.server.url</name>\n", " <value>http://{hostname}:19888/jobhistory/logs/</value>\n", " </property>\n", "\n", " <property>\n", " <name>yarn.nodemanager.log-dirs</name>\n", " <value>/home/{user}/hadoop/logs/userlogs</value>\n", " </property>\n", " <property>\n", " <name>yarn.nodemanager.local-dirs</name>\n", " <value>{sparklocals}\n", " </value>\n", " </property>\n", " <property>\n", " <name>yarn.nodemanager.aux-services.spark_shuffle.class</name>\n", " <value>org.apache.spark.network.yarn.YarnShuffleService</value>\n", " </property>\n", "</configuration>\n", "'''\n", "\n", "\n", "with open(f'/home/{user}/hadoop/etc/hadoop/yarn-site.xml','w') as f:\n", " f.writelines(yarn_site)\n", " \n", "for l in clients:\n", " !scp ~/hadoop/etc/hadoop/yarn-site.xml {l}:~/hadoop/etc/hadoop/yarn-site.xml >/dev/null 2>&1\n" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "### hadoop-env" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "#config java home\n", "if is_arm:\n", " !echo \"export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-arm64\" >> ~/hadoop/etc/hadoop/hadoop-env.sh\n", "else:\n", " !echo \"export JAVA_HOME={java_home}\" >> ~/hadoop/etc/hadoop/hadoop-env.sh\n", "\n", "for l in clients:\n", " !scp hadoop/etc/hadoop/hadoop-env.sh {l}:~/hadoop/etc/hadoop/ >/dev/null 2>&1\n" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "### workers config" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "if clients:\n", " with open(f'/home/{user}/hadoop/etc/hadoop/workers','w') as f:\n", " f.writelines(\"\\n\".join(clients))\n", " for l in clients:\n", " !scp hadoop/etc/hadoop/workers {l}:~/hadoop/etc/hadoop/ >/dev/null 2>&1\n", "else:\n", " !echo {hostname} > ~/hadoop/etc/hadoop/workers" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "### Copy jar from Spark for external shuffle service" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "for l in hclients:\n", " !scp spark/yarn/spark-3.3.1-yarn-shuffle.jar {l}:~/hadoop/share/hadoop/common/lib/" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "## Configure Spark" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "eventlog_dir=f'hdfs://{hostname}:8020/tmp/sparkEventLog'" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "sparkconf=f'''\n", "spark.eventLog.enabled true\n", "spark.eventLog.dir {eventlog_dir}\n", "spark.history.fs.logDirectory {eventlog_dir}\n", "'''\n", "\n", "with open(f'/home/{user}/spark/conf/spark-defaults.conf','w+') as f:\n", " f.writelines(sparkconf)\n", " \n", "for l in clients:\n", " !scp ~/spark/conf/spark-defaults.conf {l}:~/spark/conf/spark-defaults.conf >/dev/null 2>&1" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "sparkenv = f'export SPARK_LOCAL_DIRS={sparklocals}\\n'\n", "with open(f'/home/{user}/.bashrc', 'a+') as f:\n", " f.writelines(sparkenv)\n", "for l in clients:\n", " !scp ~/.bashrc {l}:~/.bashrc >/dev/null 2>&1" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "for l in hclients:\n", " !ssh {l} tail -n10 ~/.bashrc" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "# Configure monitor & startups" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "!cd ~\n", "!git clone https://github.com/trailofbits/tsc_freq_khz.git\n", "\n", "for l in clients:\n", " !scp -r tsc_freq_khz {l}:~/\n", "\n", "for l in hclients:\n", " !ssh {l} 'cd tsc_freq_khz && make && sudo insmod ./tsc_freq_khz.ko' >/dev/null 2>&1\n", " !ssh root@{l} 'dmesg | grep tsc_freq_khz'" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "startup=f'''#!/bin/bash\n", "echo -1 > /proc/sys/kernel/perf_event_paranoid\n", "echo 0 > /proc/sys/kernel/kptr_restrict\n", "echo madvise >/sys/kernel/mm/transparent_hugepage/enabled\n", "echo 1 > /proc/sys/kernel/numa_balancing\n", "end=$(($(nproc) - 1))\n", "for i in $(seq 0 $end); do echo performance > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor; done\n", "for file in $(find /sys/devices/system/cpu/cpu*/power/energy_perf_bias); do echo \"0\" > $file; done\n", "\n", "if [ -d /home/{user}/sep_installed ]; then\n", " /home/{user}/sep_installed/sepdk/src/insmod-sep -g {user}\n", "fi\n", "'''\n", "\n", "with open('/tmp/tmpstartup', 'w') as f:\n", " f.writelines(startup)\n", "\n", "startup_service=f'''[Unit]\n", "Description=Configure Transparent Hugepage, Auto NUMA Balancing, CPU Freq Scaling Governor\n", "\n", "[Service]\n", "ExecStart=/usr/local/bin/mystartup.sh\n", "\n", "[Install]\n", "WantedBy=multi-user.target\n", "'''\n", "\n", "with open('/tmp/tmpstartup_service', 'w') as f:\n", " f.writelines(startup_service)\n", " \n", "for l in hclients:\n", " !scp /tmp/tmpstartup $l:/tmp/tmpstartup\n", " !scp /tmp/tmpstartup_service $l:/tmp/tmpstartup_service\n", " !ssh root@$l \"cat /tmp/tmpstartup > /usr/local/bin/mystartup.sh\"\n", " !ssh root@$l \"chmod +x /usr/local/bin/mystartup.sh\"\n", " !ssh root@$l \"cat /tmp/tmpstartup_service > /etc/systemd/system/mystartup.service\"\n", " !ssh $l \"sudo systemctl enable mystartup.service\"\n", " !ssh $l \"sudo systemctl start mystartup.service\"\n", " !ssh $l \"sudo systemctl status mystartup.service\"" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "## Install Emon" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "<font color=red size=3> Get the latest offline installer from [link](https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler-download.html?operatingsystem=linux&linux-install-type=offline) </font>" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "offline_installer = 'https://registrationcenter-download.intel.com/akdlm/IRC_NAS/e7797b12-ce87-4df0-aa09-df4a272fc5d9/intel-vtune-2025.0.0.1130_offline.sh'\n", "for l in hclients:\n", " !ssh {l} \"wget {offline_installer} -q && chmod +x intel-vtune-2025.0.0.1130_offline.sh\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true, "scrolled": true }, "outputs": [], "source": [ "for l in hclients:\n", " !ssh {l} \"sudo ./intel-vtune-2025.0.0.1130_offline.sh -a -c -s --eula accept\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "for l in hclients:\n", " !ssh {l} \"sudo chown -R {user}:{user} /opt/intel/oneapi/vtune/ && rm -f sep_installed && ln -s /opt/intel/oneapi/vtune/latest sep_installed\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true, "scrolled": true }, "outputs": [], "source": [ "for l in hclients:\n", " !ssh {l} \"cd sep_installed/sepdk/src/; echo -e \\\"\\\\n\\\\n\\\\n\\\" | ./build-driver\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true, "scrolled": true }, "outputs": [], "source": [ "for l in hclients:\n", " !ssh root@{l} \"/home/{user}/sep_installed/sepdk/src/rmmod-sep && /home/{user}/sep_installed/sepdk/src/insmod-sep -g {user}\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "for l in hclients:\n", " !ssh {l} \"source /home/{user}/sep_installed/sep_vars.sh > /dev/null 2>&1; emon -v | head -n 1\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "for l in hclients:\n", " !ssh {l} 'echo \"source /home/{user}/sep_installed/sep_vars.sh > /dev/null 2>&1\" >> ~/.bashrc'" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "for c in hclients:\n", " !ssh {c} 'tail -n1 ~/.bashrc'" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "heading_collapsed": true, "hidden": true, "run_control": { "frozen": true } }, "source": [ "## Inspect CPU Freq & HT" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "hidden": true, "run_control": { "frozen": true } }, "outputs": [], "source": [ "if is_arm:\n", " t = r'''\n", " #include <stdlib.h>\n", " #include <stdio.h>\n", " #include <assert.h>\n", " #include <sys/time.h>\n", " #include <sched.h>\n", " #include <sys/shm.h>\n", " #include <errno.h>\n", " #include <sys/mman.h>\n", " #include <unistd.h> //used for parsing the command line arguments\n", " #include <fcntl.h> //used for opening the memory device file\n", " #include <math.h> //used for rounding functions\n", " #include <signal.h>\n", " #include <linux/types.h>\n", " #include <stdint.h>\n", "\n", " static inline uint64_t GetTickCount()\n", " {//Return ns counts\n", " struct timeval tp;\n", " gettimeofday(&tp,NULL);\n", " return tp.tv_sec*1000+tp.tv_usec/1000;\n", " }\n", "\n", " uint64_t CNT=CNT_DEF;\n", "\n", " int main()\n", " {\n", "\n", " uint64_t start, end;\n", " start=end=GetTickCount();\n", "\n", " asm volatile (\n", " \"1:\\n\"\n", " \"SUBS %0,%0,#1\\n\"\n", " \"bne 1b\\n\"\n", " ::\"r\"(CNT)\n", " );\n", "\n", " end=GetTickCount();\n", "\n", " printf(\" total time = %lu, freq = %lu \\n\", end-start, CNT/(end-start)/1000);\n", "\n", " return 0;\n", " }\n", " '''\n", "else:\n", " t=r'''\n", " #include <stdlib.h>\n", " #include <stdio.h>\n", " #include <assert.h>\n", " #include <sys/time.h>\n", " #include <sched.h>\n", " #include <sys/shm.h>\n", " #include <errno.h>\n", " #include <sys/mman.h>\n", " #include <unistd.h> //used for parsing the command line arguments\n", " #include <fcntl.h> //used for opening the memory device file\n", " #include <math.h> //used for rounding functions\n", " #include <signal.h>\n", " #include <linux/types.h>\n", " #include <stdint.h>\n", "\n", " static inline uint64_t GetTickCount()\n", " {//Return ns counts\n", " struct timeval tp;\n", " gettimeofday(&tp,NULL);\n", " return tp.tv_sec*1000+tp.tv_usec/1000;\n", " }\n", "\n", " uint64_t CNT=CNT_DEF;\n", "\n", " int main()\n", " {\n", "\n", " uint64_t start, end;\n", " start=end=GetTickCount();\n", "\n", " asm volatile (\n", " \"1:\\n\"\n", " \"dec %0\\n\"\n", " \"jnz 1b\\n\"\n", " ::\"r\"(CNT)\n", " );\n", "\n", " end=GetTickCount();\n", "\n", " printf(\" total time = %lu, freq = %lu \\n\", end-start, CNT/(end-start)/1000);\n", "\n", " return 0;\n", " }\n", " '''" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "hidden": true, "run_control": { "frozen": true } }, "outputs": [], "source": [ "%cd ~\n", "with open(\"t.c\", 'w') as f:\n", " f.writelines(t)\n", "!gcc -O3 -DCNT_DEF=10000000000LL -o t t.c; gcc -O3 -DCNT_DEF=1000000000000LL -o t.delay t.c;\n", "!for j in `seq 1 $(nproc)`; do echo -n $j; (for i in `seq 1 $j`; do taskset -c $i ./t.delay & done); sleep 1; ./t; killall t.delay; sleep 2; done" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "# <font color=red>Shutdown Jupyter; source ~/.bashrc; reboot Jupyter; run section [Initialize](#Initialize)</font>" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "# Build gluten" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "## Install docker" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "# Instructions from https://docs.docker.com/engine/install/ubuntu/\n", "\n", "# Add Docker's official GPG key:\n", "!sudo -E apt-get update\n", "!sudo -E apt-get install ca-certificates curl\n", "!sudo -E install -m 0755 -d /etc/apt/keyrings\n", "!sudo -E curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc\n", "!sudo chmod a+r /etc/apt/keyrings/docker.asc\n", "\n", "# Add the repository to Apt sources:\n", "!echo \\\n", " \"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \\\n", " $(. /etc/os-release && echo \"$VERSION_CODENAME\") stable\" | \\\n", " sudo -E tee /etc/apt/sources.list.d/docker.list > /dev/null\n", "!sudo -E apt-get update" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "!sudo -E apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin >/dev/null 2>&1" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Configure docker proxy" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "import os\n", "http_proxy=os.getenv('http_proxy')\n", "https_proxy=os.getenv('https_proxy')\n", "\n", "if http_proxy or https_proxy:\n", " !sudo mkdir -p /etc/systemd/system/docker.service.d\n", " with open('/tmp/http-proxy.conf', 'w') as f:\n", " s = '''\n", "[Service]\n", "{}\n", "{}\n", "'''.format(f'Environment=\"HTTP_PROXY={http_proxy}\"' if http_proxy else '', f'Environment=\"HTTPS_PROXY={https_proxy}\"' if https_proxy else '')\n", " f.writelines(s)\n", " !sudo cp /tmp/http-proxy.conf /etc/systemd/system/docker.service.d\n", " \n", " !ssh root@localhost \"mkdir -p /root/.docker\"\n", " with open(f'/tmp/config.json', 'w') as f:\n", " s = f'''\n", "{{\n", " \"proxies\": {{\n", " \"default\": {{\n", " \"httpProxy\": \"{http_proxy}\",\n", " \"httpsProxy\": \"{https_proxy}\",\n", " \"noProxy\": \"127.0.0.0/8\"\n", " }}\n", " }}\n", "}}\n", " '''\n", " f.writelines(s)\n", " !ssh root@localhost \"cp -f /tmp/config.json /root/.docker\"" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Configure maven proxy" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "!mkdir -p ~/.m2\n", "\n", "def get_proxy(proxy):\n", " pos0 = proxy.rfind('/')\n", " pos = proxy.rfind(':')\n", " host = http_proxy[pos0+1:pos]\n", " port = http_proxy[pos+1:]\n", " return host, port\n", "\n", "if http_proxy or https_proxy:\n", " with open(f\"/home/{user}/.m2/settings.xml\",\"w+\") as f:\n", " f.write('''<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n", "<settings xmlns=\"http://maven.apache.org/SETTINGS/1.2.0\"\n", " xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"\n", " xsi:schemaLocation=\"http://maven.apache.org/SETTINGS/1.2.0 https://maven.apache.org/xsd/settings-1.2.0.xsd\">\n", " <proxies>''')\n", " if http_proxy:\n", " host, port = get_proxy(http_proxy)\n", " f.write(f'''\n", " <proxy>\n", " <id>http_proxy</id>\n", " <active>true</active>\n", " <protocol>http</protocol>\n", " <host>{host}</host>\n", " <port>{port}</port>\n", " </proxy>''')\n", " if https_proxy:\n", " host, port = get_proxy(http_proxy)\n", " f.write(f'''\n", " <proxy>\n", " <id>https_proxy</id>\n", " <active>true</active>\n", " <protocol>https</protocol>\n", " <host>{host}</host>\n", " <port>{port}</port>\n", " </proxy>''')\n", " f.write('''\n", " </proxies>\n", "</settings>\n", "''')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "!sudo systemctl daemon-reload" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "!sudo systemctl restart docker.service" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "## Build gluten" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "%cd ~" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "import os\n", "if not os.path.exists('gluten'):\n", " !git clone https://github.com/apache/incubator-gluten.git gluten" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "# Build Arrow for the first time build.\n", "!sed -i 's/--build_arrow=OFF/--build_arrow=ON/' ~/gluten/dev/package-vcpkg.sh" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "%cd ~/gluten/tools/workload/benchmark_velox" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "!bash build_gluten.sh" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "for l in hclients:\n", " !scp ~/gluten/package/target/gluten-velox-bundle-spark*.jar {l}:~/" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "# Generate data" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "## Build spark-sql-perf" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "!echo \"deb https://repo.scala-sbt.org/scalasbt/debian all main\" | sudo tee /etc/apt/sources.list.d/sbt.list\n", "!echo \"deb https://repo.scala-sbt.org/scalasbt/debian /\" | sudo tee /etc/apt/sources.list.d/sbt_old.list\n", "!curl -sL \"https://keyserver.ubuntu.com/pks/lookup?op=get&search=0x2EE0EA64E40A89B84B2DF73499E82A75642AC823\" | sudo apt-key add\n", "!sudo -E apt-get update > /dev/null 2>&1\n", "!sudo -E apt-get install sbt > /dev/null 2>&1" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "import os\n", "http_proxy=os.getenv('http_proxy')\n", "https_proxy=os.getenv('https_proxy')\n", "\n", "def get_proxy(proxy):\n", " pos0 = proxy.rfind('/')\n", " pos = proxy.rfind(':')\n", " host = http_proxy[pos0+1:pos]\n", " port = http_proxy[pos+1:]\n", " return host, port\n", "\n", "sbt_opts=''\n", "\n", "if http_proxy:\n", " host, port = get_proxy(http_proxy)\n", " sbt_opts = f'{sbt_opts} -Dhttp.proxyHost={host} -Dhttp.proxyPort={port}'\n", "if https_proxy:\n", " host, port = get_proxy(https_proxy)\n", " sbt_opts = f'{sbt_opts} -Dhttps.proxyHost={host} -Dhttps.proxyPort={port}'\n", " \n", "if sbt_opts:\n", " %env SBT_OPTS={sbt_opts}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "!git clone https://github.com/databricks/spark-sql-perf.git ~/spark-sql-perf" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "!cd ~/spark-sql-perf && sbt package" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "!cp ~/spark-sql-perf/target/scala-2.12/spark-sql-perf_2.12-0.5.1-SNAPSHOT.jar ~/ipython/" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "## Start Hadoop/Spark cluster, Spark history server" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "!~/hadoop/bin/hadoop namenode -format" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "!~/hadoop/bin/hadoop datanode -format " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "!~/hadoop/sbin/start-dfs.sh" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "!hadoop dfsadmin -safemode leave" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "!hadoop fs -mkdir -p /tmp/sparkEventLog" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "!cd ~/spark && sbin/start-history-server.sh" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "import os\n", "\n", "master=''\n", "if clients:\n", " !~/hadoop/sbin/start-yarn.sh\n", " master='yarn'\n", "else:\n", " # If we run on single node, we use standalone mode\n", " !{os.environ['SPARK_HOME']}/sbin/stop-slave.sh\n", " !{os.environ['SPARK_HOME']}/sbin/stop-master.sh\n", " !{os.environ['SPARK_HOME']}/sbin/start-master.sh\n", " !{os.environ['SPARK_HOME']}/sbin/start-worker.sh spark://{hostname}:7077 -c {cpu_num}\n", " master=f'spark://{hostname}:7077'" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "!jps\n", "for l in clients:\n", " !ssh {l} jps" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "## TPCH" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "!rm -rf ~/tpch-dbgen\n", "!git clone https://github.com/databricks/tpch-dbgen.git ~/tpch-dbgen" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "for l in clients:\n", " !scp -r ~/tpch-dbgen {l}:~/" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "for l in hclients:\n", " !ssh {l} cd ~/tpch-dbgen && git checkout 0469309147b42abac8857fa61b4cf69a6d3128a8 && make clean && make OS=LINUX" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "%cd ~/gluten/tools/workload/tpch/gen_data/parquet_dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "# Suggest 2x cpu# partitions.\n", "scaleFactor = 1500\n", "numPartitions = 2*cpu_num if len(clients)==0 else len(clients)*2*cpu_num\n", "dataformat = \"parquet\" # data format of data source\n", "dataSourceCodec = \"snappy\"\n", "rootDir = f\"/tpch_sf{scaleFactor}_{dataformat}_{dataSourceCodec}\" # root directory of location to create data in." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "# Verify parameters\n", "print(f'scaleFactor = {scaleFactor}')\n", "print(f'numPartitions = {numPartitions}')\n", "print(f'dataformat = {dataformat}')\n", "print(f'rootDir = {rootDir}')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "scala=f'''import com.databricks.spark.sql.perf.tpch._\n", "\n", "\n", "val scaleFactor = \"{scaleFactor}\" // scaleFactor defines the size of the dataset to generate (in GB).\n", "val numPartitions = {numPartitions} // how many dsdgen partitions to run - number of input tasks.\n", "\n", "val format = \"{dataformat}\" // valid spark format like parquet \"parquet\".\n", "val rootDir = \"{rootDir}\" // root directory of location to create data in.\n", "val dbgenDir = \"/home/{user}/tpch-dbgen\" // location of dbgen\n", "\n", "val tables = new TPCHTables(spark.sqlContext,\n", " dbgenDir = dbgenDir,\n", " scaleFactor = scaleFactor,\n", " useDoubleForDecimal = false, // true to replace DecimalType with DoubleType\n", " useStringForDate = false) // true to replace DateType with StringType\n", "\n", "\n", "tables.genData(\n", " location = rootDir,\n", " format = format,\n", " overwrite = true, // overwrite the data that is already there\n", " partitionTables = false, // do not create the partitioned fact tables\n", " clusterByPartitionColumns = false, // shuffle to get partitions coalesced into single files.\n", " filterOutNullPartitionValues = false, // true to filter out the partition with NULL key value\n", " tableFilter = \"\", // \"\" means generate all tables\n", " numPartitions = numPartitions) // how many dsdgen partitions to run - number of input tasks.\n", "'''\n", "\n", "with open(\"tpch_datagen_parquet.scala\",\"w\") as f:\n", " f.writelines(scala)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "executor_cores = 8\n", "num_executors=cpu_num/executor_cores\n", "executor_memory = (totalmemory - 10*1024*1024)/num_executors - 1*1024*1024\n", "\n", "# Verify parameters\n", "print(f'--master {master}')\n", "print(f'--num-executors {int(num_executors)}')\n", "print(f'--executor-cores {int(executor_cores)}')\n", "print(f'--executor-memory {int(executor_memory)}k')\n", "print(f'--conf spark.sql.shuffle.partitions={numPartitions}')\n", "print(f'--conf spark.sql.parquet.compression.codec={dataSourceCodec}')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "tpch_datagen_parquet=f'''\n", "cat tpch_datagen_parquet.scala | {os.environ['SPARK_HOME']}/bin/spark-shell \\\n", " --master {master} \\\n", " --name tpch_gen_parquet \\\n", " --driver-memory 10g \\\n", " --num-executors {int(num_executors)} \\\n", " --executor-cores {int(executor_cores)} \\\n", " --executor-memory {int(executor_memory)}k \\\n", " --conf spark.executor.memoryOverhead=1g \\\n", " --conf spark.sql.broadcastTimeout=4800 \\\n", " --conf spark.driver.maxResultSize=4g \\\n", " --conf spark.sql.shuffle.partitions={numPartitions} \\\n", " --conf spark.sql.parquet.compression.codec={dataSourceCodec} \\\n", " --conf spark.network.timeout=800s \\\n", " --conf spark.executor.heartbeatInterval=200s \\\n", " --jars /home/{user}/ipython/spark-sql-perf_2.12-0.5.1-SNAPSHOT.jar \\\n", "'''\n", "\n", "with open(\"tpch_datagen_parquet.sh\",\"w\") as f:\n", " f.writelines(tpch_datagen_parquet)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "!nohup bash tpch_datagen_parquet.sh" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "## TPCDS" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "!rm -rf ~/tpcds-kit\n", "!git clone https://github.com/databricks/tpcds-kit.git ~/tpcds-kit" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "for l in clients:\n", " !scp -r ~/tpcds-kit {l}:~/" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "for l in hclients:\n", " !ssh {l} \"cd ~/tpcds-kit/tools && make clean && make OS=LINUX CC=gcc-9\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "%cd ~/gluten/tools/workload/tpcds/gen_data/parquet_dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "# Suggest 2x cpu# partitions\n", "scaleFactor = 1500\n", "numPartitions = 2*cpu_num if len(clients)==0 else len(clients)*2*cpu_num\n", "dataformat = \"parquet\" # data format of data source\n", "dataSourceCodec = \"snappy\"\n", "rootDir = f\"/tpcds_sf{scaleFactor}_{dataformat}_{dataSourceCodec}\" # root directory of location to create data in." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "# Verify parameters\n", "print(f'scaleFactor = {scaleFactor}')\n", "print(f'numPartitions = {numPartitions}')\n", "print(f'dataformat = {dataformat}')\n", "print(f'rootDir = {rootDir}')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "scala=f'''import com.databricks.spark.sql.perf.tpcds._\n", "\n", "val scaleFactor = \"{scaleFactor}\" // scaleFactor defines the size of the dataset to generate (in GB).\n", "val numPartitions = {numPartitions} // how many dsdgen partitions to run - number of input tasks.\n", "\n", "val format = \"{dataformat}\" // valid spark format like parquet \"parquet\".\n", "val rootDir = \"{rootDir}\" // root directory of location to create data in.\n", "val dsdgenDir = \"/home/{user}/tpcds-kit/tools/\" // location of dbgen\n", "\n", "val tables = new TPCDSTables(spark.sqlContext,\n", " dsdgenDir = dsdgenDir,\n", " scaleFactor = scaleFactor,\n", " useDoubleForDecimal = false, // true to replace DecimalType with DoubleType\n", " useStringForDate = false) // true to replace DateType with StringType\n", "\n", "\n", "tables.genData(\n", " location = rootDir,\n", " format = format,\n", " overwrite = true, // overwrite the data that is already there\n", " partitionTables = true, // create the partitioned fact tables\n", " clusterByPartitionColumns = true, // shuffle to get partitions coalesced into single files.\n", " filterOutNullPartitionValues = false, // true to filter out the partition with NULL key value\n", " tableFilter = \"\", // \"\" means generate all tables\n", " numPartitions = numPartitions) // how many dsdgen partitions to run - number of input tasks.\n", "'''\n", "\n", "with open(\"tpcds_datagen_parquet.scala\",\"w\") as f:\n", " f.writelines(scala)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "executor_cores = 8\n", "num_executors=cpu_num/executor_cores\n", "executor_memory = (totalmemory - 10*1024*1024)/num_executors - 1*1024*1024\n", "\n", "# Verify parameters\n", "print(f'--master {master}')\n", "print(f'--num-executors {int(num_executors)}')\n", "print(f'--executor-cores {int(executor_cores)}')\n", "print(f'--executor-memory {int(executor_memory)}k')\n", "print(f'--conf spark.sql.shuffle.partitions={numPartitions}')\n", "print(f'--conf spark.sql.parquet.compression.codec={dataSourceCodec}')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "tpcds_datagen_parquet=f'''\n", "cat tpcds_datagen_parquet.scala | {os.environ['SPARK_HOME']}/bin/spark-shell \\\n", " --master {master} \\\n", " --name tpcds_gen_parquet \\\n", " --driver-memory 10g \\\n", " --num-executors {int(num_executors)} \\\n", " --executor-cores {int(executor_cores)} \\\n", " --executor-memory {int(executor_memory)}k \\\n", " --conf spark.executor.memoryOverhead=1g \\\n", " --conf spark.sql.broadcastTimeout=4800 \\\n", " --conf spark.driver.maxResultSize=4g \\\n", " --conf spark.sql.shuffle.partitions={numPartitions} \\\n", " --conf spark.sql.parquet.compression.codec={dataSourceCodec} \\\n", " --conf spark.network.timeout=800s \\\n", " --conf spark.executor.heartbeatInterval=200s \\\n", " --jars /home/{user}/ipython/spark-sql-perf_2.12-0.5.1-SNAPSHOT.jar \\\n", "'''\n", "\n", "with open(\"tpcds_datagen_parquet.sh\",\"w\") as f:\n", " f.writelines(tpcds_datagen_parquet)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "!nohup bash tpcds_datagen_parquet.sh" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "# Set up perf analysis tools (optional)" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "We have a set of perf analysis scripts under $GLUTEN_HOME/tools/workload/benchmark_velox/analysis. You can follow below steps to deploy the scripts on the same cluster and use them for performance analysis after each run." ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "## Install and deploy Trace-Viewer" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Clone the master branch of project catapult:\n", "```\n", "cd ~\n", "git clone https://github.com/catapult-project/catapult.git -b master\n", "```" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Trace-Viewer requires python version 2.7. Create a virtualenv for python2.7:\n", "```\n", "sudo apt install -y python2.7\n", "virtualenv -p /usr/bin/python2.7 py27-env\n", "source py27-env/bin/activate\n", "```" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Apply patch:\n", "\n", "```\n", "cd catapult\n", "```\n", "```\n", "git apply <<EOF\n", "diff --git a/catapult_build/dev_server.py b/catapult_build/dev_server.py\n", "index a8b25c485..ec9638c7b 100644\n", "--- a/catapult_build/dev_server.py\n", "+++ b/catapult_build/dev_server.py\n", "@@ -328,11 +328,11 @@ def Main(argv):\n", "\n", " app = DevServerApp(pds, args=args)\n", "\n", "- server = httpserver.serve(app, host='127.0.0.1', port=args.port,\n", "+ server = httpserver.serve(app, host='0.0.0.0', port=args.port,\n", " start_loop=False, daemon_threads=True)\n", " _AddPleaseExitMixinToServer(server)\n", " # pylint: disable=no-member\n", "- server.urlbase = 'http://127.0.0.1:%i' % server.server_port\n", "+ server.urlbase = 'http://0.0.0.0:%i' % server.server_port\n", " app.server = server\n", "\n", " sys.stderr.write('Now running on %s\\n' % server.urlbase)\n", "EOF\n", "```" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Start the service:\n", "\n", "```\n", "mkdir -p ~/trace_result\n", "cd ~/catapult && nohup ./bin/run_dev_server --no-install-hooks -d ~/trace_result -p1088 &\n", "```" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "## Deploy perf analysis scripts" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Create a virtualenv to run the perf analaysis scripts:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "%cd ~" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "!virtualenv -p python3 -v paus-env\n", "!source ~/paus-env/bin/activate && python3 -m pip install -r ~/gluten/tools/workload/benchmark_velox/analysis/requirements.txt" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "We will put all perf analysis notebooks under `$HOME/PAUS`. Create the directory and start the notebook:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "!mkdir -p ~/PAUS && cd ~/PAUS && source ~/paus-env/bin/activate && nohup jupyter notebook --ip=0.0.0.0 --port=8889 &" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Package the virtual environment so that it can be distributed to other nodes:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "!cd ~ && tar -czf paus-env.tar.gz paus-env" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Distribute to the worker nodes:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "for l in clients:\n", " !scp ~/paus-env.tar.gz {l}:~/\n", " !ssh {l} tar -zxf paus-env.tar.gz" ] } ], "metadata": { "hide_input": false, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" }, "nbTranslate": { "displayLangs": [ "*" ], "hotkey": "alt-t", "langInMainMenu": true, "sourceLang": "en", "targetLang": "fr", "useGoogleTranslate": true }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }