feature-store/feature_store_client_side_encryption.ipynb (728 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Amazon SageMaker Feature Store: Client-side Encryption using AWS Encryption SDK"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n",
"\n",
"\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebook demonstrates how client-side encryption with SageMaker Feature Store is done using the [AWS Encryption SDK library](https://docs.aws.amazon.com/encryption-sdk/latest/developer-guide/introduction.html) to encrypt your data prior to ingesting it into your Online or Offline Feature Store. We first demonstrate how to encrypt your data using the AWS Encryption SDK library, and then show how to use [Amazon Athena](https://aws.amazon.com/athena/) to query for a subset of encrypted columns of features for model training.\n",
"\n",
"Currently, Feature Store supports encryption at rest and encryption in transit. With this notebook, we showcase an additional layer of security where your data is encrypted and then stored in your Feature Store. This notebook also covers the scenario where you want to query a subset of encrypted data using Amazon Athena for model training. This becomes particularly useful when you want to store encrypted data sets in a single Feature Store, and want to perform model training using only a subset of encrypted columns, forcing privacy over the remaining columns. \n",
"\n",
"If you are interested in server side encryption with Feature Store, see [Feature Store: Encrypt Data in your Online or Offline Feature Store using KMS key](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-featurestore/feature_store_kms_key_encryption.html). \n",
"\n",
"For more information on the AWS Encryption library, see [AWS Encryption SDK library](https://docs.aws.amazon.com/encryption-sdk/latest/developer-guide/introduction.html). \n",
"\n",
"For detailed information about Feature Store, see the [Developer Guide](https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store.html).\n",
"\n",
"### Overview\n",
"1. Set up\n",
"2. Load in and encrypt your data using AWS Encryption library (`aws-encryption-sdk`)\n",
"3. Create Feature Group and ingest your encrypted data into it\n",
"4. Query your encrypted data in your feature store using Amazon Athena\n",
"5. Decrypt the data you queried\n",
"\n",
"### Prerequisites\n",
"This notebook uses the Python SDK library for Feature Store, the AWS Encryption SDK library, `aws-encryption-sdk` and the `Python 3 (DataScience)` kernel. To use the`aws-encryption-sdk` library you will need to have an active KMS key that you created. If you do not have a KMS key, then you can create one by following the [KMS Policy Template](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-featurestore/feature_store_kms_key_encryption.html#KMS-Policy-Template) steps, or you can visit the [KMS section in the console](https://console.aws.amazon.com/kms/home) and follow the button prompts for creating a KMS key. This notebook works with SageMaker Studio, Jupyter, and JupyterLab. \n",
"\n",
"### Library Dependencies:\n",
"* `sagemaker>=2.0.0`\n",
"* `numpy`\n",
"* `pandas`\n",
"* `aws-encryption-sdk`\n",
"\n",
"### Data\n",
"This notebook uses a synthetic data set that has the following features: `customer_id`, `ssn` (social security number), `credit_score`, `age`, and aims to simulate a relaxed data set that has some important features that would be needed during the credit card approval process."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import sagemaker\n",
"import pandas as pd\n",
"import numpy as np"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pip install -q 'aws-encryption-sdk'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Set up"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sagemaker_session = sagemaker.Session()\n",
"s3_bucket_name = sagemaker_session.default_bucket()\n",
"prefix = \"sagemaker-featurestore-demo\"\n",
"role = sagemaker.get_execution_role()\n",
"region = sagemaker_session.boto_region_name"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Instantiate an encryption SDK client and provide your KMS ARN key to the `StrictAwsKmsMasterKeyProvider` object. This will be needed for data encryption and decryption by the AWS Encryption SDK library. You will need to substitute your KMS Key ARN for `kms_key`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import aws_encryption_sdk\n",
"from aws_encryption_sdk.identifiers import CommitmentPolicy\n",
"\n",
"client = aws_encryption_sdk.EncryptionSDKClient(\n",
" commitment_policy=CommitmentPolicy.REQUIRE_ENCRYPT_REQUIRE_DECRYPT\n",
")\n",
"\n",
"kms_key_provider = aws_encryption_sdk.StrictAwsKmsMasterKeyProvider(\n",
" key_ids=[kms_key] ## Add your KMS key here\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Load in your data. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"credit_card_data = pd.read_csv(\"data/credit_card_approval_synthetic.csv\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"credit_card_data.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"credit_card_data.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Client-Side Encryption Methods\n",
"\n",
"Below are some methods that use the Amazon Encryption SDK library for data encryption, and decryption. Note that the data type of the encryption is byte which we convert to an integer prior to storing it into Feature Store and do the reverse prior to decrypting. This is because Feature Store doesn't support byte format directly, thus why we convert the byte encryption to an integer. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def encrypt_data_frame(df, columns):\n",
" \"\"\"\n",
" Input:\n",
" df: A pandas Dataframe\n",
" columns: A list of column names.\n",
"\n",
" Encrypt the provided columns in df. This method assumes that column names provided in columns exist in df,\n",
" and uses the AWS Encryption SDK library.\n",
" \"\"\"\n",
" for col in columns:\n",
" buffer = []\n",
" for entry in np.array(df[col]):\n",
" entry = str(entry)\n",
" encrypted_entry, encryptor_header = client.encrypt(\n",
" source=entry, key_provider=kms_key_provider\n",
" )\n",
" buffer.append(encrypted_entry)\n",
" df[col] = buffer\n",
"\n",
"\n",
"def decrypt_data_frame(df, columns):\n",
" \"\"\"\n",
" Input:\n",
" df: A pandas Dataframe\n",
" columns: A list of column names.\n",
"\n",
" Decrypt the provided columns in df. This method assumes that column names provided in columns exist in df,\n",
" and uses the AWS Encryption SDK library.\n",
" \"\"\"\n",
" for col in columns:\n",
" buffer = []\n",
" for entry in np.array(df[col]):\n",
" decrypted_entry, decryptor_header = client.decrypt(\n",
" source=entry, key_provider=kms_key_provider\n",
" )\n",
" buffer.append(float(decrypted_entry))\n",
" df[col] = np.array(buffer)\n",
"\n",
"\n",
"def bytes_to_int(df, columns):\n",
" \"\"\"\n",
" Input:\n",
" df: A pandas Dataframe\n",
" columns: A list of column names.\n",
"\n",
" Convert the provided columns in df of type bytes to integers. This method assumes that column names provided\n",
" in columns exist in df and that the columns passed in are of type bytes.\n",
" \"\"\"\n",
" for col in columns:\n",
" for index, entry in enumerate(np.array(df[col])):\n",
" df[col][index] = int.from_bytes(entry, \"little\")\n",
"\n",
"\n",
"def int_to_bytes(df, columns):\n",
" \"\"\"\n",
" Input:\n",
" df: A pandas Dataframe\n",
" columns: A list of column names.\n",
"\n",
" Convert the provided columns in df of type integers to bytes. This method assumes that column names provided\n",
" in columns exist in df and that the columns passed in are of type integers.\n",
" \"\"\"\n",
" for col in columns:\n",
" buffer = []\n",
" for index, entry in enumerate(np.array(df[col])):\n",
" current = int(df[col][index])\n",
" current_bit_length = current.bit_length() + 1 # include the sign bit, 1\n",
" current_byte_length = (current_bit_length + 7) // 8\n",
" buffer.append(current.to_bytes(current_byte_length, \"little\"))\n",
" df[col] = pd.Series(buffer)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"## Encrypt credit card data. Note that we treat `customer_id` as a primary key, and since it's encryption is unique we can encrypt it.\n",
"encrypt_data_frame(credit_card_data, [\"customer_id\", \"age\", \"SSN\", \"credit_score\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"credit_card_data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(credit_card_data.dtypes)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"## Cast encryption of type bytes to an integer so it can be stored in Feature Store.\n",
"bytes_to_int(credit_card_data, [\"customer_id\", \"age\", \"SSN\", \"credit_score\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(credit_card_data.dtypes)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"credit_card_data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def cast_object_to_string(data_frame):\n",
" \"\"\"\n",
" Input:\n",
" data_frame: A pandas Dataframe\n",
"\n",
" Cast all columns of data_frame of type object to type string.\n",
" \"\"\"\n",
" for label in data_frame.columns:\n",
" if data_frame.dtypes[label] == object:\n",
" data_frame[label] = data_frame[label].astype(\"str\").astype(\"string\")\n",
" return data_frame\n",
"\n",
"\n",
"credit_card_data = cast_object_to_string(credit_card_data)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(credit_card_data.dtypes)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"credit_card_data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create your Feature Group and Ingest your encrypted data into it\n",
"\n",
"Below we start by appending the `EventTime` feature to your data to timestamp entries, then we load the feature definition, and instantiate the Feature Group object. Then lastly we ingest the data into your feature store. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from time import gmtime, strftime, sleep\n",
"\n",
"credit_card_feature_group_name = \"credit-card-feature-group-\" + strftime(\"%d-%H-%M-%S\", gmtime())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Instantiate a FeatureGroup object for `credit_card_data`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sagemaker.feature_store.feature_group import FeatureGroup\n",
"\n",
"credit_card_feature_group = FeatureGroup(\n",
" name=credit_card_feature_group_name, sagemaker_session=sagemaker_session\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import time\n",
"\n",
"current_time_sec = int(round(time.time()))\n",
"\n",
"## Recall customer_id is encrypted therefore unique, and so it can be used as a record identifier.\n",
"record_identifier_feature_name = \"customer_id\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Append the `EventTime` feature to your data frame. This parameter is required, and time stamps each data point."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"credit_card_data[\"EventTime\"] = pd.Series(\n",
" [current_time_sec] * len(credit_card_data), dtype=\"float64\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"credit_card_data.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(credit_card_data.dtypes)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"credit_card_feature_group.load_feature_definitions(data_frame=credit_card_data)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"credit_card_feature_group.create(\n",
" s3_uri=f\"s3://{s3_bucket_name}/{prefix}\",\n",
" record_identifier_name=record_identifier_feature_name,\n",
" event_time_feature_name=\"EventTime\",\n",
" role_arn=role,\n",
" enable_online_store=False,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"time.sleep(60)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ingest your data into your feature group. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"credit_card_feature_group.ingest(data_frame=credit_card_data, max_workers=3, wait=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"time.sleep(30)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Continually check your offline store until your data is available in it. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"s3_client = sagemaker_session.boto_session.client(\"s3\", region_name=region)\n",
"\n",
"credit_card_feature_group_s3_uri = (\n",
" credit_card_feature_group.describe()\n",
" .get(\"OfflineStoreConfig\")\n",
" .get(\"S3StorageConfig\")\n",
" .get(\"ResolvedOutputS3Uri\")\n",
")\n",
"\n",
"credit_card_feature_group_s3_prefix = credit_card_feature_group_s3_uri.replace(\n",
" f\"s3://{s3_bucket_name}/\", \"\"\n",
")\n",
"offline_store_contents = None\n",
"while offline_store_contents is None:\n",
" objects_in_bucket = s3_client.list_objects(\n",
" Bucket=s3_bucket_name, Prefix=credit_card_feature_group_s3_prefix\n",
" )\n",
" if \"Contents\" in objects_in_bucket and len(objects_in_bucket[\"Contents\"]) > 1:\n",
" offline_store_contents = objects_in_bucket[\"Contents\"]\n",
" else:\n",
" print(\"Waiting for data in offline store...\\n\")\n",
" time.sleep(60)\n",
"\n",
"print(\"Data available.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Use Amazon Athena to Query your Encrypted Data in your Feature Store\n",
"Using Amazon Athena, we query columns `customer_id`, `age`, and `credit_score` from your offline feature store where your encrypted data is. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"credit_card_query = credit_card_feature_group.athena_query()\n",
"\n",
"credit_card_table = credit_card_query.table_name\n",
"\n",
"query_credit_card_table = 'SELECT customer_id, age, credit_score FROM \"' + credit_card_table + '\"'\n",
"\n",
"print(\"Running \" + query_credit_card_table)\n",
"\n",
"# Run the Athena query\n",
"credit_card_query.run(\n",
" query_string=query_credit_card_table,\n",
" output_location=\"s3://\" + s3_bucket_name + \"/\" + prefix + \"/query_results/\",\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"time.sleep(60)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"credit_card_dataset = credit_card_query.as_dataframe()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(credit_card_dataset.dtypes)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"credit_card_dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"int_to_bytes(credit_card_dataset, [\"customer_id\", \"age\", \"credit_score\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"credit_card_dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"decrypt_data_frame(credit_card_dataset, [\"customer_id\", \"age\", \"credit_score\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this notebook, we queried a subset of encrypted features. From here you can now train a model on this new dataset while remaining privacy over other columns e.g., `ssn`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"credit_card_dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Clean Up Resources\n",
"Remove the Feature Group that was created. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"credit_card_feature_group.delete()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Next Steps\n",
"\n",
"In this notebook we covered client-side encryption with Feature Store. If you are interested in understanding how server-side encryption is done with Feature Store, see [Feature Store: Encrypt Data in your Online or Offline Feature Store using KMS key](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-featurestore/feature_store_kms_key_encryption.html). \n",
"\n",
"For more information on the AWS Encryption library, see [AWS Encryption SDK library](https://docs.aws.amazon.com/encryption-sdk/latest/developer-guide/introduction.html). \n",
"\n",
"For detailed information about Feature Store, see the [Developer Guide](https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store.html)."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Notebook CI Test Results\n",
"\n",
"This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n"
]
}
],
"metadata": {
"instance_type": "ml.t3.medium",
"kernelspec": {
"display_name": "Python 3 (Data Science)",
"language": "python",
"name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.10"
}
},
"nbformat": 4,
"nbformat_minor": 4
}