public_datasets/public_datasets.ipynb (441 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"id": "0525ed64-b083-4fdb-8660-107454e36bbb",
"metadata": {},
"source": [
"# Available datasets\n",
"\n",
"The following GCS bucket contains the files for each available dataset\n",
"\n",
"| GCP_PROJECT | GCS bucket|\n",
"|-----------------------------|------------------------------------------------------------------------------------------------------------------------------|\n",
"|dataproc-workspaces-notebooks|[gs://dataproc-metastore-public-binaries](https://console.cloud.google.com/storage/browser/dataproc-metastore-public-binaries)|"
]
},
{
"cell_type": "markdown",
"id": "75d7f1bb-f5c4-45c9-b9f0-fdff21434a2b",
"metadata": {},
"source": [
"#### Examples reading datasets using Bigframes"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "5dd79ea9-a1eb-497a-9b43-63301729a1e5",
"metadata": {
"pycharm": {
"is_executing": true
}
},
"outputs": [],
"source": [
"import bigframes.pandas as bpd"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "19757bde-5d1e-4043-b2cb-4d7a8ac03a3c",
"metadata": {
"pycharm": {
"is_executing": true
}
},
"outputs": [],
"source": [
"# Datasets in public GCS bucket\n",
"gcs_df = bpd.read_csv('gs://dataproc-metastore-public-binaries/ai4i_2020_predictive_maintenance/')\n",
"# Datasets in public BigQuery datasets\n",
"bq_df = bpd.read_gbq('bigquery-public-data.faa.us_airports')"
]
},
{
"cell_type": "markdown",
"id": "95753adb-84c4-45d4-9621-ea2b60d2893d",
"metadata": {},
"source": [
"#### Examples reading datasets using PySpark"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c1a72f1f-ba5a-4bdd-b4eb-707a7c325e25",
"metadata": {},
"outputs": [],
"source": [
"from pyspark.sql import SparkSession"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a6c0da30-5433-49f2-81c8-0fba256f29cc",
"metadata": {},
"outputs": [],
"source": [
"spark = SparkSession.builder \\\n",
" .appName(\"Dataproc Public Datasets Example\") \\\n",
" .enableHiveSupport() \\\n",
" .getOrCreate()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a461755f-01ce-47eb-ab5d-a772998e8da1",
"metadata": {},
"outputs": [],
"source": [
"# Datasets in public GCS bucket\n",
"dataset_df_1 = spark.read.option(\"header\", True).csv(\"gs://dataproc-metastore-public-binaries/ai4i_2020_predictive_maintenance/\")\n",
"dataset_df_2 = spark.read.parquet(\"gs://dataproc-metastore-public-binaries/real_estate_sales/\")\n",
"dataset_df_3 = spark.read.format(\"binaryFile\").option(\"recursiveFileLookup\", \"true\").load(\"gs://dataproc-metastore-public-binaries/youtube_ucg/\")\n",
"# Datasets in public BigQuery datasets\n",
"movie_reviews = spark.read.format(\"bigquery\").option(\"table\", \"bigquery-public-data.imdb.reviews\").load()"
]
},
{
"cell_type": "markdown",
"id": "ea790685-ccce-416f-b439-b65b84d5e83f",
"metadata": {},
"source": [
"## Datasets"
]
},
{
"cell_type": "markdown",
"id": "5db18531-9649-457c-aabd-f85f098b3fc1",
"metadata": {},
"source": [
"[**cuad_v1**](https://www.atticusprojectai.org/cuad) \n",
"GCS bucket path: gs://dataproc-metastore-public-binaries/cuad_v1/full_contract_pdf/\n",
"- Format: .pdf\n",
"- Metastore referece: public_datasets.cuad_v1\n",
"- Description: 510 PDF files of legal contracts\n",
"\n",
"| path| modificationTime| length| content|\n",
"|--------------------|--------------------|-------|--------------------|\n",
"|gs://dataproc-met...|2023-05-15 20:53:...|3683550|[25 50 44 46 2D 3...|\n",
"|gs://dataproc-met...|2023-05-15 20:53:...|2881262|[25 50 44 46 2D 3...|\n",
"|gs://dataproc-met...|2023-05-15 20:54:...|1778356|[25 50 44 46 2D 3...|\n",
"|gs://dataproc-met...|2023-05-15 20:53:...|1557129|[25 50 44 46 2D 3...|\n",
"|gs://dataproc-met...|2023-05-15 20:53:...|1452180|[25 50 44 46 2D 3...|"
]
},
{
"cell_type": "markdown",
"id": "ab4b6f6b-a090-4b47-b420-887dc31987e8",
"metadata": {},
"source": [
"[**winequality_red**](https://archive.ics.uci.edu/dataset/186/wine+quality) \n",
"GCS bucket path: gs://dataproc-metastore-public-binaries/winequality_red/\n",
"- Format: .csv\n",
"- Metastore referece: public_datasets.winequality_red\n",
"- Description: 4898 Wine quality data of white vinho verde samples\n",
"\n",
"|fixed_acidity|volatile_acidity|citric_acid|residual_sugar|chlorides|free_sulfur_dioxide|total_sulfur_dioxide|density| pH|sulphates|alcohol|quality|\n",
"|-------------|----------------|-----------|--------------|---------|-------------------|--------------------|-------|----|---------|-------|-------|\n",
"| 7.0| 0.27| 0.36| 20.7| 0.045| 45.0| 170.0| 1.001| 3.0| 0.45| 8.8| 6|\n",
"| 6.3| 0.3| 0.34| 1.6| 0.049| 14.0| 132.0| 0.994| 3.3| 0.49| 9.5| 6|\n",
"| 8.1| 0.28| 0.4| 6.9| 0.05| 30.0| 97.0| 0.9951|3.26| 0.44| 10.1| 6|\n",
"| 7.2| 0.23| 0.32| 8.5| 0.058| 47.0| 186.0| 0.9956|3.19| 0.4| 9.9| 6|\n",
"| 7.2| 0.23| 0.32| 8.5| 0.058| 47.0| 186.0| 0.9956|3.19| 0.4| 9.9| 6|"
]
},
{
"cell_type": "markdown",
"id": "572591d5-a3ff-4bfa-b03d-6130838deda4",
"metadata": {},
"source": [
"[**winequality_white**](https://archive.ics.uci.edu/dataset/186/wine+quality) \n",
"GCS bucket path: gs://dataproc-metastore-public-binaries/winequality_white/\n",
"- Format: .csv\n",
"- Metastore referece: public_datasets.winequality_white\n",
"- Description: 1599 Wine quality data of red vinho verde samples\n",
"\n",
"|fixed_acidity|volatile_acidity|citric_acid|residual_sugar|chlorides|free_sulfur_dioxide|total_sulfur_dioxide|density| pH|sulphates|alcohol|quality|\n",
"|-------------|----------------|-----------|--------------|---------|-------------------|--------------------|-------|----|---------|-------|-------|\n",
"| 7.0| 0.27| 0.36| 20.7| 0.045| 45.0| 170.0| 1.001| 3.0| 0.45| 8.8| 6|\n",
"| 6.3| 0.3| 0.34| 1.6| 0.049| 14.0| 132.0| 0.994| 3.3| 0.49| 9.5| 6|\n",
"| 8.1| 0.28| 0.4| 6.9| 0.05| 30.0| 97.0| 0.9951|3.26| 0.44| 10.1| 6|\n",
"| 7.2| 0.23| 0.32| 8.5| 0.058| 47.0| 186.0| 0.9956|3.19| 0.4| 9.9| 6|\n",
"| 7.2| 0.23| 0.32| 8.5| 0.058| 47.0| 186.0| 0.9956|3.19| 0.4| 9.9| 6|"
]
},
{
"cell_type": "markdown",
"id": "f2594958-bb7d-440c-9ef0-274670d2361c",
"metadata": {},
"source": [
"[**real_estate_sales**](https://catalog.data.gov/dataset/real-estate-sales-2001-2018) \n",
"GCS bucket path: gs://dataproc-metastore-public-binaries/real_estate_sales/\n",
"- Format: .parquet\n",
"- Metastore referece: public_datasets.real_estate_sales\n",
"- Description: 997213 samples of real estate sales\n",
"\n",
"| serial_number | list_year | date_recorded | town | address | assessed_value | sale_amount | sales_ratio | property_type | residential_type | non_use_code | assessor_remarks | opm_remarks | longitude | latitude |\n",
"|---------------|-----------|---------------|---------|----------------------|----------------|-------------|-------------|---------------|------------------|----------------------|----------------------|----------------------|----------------------|--------------------|\n",
"| 200594 | 2020 | 2021-02-16 | Danbury | 8 HICKORY ST | 121600.0 | 146216.0 | 0.8316463 | Residential | Single Family | 25 - Other | I11192 | HOUSE HAS SETTLED... | -73.44696 | 40.41404 |\n",
"| 200562 | 2020 | 2021-02-03 | Danbury | 19 MILL RD | 263600.0 | 415000.0 | 0.6351807 | Residential | Single Family | 25 - Other | AFFORDABLE HOUSIN... | INCORRECT DATA PE... | -23.42596 | 21.41424 |\n",
"| 200968 | 2020 | 2021-05-24 | Danbury | 4A FLIRTATION DR | 205700.0 | 515000.0 | 0.3994175 | Residential | Single Family | 07 - Change in Pr... | B17008 | UPDATED KITCHEN P... | -73.55596 | 48.43564 |\n",
"| 200260 | 2020 | 2020-11-23 | Danbury | 32 COALPIT HILL R... | 84900.0 | 181778.0 | 0.4670532 | Residential | Condo | 25 - Other | J16087-4 | MULTIPLE UNIT SALE | -75.46666 | 75.56424 |\n",
"| 200262 | 2020 | 2020-11-23 | Danbury | 32 COALPIT HILL R... | 84900.0 | 181778.0 | 0.4670532 | Residential | Condo | 25 - Other | J16087-6 | MULTIPLE UNIT SALE | -13.44696 | 22.48524 |"
]
},
{
"cell_type": "markdown",
"id": "a616040a-e3d6-4207-ae69-780e1bfa7e88",
"metadata": {},
"source": [
"[**sms_spam_collection**](https://archive.ics.uci.edu/dataset/228/sms+spam+collection) \n",
"GCS bucket path: gs://dataproc-metastore-public-binaries/sms_spam_collection/\n",
"- Format: .csv\n",
"- Metastore referece: public_datasets.sms_spam_collection\n",
"- Description: 5574 SMS messages (4827 ham and 747 spam)\n",
"\n",
"|label| text|\n",
"|-----|--------------------|\n",
"| ham|Go until jurong p...|\n",
"| ham|Ok lar... Joking ...|\n",
"| spam|Free entry in 2 a...|\n",
"| ham|U dun say so earl...|\n",
"| ham|Nah I don't think...|"
]
},
{
"cell_type": "markdown",
"id": "cd1179d7-3c59-494b-9423-9fa82b6d6b72",
"metadata": {},
"source": [
"[**us_customer_price_index_yearly**](https://data.bls.gov/timeseries/CUUR0000SA0L1E?output_view=pct_12mths) \n",
"GCS bucket path: gs://dataproc-metastore-public-binaries/us_customer_price_index_yearly/\n",
"- Format: .csv\n",
"- Metastore referece: public_datasets.us_customer_price_index_yearly\n",
"- Description: Customer Price Indexes from 1913 to 2022\n",
"\n",
"|Year| CPI|\n",
"|----|----|\n",
"|1913| 9.9|\n",
"|1914|10.0|\n",
"|1915|10.1|\n",
"|1916|10.9|\n",
"|1917|12.8|"
]
},
{
"cell_type": "markdown",
"id": "82dd7a52-d429-4730-ae26-fa0c2f4fcb00",
"metadata": {},
"source": [
"[**ai4i_2020_predictive_maintenance**](https://archive.ics.uci.edu/dataset/601/ai4i+2020+predictive+maintenance+dataset) \n",
"GCS bucket path: gs://dataproc-metastore-public-binaries/ai4i_2020_predictive_maintenance/\n",
"- Format: .csv\n",
"- Metastore referece: public_datasets.ai4i_2020_predictive_maintenance\n",
"- Description: 10000 data points stored as rows with 14 features in columns\n",
"\n",
"|udi|product_id|type|air_temperature_k|process_temperature_k|rotational_speed_rpm|torque_nm|tool_wear_min|machine_failure|twf|hdf|pwf|osf|rnf|\n",
"|---|----------|----|-----------------|---------------------|--------------------|---------|-------------|---------------|---|---|---|---|---|\n",
"| 1| M14860| M| 298.1| 308.6| 1551| 42.8| 0| 0| 0| 0| 0| 0| 0|\n",
"| 2| L47181| L| 298.2| 308.7| 1408| 46.3| 3| 0| 0| 0| 0| 0| 0|\n",
"| 3| L47182| L| 298.1| 308.5| 1498| 49.4| 5| 0| 0| 0| 0| 0| 0|\n",
"| 4| L47183| L| 298.2| 308.6| 1433| 39.5| 7| 0| 0| 0| 0| 0| 0|\n",
"| 5| L47184| L| 298.2| 308.7| 1408| 40.0| 9| 0| 0| 0| 0| 0| 0|"
]
},
{
"cell_type": "markdown",
"id": "cbe31c85-6721-4bb9-8452-ee23968b42e7",
"metadata": {},
"source": [
"[**stanford_online_products**](https://cvgl.stanford.edu/projects/lifted_struct/) \n",
"GCS bucket path: gs://dataproc-metastore-public-binaries/stanford_online_products/\n",
"- Format: .jpg\n",
"- Metastore referece: public_datasets.stanford_online_products\n",
"- Description: 120k images of 23k classes of online products\n",
"\n",
"| path| modificationTime| length|\n",
"|--------------------|--------------------|-------|\n",
"|gs://dataproc-met...|2023-12-12 20:45:...|3051905|\n",
"|gs://dataproc-met...|2023-12-12 20:45:...|2998889|\n",
"|gs://dataproc-met...|2023-12-12 20:44:...|2646343|\n",
"|gs://dataproc-met...|2023-12-12 20:44:...|2605636|\n",
"|gs://dataproc-met...|2023-12-12 20:45:...|2457016|"
]
},
{
"cell_type": "markdown",
"id": "9721d338-9ee6-4ec1-b08a-a3c3c7e0af2b",
"metadata": {},
"source": [
"[**youtube_ucg**](https://media.withyoutube.com/) \n",
"GCS bucket path: gs://dataproc-metastore-public-binaries/youtube_ucg/\n",
"- Format: .mp4\n",
"- Metastore referece: public_datasets.youtube_ucg\n",
"- Description: 20 youtube videos\n",
"\n",
"| path| modificationTime| length|\n",
"|--------------------|--------------------|-------|\n",
"|gs://dataproc-met...|2024-01-04 20:08:...|5051568|\n",
"|gs://dataproc-met...|2024-01-04 20:08:...|4450939|\n",
"|gs://dataproc-met...|2024-01-04 20:08:...|2766749|\n",
"|gs://dataproc-met...|2024-01-04 20:08:...|2525019|\n",
"|gs://dataproc-met...|2024-01-04 20:08:...|2311945|\n",
"|gs://dataproc-met...|2024-01-04 20:08:...|1682359|"
]
},
{
"cell_type": "markdown",
"id": "af591426-391c-46f6-83ac-1dadc5d6f01b",
"metadata": {},
"source": [
"**ads_banners_images** \n",
"GCS bucket path: gs://dataproc-metastore-public-binaries/ads_banners_images/\n",
"- Format: .png\n",
"- Description: image files of public ads banners"
]
},
{
"cell_type": "markdown",
"id": "8a4d7d59-2f48-4bf1-a364-18a5419f6041",
"metadata": {
"tags": []
},
"source": [
"**[gas_sensors](https://archive.ics.uci.edu/dataset/487/gas+sensor+array+temperature+modulation)** \n",
"GCS bucket path: gs://dataproc-metastore-public-binaries/gas_sensors/\n",
"- Format: [delta](https://delta.io/)\n",
"- Description: Gas sensor array temperature modulation - A chemical detection platform composed of 14 temperature-modulated metal oxide semiconductor (MOX) gas sensors was exposed to dynamic mixtures of carbon monoxide (CO) and humid synthetic air in a gas chamber. \n",
"\n",
"| Times| COppm|Humidityrh|TemperatureC|FlowratemLmin|HeatervoltageV| R1MOhm| R2MOhm| R3MOhm| R4MOhm| R5MOhm| R6MOhm| R7MOhm| R8MOhm|R9MOhm|R10MOhm|R11MOhm|R12MOhm|R13MOhm|R14MOhm|\n",
"|------|------|----------|------------|-------------|--------------|-------|-------|-------|-------|-------|-------|-------|-------|------|-------|-------|-------|-------|-------|\n",
"|0.0000|0.0000| 32.7023| 9.9032| 240.8069| 0.8900| 0.0738| 0.1314| 0.0987| 0.0936| 0.1026| 0.1152| 0.1105| 0.0891|0.0951| 0.1083| 0.1037| 0.1009| 0.0927| 0.1009|\n",
"|0.3070|0.0000| 45.5699| 13.7998| 240.8029| 0.8950| 0.0786| 0.1286| 0.1019| 0.0932| 0.1051| 0.1129| 0.1128| 0.0905|0.0958| 0.1103| 0.1043| 0.1025| 0.0942| 0.1020|\n",
"|0.6170|0.0000| 58.5539| 17.7318| 240.7989| 0.8979| 0.0816| 0.1287| 0.1052| 0.0940| 0.1075| 0.1128| 0.1149| 0.0918|0.0966| 0.1119| 0.1050| 0.1036| 0.0958| 0.1028|\n",
"|0.9240|0.0000| 71.4123| 21.6256| 240.7949| 0.8971| 0.0834| 0.1303| 0.1083| 0.0954| 0.1100| 0.1138| 0.1172| 0.0928|0.0972| 0.1130| 0.1054| 0.1044| 0.0968| 0.1033|\n",
"|1.2340|0.0000| 83.8100| 25.3800| 240.7917| 0.8980| 0.0851| 0.1324| 0.1112| 0.0967| 0.1120| 0.1151| 0.1194| 0.0936|0.0976| 0.1138| 0.1057| 0.1049| 0.0975| 0.1038|\n",
"|1.5410|0.0000| 83.8100| 25.3800| 240.8061| 0.8986| 0.0867| 0.1343| 0.1138| 0.0981| 0.1139| 0.1165| 0.1213| 0.0942|0.0980| 0.1144| 0.1058| 0.1053| 0.0980| 0.1041|"
]
},
{
"cell_type": "markdown",
"id": "82809505",
"metadata": {
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"**[tree_preservations](https://data.yorkopendata.org/dataset/7866ce4e-7b72-4a4c-b59f-869697d91029 )** \n",
"GCS bucket paths: \n",
"gs://dataproc-metastore-public-binaries/tree_preservations/csv/tree_preservations.csv \n",
"gs://dataproc-metastore-public-binaries/tree_preservations/kml/tree_preservations.kml \n",
"gs://dataproc-metastore-public-binaries/tree_preservations/geojson/tree_preservations.geojson \n",
"gs://dataproc-metastore-public-binaries/tree_preservations/shapefiles/\n",
"\n",
"- Format: .csv, .kml, .geojson, shapefiles\n",
"- Description: Tree data from City of York Council.\n",
"- Obs: 428 samples with non-null heights"
]
},
{
"cell_type": "markdown",
"id": "4e17d50f-2875-48fc-98cb-b315f05c8e10",
"metadata": {},
"source": [
"| index | X | Y | OBJECTID | TREEID | LOCATION | BOTANICAL | SPECIES | OWNER | SITE_NAME | TYPE | AGE | SPREAD | HEIGHT | TRUNK | Planted | \n",
"|----------|-----------|----------|---------------|--------------------|-----------------------|-----------------------|---------|--------------------------------------|-------------|-----|--------|--------|-------|---------|---|\n",
"| 0 | -1.145306 | 53.988087| 74123 | 1/1970-T41 | Sycamore | Acer Pseudoplatanus | Sycamore| Private | Nether & Upper Poppleton | Single tree | 0 | 0.0 | 0.0 | 0 | 0 |0 |\n",
"| 1 | -1.142676 | 53.987682| 74124 | 1/1970-T72 | Beech | Fagus Sylvatica | Beech | Private | Nether & Upper Poppleton | Single tree | 0 | 0.0 | 0.0 | 0 | 0 |0 |\n",
"| 2 | -0.965594 | 53.895187| 74125 | 1/1988-T15 | Horse Chestnut | Aesculus Hippocastanum| Horse Chestnut | Private | Main Street/Back Lane South, Wheldrake | Single tree | 0 | 0.0 | 0.0 | 0 | 0 |0 |\n",
"| 3 | -1.108463 | 53.902506| 74126 | 10/1988-T3 | Ash | Fraxinus Excelsior | Ash | Private | Darling Lane/Mill Lane, Acaster Malbis | Single tree | 0 | 0.0 | 0.0 | 0 | 0 |0 |\n",
"| 4 | -1.134858 | 53.999193| 74127 | 1973/107-T9 | Sycamore | Acer Pseudoplatanus | Sycamore| Private | The Vale, Skelton Village | Single tree | 0 | 0.0 | 0.0 | 0 | 0 |0 |"
]
},
{
"cell_type": "markdown",
"id": "9d34d728-6bd2-414d-9c39-dad38541bd4c",
"metadata": {},
"source": [
"**[jigsaw_multilingual_toxic_comment_classification](https://www.kaggle.com/competitions/jigsaw-multilingual-toxic-comment-classification/data)** \n",
"GCS bucket paths: \n",
"gs://dataproc-metastore-public-binaries/jigsaw/jigsaw-toxic-comment-train-processed-seqlen128.csv \n",
"gs://dataproc-metastore-public-binaries/jigsaw/jigsaw-toxic-comment-train.csv \n",
"gs://dataproc-metastore-public-binaries/jigsaw/jigsaw-unintended-bias-train-processed-seqlen128.csv \n",
"gs://dataproc-metastore-public-binaries/jigsaw/jigsaw-unintended-bias-train.csv \n",
"gs://dataproc-metastore-public-binaries/jigsaw/sample_submission.csv \n",
"gs://dataproc-metastore-public-binaries/jigsaw/test-processed-seqlen128.csv \n",
"gs://dataproc-metastore-public-binaries/jigsaw/test.csv \n",
"gs://dataproc-metastore-public-binaries/jigsaw/test_labels.csv \n",
"gs://dataproc-metastore-public-binaries/jigsaw/validation-processed-seqlen128.csv \n",
"gs://dataproc-metastore-public-binaries/jigsaw/validation.csv \n",
"\n",
"- Format: .csv\n",
"- Description: 223.549 Human text toxic comments classified into different Responsible AI harms categories: \"toxic\", \"severe_toxic\", \"obscene\", \"threat\", \"insult\", \"identity_hate\""
]
},
{
"cell_type": "markdown",
"id": "479e1d96-2a52-4cce-bdf8-df2d49efe4c4",
"metadata": {},
"source": [
"| index | id | comment_text | toxic | severe_toxic | obscene | threat | insult | identity_hate |\n",
"|-------|------------------------------------|-------------------------------------------------------------------------|-------|--------------|---------|--------|--------|----------------|\n",
"| 0 | 0000997932d777bf | Explanation Why the edits made under my username... | 0 | 0 | 0 | 0 | 0 | 0 |\n",
"| 1 | 000103f0d9cfb60f | D'aww! He matches this background colour I'm so... | 0 | 0 | 0 | 0 | 0 | 0 |\n",
"| 2 | 000113f07ec002fd | Hey man, I'm really not trying to edit war. It... | 0 | 0 | 0 | 0 | 0 | 0 |\n",
"| 3 | 0001b41b1c6bb37e | \" More I can't make any real suggestions on im... | 0 | 0 | 0 | 0 | 0 | 0 |\n",
"| 4 | 0001d958c54c6e35 | You, sir, are my hero. Any chance you remember... | 0 | 0 | 0 | 0 | 0 | 0 |\n",
"| 5 | 00025465d4725e87 | \" Congratulations from me as well, use the to... | 0 | 0 | 0 | 0 | 0 | 0 |\n",
"| 6 | 0002bcb3da6cb337 | COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK | 1 | 1 | 1 | 0 | 1 | 0 |\n",
"| 7 | 00031b1e95af7921 | Your vandalism to the Matt Shirvington article... | 0 | 0 | 0 | 0 | 0 | 0 |\n",
"| 8 | 00037261f536c51d | Sorry if the word 'nonsense' was offensive to... | 0 | 0 | 0 | 0 | 0 | 0 |\n",
"| 9 | 00040093b2687caa | alignment on this subject and which are contra... | 0 | 0 | 0 | 0 | 0 | 0 |\n"
]
},
{
"cell_type": "markdown",
"id": "e8d2d495",
"metadata": {},
"source": [
"**[wikipedia_translated_clusters](https://github.com/google-research-datasets/wiki-translated-clusters-nli/tree/main)** \n",
"GCS bucket path: \n",
"gs://dataproc-metastore-public-binaries/wikipedia_translated_clusters/*.json\n",
"\n",
"- Format: .json\n",
"- Description: It is a collection of 5K introductions to popular English Wikipedia articles, with their parallel versions in 10 other languages, and machine translations to English. Also includes a synthetically corrupted dataset where one sentence out of the English Wiki is modified, and the task is to use the multilingual documents to identify the outlier with natural language inference (NLI)"
]
}
],
"metadata": {
"environment": {
"kernel": "conda-base-py",
"name": "workbench-notebooks.m126",
"type": "gcloud",
"uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m126"
},
"kernelspec": {
"display_name": "Python 3 (ipykernel) (Local)",
"language": "python",
"name": "conda-base-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.15"
}
},
"nbformat": 4,
"nbformat_minor": 5
}