document_ai_warehouse/document-ai-warehouse-java-samples/document_ai_warehouse.ipynb (503 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"id": "f3b664d0",
"metadata": {},
"source": [
"We start by importing the JARs from the Maven repository that are going to be used in this sample. We see that we import \"contentwarehouse\" which is just a synonym for Document AI Warehouse."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "ba1044c2",
"metadata": {},
"outputs": [],
"source": [
"%%loadFromPOM\n",
"<dependency>\n",
" <groupId>com.google.cloud</groupId>\n",
" <artifactId>google-cloud-contentwarehouse</artifactId>\n",
" <version>0.3.0</version>\n",
"</dependency>"
]
},
{
"cell_type": "markdown",
"id": "b3d31c8a",
"metadata": {},
"source": [
"Next we import the classes that we are going to use."
]
},
{
"cell_type": "code",
"execution_count": 41,
"id": "fe87b898",
"metadata": {},
"outputs": [],
"source": [
"import com.google.cloud.contentwarehouse.v1.CreateDocumentRequest;\n",
"import com.google.cloud.contentwarehouse.v1.CreateDocumentResponse;\n",
"import com.google.cloud.contentwarehouse.v1.DateTimeTypeOptions;\n",
"import com.google.cloud.contentwarehouse.v1.DeleteDocumentRequest;\n",
"import com.google.cloud.contentwarehouse.v1.Document;\n",
"import com.google.cloud.contentwarehouse.v1.DocumentQuery;\n",
"import com.google.cloud.contentwarehouse.v1.DocumentSchema;\n",
"import com.google.cloud.contentwarehouse.v1.DocumentSchemaServiceClient;\n",
"import com.google.cloud.contentwarehouse.v1.DocumentServiceClient;\n",
"import com.google.cloud.contentwarehouse.v1.FloatTypeOptions;\n",
"import com.google.cloud.contentwarehouse.v1.LocationName;\n",
"import com.google.cloud.contentwarehouse.v1.Property;\n",
"import com.google.cloud.contentwarehouse.v1.PropertyDefinition;\n",
"import com.google.cloud.contentwarehouse.v1.RawDocumentFileType;\n",
"import com.google.cloud.contentwarehouse.v1.RequestMetadata;\n",
"import com.google.cloud.contentwarehouse.v1.SearchDocumentsRequest;\n",
"import com.google.cloud.contentwarehouse.v1.SearchDocumentsResponse;\n",
"import com.google.cloud.contentwarehouse.v1.TextArray;\n",
"import com.google.cloud.contentwarehouse.v1.TextTypeOptions;\n",
"import com.google.cloud.contentwarehouse.v1.UserInfo;\n",
"\n",
"import com.google.cloud.documentai.v1.DocumentProcessorServiceClient;\n",
"import com.google.cloud.documentai.v1.ProcessRequest;\n",
"import com.google.cloud.documentai.v1.ProcessResponse;\n",
"import com.google.cloud.documentai.v1.RawDocument;\n",
"\n",
"import com.google.protobuf.ByteString;"
]
},
{
"cell_type": "markdown",
"id": "0762a9d8",
"metadata": {},
"source": [
"### Change Here\n",
"In the following cell, be sure and change the values to reflect your own environment. Specifically, you\n",
"should definitely supply your own value for `PROJECT_NUMBER`."
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "a1e84d94",
"metadata": {},
"outputs": [],
"source": [
"// Change the following variables\n",
"final String PROJECT_NUMBER = \"41208676560\";\n",
"final String LOCATION = \"us\";\n",
"final String USERID = \"user:kolban@kolban.altostrat.com\";\n",
"\n",
"// End of change area ...\n",
"final RequestMetadata requestMetadata = RequestMetadata.newBuilder()\n",
" .setUserInfo(UserInfo.newBuilder()\n",
" .setId(USERID)\n",
" .build())\n",
" .build();\n",
"final LocationName parent = LocationName.of(PROJECT_NUMBER, LOCATION);\n",
"DocumentSchemaServiceClient documentSchemaServiceClient = DocumentSchemaServiceClient.create();\n",
"DocumentServiceClient documentServiceClient = DocumentServiceClient.create();"
]
},
{
"cell_type": "markdown",
"id": "2b04235e",
"metadata": {},
"source": [
"## Create Schema\n",
"In this example we will be creating a new schema. While it looks like a large amount of code, don't let that fool you. A schema can have zero or more properties and in this example we are setting quite a few. As such, most of the code is merely repetitions of `addPropertyDefinitions` where we add new properties to the description of the schema we wish to create.\n",
"\n",
"At the highest level, our fragment populates an instance of an object of type `DocumentSchema`. This describes what we want our resulting schema to contain. Next we invoke a client method called `createDocumentSchema` that takes as input our schema description and causes the creation of a new schema based on our description. On completion, a new schema will have been created and will have been assigned a unique name (identity). The value of that name is then logged."
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "a5c72310",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"name\n",
"-------------------------------------------------------------------------\n",
"projects/41208676560/locations/us/documentSchemas/3loccu79n5t88\n"
]
}
],
"source": [
"public void createSchema() {\n",
" DocumentSchema documentSchema = DocumentSchema.newBuilder()\n",
" .setDisplayName(\"Invoice\")\n",
" .setDescription(\"Invoice Schema\")\n",
" .setDocumentIsFolder(false)\n",
" .addPropertyDefinitions(PropertyDefinition.newBuilder()\n",
" .setName(\"payee\")\n",
" .setDisplayName(\"Payee\")\n",
" .setIsFilterable(true).setIsSearchable(true).setIsMetadata(true).setIsRequired(true)\n",
" .setTextTypeOptions(TextTypeOptions.newBuilder().build())\n",
" .build())\n",
" .addPropertyDefinitions(PropertyDefinition.newBuilder()\n",
" .setName(\"payer\")\n",
" .setDisplayName(\"Payer\")\n",
" .setIsFilterable(true).setIsSearchable(false).setIsMetadata(true).setIsRequired(true)\n",
" .setTextTypeOptions(TextTypeOptions.newBuilder().build())\n",
" .build())\n",
" .addPropertyDefinitions(PropertyDefinition.newBuilder()\n",
" .setName(\"amount\")\n",
" .setDisplayName(\"Amount\")\n",
" .setIsFilterable(true).setIsSearchable(false).setIsMetadata(true).setIsRequired(false)\n",
" .setFloatTypeOptions(FloatTypeOptions.newBuilder().build())\n",
" .build())\n",
" .addPropertyDefinitions(PropertyDefinition.newBuilder()\n",
" .setName(\"id\")\n",
" .setDisplayName(\"Invoice ID\")\n",
" .setIsFilterable(true).setIsSearchable(false).setIsMetadata(true).setIsRequired(false)\n",
" .setTextTypeOptions(TextTypeOptions.newBuilder().build())\n",
" .build())\n",
" .addPropertyDefinitions(PropertyDefinition.newBuilder()\n",
" .setName(\"date\")\n",
" .setDisplayName(\"Date\")\n",
" .setIsFilterable(true).setIsSearchable(false).setIsMetadata(true).setIsRequired(false)\n",
" .setDateTimeTypeOptions(DateTimeTypeOptions.newBuilder().build())\n",
" .build())\n",
" .addPropertyDefinitions(PropertyDefinition.newBuilder()\n",
" .setName(\"notes\")\n",
" .setDisplayName(\"Notes\")\n",
" .setIsFilterable(true).setIsSearchable(false).setIsMetadata(true).setIsRequired(false)\n",
" .setTextTypeOptions(TextTypeOptions.newBuilder().build())\n",
" .build())\n",
" .build();\n",
" \n",
" DocumentSchema newDocumentSchema = documentSchemaServiceClient.createDocumentSchema(parent, documentSchema);\n",
" \n",
" System.out.println(\"name\");\n",
" System.out.println(\"-------------------------------------------------------------------------\");\n",
" System.out.println(newDocumentSchema.getName());\n",
"} // createSchema\n",
"\n",
"createSchema()"
]
},
{
"cell_type": "markdown",
"id": "fc21e923",
"metadata": {},
"source": [
"## List Schemas\n",
"Having just create a new schema, we should be able to list all our schemas and see the one we just created. There isn't much to explain here. We invoke the `listDocumentSchemas` method of the client which returns an iterrable over the list of all schemas that we then log."
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "709e8ce6",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"display name name\n",
"--------------- ---------------------------------------------------------------------\n",
"Invoice projects/41208676560/locations/us/documentSchemas/06l7hah2jjqqo\n",
"Invoice projects/41208676560/locations/us/documentSchemas/13smg321hoo1o\n",
"S1 projects/41208676560/locations/us/documentSchemas/1cp01ej1hk798\n",
"Invoice projects/41208676560/locations/us/documentSchemas/3loccu79n5t88\n",
"Invoice projects/41208676560/locations/us/documentSchemas/3qth4jbn7n4jo\n",
"Invoice projects/41208676560/locations/us/documentSchemas/42gj9il1an3d8\n",
"Invoice projects/41208676560/locations/us/documentSchemas/4449qkpphffdo\n",
"Invoice projects/41208676560/locations/us/documentSchemas/6gap487vkc1a8\n"
]
}
],
"source": [
"public void listSchema() {\n",
" DocumentSchemaServiceClient.ListDocumentSchemasPagedResponse response\n",
" = documentSchemaServiceClient.listDocumentSchemas(parent);\n",
" System.out.println(\"display name name\");\n",
" System.out.println(\"--------------- ---------------------------------------------------------------------\");\n",
" for (DocumentSchema currentSchema: response.iterateAll()) {\n",
" System.out.printf(\"%-15.15s %s\\n\",currentSchema.getDisplayName() , currentSchema.getName());\n",
" }\n",
"} // listSchema\n",
"\n",
"listSchema();"
]
},
{
"cell_type": "markdown",
"id": "2cc4a2b8",
"metadata": {},
"source": [
"## Create a document\n",
"In this fragment we ingest a document into Document AI Warehouse. We take the content of the document from a local file. We must also specify the schema we want to associate with our document."
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "dd422b7e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"name\n",
"-------------------------------------------------------------------------\n",
"projects/41208676560/locations/us/documents/6fd1gpd2f51e0\n"
]
}
],
"source": [
"public void createDocument(String schemaName, ByteString fileData) {\n",
" Document document = Document.newBuilder()\n",
" .setDisplayName(\"Invoice 1\")\n",
" .setTitle(\"My Invoice 1\")\n",
" .setDocumentSchemaName(schemaName)\n",
" .setInlineRawDocument(fileData)\n",
" .setRawDocumentFileType(RawDocumentFileType.RAW_DOCUMENT_FILE_TYPE_PDF)\n",
" .setTextExtractionDisabled(true)\n",
" .addProperties(Property.newBuilder()\n",
" .setName(\"payee\")\n",
" .setTextValues(TextArray.newBuilder().addValues(\"Developer Company\").build())\n",
" .build())\n",
" .addProperties(Property.newBuilder()\n",
" .setName(\"payer\")\n",
" .setTextValues(TextArray.newBuilder().addValues(\"Buyer Company\").build())\n",
" .build())\n",
" .build();\n",
"\n",
" CreateDocumentRequest createDocumentRequest = CreateDocumentRequest.newBuilder()\n",
" .setDocument(document)\n",
" .setParent(parent.toString())\n",
" .setRequestMetadata(requestMetadata)\n",
" .build();\n",
"\n",
" CreateDocumentResponse createDocumentResponse = documentServiceClient.createDocument(createDocumentRequest);\n",
"\n",
" System.out.println(\"name\");\n",
" System.out.println(\"-------------------------------------------------------------------------\");\n",
" System.out.println(createDocumentResponse.getDocument().getName());\n",
"} // createDocument\n",
"\n",
"String schemaName = \"projects/41208676560/locations/us/documentSchemas/4449qkpphffdo\";\n",
"String fileName = \"data/SampleInvoice1.pdf\";\n",
"\n",
"ByteString fileData = ByteString.readFrom(new FileInputStream(fileName));\n",
"createDocument(schemaName, fileData);"
]
},
{
"cell_type": "markdown",
"id": "9c0bc35a",
"metadata": {},
"source": [
"## Document Deletion\n",
"Having looked at how we can create a document, we now look at how to delete a document."
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "84114161",
"metadata": {},
"outputs": [],
"source": [
"public void deleteDocument(String documentName) {\n",
" DeleteDocumentRequest deleteDocumentRequest = DeleteDocumentRequest.newBuilder()\n",
" .setName(documentName)\n",
" .setRequestMetadata(requestMetadata)\n",
" .build();\n",
" documentServiceClient.deleteDocument(deleteDocumentRequest);\n",
"} // deleteDocument\n",
"\n",
"String documentName = \"projects/41208676560/locations/us/documents/6fd1gpd2f51e0\";\n",
"\n",
"deleteDocument(documentName);"
]
},
{
"cell_type": "markdown",
"id": "22e84d33",
"metadata": {},
"source": [
"## Document Search\n",
"One of the most important features of Document AI Warehouse is the ability to search for documents. In this fragment we perform a search and show the documents that matched. The result of a search is an object that contains an iterrable that will walk us over the documents that matched."
]
},
{
"cell_type": "code",
"execution_count": 37,
"id": "8c7a3865",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"display name name\n",
"--------------- ------------------------------------------------------------------------\n",
"Invoice GCS 1 projects/41208676560/locations/us/documents/5uve1gj0vtrk8\n",
"Invoice GCS 1 projects/41208676560/locations/us/documents/4tba7qmdk0mag\n",
"Invoice GCS 1 projects/41208676560/locations/us/documents/4i3tjqjqj95og\n",
"Invoice GCS 1 projects/41208676560/locations/us/documents/3j8fs62gl2d0o\n",
"Invoice GCS 1 projects/41208676560/locations/us/documents/2rjtn7u6sp6s8\n",
"Invoice GCS 1 projects/41208676560/locations/us/documents/2nc617rhono70\n",
"Invoice GCS 1 projects/41208676560/locations/us/documents/1oljii2hnfdio\n",
"Invoice GCS 1 projects/41208676560/locations/us/documents/145ln3mgq2h18\n",
"Invoice GCS 1 projects/41208676560/locations/us/documents/0sjgojp1jomp0\n",
"Invoice GCS 1 projects/41208676560/locations/us/documents/0enlnenumms4o\n"
]
}
],
"source": [
"public void searchDocuments(String query) {\n",
" DocumentQuery documentQuery = DocumentQuery.newBuilder()\n",
" .setQuery(query)\n",
" .build();\n",
"\n",
" SearchDocumentsRequest searchDocumentsRequest = SearchDocumentsRequest.newBuilder()\n",
" .setDocumentQuery(documentQuery)\n",
" .setParent(parent.toString())\n",
" .setRequestMetadata(requestMetadata)\n",
" .build();\n",
"\n",
" DocumentServiceClient.SearchDocumentsPagedResponse response\n",
" = documentServiceClient.searchDocuments(searchDocumentsRequest);\n",
" \n",
" System.out.println(\"display name name\");\n",
" System.out.println(\"--------------- ------------------------------------------------------------------------\");\n",
" for (SearchDocumentsResponse.MatchingDocument matchingDocument: response.iterateAll()) {\n",
" System.out.printf(\"%-15.15s %s\\n\",\n",
" matchingDocument.getDocument().getDisplayName() , matchingDocument.getDocument().getName());\n",
" }\n",
"} // searchDocuments\n",
"searchDocuments(\"12-345678\");"
]
},
{
"cell_type": "markdown",
"id": "c1e5cd09",
"metadata": {},
"source": [
"## Document Creation with Doc AI\n",
"Next our example gets a little richer. This time we invoke Doc AI to process (parse) a document and pass the Doc AI Document results returned into Document AI Warehouse to store both the file and the parsed data."
]
},
{
"cell_type": "code",
"execution_count": 44,
"id": "a312df37",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"name\n",
"-------------------------------------------------------------------------\n",
"projects/41208676560/locations/us/documents/6s9fc7l3re52g\n"
]
}
],
"source": [
"public void createDocAIDocument(\n",
" String schemaName,\n",
" com.google.cloud.documentai.v1.Document docAiDocument,\n",
" ByteString fileData) {\n",
"\n",
" Document document = Document.newBuilder()\n",
" .setDisplayName(\"Invoice 1\")\n",
" .setTitle(\"My Invoice 1\")\n",
" .setDocumentSchemaName(schemaName)\n",
" .setCloudAiDocument(docAiDocument)\n",
" .setInlineRawDocument(fileData)\n",
" .setRawDocumentFileType(RawDocumentFileType.RAW_DOCUMENT_FILE_TYPE_PDF)\n",
" .setTextExtractionDisabled(true)\n",
" .addProperties(Property.newBuilder()\n",
" .setName(\"payee\")\n",
" .setTextValues(TextArray.newBuilder().addValues(\"Developer Company\").build())\n",
" .build())\n",
" .addProperties(Property.newBuilder()\n",
" .setName(\"payer\")\n",
" .setTextValues(TextArray.newBuilder().addValues(\"Buyer Company\").build())\n",
" .build())\n",
" .build();\n",
"\n",
" RequestMetadata requestMetadata = RequestMetadata.newBuilder()\n",
" .setUserInfo(UserInfo.newBuilder()\n",
" .setId(USERID)\n",
" .build())\n",
" .build();\n",
" \n",
" CreateDocumentRequest createDocumentRequest = CreateDocumentRequest.newBuilder()\n",
" .setDocument(document)\n",
" .setParent(parent.toString())\n",
" .setRequestMetadata(requestMetadata)\n",
" .build();\n",
"\n",
" CreateDocumentResponse createDocumentResponse = documentServiceClient.createDocument(createDocumentRequest);\n",
"\n",
" System.out.println(\"name\");\n",
" System.out.println(\"-------------------------------------------------------------------------\");\n",
" System.out.println(createDocumentResponse.getDocument().getName());\n",
"\n",
"} // createDocument\n",
"\n",
"public com.google.cloud.documentai.v1.Document processDocAI(String processorName, ByteString fileData) {\n",
" try {\n",
" try (DocumentProcessorServiceClient documentProcessorServiceClient = DocumentProcessorServiceClient.create()) {\n",
" RawDocument rawDocument = RawDocument.newBuilder()\n",
" .setContent(fileData)\n",
" .setMimeType(\"application/pdf\")\n",
" .build();\n",
" ProcessRequest processRequest = ProcessRequest.newBuilder()\n",
" .setName(processorName)\n",
" .setRawDocument(rawDocument)\n",
" .build();\n",
" ProcessResponse response = documentProcessorServiceClient.processDocument(processRequest);\n",
" return response.getDocument();\n",
" }\n",
" } catch(Exception e) {\n",
" e.printStackTrace();\n",
" return null;\n",
" }\n",
"} // processDocAI\n",
"\n",
"\n",
"String processorName = \"projects/41208676560/locations/us/processors/7bc4dc0dfcc7e040\";\n",
"com.google.cloud.documentai.v1.Document docAiDocument = processDocAI(processorName, fileData);\n",
"createDocAIDocument(schemaName, docAiDocument, fileData);"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8272a5b4",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Java",
"language": "java",
"name": "java"
},
"language_info": {
"codemirror_mode": "java",
"file_extension": ".jshell",
"mimetype": "text/x-java-source",
"name": "Java",
"pygments_lexer": "java",
"version": "11.0.16+8-post-Debian-1deb10u1"
}
},
"nbformat": 4,
"nbformat_minor": 5
}