document_ai_warehouse/document-ai-warehouse-java-samples/document_ai_warehouse.ipynb (503 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "id": "f3b664d0", "metadata": {}, "source": [ "We start by importing the JARs from the Maven repository that are going to be used in this sample. We see that we import \"contentwarehouse\" which is just a synonym for Document AI Warehouse." ] }, { "cell_type": "code", "execution_count": 1, "id": "ba1044c2", "metadata": {}, "outputs": [], "source": [ "%%loadFromPOM\n", "<dependency>\n", " <groupId>com.google.cloud</groupId>\n", " <artifactId>google-cloud-contentwarehouse</artifactId>\n", " <version>0.3.0</version>\n", "</dependency>" ] }, { "cell_type": "markdown", "id": "b3d31c8a", "metadata": {}, "source": [ "Next we import the classes that we are going to use." ] }, { "cell_type": "code", "execution_count": 41, "id": "fe87b898", "metadata": {}, "outputs": [], "source": [ "import com.google.cloud.contentwarehouse.v1.CreateDocumentRequest;\n", "import com.google.cloud.contentwarehouse.v1.CreateDocumentResponse;\n", "import com.google.cloud.contentwarehouse.v1.DateTimeTypeOptions;\n", "import com.google.cloud.contentwarehouse.v1.DeleteDocumentRequest;\n", "import com.google.cloud.contentwarehouse.v1.Document;\n", "import com.google.cloud.contentwarehouse.v1.DocumentQuery;\n", "import com.google.cloud.contentwarehouse.v1.DocumentSchema;\n", "import com.google.cloud.contentwarehouse.v1.DocumentSchemaServiceClient;\n", "import com.google.cloud.contentwarehouse.v1.DocumentServiceClient;\n", "import com.google.cloud.contentwarehouse.v1.FloatTypeOptions;\n", "import com.google.cloud.contentwarehouse.v1.LocationName;\n", "import com.google.cloud.contentwarehouse.v1.Property;\n", "import com.google.cloud.contentwarehouse.v1.PropertyDefinition;\n", "import com.google.cloud.contentwarehouse.v1.RawDocumentFileType;\n", "import com.google.cloud.contentwarehouse.v1.RequestMetadata;\n", "import com.google.cloud.contentwarehouse.v1.SearchDocumentsRequest;\n", "import com.google.cloud.contentwarehouse.v1.SearchDocumentsResponse;\n", "import com.google.cloud.contentwarehouse.v1.TextArray;\n", "import com.google.cloud.contentwarehouse.v1.TextTypeOptions;\n", "import com.google.cloud.contentwarehouse.v1.UserInfo;\n", "\n", "import com.google.cloud.documentai.v1.DocumentProcessorServiceClient;\n", "import com.google.cloud.documentai.v1.ProcessRequest;\n", "import com.google.cloud.documentai.v1.ProcessResponse;\n", "import com.google.cloud.documentai.v1.RawDocument;\n", "\n", "import com.google.protobuf.ByteString;" ] }, { "cell_type": "markdown", "id": "0762a9d8", "metadata": {}, "source": [ "### Change Here\n", "In the following cell, be sure and change the values to reflect your own environment. Specifically, you\n", "should definitely supply your own value for `PROJECT_NUMBER`." ] }, { "cell_type": "code", "execution_count": 25, "id": "a1e84d94", "metadata": {}, "outputs": [], "source": [ "// Change the following variables\n", "final String PROJECT_NUMBER = \"41208676560\";\n", "final String LOCATION = \"us\";\n", "final String USERID = \"user:kolban@kolban.altostrat.com\";\n", "\n", "// End of change area ...\n", "final RequestMetadata requestMetadata = RequestMetadata.newBuilder()\n", " .setUserInfo(UserInfo.newBuilder()\n", " .setId(USERID)\n", " .build())\n", " .build();\n", "final LocationName parent = LocationName.of(PROJECT_NUMBER, LOCATION);\n", "DocumentSchemaServiceClient documentSchemaServiceClient = DocumentSchemaServiceClient.create();\n", "DocumentServiceClient documentServiceClient = DocumentServiceClient.create();" ] }, { "cell_type": "markdown", "id": "2b04235e", "metadata": {}, "source": [ "## Create Schema\n", "In this example we will be creating a new schema. While it looks like a large amount of code, don't let that fool you. A schema can have zero or more properties and in this example we are setting quite a few. As such, most of the code is merely repetitions of `addPropertyDefinitions` where we add new properties to the description of the schema we wish to create.\n", "\n", "At the highest level, our fragment populates an instance of an object of type `DocumentSchema`. This describes what we want our resulting schema to contain. Next we invoke a client method called `createDocumentSchema` that takes as input our schema description and causes the creation of a new schema based on our description. On completion, a new schema will have been created and will have been assigned a unique name (identity). The value of that name is then logged." ] }, { "cell_type": "code", "execution_count": 17, "id": "a5c72310", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "name\n", "-------------------------------------------------------------------------\n", "projects/41208676560/locations/us/documentSchemas/3loccu79n5t88\n" ] } ], "source": [ "public void createSchema() {\n", " DocumentSchema documentSchema = DocumentSchema.newBuilder()\n", " .setDisplayName(\"Invoice\")\n", " .setDescription(\"Invoice Schema\")\n", " .setDocumentIsFolder(false)\n", " .addPropertyDefinitions(PropertyDefinition.newBuilder()\n", " .setName(\"payee\")\n", " .setDisplayName(\"Payee\")\n", " .setIsFilterable(true).setIsSearchable(true).setIsMetadata(true).setIsRequired(true)\n", " .setTextTypeOptions(TextTypeOptions.newBuilder().build())\n", " .build())\n", " .addPropertyDefinitions(PropertyDefinition.newBuilder()\n", " .setName(\"payer\")\n", " .setDisplayName(\"Payer\")\n", " .setIsFilterable(true).setIsSearchable(false).setIsMetadata(true).setIsRequired(true)\n", " .setTextTypeOptions(TextTypeOptions.newBuilder().build())\n", " .build())\n", " .addPropertyDefinitions(PropertyDefinition.newBuilder()\n", " .setName(\"amount\")\n", " .setDisplayName(\"Amount\")\n", " .setIsFilterable(true).setIsSearchable(false).setIsMetadata(true).setIsRequired(false)\n", " .setFloatTypeOptions(FloatTypeOptions.newBuilder().build())\n", " .build())\n", " .addPropertyDefinitions(PropertyDefinition.newBuilder()\n", " .setName(\"id\")\n", " .setDisplayName(\"Invoice ID\")\n", " .setIsFilterable(true).setIsSearchable(false).setIsMetadata(true).setIsRequired(false)\n", " .setTextTypeOptions(TextTypeOptions.newBuilder().build())\n", " .build())\n", " .addPropertyDefinitions(PropertyDefinition.newBuilder()\n", " .setName(\"date\")\n", " .setDisplayName(\"Date\")\n", " .setIsFilterable(true).setIsSearchable(false).setIsMetadata(true).setIsRequired(false)\n", " .setDateTimeTypeOptions(DateTimeTypeOptions.newBuilder().build())\n", " .build())\n", " .addPropertyDefinitions(PropertyDefinition.newBuilder()\n", " .setName(\"notes\")\n", " .setDisplayName(\"Notes\")\n", " .setIsFilterable(true).setIsSearchable(false).setIsMetadata(true).setIsRequired(false)\n", " .setTextTypeOptions(TextTypeOptions.newBuilder().build())\n", " .build())\n", " .build();\n", " \n", " DocumentSchema newDocumentSchema = documentSchemaServiceClient.createDocumentSchema(parent, documentSchema);\n", " \n", " System.out.println(\"name\");\n", " System.out.println(\"-------------------------------------------------------------------------\");\n", " System.out.println(newDocumentSchema.getName());\n", "} // createSchema\n", "\n", "createSchema()" ] }, { "cell_type": "markdown", "id": "fc21e923", "metadata": {}, "source": [ "## List Schemas\n", "Having just create a new schema, we should be able to list all our schemas and see the one we just created. There isn't much to explain here. We invoke the `listDocumentSchemas` method of the client which returns an iterrable over the list of all schemas that we then log." ] }, { "cell_type": "code", "execution_count": 18, "id": "709e8ce6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "display name name\n", "--------------- ---------------------------------------------------------------------\n", "Invoice projects/41208676560/locations/us/documentSchemas/06l7hah2jjqqo\n", "Invoice projects/41208676560/locations/us/documentSchemas/13smg321hoo1o\n", "S1 projects/41208676560/locations/us/documentSchemas/1cp01ej1hk798\n", "Invoice projects/41208676560/locations/us/documentSchemas/3loccu79n5t88\n", "Invoice projects/41208676560/locations/us/documentSchemas/3qth4jbn7n4jo\n", "Invoice projects/41208676560/locations/us/documentSchemas/42gj9il1an3d8\n", "Invoice projects/41208676560/locations/us/documentSchemas/4449qkpphffdo\n", "Invoice projects/41208676560/locations/us/documentSchemas/6gap487vkc1a8\n" ] } ], "source": [ "public void listSchema() {\n", " DocumentSchemaServiceClient.ListDocumentSchemasPagedResponse response\n", " = documentSchemaServiceClient.listDocumentSchemas(parent);\n", " System.out.println(\"display name name\");\n", " System.out.println(\"--------------- ---------------------------------------------------------------------\");\n", " for (DocumentSchema currentSchema: response.iterateAll()) {\n", " System.out.printf(\"%-15.15s %s\\n\",currentSchema.getDisplayName() , currentSchema.getName());\n", " }\n", "} // listSchema\n", "\n", "listSchema();" ] }, { "cell_type": "markdown", "id": "2cc4a2b8", "metadata": {}, "source": [ "## Create a document\n", "In this fragment we ingest a document into Document AI Warehouse. We take the content of the document from a local file. We must also specify the schema we want to associate with our document." ] }, { "cell_type": "code", "execution_count": 33, "id": "dd422b7e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "name\n", "-------------------------------------------------------------------------\n", "projects/41208676560/locations/us/documents/6fd1gpd2f51e0\n" ] } ], "source": [ "public void createDocument(String schemaName, ByteString fileData) {\n", " Document document = Document.newBuilder()\n", " .setDisplayName(\"Invoice 1\")\n", " .setTitle(\"My Invoice 1\")\n", " .setDocumentSchemaName(schemaName)\n", " .setInlineRawDocument(fileData)\n", " .setRawDocumentFileType(RawDocumentFileType.RAW_DOCUMENT_FILE_TYPE_PDF)\n", " .setTextExtractionDisabled(true)\n", " .addProperties(Property.newBuilder()\n", " .setName(\"payee\")\n", " .setTextValues(TextArray.newBuilder().addValues(\"Developer Company\").build())\n", " .build())\n", " .addProperties(Property.newBuilder()\n", " .setName(\"payer\")\n", " .setTextValues(TextArray.newBuilder().addValues(\"Buyer Company\").build())\n", " .build())\n", " .build();\n", "\n", " CreateDocumentRequest createDocumentRequest = CreateDocumentRequest.newBuilder()\n", " .setDocument(document)\n", " .setParent(parent.toString())\n", " .setRequestMetadata(requestMetadata)\n", " .build();\n", "\n", " CreateDocumentResponse createDocumentResponse = documentServiceClient.createDocument(createDocumentRequest);\n", "\n", " System.out.println(\"name\");\n", " System.out.println(\"-------------------------------------------------------------------------\");\n", " System.out.println(createDocumentResponse.getDocument().getName());\n", "} // createDocument\n", "\n", "String schemaName = \"projects/41208676560/locations/us/documentSchemas/4449qkpphffdo\";\n", "String fileName = \"data/SampleInvoice1.pdf\";\n", "\n", "ByteString fileData = ByteString.readFrom(new FileInputStream(fileName));\n", "createDocument(schemaName, fileData);" ] }, { "cell_type": "markdown", "id": "9c0bc35a", "metadata": {}, "source": [ "## Document Deletion\n", "Having looked at how we can create a document, we now look at how to delete a document." ] }, { "cell_type": "code", "execution_count": 34, "id": "84114161", "metadata": {}, "outputs": [], "source": [ "public void deleteDocument(String documentName) {\n", " DeleteDocumentRequest deleteDocumentRequest = DeleteDocumentRequest.newBuilder()\n", " .setName(documentName)\n", " .setRequestMetadata(requestMetadata)\n", " .build();\n", " documentServiceClient.deleteDocument(deleteDocumentRequest);\n", "} // deleteDocument\n", "\n", "String documentName = \"projects/41208676560/locations/us/documents/6fd1gpd2f51e0\";\n", "\n", "deleteDocument(documentName);" ] }, { "cell_type": "markdown", "id": "22e84d33", "metadata": {}, "source": [ "## Document Search\n", "One of the most important features of Document AI Warehouse is the ability to search for documents. In this fragment we perform a search and show the documents that matched. The result of a search is an object that contains an iterrable that will walk us over the documents that matched." ] }, { "cell_type": "code", "execution_count": 37, "id": "8c7a3865", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "display name name\n", "--------------- ------------------------------------------------------------------------\n", "Invoice GCS 1 projects/41208676560/locations/us/documents/5uve1gj0vtrk8\n", "Invoice GCS 1 projects/41208676560/locations/us/documents/4tba7qmdk0mag\n", "Invoice GCS 1 projects/41208676560/locations/us/documents/4i3tjqjqj95og\n", "Invoice GCS 1 projects/41208676560/locations/us/documents/3j8fs62gl2d0o\n", "Invoice GCS 1 projects/41208676560/locations/us/documents/2rjtn7u6sp6s8\n", "Invoice GCS 1 projects/41208676560/locations/us/documents/2nc617rhono70\n", "Invoice GCS 1 projects/41208676560/locations/us/documents/1oljii2hnfdio\n", "Invoice GCS 1 projects/41208676560/locations/us/documents/145ln3mgq2h18\n", "Invoice GCS 1 projects/41208676560/locations/us/documents/0sjgojp1jomp0\n", "Invoice GCS 1 projects/41208676560/locations/us/documents/0enlnenumms4o\n" ] } ], "source": [ "public void searchDocuments(String query) {\n", " DocumentQuery documentQuery = DocumentQuery.newBuilder()\n", " .setQuery(query)\n", " .build();\n", "\n", " SearchDocumentsRequest searchDocumentsRequest = SearchDocumentsRequest.newBuilder()\n", " .setDocumentQuery(documentQuery)\n", " .setParent(parent.toString())\n", " .setRequestMetadata(requestMetadata)\n", " .build();\n", "\n", " DocumentServiceClient.SearchDocumentsPagedResponse response\n", " = documentServiceClient.searchDocuments(searchDocumentsRequest);\n", " \n", " System.out.println(\"display name name\");\n", " System.out.println(\"--------------- ------------------------------------------------------------------------\");\n", " for (SearchDocumentsResponse.MatchingDocument matchingDocument: response.iterateAll()) {\n", " System.out.printf(\"%-15.15s %s\\n\",\n", " matchingDocument.getDocument().getDisplayName() , matchingDocument.getDocument().getName());\n", " }\n", "} // searchDocuments\n", "searchDocuments(\"12-345678\");" ] }, { "cell_type": "markdown", "id": "c1e5cd09", "metadata": {}, "source": [ "## Document Creation with Doc AI\n", "Next our example gets a little richer. This time we invoke Doc AI to process (parse) a document and pass the Doc AI Document results returned into Document AI Warehouse to store both the file and the parsed data." ] }, { "cell_type": "code", "execution_count": 44, "id": "a312df37", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "name\n", "-------------------------------------------------------------------------\n", "projects/41208676560/locations/us/documents/6s9fc7l3re52g\n" ] } ], "source": [ "public void createDocAIDocument(\n", " String schemaName,\n", " com.google.cloud.documentai.v1.Document docAiDocument,\n", " ByteString fileData) {\n", "\n", " Document document = Document.newBuilder()\n", " .setDisplayName(\"Invoice 1\")\n", " .setTitle(\"My Invoice 1\")\n", " .setDocumentSchemaName(schemaName)\n", " .setCloudAiDocument(docAiDocument)\n", " .setInlineRawDocument(fileData)\n", " .setRawDocumentFileType(RawDocumentFileType.RAW_DOCUMENT_FILE_TYPE_PDF)\n", " .setTextExtractionDisabled(true)\n", " .addProperties(Property.newBuilder()\n", " .setName(\"payee\")\n", " .setTextValues(TextArray.newBuilder().addValues(\"Developer Company\").build())\n", " .build())\n", " .addProperties(Property.newBuilder()\n", " .setName(\"payer\")\n", " .setTextValues(TextArray.newBuilder().addValues(\"Buyer Company\").build())\n", " .build())\n", " .build();\n", "\n", " RequestMetadata requestMetadata = RequestMetadata.newBuilder()\n", " .setUserInfo(UserInfo.newBuilder()\n", " .setId(USERID)\n", " .build())\n", " .build();\n", " \n", " CreateDocumentRequest createDocumentRequest = CreateDocumentRequest.newBuilder()\n", " .setDocument(document)\n", " .setParent(parent.toString())\n", " .setRequestMetadata(requestMetadata)\n", " .build();\n", "\n", " CreateDocumentResponse createDocumentResponse = documentServiceClient.createDocument(createDocumentRequest);\n", "\n", " System.out.println(\"name\");\n", " System.out.println(\"-------------------------------------------------------------------------\");\n", " System.out.println(createDocumentResponse.getDocument().getName());\n", "\n", "} // createDocument\n", "\n", "public com.google.cloud.documentai.v1.Document processDocAI(String processorName, ByteString fileData) {\n", " try {\n", " try (DocumentProcessorServiceClient documentProcessorServiceClient = DocumentProcessorServiceClient.create()) {\n", " RawDocument rawDocument = RawDocument.newBuilder()\n", " .setContent(fileData)\n", " .setMimeType(\"application/pdf\")\n", " .build();\n", " ProcessRequest processRequest = ProcessRequest.newBuilder()\n", " .setName(processorName)\n", " .setRawDocument(rawDocument)\n", " .build();\n", " ProcessResponse response = documentProcessorServiceClient.processDocument(processRequest);\n", " return response.getDocument();\n", " }\n", " } catch(Exception e) {\n", " e.printStackTrace();\n", " return null;\n", " }\n", "} // processDocAI\n", "\n", "\n", "String processorName = \"projects/41208676560/locations/us/processors/7bc4dc0dfcc7e040\";\n", "com.google.cloud.documentai.v1.Document docAiDocument = processDocAI(processorName, fileData);\n", "createDocAIDocument(schemaName, docAiDocument, fileData);" ] }, { "cell_type": "code", "execution_count": null, "id": "8272a5b4", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Java", "language": "java", "name": "java" }, "language_info": { "codemirror_mode": "java", "file_extension": ".jshell", "mimetype": "text/x-java-source", "name": "Java", "pygments_lexer": "java", "version": "11.0.16+8-post-Debian-1deb10u1" } }, "nbformat": 4, "nbformat_minor": 5 }