Searching PDF content – AWS CloudSearch

In my earlier post, we saw that how we can use AWS CloudSearch to make data in our RDBMS easily searchable. I read through the documentation that AWS supports variety of other document formats such as PDF, Word etc. I tried to figure out how do so with content of PDF files. This information is not readily available in any documentation offered by AWS or on any forums or blogs.

This was dead on arrival. The first step, I tried is to use the “upload documents” feature in the domain dashboard in CloudSearch. This successfully uploaded the file successfully and it asked me to run indexing and I did. However, when I searched for any words that was in the PDF file, I got nothing. I attempted to repeat the steps and I reviewed the data it generated before uploading. To my surprise, I found that the “content” field was in binary! No wonder, it didn’t work. Then, I started my research and noticed two other people posted to stackoverflow with the same question but no one had answered. I posted to AWS forum but no answer came by for two days. [ I answered my own query at the end of it – nice way to score some brownie points :-)]

Started researching on competing products – ElasticSearch, Solr and other and the documentation was stating that we need to encode it to Base64.  Thus, I tried converting it to Base64 and uploaded to AWS, no luck and it didn’t work. No matter what I do, it was not working. I was re-reading the documentation on AWS again and decided to try their CLI  command cs-import-documents. This is the equivalent of the “upload document” wizard on the command line. We can upload directly to the domain from the CLI using this command. However, my attempt to upload directly gave a fatal error. Then I used the –output option to generate the SDF (Searchable Document Format) file for the PDF I wanted. The tool generated a SDF JSON file. The content was all text and it looked as follows

[ {
“type” : “add”,
“id” : “C:_Downloads_100q.pdf”,
“fields” : {
“content_type” : “application/pdf”,
“resourcename” : “100q.pdf”,
“content” :  “The is the content of the pdf”,

“xmptpg_npages” : “11”
}
} ]

I uploaded this json document using the “upload documents” button on the AWS CloudSearch console. Then used the “Run Test” interface and did a search and voila! it found my pdf and displayed the content of the content field!

Though, I got it to work, here are some limitations of it as such. We already saw that each document cannot be more than 1 MB in size and the batch cannot be more than 5MB. Thus you need to chop your PDFs into multiple smaller files, then run through this CLI utility or use one of the content extraction tools such as Apache Tika, or other similar, create a JSON document and then upload (It would be nice if this CLI tool automatically split them as 1MB and uploaded or generated the SDF files). The other thing is that it returns you the entire content and not the matching paragraph / page number or any other similar additional information. However, based on the document name, you can get the PDF and search and show the PDF the way you want it.

Here is an article on how to do this on Elastic search. https://www.elastic.co/blog/ingesting-and-exploring-scientific-papers-using-elastic-cloud

I already signed up for ElasticSearch service, hope I can try this out on that, However in case if you did this on your own, do share your experience and give me a pointer. I want to know your thoughts.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: