Rich Content Extraction

For some projects it is desirable to index text content which is stored in structured files such as PDFs, Microsoft Office documents, images, etc. Currently only Solr’s ExtractingRequestHandler is directly supported by Haystack but the approach below could be used with any backend which supports this feature.

Extracting Content

SearchBackend.extract_file_contents() accepts a file or file-like object and returns a dictionary containing two keys: metadata and contents. The contents value will be a string containing all of the text which the backend managed to extract from the file contents. metadata will always be a dictionary but the keys and values will vary based on the underlying extraction engine and the type of file provided.

Indexing Extracted Content

Generally you will want to include the extracted text in your main document field along with everything else specified in your search template. This example shows how to override a hypothetical FileIndex’s prepare method to include the extract content along with information retrieved from the database:

def prepare(self, obj):
    data = super(FileIndex, self).prepare(obj)

    # This could also be a regular Python open() call, a StringIO instance
    # or the result of opening a URL. Note that due to a library limitation
    # file_obj must have a .name attribute even if you need to set one
    # manually before calling extract_file_contents:
    file_obj = obj.the_file.open()

    extracted_data = self.get_backend().extract_file_contents(file_obj)

    # Now we'll finally perform the template processing to render the
    # text field with *all* of our metadata visible for templating:
    t = loader.select_template(('search/indexes/myapp/file_text.txt', ))
    data['text'] = t.render(Context({'object': obj,
                                     'extracted': extracted_data}))

    return data

This allows you to insert the extracted text at the appropriate place in your template, modified or intermixed with database content as appropriate:

{{ object.title }}
{{ object.owner.name }}{% for k, v in extracted.metadata.items %}
    {% for val in v %}
        {{ k }}: {{ val|safe }}
    {% endfor %}
{% endfor %}

{{ extracted.contents|striptags|safe }}