.. _ref-rich_content_extraction: ======================= Rich Content Extraction ======================= For some projects it is desirable to index text content which is stored in structured files such as PDFs, Microsoft Office documents, images, etc. Currently only Solr's `ExtractingRequestHandler`_ is directly supported by Haystack but the approach below could be used with any backend which supports this feature. .. _`ExtractingRequestHandler`: http://wiki.apache.org/solr/ExtractingRequestHandler Extracting Content ================== :meth:`SearchBackend.extract_file_contents` accepts a file or file-like object and returns a dictionary containing two keys: ``metadata`` and ``contents``. The ``contents`` value will be a string containing all of the text which the backend managed to extract from the file contents. ``metadata`` will always be a dictionary but the keys and values will vary based on the underlying extraction engine and the type of file provided. Indexing Extracted Content ========================== Generally you will want to include the extracted text in your main document field along with everything else specified in your search template. This example shows how to override a hypothetical ``FileIndex``'s ``prepare`` method to include the extract content along with information retrieved from the database:: def prepare(self, obj): data = super(FileIndex, self).prepare(obj) # This could also be a regular Python open() call, a StringIO instance # or the result of opening a URL. Note that due to a library limitation # file_obj must have a .name attribute even if you need to set one # manually before calling extract_file_contents: file_obj = obj.the_file.open() extracted_data = self.backend.extract_file_contents(file_obj) # Now we'll finally perform the template processing to render the # text field with *all* of our metadata visible for templating: t = loader.select_template(('search/indexes/myapp/file_text.txt', )) data['text'] = t.render(Context({'object': obj, 'extracted': extracted_data})) return data This allows you to insert the extracted text at the appropriate place in your template, modified or intermixed with database content as appropriate: .. code-block:: html+django {{ object.title }} {{ object.owner.name }} … {% for k, v in extracted.metadata.items %} {% for val in v %} {{ k }}: {{ val|safe }} {% endfor %} {% endfor %} {{ extracted.contents|striptags|safe }}