Oct 11, 2012 13:30 CEST - I am very glad that KMWorld recognizes Hydra as a Trend-setting product. Our goal for releasing Hydra as Open Source was to contribute technology back to the community that we think will be useful in many projects, says Helge Legernes, founding partner and CTO at Findwise.
Findwise unveils their content processing framework “Hydra” as Open SourceMay 09, 2012 22:00 CEST
Today Sweden based company Findwise unveils their content processing framework Hydra for search driven solutions. Hydra is released to give clients, the market and the industry the best possible opportunity to serve and meet the needs in today’s growing information landscape. Hydra is now released as open source, free for all to use, develop and benefit from.
Helge Legernes, founder and CTO at Findwise says "I am pleased and proud that a Swedish software company is providing such a leading technology in document processing, that even Silicon Valley is looking towards Sweden. Our goal is that Findwise technology will be the world leader in this field. This goal is the reason why we put out this framework as free open source. It is well suited for enterprise search solutions, and Big Data Applications."
What is Hydra?
When working with free text search, the quality of the data in the index is a key factor on the quality of the results delivered and has a major impact on the information consumption experience. Hydra is designed to give the search solution the tools necessary to modify the data that is to be indexed in an efficient and flexible way. Providing a scalable and efficient pipeline which the documents pass through before being indexed into the search engine does this.
The pipeline is an essential part of any search architecture, tasked with performing invaluable data enrichment tasks. At Findwise we use the pipeline to perform all manner of tasks to power and enhance the search solutions we create. This might be a simple task like extracting headlines in documents, generating thumbnails of the document, or detecting the language they are written in. But it may also be more complex, such as Natural Language Processing tasks adding metadata about which entities are mentioned in a text (Entity Extraction) or determining if a news article is positive or negative in tone (sentiment analysis). Having a good framework that is able to handle this properly is one of the key factors to success.
Hydra is designed to work with any data. No matter if you want to make searchable a folder of unstructured PDF documents, plain text web pages, or highly structured XML data generated by your CMS, Hydra will handle it all. We've also designed Hydra to be able to feed documents to any search engine or database, be they proprietary or open source.
Hydra Technical Details
Hydra revolves around a central data repository where all documents are kept while being processed, implemented with an instance of MongoDB, a document-oriented database. Connected to this database are the processing nodes, a standalone Java application that runs the processing stages that the user has configured. The configuration is stored and read from MongoDB as well, allowing you to create more processing nodes by simply starting up the framework on a new machine. Hydra is designed to be:
- Scalable: the central repository as well as the number of worker nodes can scale horizontally with little to no performance loss.
- Distributed: any processing node can work on any document ‐ a single document may be processed on any number of physical machines
- Fail-safe: if a processing node goes down, this will not affect the documents in the pipeline, which are persisted centrally, and any other node can simply and automatically pick up where the other left off.
- Robust: all stages run in separate JVMs, thus allowing for instance Tika to crash in a separate JVM, which will be automatically restarted, without stopping the processing pipeline for less problematic documents.
- Easy to use/configure: stages can be run from your IDE during development, allowing testing against the actual data in the repository.