Distributed architecture: InfoCrawler was designed from the ground up for distributed architecture, it is a 100% java service, and can be executed permanently on one or more machines. Communicating using XML, its components can be installed on different machines: the administration, the spider, and the indexing engine.
Intuitive administration: Using its own WEB based administrating interface, you can administer and monitor the different collections in a very user-friendly manner. The simplicity and flexibility limits the total costs of ownership.
Optimized crawling: Thanks to its multi-threaded architecture, InfoCrawler can spider many collections in parallel, and can have many threads per collection.
Powerful indexing: Using Lucene indexing engine to index the documents, InfoCrawler can index various file types: HTML files, Microsoft office documents, PDF, XML, and more.
Open technology: InfoCrawler does not use any proprietary technology, URLs are maintained using mySql database, The Indexing engine is Lucene (Open Source Indexer), the WEB administration is done using Apache Tomcat and JSP, the communication between the administration and the spider is done using XML, and the spider itself is 100% java.
Flexible: Being compatible with standards like HTML, XML, JSP, Java, and JDBC, InfoCrawler can be integrated easily in large projects.