What is InfoCrawler ?

InfoCrawler is an Open Source Knowledge Management solution that allows you to crawl, index, and query various types of documents, accessing data from various resources: Intranets, public WEB sites, News groups, FTP sites, local or remote file systems.

Main Features

Distributed architecture: InfoCrawler was designed from the ground up for distributed architecture, it is a 100% java service, and can be executed permanently on one or more machines. Communicating using XML, its components can be installed on different machines: the administration, the spider, and the indexing engine.

Intuitive administration: Using its own WEB based administrating interface, you can administer and monitor the different collections in a very user-friendly manner. The simplicity and flexibility limits the total costs of ownership.

Optimized crawling: Thanks to its multi-threaded architecture, InfoCrawler can spider many collections in parallel, and can have many threads per collection.

Powerful indexing: Using Lucene indexing engine to index the documents, InfoCrawler can index various file types: HTML files, Microsoft office documents, PDF, XML, and more.

Open technology: InfoCrawler does not use any proprietary technology, URLs are maintained using mySql database, The Indexing engine is Lucene (Open Source Indexer), the WEB administration is done using Apache Tomcat and JSP, the communication between the administration and the spider is done using XML, and the spider itself is 100% java.

Flexible: Being compatible with standards like HTML, XML, JSP, Java, and JDBC, InfoCrawler can be integrated easily in large projects.

Unique features: InfoCrawler has some unique features like the JavaScript interpreter or the intelligent URL management.


Based on Standards

  • Java
  • Apache
  • MySql
  • Windows/Linux/Unix
  • Lucene
  • JSP
  • XML
  • Firefox/I.E
  • English/French/Spanish