Full Text Search Server for Java, Lightweight embeddable, powered by iBoxDB.
$ cd FTServer $ mvn package cargo:run
Input a Full URL to index the Page, then search.
[Word1 Word2 Word3] => text has Word1 and Word2 and Word3
["Word1 Word2 Word3"] => text has "Word1 Word2 Word3" as a whole
Search [https http] => get almost all pages
The results order based on the id() number in class PageText, descending order.
A Page has many PageTexts. if don't need multiple Texts, modify Html.getDefaultTexts(Page), returns only one PageText.
the Page.GetRandomContent() method is used to keep the Search-Page-Content always changing, doesn't affect the real PageText order.
Use the ID number to control the order instead of loading all pages to memory. Or load top 100 pages to memory then re-order it by favor.
search (... String keywords, long startId, long count)
startId => which ID(the id when you created PageText) to start, use (startId=Long.MaxValue) to read from the top, descending order
count => records to read, important parameter, the search speed depends on this parameter, not how big the data is.
set the startId as the last id from the results of search minus one
startId = search( "keywords", startId, count); nextpage_startId = startId - 1 // this 'minus one' has done inside search() ... //read next page search("keywords", nextpage_startId, count)
java public Page Html.get(String url);Set your private WebSite text
java Page page = new Page(); page.url = url; page.title = title; page.text = replace(doc.body().text()); page... = ... return page;
[[email protected] ~]$ cat /proc/sys/fs/file-max 803882 [[email protected] ~]$ ulimit -a | grep files open files (-n) 500000 [[email protected] ~]$ ulimit -Hn 500000 [[email protected] ~]$ ulimit -Sn 500000 [[email protected] ~]$
$ vi /etc/security/limits.conf
hard nofile 500000
root hard nofile 500000 root soft nofile 500000
soft nofile 500000