Free Republic
Browse · Search
News/Activism
Topics · Post Article

Skip to comments.

New Search Technology Created for Free Republic
FR Exclusive | 2/23/2005 | Dave Taylor

Posted on 02/23/2005 12:16:01 PM PST by Technocrat

I have created a new search technology that will make it much easier to find related articles and avoid duplicates. It is NOT keyword-based - instead, it uses sense and context to determine what an article is actually about, both in major and minor themes. I had about 500 FR articles lying around, so I shoved them into the indexer to see how well it would do, and the results are noteworthy :)

Some notes - everything is in lower case, to speed indexing. No content is actually stored in the database except the title, URL, and a 250 character snippet of internal text - instead, semantic compression reduces the entire contents of the article and first three replies to a 150-dimensional vector in domainspace. As a result, comparisons are blindingly fast (although initial indexing takes about a half second on my creaky old 1999 box) Also, I can't automatically retrieve text from FR to respect the robots policy, so for now, you have to copy and paste it from articles you want to match.

When you run a new article into the engine for comparison, its domain fingerprint is taken and stored for future comparisons. You get to see what that fingerprint is, and then you get three tiers of results: full matches, partial matches, and peripheral matches. Full matches will contain duplicates, articles about the same thing from different sources, and occasionally different articles about very similar things. For fun, you can press the back button and delete or change some text, submit it again, and see how much change it takes for your original submission to drop out of the Primary Match tier. (Don't change the URL, so it doesn't get inserted into the DB again) Secondary or partial matches contain closely related articles (although you may see a few of these in Tertiary or peripheral matches, especially the first few entries). The number in parenthesis is the match cost, or how far away the given reference is from the one you submitted in terms of domain space.

Tertiary matches let you go on a random walk through domain space. Sure, there's some definite relation to what you posted, but you will definitely be venturing afield in much of the linked article. If you want to see a really good example that works well against the current database, post the text from to the search engine.

You can try it by going here (or better yet, open a new browser window and point it at http://www.neurogy.com/sense/compare.html so you can copy and paste articles from this side to see similar articles that are already in my database.

Since this is something completely different, please post requests for format and capability, and I'll see what I can do.

Jim and John, once you see this, feel free to use it forever - I didn't have enough cash to contribute in the last fundraiser :( You can look at my previous donation history to get contact information if you want. I can provide you with simple CGI interface calls to make this part of the posting process, to show posters potential duplicates before they post. If you want to index a significant portion of the recent articles, let me know and I'll make it easier.


TOPICS: Your Opinion/Questions
KEYWORDS: donotduplicate; fr; search
Navigation: use the links below to view more comments.
first previous 1-2021-4041-54 next last
To: Technocrat
"where did member X post about subject Y?"

I've often thought it would be useful to see a comprehensive listing of posts between members X and X1 (e.g. is this the freeper who told me to get lost a few months ago?). ;O)

21 posted on 02/23/2005 12:58:58 PM PST by newgeezer (Just my opinion, of course. Your mileage may vary.)
[ Post Reply | Private Reply | To 11 | View Replies]

To: Technocrat
Very cool. BTT.
22 posted on 02/23/2005 12:59:46 PM PST by Billthedrill
[ Post Reply | Private Reply | To 1 | View Replies]

To: elfman2
The set of topics and assertions made at FR could be considered as a semantic space. You can "subdivide" (bad word) any semantic space into domains, and reference a position in semantic space as a vector comprised of measures of membership to any of those domains. If you have two different assertion trees that get you to the same spot in your defined space, just add another dimension until the problem goes away. All you really need to do is make sure that three things are true:


23 posted on 02/23/2005 1:02:15 PM PST by Technocrat
[ Post Reply | Private Reply | To 18 | View Replies]

To: Technocrat
" "150-dimensional vector in domainspace""

Does that mean that 150 string matches or match failures are recorded as a fingerprint?

24 posted on 02/23/2005 1:03:35 PM PST by elfman2
[ Post Reply | Private Reply | To 18 | View Replies]

To: Technocrat

nice donation


25 posted on 02/23/2005 1:06:58 PM PST by stainlessbanner (Let's all pray for HenryLee II)
[ Post Reply | Private Reply | To 1 | View Replies]

To: Technocrat

That’s beyond me. I’d have to invest some time to really follow it. Thanks though.


26 posted on 02/23/2005 1:07:46 PM PST by elfman2
[ Post Reply | Private Reply | To 23 | View Replies]

To: elfman2

No - this is much more complicated than simple string matching. A domain is comprised of concepts, concepts are activated by assertions, assertions are probablistically activated by words, phrases, and other things. What I did for this search engine was to factor a generalized model into a "best fit" situation for this particular site, based on some semantic analysis I did on the 500 articles I grabbed (although it looks like we are up to 508 now). That way, I could run it in realistic time and not lose too much of the advantage of the general model.


27 posted on 02/23/2005 1:08:20 PM PST by Technocrat
[ Post Reply | Private Reply | To 24 | View Replies]

To: stainlessbanner

Thanks! It was more fun than it should have been :)


28 posted on 02/23/2005 1:11:35 PM PST by Technocrat
[ Post Reply | Private Reply | To 25 | View Replies]

To: Technocrat

I think the admin would like to know about this. Give him a mail and he might add the link to it.


29 posted on 02/23/2005 1:13:30 PM PST by Wiz
[ Post Reply | Private Reply | To 1 | View Replies]

To: Wiz

Before I do that, let me add some of the features that people have requested here and a few more ideas I have had and then we'll go whole hog on it.


30 posted on 02/23/2005 1:15:02 PM PST by Technocrat
[ Post Reply | Private Reply | To 29 | View Replies]

To: Technocrat

Hasn't this already been posted? LOL


31 posted on 02/23/2005 1:23:47 PM PST by fish hawk
[ Post Reply | Private Reply | To 1 | View Replies]

To: Technocrat
Also, the more text you post, the better your results will be

Certain of our FRiends around here seem to agree. I tend to think they're wrong, though. {grin}

32 posted on 02/23/2005 2:05:51 PM PST by newgeezer (Just my opinion, of course. Your mileage may vary.)
[ Post Reply | Private Reply | To 19 | View Replies]

To: Nick Danger

ping


33 posted on 02/23/2005 2:08:41 PM PST by Interesting Times (ABCNNBCBS -- yesterday's news.)
[ Post Reply | Private Reply | To 1 | View Replies]

To: fish hawk
Hasn't this already been posted? LOL

Don't laugh. In fact, it was already posted here.

34 posted on 02/23/2005 2:18:24 PM PST by newgeezer (When encryption is outlawed, rwei qtjske ud alsx zkjwejruc.)
[ Post Reply | Private Reply | To 31 | View Replies]

To: Technocrat

Thank you!


35 posted on 02/23/2005 3:24:22 PM PST by Alia
[ Post Reply | Private Reply | To 1 | View Replies]

To: Alia; newgeezer; fish hawk; Interesting Times; Nick Danger; Wiz; stainlessbanner; elfman2; ...

All requested bug fixes and feature upgrades are now complete, except the ability to index replies to articles (needs tighter integration from John Robinson to make that a reality). Let's break it again!


36 posted on 02/24/2005 8:32:47 AM PST by Technocrat
[ Post Reply | Private Reply | To 35 | View Replies]

To: Technocrat
Errr, why are the results related to groundhogs or Groundhog Day? I put in two current articles:

http://www.freerepublic.com/focus/f-news/1350015/posts
http://www.freerepublic.com/focus/f-news/1350121/posts

...and both of them returned this kind of thing:

Primary Matches: (strong match on most subjects, match cost under 10 is duplicate) (Groundhog Day? Here's another article on Punxsutawney Phil.)
Secondary Matches: (significant shared subject matter) (Groundhog Day? Here's an article on rodent-centered holidays in India.)
Tertiary Matches: (somewhat related on one or two points) (Groundhog Day? Here's some articles on weather, and some groundhog recipes.)

It's nice, but I'm not sure the world needs a search engine devoted to groundhogs ;)

37 posted on 02/24/2005 8:46:56 AM PST by general_re ("Frantic orthodoxy is never rooted in faith, but in doubt." - Reinhold Niebuhr)
[ Post Reply | Private Reply | To 36 | View Replies]

To: Technocrat
All requested bug fixes and feature upgrades are now complete

I ran this...

Article Title: The War Against World War IV
Article URL: http://www.commentarymagazine.com/article.asp?aid=11902025_1
Article Text The War Against World War IV Norman Podhoretz A Second-Term Retreat? Will George W. Bush spend the next few years backing down from the ambitious strategy he outlined in the Bush Doctrine for fighting and winning World War IV? To be sure, Bush himself still calls it the "war on terrorism," and has shied away from giving the name World War IV to the great conflict into which we were plunged by 9/11. (World War III, in this accounting, was the cold war.) Yet he has never hesitated to compare the fight against radical Islamism, and the forces nurturing and arming it, with those earlier struggles against Nazism and Communism. Nor has he flinched from suggesting that achieving victory as the Bush Doctrine defines it may take as long as it took to win World War III (which lasted more than four decades—from the promulgation of the Truman Doctrine in 1947 until the fall of the Berlin Wall in 1989). Even more than the Truman Doctrine in its time, the Bush Doctrine was subjected to a ferocious assault by domestic opponents from the moment it was enunciated. Then, when Bush actually started acting on it, the ferocity grew even more intense, finally reaching record levels of vituperation during the presidential campaign. But in defiance of everything that was being thrown at him, and in spite of setbacks in Iraq that posed a serious threat to his reelection, Bush never yielded an inch. Instead of scurrying for protective cover from the assault, he stood out in the open and countered by reaffirming his belief in the soundness of the doctrine as well as his firm intention to stick with it in the years ahead. Thus, over and over again he said that he would stay the course in Iraq; that he would go on working for the spread of liberty throughout the greater Middle East (and democratic reform as a condition for the establishment of a Palestinian state); that he would continue reserving the right to take preemptive military action against what in his best judgment were gathering dangers to the security of this country; and that he would if necessary do so unilaterally. Why then, given that he was reelected on this pledge, should a question now be raised about whether he will keep it? And why—more strangely still—should the answer most often be that he is indeed about to renege? Because, comes the response, whether he likes it or not, and whether he intends to or not, he will simply have no other choice. Either his resolve will be sapped by the knowledge that he lacks the necessary political support to push any further ahead with the Bush Doctrine; or he will be prevented by a certain "law" of democratic politics governing Presidents who win a second term; or he will (as Irving Kristol famously said of liberals who turned into neoconservatives) be mugged by reality. War and Moral Values The notion that the Bush Doctrine lacks solid political backing derives from the widely publicized National Election Pool (NEP) exit poll. According to this poll, more voters (22 percent of the sample) were motivated primarily by a concern with moral values than by anything else, and it was among these voters that Bush did best against his Democratic opponent John F. Kerry; and while he also won overwhelmingly among the smaller group (19 percent) who were mainly worried about terrorism, he lost by a correspondingly large margin with the still smaller proportion (15 percent) who chose Iraq as their paramount concern. Not surprisingly, the President’s liberal opponents have interpreted this poll to mean that the election did not constitute a ratification of the Bush Doctrine. This is why they have been only too happy to second the claim pressed by spokesmen for various groups on the religious Right that Bush won because of the "faith factor" and the mobilization of the faithful around "family issues, including marriage [and] life." As it happens, a few commentators associated with the religious Right are themselves opposed to the Bush Doctrine, which gives them, too, an incentive for minimizing its role in the President’s victory. But even those religious conservatives who support the Bush Doctrine have inadvertently played into the hands of his antagonists, both domestic and foreign. That is, by claiming the lion’s share of credit for November 2, they have made it a little easier for the antiwar forces to deny that the election held on that day was a referendum on the Bush Doctrine, and that it has the wind of a solid majority of the American people behind it. Yet for all its intensity, this entire debate over the relative importance of moral values and the Bush Doctrine may stem from a complete misreading of the polls. For it is not in the least self-evident that the vague category of moral values was taken by the people who participated in the NEP survey merely as embracing abortion and gay marriage alone. On the contrary: in all probability they understood it more broadly to mean the traditionalist culture in general.
... and it yielded a result that I don't begin to understand (I didn't see the duplicate article). However, when I submitted the same thing a second time, I saw the duplicate article (only 1) with a '(0)' at the top of the Primary matches. I suppose it's possible I simply didn't notice it the first time. Perhaps if you'll give me another example article, I can run the original source through your tool.

Regardless, if your example WW IV article actually appears twice on FR, shouldn't it have appeared twice in the results?

38 posted on 02/24/2005 8:48:56 AM PST by newgeezer (Just my opinion, of course. Your mileage may vary.)
[ Post Reply | Private Reply | To 36 | View Replies]

To: Alia; newgeezer; fish hawk; Interesting Times; Nick Danger; Wiz; stainlessbanner; elfman2; ...

All requested bug fixes and feature upgrades are now complete, except the ability to index replies to articles (needs tighter integration from John Robinson to make that a reality). Let's break it again!


39 posted on 02/24/2005 8:58:06 AM PST by Technocrat
[ Post Reply | Private Reply | To 35 | View Replies]

To: general_re

That's just an example of the relative relevance, and you'll see that every time you do a search. Actually, the word "groundhog" is domain-neutral for this site, so it wouldn't help in a search anyway :)


40 posted on 02/24/2005 8:59:52 AM PST by Technocrat
[ Post Reply | Private Reply | To 37 | View Replies]


Navigation: use the links below to view more comments.
first previous 1-2021-4041-54 next last

Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.

Free Republic
Browse · Search
News/Activism
Topics · Post Article

FreeRepublic, LLC, PO BOX 9771, FRESNO, CA 93794
FreeRepublic.com is powered by software copyright 2000-2008 John Robinson