ClusType

Publication

Note

"./result" folder contains results on a sample of 50k Yelp reviews.

Requirements

We will take Ubuntu for example.

$ sudo apt-get install python
$ sudo apt-get install pip
$ sudo pip install numpy
$ sudo pip install scipy
$ sudo pip install sklearn
$ sudo pip install textblob
$ sudo python -m textblob.download_corpora
$ sudo pip install lxml

Default Run

$ ./run.sh  

File path setting - run.sh

We will take Yelp dataset as an example.

Input: dataset folder. There are one sample Yelp review dataset (yelp) and one NYT news dataset (nyt).

DataPath='data/yelp'

Input data file path.

RawText='data/yelp/yelp_sample50k.txt'

Input: type mapping file path.

TypeFile='data/yelp/type_tid.txt'

Input: Download Freebase-to-DBpedia mapping file. Place it under "/entity_linking/" directory

'entity_linking/freebase_links.nt'

Input: stopword list.

StopwordFile='data/stopwords.txt'

Output: output file from candidate generation.

SegmentOutFile='result/segment.txt'

Output: entity linking output file.

NOTE: Our entity linking module calls DBpediaSpotLight Web service, which has limited querying speed. This process can be largely accelarated by installing the tool on your local machine Link.

SeedFile='result/seed.txt'

Output: data statistics on graph construction.

DataStatsFile='result/data_model_stats.txt'

Output: Typed entity mentions.

ResultFile='result/results.txt'

Output: Typed mentions annotated in the segmented text.

ResultFileInText='result/resultsInText.txt'

Parameters - run.sh

Threshold on significance score for candidate generation.

significance="1"

Switch on capitalization feature for candidate generation.

capitalize="1"

Maximal phrase length for candidate generation.

maxLength='4'

Minimal support of phrases for candidate generation.

minSup='10'

Number of relation phrase clusters.

NumRelationPhraseClusters='50'