Data - Tools
SemSim: Resources for Normalized Semantic Similarity Computation Using Lexical Networks (presented @ LREC 2012)
Excerpt of noun network
Network vocabulary: An alphabetically ordered list of 8752 nouns extracted from the SemCor3 corpus. This list comes as a single file consisting of 8752 line. Each line has two space-separated fields: (i) the lexical form of the noun, and (ii) a unique index. Download vocabulary (76K).
SemSim corpus: Snippets of web documents for the network vocabulary. For each noun up to 1000 snippets were downloaded. The corpus is organized in sub-corpora, that is one file for each noun. Download corpus (819M).
Tools: The first tool, CParse, is a Perl script that parses a corpus and creates feature vectors. The second tool, CosSim, is fed with the feature vectors and computes similarities. Download tools (16K).
Similarities repository: This repository includes the pairwise similarities of the networks nouns. The similarities were estimated over the above corpus for several values of the context window H. Given an H value, the similarity scores for a particular noun come into a separate file, i.e., 8752 files are available. These files are named according to the corresponding noun indices. In particular, the similarity scores are represented as follows. Consider the file of similarities for a noun indexed as i , e.g., "i.sims". The j-th row of "i.sims" corresponds to the similarity between nouns indexed by i and j.
- Baseline context-based similarities. Similarity scores are available for the following values of contextual window size H:
- Baseline context-based similarities normalized according to local normalization (N-normalization) for N=100. Similarity scores are available for the following values of contextual window size H:
- Baseline context-based similarities normalized according to global normalization (Z-normalizarion). The statistics of similarities, i.e, mean and variance, were computed across the entire network. Similarity scores are available for the following values of contextual window size H:
Associative and Semantic Features Extracted From Web-Harvested Corpora (presented @ LREC 2012)
Download data (90K) including associative and semantic pairs, and the corresponding priming coefficients.
Semantic spaces are represented as similarity matrices. Currently, we provide semantic spaces for the nouns the network vocabulary (in total 8752 nouns - see above). Semantic similarities were computed according to some co-occurrence-based similarity metrics (currently: Google-based Semantic Relatedness and Dice coefficient) using counts computed over the SemSim Corpus (see above). The value of [i,j] matrix element refer to the semantic similarity score of nouns with index i and j. The mapping between nouns and indices can be found in the network vocabulary (see above). The similarity metrics are symmetric, i.e., [i,j] = [j,i]. Download: 8752x8752 matrix for Google-based Semantic Relatedness metric(122MB in .tar.gz). Note: Approximately 1.4GB are required when decompressed. Download: 8752x8752 matrix for Dice coefficient (83MB in .tar.gz). Note: Approximately 1.4GB are required when decompressed. A decimation (1000 randomly selected nouns) of the matrix based on Google-based Semantic Relatedness is available. Download: 1000x1000 matrix (1.8MB in .tar.gz). The noun indices included in the 1000x1000 matrix (with respect to network vocabulary) can be found here.
Semantic Spaces ++: More Similarity Metrics - More PoS
Semantic similarity matrices are available for four different dictionaries organized by part-of-speech: nouns, verbs, adjectives, and adverbs. Semantic similarity was computed according to 2 different types of metrics, as follows. (1) Three typical co-occurrence-based metrics: Dice coefficient, Mutual information, and Google-based semantic relatedness). (2) Two novel network-based similarity metrics utilizing semantic neighborhoods: (i) maximum similarity, and (ii) sum of squared similarities. Network metrics were computed using either Dice coefficient or Google-based semantic relatedness. Download dictionaries (lists) including: nouns (4000 entries), verbs (1434 entries), adjectives(1427 entries), and adverbs (334 entries). Dictionaries(28K) For each part-of-speech three similarity matrices were computed using Dice coefficient, Mutual information, and Google-based semantic relatedness. Download: Similarity matrices (139M) For each part-of-speech two similarity matrices were computed using the following network similarity metrics which are based on the notion of semantic neighborhoods of lexical networks: (i) maximum similarity, (ii) sum of squared similarities. Each network metric was computed (a) using Dice coefficient and Google-based semantic relatedness, and (b) for various numbers of neighbors (size of neighborhood): 30, 50, 100, 150. Download similarity matrices for
- Adverbs (4.1M)
- Verbs (79M)
- Adjectives (84M)
- Nouns (4000) (640M)
- Nouns (5884) (348M) Note for the matrix of 5884 nouns: only network-based similarities for 100 neighbors are provided. (Similarities for more neighbors, 50 and 150, will be available soon.) Also, the dictionary (list) of 5884 nouns is provided.