Brown Hierarchical Word Clustering Model Card for Urdu and English Model Description This implementation of the Brown hierarchical word clustering algorithm groups words into clusters based on their distributional similarity in text. The algorithm creates a binary tree of word clusters, where words that appear in similar contexts are grouped together. This version has been applied to both Urdu and English text data.
Model Details Developed by: Percy Liang (original implementation) Model type: Unsupervised hierarchical word clustering algorithm Languages: Urdu and English Version: 1.3 Last updated: 2012-07-24 License: Free for research and education purposes with attribution Intended Uses Creating word classes for Urdu and English language models Reducing vocabulary size in multilingual NLP applications Discovering semantic relationships between words in Urdu and English Feature engineering for downstream NLP tasks in these languages Cross-lingual applications and research How to Use
Compile the code
make
Cluster words from Urdu or English text
./wcluster --text your_urdu_or_english_text.txt --c 50
Output will be in your_urdu_or_english_text-c50-p1.out/paths
To visualize the clusters:
./cluster-viewer/build-viewer.sh your_urdu_or_english_text-c50-p1.out/paths
Training Data This is an algorithm implementation that has been applied to both Urdu and English text. Users can provide their own text data in either language for clustering.
Performance and Limitations Time complexity: O(N*C²), where N is the number of word types and C is the number of clusters Works best with sufficient text data to capture word distributions No built-in support for multi-word expressions Limited to distributional similarity (doesn't capture all semantic relationships) May require language-specific preprocessing for optimal results with Urdu text References Brown, P. F., Desouza, P. V., Mercer, R. L., Pietra, V. J. D., & Lai, J. C. (1992). Class-based n-gram models of natural language. Computational linguistics, 18(4), 467-479. Liang, P. (2005). Semi-supervised learning for natural language processing. Master's thesis, Massachusetts Institute of Technology. Citation If you use this implementation in your research, please cite:
@misc{liang2012brown, author = {Percy Liang}, {Sajjad Rasool}, title = {Brown Hierarchical Word Clustering Algorithm for Urdu and English}, year = {2012}, howpublished = {\url{https://github.com/percyliang/brown-cluster}} }