--- {} --- # Model Card for Model Geo-BERT-multilingual This model predicts the geolocation of short texts (less than 500 words) in a form of two-dimensional distributions also referenced as the Gaussian Mixture Model (GMM). ## Model Details Number of predicted points: 5 Custom transformers pipeline and result visualization: https://github.com/K4TEL/geo-twitter/tree/predict ### Model Description This project was aimed to solve the tweet/user geolocation prediction task and provide a flexible methodology for the geotagging of textual big data. The suggested approach implements BERT-based neural networks for NLP to estimate the location in a form of two-dimensional GMMs (longitude, latitude, weight, covariance). The base model has been finetuned on a Twitter dataset containing text content and metadata context of the tweets. - **Developed by:** Kateryna Lutsai - **Model type:** regression - **Language(s) (NLP):** multilingual - **Finetuned from model:** bert-base-multilingual-cased ### Model Sources - **Repository:** https://github.com/K4TEL/geo-twitter - **Paper:** https://arxiv.org/pdf/2303.07865.pdf - **Demo:** https://github.com/K4TEL/geo-twitter/blob/predict/prediction.ipynb ## Uses Geo-tagging of Big data ### Direct Use Per-tweet geolocation prediction ### Out-of-Scope Use Per-tweet geolocation prediction without "user" metadata is expected to show lower accuracy of predictions. ## Bias, Risks, and Limitations Risk for unethical use on the basis of data that is not publicly available. The limitation of text length is dictated by the BERT-based model's capacity of 500 tokens (words). ### How to Get Started with the Model Use the code below to get started with the model: https://github.com/K4TEL/geo-twitter/tree/predict A short startup guide is given in the repository branch description. ## Training Details ### Training Data The Twitter dataset contained tweets with their text content, metadata ("user" and "place") context, and geolocation coordinates. ### Training Procedure Information about the model training on the user-defined data could be found in the GitHub repository: https://github.com/K4TEL/geo-twitter #### Training Hyperparameters - **Learning rate start:** 1e-5 - **Learning rate end:** 1e-6 - **Learning rate scheduler:** cosine - **Number of epochs:** 3 - **Batch size:** 10 - **Optimizer:** Adam - **Intra-feature loss:** mean - **Inter-feature loss:** mean - **Neg log-likelihood domain:** positive - **Features:** NON-GEO + GEO-ONLY ## Evaluation All performance metrics and results are demonstrated in the Results section of the article pre-print: https://arxiv.org/pdf/2303.07865.pdf ### Testing Data, Factors & Metrics #### Testing Data Worldwide dataset of tweets with TEXT-ONLY and NON-GEO features #### Metrics Spatial metrics: mean and median Simple Accuracy Error (SAE), Acc@161 Probabilistic metrics: mean and median Cumulative Accuracy Error (CAE), mean and median Prediction Area Region (PRA) for 95% density area, Coverage of PRA ### Results **Tweet geolocation prediction task** - TEXT-ONLY: mean 1588 km and median 50 km, 61% of Acc@161 - NON-GEO: mean 800 km and median 25 km, 80% of Acc@161 **User home geolocation prediction task** - TEXT-ONLY: mean 892 km and median 31 km, 74% of Acc@161 - NON-GEO: mean 567 km and median 26 km, 82% of Acc@161 ### Model Architecture and Objective Implemented wrapper layer of liner regression with a custom number of output variables that operates with classification token generated by the base BERT model. #### Hardware NVIDIA GeForce GTX 1080 Ti #### Software Python IDE ## Model Card Contact lutsai.k@gmail.com