Spaces:
Sleeping
Sleeping
Upload 7 files
Browse files- CITATION.cff +22 -0
- DEPLOY.md +115 -0
- README.md +73 -0
- app.py +414 -0
- packages.txt +4 -0
- requirements.txt +10 -0
- sample.md +90 -0
CITATION.cff
ADDED
@@ -0,0 +1,22 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
cff-version: 1.2.0
|
2 |
+
message: "If you use this software, please cite it as below."
|
3 |
+
authors:
|
4 |
+
- family-names: "User"
|
5 |
+
given-names: "Hugging Face"
|
6 |
+
title: "SigmaTriple: Knowledge Graph Extraction from Markdown"
|
7 |
+
version: 1.0.0
|
8 |
+
date-released: 2025-04-15
|
9 |
+
url: "https://huggingface.co/spaces/[your-username]/SigmaTriple"
|
10 |
+
repository-code: "https://huggingface.co/spaces/[your-username]/SigmaTriple"
|
11 |
+
license: "MIT"
|
12 |
+
references:
|
13 |
+
- type: software
|
14 |
+
authors:
|
15 |
+
- family-names: "SciPhi"
|
16 |
+
title: "Triplex"
|
17 |
+
url: "https://huggingface.co/sciphi/triplex"
|
18 |
+
- type: software
|
19 |
+
authors:
|
20 |
+
- family-names: "vllm-project"
|
21 |
+
title: "vllm"
|
22 |
+
url: "https://github.com/vllm-project/vllm"
|
DEPLOY.md
ADDED
@@ -0,0 +1,115 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Deploying SigmaTriple to Hugging Face Spaces
|
2 |
+
|
3 |
+
This guide will help you deploy the SigmaTriple application to Hugging Face Spaces.
|
4 |
+
|
5 |
+
## Prerequisites
|
6 |
+
|
7 |
+
1. A Hugging Face account (sign up at [huggingface.co](https://huggingface.co/join))
|
8 |
+
2. Git installed on your local machine
|
9 |
+
3. Hugging Face CLI (optional, for command line deployment)
|
10 |
+
|
11 |
+
## Deployment Steps
|
12 |
+
|
13 |
+
### Option 1: Using the Hugging Face Web Interface
|
14 |
+
|
15 |
+
1. **Create a New Space**:
|
16 |
+
- Go to [huggingface.co/spaces](https://huggingface.co/spaces)
|
17 |
+
- Click on "Create new Space"
|
18 |
+
- Enter a name for your Space (e.g., "SigmaTriple")
|
19 |
+
- Select "Streamlit" as the SDK
|
20 |
+
- Choose "Public" or "Private" visibility
|
21 |
+
- Select "T4" as the Hardware (GPU is recommended for this application)
|
22 |
+
- Click "Create Space"
|
23 |
+
|
24 |
+
2. **Upload Files**:
|
25 |
+
- You can either upload the files directly through the web interface
|
26 |
+
- Or clone the Space repository and push the files using Git (recommended)
|
27 |
+
|
28 |
+
3. **Git Deployment**:
|
29 |
+
```bash
|
30 |
+
# Clone your new Space repository
|
31 |
+
git clone https://huggingface.co/spaces/YOUR_USERNAME/SigmaTriple
|
32 |
+
|
33 |
+
# Copy all files from this project to the cloned repository
|
34 |
+
cp -r * /path/to/cloned/repo/
|
35 |
+
cp -r .streamlit /path/to/cloned/repo/
|
36 |
+
cp .gitignore /path/to/cloned/repo/
|
37 |
+
|
38 |
+
# Navigate to the cloned repository
|
39 |
+
cd /path/to/cloned/repo
|
40 |
+
|
41 |
+
# Add all files
|
42 |
+
git add .
|
43 |
+
|
44 |
+
# Commit the changes
|
45 |
+
git commit -m "Initial commit of SigmaTriple application"
|
46 |
+
|
47 |
+
# Push to Hugging Face Spaces
|
48 |
+
git push
|
49 |
+
```
|
50 |
+
|
51 |
+
4. **Wait for Deployment**:
|
52 |
+
- Hugging Face will automatically build and deploy your Space
|
53 |
+
- This may take a few minutes, especially for the first deployment
|
54 |
+
- You can monitor the build process in the "Settings" tab of your Space
|
55 |
+
|
56 |
+
### Option 2: Using Hugging Face CLI
|
57 |
+
|
58 |
+
1. **Install the Hugging Face CLI**:
|
59 |
+
```bash
|
60 |
+
pip install huggingface_hub
|
61 |
+
```
|
62 |
+
|
63 |
+
2. **Login to Hugging Face**:
|
64 |
+
```bash
|
65 |
+
huggingface-cli login
|
66 |
+
```
|
67 |
+
|
68 |
+
3. **Create a New Space**:
|
69 |
+
```bash
|
70 |
+
huggingface-cli repo create SigmaTriple --type space --sdk streamlit
|
71 |
+
```
|
72 |
+
|
73 |
+
4. **Clone and Push**:
|
74 |
+
```bash
|
75 |
+
git clone https://huggingface.co/spaces/YOUR_USERNAME/SigmaTriple
|
76 |
+
cp -r * /path/to/cloned/repo/
|
77 |
+
cp -r .streamlit /path/to/cloned/repo/
|
78 |
+
cp .gitignore /path/to/cloned/repo/
|
79 |
+
cd /path/to/cloned/repo
|
80 |
+
git add .
|
81 |
+
git commit -m "Initial commit of SigmaTriple application"
|
82 |
+
git push
|
83 |
+
```
|
84 |
+
|
85 |
+
## Configuration Options
|
86 |
+
|
87 |
+
You can customize your Space by modifying the following files:
|
88 |
+
|
89 |
+
- `.streamlit/config.toml`: Streamlit configuration
|
90 |
+
- `README.md`: Documentation and Space description
|
91 |
+
- `requirements.txt`: Python dependencies
|
92 |
+
- `packages.txt`: System dependencies
|
93 |
+
|
94 |
+
## Troubleshooting
|
95 |
+
|
96 |
+
If you encounter any issues during deployment:
|
97 |
+
|
98 |
+
1. **Check the Build Logs**:
|
99 |
+
- Go to the "Settings" tab of your Space
|
100 |
+
- Look for any error messages in the build logs
|
101 |
+
|
102 |
+
2. **Common Issues**:
|
103 |
+
- **Memory Errors**: The model requires significant memory. Make sure you're using a GPU instance.
|
104 |
+
- **Dependency Issues**: Check that all required packages are listed in requirements.txt and packages.txt.
|
105 |
+
- **Timeout Errors**: The initial model loading might take time. Hugging Face Spaces has a build timeout of 10 minutes.
|
106 |
+
|
107 |
+
3. **Reduce Model Size**:
|
108 |
+
- If you're experiencing memory issues, you can modify app.py to use a smaller model or implement model loading optimizations.
|
109 |
+
|
110 |
+
## Accessing Your Space
|
111 |
+
|
112 |
+
Once deployed, your Space will be available at:
|
113 |
+
`https://huggingface.co/spaces/YOUR_USERNAME/SigmaTriple`
|
114 |
+
|
115 |
+
You can share this URL with others to let them use your application.
|
README.md
ADDED
@@ -0,0 +1,73 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
title: SigmaTriple
|
3 |
+
emoji: 🔍
|
4 |
+
colorFrom: blue
|
5 |
+
colorTo: indigo
|
6 |
+
sdk: streamlit
|
7 |
+
sdk_version: "1.32.0"
|
8 |
+
app_file: app.py
|
9 |
+
pinned: false
|
10 |
+
---
|
11 |
+
|
12 |
+
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
13 |
+
|
14 |
+
# SigmaTriple: Knowledge Graph Extraction from Markdown
|
15 |
+
|
16 |
+
This Hugging Face Space provides a Streamlit interface for extracting knowledge graphs from markdown text using the [SciPhi/Triplex](https://huggingface.co/sciphi/triplex) model.
|
17 |
+
|
18 |
+
## Features
|
19 |
+
|
20 |
+
- **Extract Knowledge Graphs**: Automatically identify entities and relationships from markdown text
|
21 |
+
- **Customizable Entity Types and Predicates**: Define the types of entities and relationships you want to extract
|
22 |
+
- **Batch Processing**: Process large markdown files efficiently using vllm
|
23 |
+
- **Interactive Visualization**: View the extracted knowledge graph as an interactive network diagram
|
24 |
+
- **File Upload Support**: Upload markdown files directly or input text manually
|
25 |
+
|
26 |
+
## How It Works
|
27 |
+
|
28 |
+
1. The application uses the SciPhi/Triplex model, which is fine-tuned for knowledge graph extraction
|
29 |
+
2. Markdown text is processed to extract plain text content
|
30 |
+
3. For large texts, batch processing is applied with overlapping chunks to ensure context is maintained
|
31 |
+
4. The model identifies entities and relationships based on the specified entity types and predicates
|
32 |
+
5. Results are parsed and visualized as an interactive knowledge graph
|
33 |
+
|
34 |
+
## Usage
|
35 |
+
|
36 |
+
1. **Configure Entity Types and Predicates**:
|
37 |
+
- In the sidebar, customize the entity types (e.g., PERSON, ORGANIZATION) and predicates (e.g., WORKS_AT, FOUNDED) you want to extract
|
38 |
+
|
39 |
+
2. **Input Text**:
|
40 |
+
- Choose between direct text input or file upload
|
41 |
+
- For text input, simply paste your markdown text in the provided area
|
42 |
+
- For file upload, select a markdown (.md), markdown (.markdown), or text (.txt) file
|
43 |
+
|
44 |
+
3. **Extract Knowledge Graph**:
|
45 |
+
- Click the "Extract Knowledge Graph" button to process the text
|
46 |
+
- View the raw model output, extracted triplets table, and interactive visualization
|
47 |
+
|
48 |
+
## Technical Details
|
49 |
+
|
50 |
+
- Uses the SciPhi/Triplex model for knowledge graph extraction
|
51 |
+
- Implements vllm for efficient batch processing when available
|
52 |
+
- Falls back to standard transformers library if vllm is not available
|
53 |
+
- Visualizes knowledge graphs using NetworkX and PyVis
|
54 |
+
|
55 |
+
## Example Use Cases
|
56 |
+
|
57 |
+
- **Research Papers**: Extract key concepts and relationships from academic papers
|
58 |
+
- **Documentation**: Create knowledge graphs from technical documentation
|
59 |
+
- **Content Analysis**: Identify key entities and relationships in articles or blog posts
|
60 |
+
- **Educational Content**: Visualize relationships between concepts in educational materials
|
61 |
+
|
62 |
+
## Limitations
|
63 |
+
|
64 |
+
- The quality of extraction depends on the clarity and structure of the input text
|
65 |
+
- Very large documents may require significant processing time
|
66 |
+
- The model may not capture all relationships, especially those requiring deep contextual understanding
|
67 |
+
|
68 |
+
## Credits
|
69 |
+
|
70 |
+
- [SciPhi/Triplex Model](https://huggingface.co/sciphi/triplex)
|
71 |
+
- [vllm](https://github.com/vllm-project/vllm) for efficient batch processing
|
72 |
+
- [Streamlit](https://streamlit.io/) for the web interface
|
73 |
+
- [NetworkX](https://networkx.org/) and [PyVis](https://pyvis.readthedocs.io/) for graph visualization
|
app.py
ADDED
@@ -0,0 +1,414 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import streamlit as st
|
2 |
+
import json
|
3 |
+
import torch
|
4 |
+
import os
|
5 |
+
import tempfile
|
6 |
+
import networkx as nx
|
7 |
+
from pyvis.network import Network
|
8 |
+
import markdown
|
9 |
+
import time
|
10 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
11 |
+
|
12 |
+
# Try to import vllm, but don't fail if it's not available
|
13 |
+
try:
|
14 |
+
from vllm import LLM, SamplingParams
|
15 |
+
VLLM_AVAILABLE = True
|
16 |
+
except ImportError:
|
17 |
+
VLLM_AVAILABLE = False
|
18 |
+
|
19 |
+
# Set page configuration
|
20 |
+
st.set_page_config(
|
21 |
+
page_title="SigmaTriple - Knowledge Graph Extractor",
|
22 |
+
page_icon="🔍",
|
23 |
+
layout="wide"
|
24 |
+
)
|
25 |
+
|
26 |
+
# Cache the model loading to avoid reloading on each interaction
|
27 |
+
@st.cache_resource
|
28 |
+
def load_model():
|
29 |
+
with st.spinner("Loading model... This may take several minutes on CPU."):
|
30 |
+
# Check if GPU is available
|
31 |
+
gpu_available = torch.cuda.is_available()
|
32 |
+
st.info(f"GPU available: {gpu_available}")
|
33 |
+
|
34 |
+
# Try to use vllm if GPU is available and vllm is installed
|
35 |
+
if gpu_available and VLLM_AVAILABLE:
|
36 |
+
try:
|
37 |
+
# Try to use vllm for faster inference (GPU only)
|
38 |
+
model = LLM(
|
39 |
+
model="sciphi/triplex",
|
40 |
+
trust_remote_code=True,
|
41 |
+
tensor_parallel_size=1, # Adjust based on available GPUs
|
42 |
+
)
|
43 |
+
tokenizer = AutoTokenizer.from_pretrained("sciphi/triplex", trust_remote_code=True)
|
44 |
+
st.success("Successfully loaded model with vllm")
|
45 |
+
return model, tokenizer, True # True indicates vllm is used
|
46 |
+
except Exception as e:
|
47 |
+
st.warning(f"Failed to load model with vllm: {e}. Falling back to standard transformers.")
|
48 |
+
else:
|
49 |
+
if not VLLM_AVAILABLE:
|
50 |
+
st.warning("vllm is not available. Using standard transformers.")
|
51 |
+
elif not gpu_available:
|
52 |
+
st.warning("No GPU available. vllm requires a GPU. Using standard transformers.")
|
53 |
+
|
54 |
+
# Fallback to standard transformers (works on both CPU and GPU)
|
55 |
+
device = "cuda" if gpu_available else "cpu"
|
56 |
+
st.info(f"Loading model on {device}. Note: This model is large and may be very slow on CPU.")
|
57 |
+
|
58 |
+
# Load with standard transformers - use 8-bit quantization for CPU to improve performance
|
59 |
+
if device == "cpu":
|
60 |
+
try:
|
61 |
+
st.info("Attempting to load 8-bit quantized model for better CPU performance...")
|
62 |
+
from transformers import BitsAndBytesConfig
|
63 |
+
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
|
64 |
+
model = AutoModelForCausalLM.from_pretrained(
|
65 |
+
"sciphi/triplex",
|
66 |
+
trust_remote_code=True,
|
67 |
+
device_map=None,
|
68 |
+
quantization_config=quantization_config
|
69 |
+
)
|
70 |
+
except Exception as e:
|
71 |
+
st.warning(f"Failed to load 8-bit model: {e}. Using standard model.")
|
72 |
+
model = AutoModelForCausalLM.from_pretrained(
|
73 |
+
"sciphi/triplex",
|
74 |
+
trust_remote_code=True,
|
75 |
+
device_map=None
|
76 |
+
)
|
77 |
+
else:
|
78 |
+
model = AutoModelForCausalLM.from_pretrained(
|
79 |
+
"sciphi/triplex",
|
80 |
+
trust_remote_code=True,
|
81 |
+
device_map=None
|
82 |
+
)
|
83 |
+
|
84 |
+
# Move model to appropriate device
|
85 |
+
model = model.to(device)
|
86 |
+
|
87 |
+
tokenizer = AutoTokenizer.from_pretrained("sciphi/triplex", trust_remote_code=True)
|
88 |
+
return model, tokenizer, False # False indicates standard transformers is used
|
89 |
+
|
90 |
+
def triplextract(model, tokenizer, text, entity_types, predicates, use_vllm=True):
|
91 |
+
input_format = """Perform Named Entity Recognition (NER) and extract knowledge graph triplets from the text. NER identifies named entities of given entity types, and triple extraction identifies relationships between entities using specified predicates.
|
92 |
+
|
93 |
+
**Entity Types:**
|
94 |
+
{entity_types}
|
95 |
+
|
96 |
+
**Predicates:**
|
97 |
+
{predicates}
|
98 |
+
|
99 |
+
**Text:**
|
100 |
+
{text}
|
101 |
+
"""
|
102 |
+
|
103 |
+
message = input_format.format(
|
104 |
+
entity_types = json.dumps({"entity_types": entity_types}),
|
105 |
+
predicates = json.dumps({"predicates": predicates}),
|
106 |
+
text = text)
|
107 |
+
|
108 |
+
start_time = time.time()
|
109 |
+
|
110 |
+
if use_vllm and VLLM_AVAILABLE:
|
111 |
+
# Use vllm for inference
|
112 |
+
sampling_params = SamplingParams(
|
113 |
+
temperature=0.0,
|
114 |
+
max_tokens=2048,
|
115 |
+
)
|
116 |
+
outputs = model.generate([message], sampling_params)
|
117 |
+
output = outputs[0].outputs[0].text
|
118 |
+
else:
|
119 |
+
# Use standard transformers
|
120 |
+
messages = [{'role': 'user', 'content': message}]
|
121 |
+
device = next(model.parameters()).device # Get the device the model is on
|
122 |
+
input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(device)
|
123 |
+
output = tokenizer.decode(model.generate(input_ids=input_ids, max_length=2048)[0], skip_special_tokens=True)
|
124 |
+
|
125 |
+
processing_time = time.time() - start_time
|
126 |
+
st.info(f"Processing time: {processing_time:.2f} seconds")
|
127 |
+
|
128 |
+
return output
|
129 |
+
|
130 |
+
def batch_process_markdown(model, tokenizer, markdown_text, entity_types, predicates, use_vllm=True, chunk_size=500, overlap=50, sample_mode=False):
|
131 |
+
"""Process large markdown text in batches"""
|
132 |
+
# Convert markdown to plain text
|
133 |
+
html = markdown.markdown(markdown_text)
|
134 |
+
from bs4 import BeautifulSoup
|
135 |
+
text = BeautifulSoup(html, features="html.parser").get_text()
|
136 |
+
|
137 |
+
# In sample mode, just take the first 500 characters
|
138 |
+
if sample_mode:
|
139 |
+
st.info("⚡ Running in sample mode: processing only the first 500 characters for quick demonstration.")
|
140 |
+
text = text[:500]
|
141 |
+
|
142 |
+
# Split text into chunks with overlap
|
143 |
+
chunks = []
|
144 |
+
for i in range(0, len(text), chunk_size - overlap):
|
145 |
+
chunk = text[i:i + chunk_size]
|
146 |
+
chunks.append(chunk)
|
147 |
+
|
148 |
+
# If there are too many chunks, warn the user
|
149 |
+
if len(chunks) > 10 and not sample_mode:
|
150 |
+
st.warning(f"⚠️ Your text is very large ({len(chunks)} chunks). Processing may take a long time on CPU. Consider using sample mode for a quick demonstration.")
|
151 |
+
|
152 |
+
# Process each chunk with progress bar
|
153 |
+
all_results = []
|
154 |
+
progress_bar = st.progress(0)
|
155 |
+
status_text = st.empty()
|
156 |
+
time_estimate = st.empty()
|
157 |
+
|
158 |
+
# Process first chunk to estimate time
|
159 |
+
start_time = time.time()
|
160 |
+
|
161 |
+
for i, chunk in enumerate(chunks):
|
162 |
+
# Update progress
|
163 |
+
progress = (i + 1) / len(chunks)
|
164 |
+
progress_bar.progress(progress)
|
165 |
+
status_text.text(f"Processing chunk {i+1}/{len(chunks)} ({int(progress*100)}%)")
|
166 |
+
|
167 |
+
# Process chunk with timeout protection
|
168 |
+
try:
|
169 |
+
with st.spinner(f"Processing chunk {i+1}/{len(chunks)}..."):
|
170 |
+
chunk_start_time = time.time()
|
171 |
+
result = triplextract(model, tokenizer, chunk, entity_types, predicates, use_vllm)
|
172 |
+
chunk_time = time.time() - chunk_start_time
|
173 |
+
|
174 |
+
# After first chunk, estimate total time
|
175 |
+
if i == 0:
|
176 |
+
estimated_total_time = chunk_time * len(chunks)
|
177 |
+
time_estimate.info(f"⏱️ Estimated total processing time: {estimated_total_time:.1f} seconds ({estimated_total_time/60:.1f} minutes)")
|
178 |
+
|
179 |
+
all_results.append(result)
|
180 |
+
|
181 |
+
# Show time taken for this chunk
|
182 |
+
st.success(f"✅ Chunk {i+1}/{len(chunks)} processed in {chunk_time:.1f} seconds")
|
183 |
+
except Exception as e:
|
184 |
+
st.error(f"Error processing chunk {i+1}: {e}")
|
185 |
+
all_results.append(f"Error processing this chunk: {e}")
|
186 |
+
|
187 |
+
# Show total time taken
|
188 |
+
total_time = time.time() - start_time
|
189 |
+
st.info(f"Total processing time: {total_time:.1f} seconds ({total_time/60:.1f} minutes)")
|
190 |
+
|
191 |
+
# Clear progress indicators
|
192 |
+
progress_bar.empty()
|
193 |
+
status_text.empty()
|
194 |
+
time_estimate.empty()
|
195 |
+
|
196 |
+
# Combine results
|
197 |
+
combined_result = "\n\n".join(all_results)
|
198 |
+
return combined_result
|
199 |
+
|
200 |
+
def parse_triplets(output):
|
201 |
+
"""Parse the model output to extract triplets"""
|
202 |
+
try:
|
203 |
+
# Find the JSON part in the output
|
204 |
+
start_idx = output.find('{')
|
205 |
+
end_idx = output.rfind('}') + 1
|
206 |
+
|
207 |
+
if start_idx != -1 and end_idx != -1:
|
208 |
+
json_str = output[start_idx:end_idx]
|
209 |
+
data = json.loads(json_str)
|
210 |
+
return data
|
211 |
+
else:
|
212 |
+
# If no JSON found, try to parse the text format
|
213 |
+
triplets = []
|
214 |
+
lines = output.split('\n')
|
215 |
+
for line in lines:
|
216 |
+
if '->' in line and '<-' in line:
|
217 |
+
parts = line.split('->')
|
218 |
+
if len(parts) >= 2:
|
219 |
+
subject = parts[0].strip()
|
220 |
+
rest = parts[1].split('<-')
|
221 |
+
if len(rest) >= 2:
|
222 |
+
predicate = rest[0].strip()
|
223 |
+
object_ = rest[1].strip()
|
224 |
+
triplets.append({
|
225 |
+
"subject": subject,
|
226 |
+
"predicate": predicate,
|
227 |
+
"object": object_
|
228 |
+
})
|
229 |
+
|
230 |
+
if triplets:
|
231 |
+
return {"triplets": triplets}
|
232 |
+
|
233 |
+
# If still no triplets found, return empty result
|
234 |
+
return {"triplets": []}
|
235 |
+
except Exception as e:
|
236 |
+
st.error(f"Error parsing triplets: {e}")
|
237 |
+
return {"triplets": []}
|
238 |
+
|
239 |
+
def visualize_knowledge_graph(triplets):
|
240 |
+
"""Create a network visualization of the knowledge graph"""
|
241 |
+
G = nx.DiGraph()
|
242 |
+
|
243 |
+
# Add nodes and edges
|
244 |
+
for triplet in triplets:
|
245 |
+
subject = triplet.get("subject", "")
|
246 |
+
predicate = triplet.get("predicate", "")
|
247 |
+
object_ = triplet.get("object", "")
|
248 |
+
|
249 |
+
if subject and object_:
|
250 |
+
G.add_node(subject)
|
251 |
+
G.add_node(object_)
|
252 |
+
G.add_edge(subject, object_, title=predicate, label=predicate)
|
253 |
+
|
254 |
+
# Create pyvis network
|
255 |
+
net = Network(notebook=True, height="600px", width="100%", directed=True)
|
256 |
+
|
257 |
+
# Add nodes with different colors based on type if available
|
258 |
+
for node in G.nodes():
|
259 |
+
net.add_node(node, label=node, title=node)
|
260 |
+
|
261 |
+
# Add edges
|
262 |
+
for edge in G.edges(data=True):
|
263 |
+
net.add_edge(edge[0], edge[1], title=edge[2].get('title', ''), label=edge[2].get('label', ''))
|
264 |
+
|
265 |
+
# Generate HTML file
|
266 |
+
with tempfile.NamedTemporaryFile(delete=False, suffix='.html') as tmp:
|
267 |
+
net.save_graph(tmp.name)
|
268 |
+
return tmp.name
|
269 |
+
|
270 |
+
def main():
|
271 |
+
st.title("🔍 SigmaTriple - Knowledge Graph Extractor")
|
272 |
+
st.markdown("""
|
273 |
+
Extract knowledge graphs from markdown text using the SciPhi/Triplex model.
|
274 |
+
""")
|
275 |
+
|
276 |
+
# Load model (spinner is inside the load_model function)
|
277 |
+
model, tokenizer, use_vllm = load_model()
|
278 |
+
|
279 |
+
# Add a note about performance
|
280 |
+
if not torch.cuda.is_available():
|
281 |
+
st.warning("""
|
282 |
+
⚠️ You are running on CPU which can be very slow for the SciPhi/Triplex model.
|
283 |
+
Processing may take 10+ minutes for even small texts. Consider using sample mode for a quick demonstration.
|
284 |
+
""")
|
285 |
+
|
286 |
+
# Add a sample mode checkbox
|
287 |
+
sample_mode = st.checkbox("⚡ Use sample mode (process only first 500 characters for quick demonstration)", value=True)
|
288 |
+
else:
|
289 |
+
sample_mode = False
|
290 |
+
|
291 |
+
# Sidebar for configuration
|
292 |
+
st.sidebar.title("Configuration")
|
293 |
+
|
294 |
+
# Entity types and predicates input
|
295 |
+
st.sidebar.subheader("Entity Types")
|
296 |
+
entity_types_default = ["PERSON", "ORGANIZATION", "LOCATION", "DATE", "EVENT", "PRODUCT", "TECHNOLOGY"]
|
297 |
+
entity_types_input = st.sidebar.text_area("Enter entity types (one per line)",
|
298 |
+
"\n".join(entity_types_default),
|
299 |
+
height=150)
|
300 |
+
entity_types = [et.strip() for et in entity_types_input.split("\n") if et.strip()]
|
301 |
+
|
302 |
+
st.sidebar.subheader("Predicates")
|
303 |
+
predicates_default = ["WORKS_AT", "LOCATED_IN", "FOUNDED", "DEVELOPED", "USES", "RELATED_TO", "PART_OF", "CREATED", "MEMBER_OF"]
|
304 |
+
predicates_input = st.sidebar.text_area("Enter predicates (one per line)",
|
305 |
+
"\n".join(predicates_default),
|
306 |
+
height=150)
|
307 |
+
predicates = [p.strip() for p in predicates_input.split("\n") if p.strip()]
|
308 |
+
|
309 |
+
# Add option to use smaller chunks for better performance
|
310 |
+
st.sidebar.subheader("Performance Settings")
|
311 |
+
chunk_size = st.sidebar.slider("Chunk Size", 100, 1000, 500,
|
312 |
+
help="Smaller chunks process faster but may miss context across chunks")
|
313 |
+
|
314 |
+
# Input method selection
|
315 |
+
input_method = st.radio("Select input method:", ["Text Input", "File Upload"])
|
316 |
+
|
317 |
+
if input_method == "Text Input":
|
318 |
+
markdown_text = st.text_area("Enter markdown text:", height=300)
|
319 |
+
process_button = st.button("Extract Knowledge Graph")
|
320 |
+
|
321 |
+
if process_button and markdown_text:
|
322 |
+
with st.spinner("Processing text... This may take several minutes on CPU"):
|
323 |
+
result = batch_process_markdown(model, tokenizer, markdown_text, entity_types, predicates, use_vllm, chunk_size=chunk_size, sample_mode=sample_mode)
|
324 |
+
|
325 |
+
# Display raw output in an expandable section
|
326 |
+
with st.expander("Raw Model Output"):
|
327 |
+
st.text(result)
|
328 |
+
|
329 |
+
# Parse and visualize triplets
|
330 |
+
parsed_data = parse_triplets(result)
|
331 |
+
triplets = parsed_data.get("triplets", [])
|
332 |
+
|
333 |
+
if triplets:
|
334 |
+
st.subheader(f"Extracted {len(triplets)} Knowledge Graph Triplets:")
|
335 |
+
|
336 |
+
# Display triplets in a table
|
337 |
+
triplet_data = []
|
338 |
+
for t in triplets:
|
339 |
+
triplet_data.append({
|
340 |
+
"Subject": t.get("subject", ""),
|
341 |
+
"Predicate": t.get("predicate", ""),
|
342 |
+
"Object": t.get("object", "")
|
343 |
+
})
|
344 |
+
|
345 |
+
st.table(triplet_data)
|
346 |
+
|
347 |
+
# Visualize the knowledge graph
|
348 |
+
if len(triplets) > 0:
|
349 |
+
html_file = visualize_knowledge_graph(triplets)
|
350 |
+
st.subheader("Knowledge Graph Visualization:")
|
351 |
+
st.components.v1.html(open(html_file, 'r').read(), height=600)
|
352 |
+
os.unlink(html_file) # Clean up the temporary file
|
353 |
+
else:
|
354 |
+
st.warning("No triplets were extracted from the text.")
|
355 |
+
|
356 |
+
else: # File Upload
|
357 |
+
uploaded_file = st.file_uploader("Upload a markdown file", type=["md", "markdown", "txt"])
|
358 |
+
|
359 |
+
if uploaded_file is not None:
|
360 |
+
markdown_text = uploaded_file.read().decode("utf-8")
|
361 |
+
st.subheader("File Preview:")
|
362 |
+
with st.expander("Show file content"):
|
363 |
+
st.markdown(markdown_text)
|
364 |
+
|
365 |
+
process_button = st.button("Extract Knowledge Graph")
|
366 |
+
|
367 |
+
if process_button:
|
368 |
+
with st.spinner("Processing file... This may take several minutes on CPU"):
|
369 |
+
result = batch_process_markdown(model, tokenizer, markdown_text, entity_types, predicates, use_vllm, chunk_size=chunk_size, sample_mode=sample_mode)
|
370 |
+
|
371 |
+
# Display raw output in an expandable section
|
372 |
+
with st.expander("Raw Model Output"):
|
373 |
+
st.text(result)
|
374 |
+
|
375 |
+
# Parse and visualize triplets
|
376 |
+
parsed_data = parse_triplets(result)
|
377 |
+
triplets = parsed_data.get("triplets", [])
|
378 |
+
|
379 |
+
if triplets:
|
380 |
+
st.subheader(f"Extracted {len(triplets)} Knowledge Graph Triplets:")
|
381 |
+
|
382 |
+
# Display triplets in a table
|
383 |
+
triplet_data = []
|
384 |
+
for t in triplets:
|
385 |
+
triplet_data.append({
|
386 |
+
"Subject": t.get("subject", ""),
|
387 |
+
"Predicate": t.get("predicate", ""),
|
388 |
+
"Object": t.get("object", "")
|
389 |
+
})
|
390 |
+
|
391 |
+
st.table(triplet_data)
|
392 |
+
|
393 |
+
# Visualize the knowledge graph
|
394 |
+
if len(triplets) > 0:
|
395 |
+
html_file = visualize_knowledge_graph(triplets)
|
396 |
+
st.subheader("Knowledge Graph Visualization:")
|
397 |
+
st.components.v1.html(open(html_file, 'r').read(), height=600)
|
398 |
+
os.unlink(html_file) # Clean up the temporary file
|
399 |
+
else:
|
400 |
+
st.warning("No triplets were extracted from the file.")
|
401 |
+
|
402 |
+
# Add information about the model
|
403 |
+
st.sidebar.markdown("---")
|
404 |
+
st.sidebar.subheader("About")
|
405 |
+
st.sidebar.info("""
|
406 |
+
This app uses the SciPhi/Triplex model to extract knowledge graphs from text.
|
407 |
+
|
408 |
+
The model performs Named Entity Recognition (NER) and extracts relationships between entities.
|
409 |
+
|
410 |
+
Using vllm: {}
|
411 |
+
""".format("Yes" if use_vllm else "No (using standard transformers)"))
|
412 |
+
|
413 |
+
if __name__ == "__main__":
|
414 |
+
main()
|
packages.txt
ADDED
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
1 |
+
build-essential
|
2 |
+
python3-dev
|
3 |
+
libgraphviz-dev
|
4 |
+
pkg-config
|
requirements.txt
ADDED
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
streamlit==1.32.0
|
2 |
+
transformers==4.38.2
|
3 |
+
torch==2.1.2
|
4 |
+
accelerate==0.27.2
|
5 |
+
bitsandbytes==0.41.1
|
6 |
+
markdown==3.5.2
|
7 |
+
pydantic==2.5.2
|
8 |
+
networkx==3.2.1
|
9 |
+
pyvis==0.3.2
|
10 |
+
beautifulsoup4==4.12.2
|
sample.md
ADDED
@@ -0,0 +1,90 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Artificial Intelligence and Machine Learning: A Brief Overview
|
2 |
+
|
3 |
+
## Introduction
|
4 |
+
|
5 |
+
Artificial Intelligence (AI) has become one of the most transformative technologies of the 21st century. Since its theoretical conception in the 1950s, AI has evolved from a scientific curiosity to a powerful tool that impacts nearly every industry.
|
6 |
+
|
7 |
+
## Key Organizations and Figures
|
8 |
+
|
9 |
+
### Research Organizations
|
10 |
+
|
11 |
+
**OpenAI** was founded in December 2015 by Elon Musk, Sam Altman, Greg Brockman, Ilya Sutskever, John Schulman, and Wojciech Zaremba. The organization is headquartered in San Francisco and has developed several groundbreaking AI models including GPT-4.
|
12 |
+
|
13 |
+
**DeepMind** was founded in London in 2010 by Demis Hassabis, Shane Legg, and Mustafa Suleyman. It was later acquired by Google in 2014. DeepMind is known for developing AlphaGo, which defeated world champion Go player Lee Sedol in 2016.
|
14 |
+
|
15 |
+
**Meta AI** (formerly Facebook AI Research or FAIR) was established in 2013 by Yann LeCun. The research lab focuses on advancing the field of artificial intelligence through open research for the benefit of all.
|
16 |
+
|
17 |
+
### Notable Researchers
|
18 |
+
|
19 |
+
**Geoffrey Hinton**, often referred to as the "Godfather of Deep Learning," has made significant contributions to neural networks. He worked at Google and is a professor at the University of Toronto.
|
20 |
+
|
21 |
+
**Yoshua Bengio** is a Canadian computer scientist known for his work on artificial neural networks and deep learning. He is a professor at the University of Montreal and the scientific director of Mila, Quebec's AI Institute.
|
22 |
+
|
23 |
+
**Andrew Ng** co-founded Google Brain and was the former Chief Scientist at Baidu. He is also the founder of deeplearning.ai and an adjunct professor at Stanford University.
|
24 |
+
|
25 |
+
## Major Developments and Timeline
|
26 |
+
|
27 |
+
- **1956**: The term "Artificial Intelligence" was coined at the Dartmouth Conference.
|
28 |
+
- **1997**: IBM's Deep Blue defeated world chess champion Garry Kasparov.
|
29 |
+
- **2011**: IBM Watson won the quiz show Jeopardy! against former champions.
|
30 |
+
- **2012**: AlexNet, developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, won the ImageNet competition, marking a breakthrough in computer vision.
|
31 |
+
- **2014**: Google acquired DeepMind for $500 million.
|
32 |
+
- **2016**: AlphaGo defeated world champion Go player Lee Sedol.
|
33 |
+
- **2017**: AlphaZero, developed by DeepMind, mastered chess, shogi, and Go.
|
34 |
+
- **2018**: BERT (Bidirectional Encoder Representations from Transformers) was introduced by Google.
|
35 |
+
- **2020**: OpenAI released GPT-3, one of the largest language models at the time.
|
36 |
+
- **2022**: ChatGPT was released by OpenAI, bringing conversational AI to the mainstream.
|
37 |
+
- **2023**: GPT-4 was released, further advancing the capabilities of large language models.
|
38 |
+
|
39 |
+
## Applications and Technologies
|
40 |
+
|
41 |
+
### Natural Language Processing (NLP)
|
42 |
+
|
43 |
+
Natural Language Processing has seen remarkable progress with models like BERT, GPT, and T5. These technologies power applications such as:
|
44 |
+
|
45 |
+
- Machine translation services like Google Translate
|
46 |
+
- Virtual assistants like Siri, Alexa, and Google Assistant
|
47 |
+
- Content generation tools like Jasper and Copy.ai
|
48 |
+
- Sentiment analysis for social media monitoring
|
49 |
+
|
50 |
+
### Computer Vision
|
51 |
+
|
52 |
+
Computer vision technologies enable machines to interpret and understand visual information from the world:
|
53 |
+
|
54 |
+
- Facial recognition systems used in security and smartphones
|
55 |
+
- Medical image analysis for disease detection
|
56 |
+
- Autonomous vehicles developed by companies like Tesla and Waymo
|
57 |
+
- Augmented reality applications in retail and gaming
|
58 |
+
|
59 |
+
### Reinforcement Learning
|
60 |
+
|
61 |
+
Reinforcement learning has been applied to solve complex problems:
|
62 |
+
|
63 |
+
- Game playing AI like AlphaGo and MuZero
|
64 |
+
- Robotics control systems for industrial automation
|
65 |
+
- Resource management in data centers
|
66 |
+
- Personalized recommendation systems
|
67 |
+
|
68 |
+
## Ethical Considerations
|
69 |
+
|
70 |
+
The rapid advancement of AI has raised important ethical questions:
|
71 |
+
|
72 |
+
- **Bias and Fairness**: AI systems can perpetuate and amplify existing biases in data.
|
73 |
+
- **Privacy Concerns**: Facial recognition and surveillance technologies raise questions about privacy rights.
|
74 |
+
- **Job Displacement**: Automation may lead to significant changes in employment patterns.
|
75 |
+
- **Autonomous Weapons**: The development of lethal autonomous weapons systems raises moral and legal questions.
|
76 |
+
- **Alignment Problem**: Ensuring AI systems act in accordance with human values and intentions.
|
77 |
+
|
78 |
+
## Future Directions
|
79 |
+
|
80 |
+
Research is actively ongoing in several promising areas:
|
81 |
+
|
82 |
+
- **Multimodal AI**: Systems that can process and generate multiple types of data (text, images, audio).
|
83 |
+
- **AI Alignment**: Ensuring AI systems remain beneficial and aligned with human values.
|
84 |
+
- **Neuromorphic Computing**: Hardware designed to mimic the structure and function of the human brain.
|
85 |
+
- **Quantum Machine Learning**: Leveraging quantum computing to enhance machine learning capabilities.
|
86 |
+
- **Explainable AI**: Developing systems that can explain their decision-making processes.
|
87 |
+
|
88 |
+
## Conclusion
|
89 |
+
|
90 |
+
Artificial Intelligence continues to evolve at a rapid pace, with new breakthroughs and applications emerging regularly. As these technologies become more integrated into our daily lives, the collaboration between researchers, policymakers, and the public will be crucial in ensuring that AI development proceeds in a way that is beneficial, ethical, and aligned with human values.
|