mbudisic commited on
Commit
ba065c1
Β·
1 Parent(s): e58777e

doc: Updated DEVELOPER.md to reflect current codebase structure

Browse files
Files changed (1) hide show
  1. docs/DEVELOPER.md +127 -49
docs/DEVELOPER.md CHANGED
@@ -4,86 +4,164 @@
4
 
5
  ```
6
  .
7
- β”œβ”€β”€ app.py # Main Chainlit application
8
- β”œβ”€β”€ app_simple_rag.py # Simplified RAG application
9
- β”œβ”€β”€ pyproject.toml # Project configuration and dependencies
10
- β”œβ”€β”€ pstuts_rag/ # Core package
11
- β”‚ └── pstuts_rag/ # Source code
12
- β”‚ β”œβ”€β”€ __init__.py
13
- β”‚ β”œβ”€β”€ datastore.py # Vector database management
14
- β”‚ β”œβ”€β”€ loader.py # Data loading utilities
15
- β”‚ β”œβ”€β”€ rag.py # RAG implementation
16
- β”‚ β”œβ”€β”€ agents.py # Team agent implementation
17
- β”‚ └── ...
18
- β”œβ”€β”€ data/ # Dataset files
19
- └── README.md # User documentation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
  ```
21
 
22
  ## 🧩 Dependency Structure
23
 
24
- Dependencies are organized into logical groups:
25
 
26
- - **Core**: Basic dependencies needed for the RAG system (includes Jupyter support)
27
- - **Dev**: Development tools (linting, testing, etc.)
28
- - **Web**: Dependencies for web server functionality
29
- - **Extras**: Additional optional packages (numpy, ragas, tavily)
 
 
 
 
 
 
30
 
31
- You can install different combinations using pip's extras syntax:
 
 
 
 
32
  ```bash
33
- pip install -e ".[dev,web]" # Install core + dev + web dependencies
 
 
34
  ```
35
 
36
- ## πŸ”§ Technical Details
37
-
38
- The application uses LangChain, LangGraph, and Chainlit to create an agentic RAG system:
39
 
40
  ### Key Components
41
 
42
- - **DatastoreManager**: Manages the Qdrant vector store and document retrieval
43
- - **RAGChainFactory**: Creates retrieval-augmented generation chains
44
- - **PsTutsTeamState**: Manages the state of the agent-based system
45
- - **Langgraph**: Implements the routing logic between different agents
46
-
47
- ## πŸš€ Running Locally
48
-
49
- 1. Create a virtual environment (recommended):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
  ```bash
51
- python -m venv venv
52
- source venv/bin/activate # On Windows: venv\Scripts\activate
53
  ```
 
54
 
55
- 2. Install dependencies:
56
  ```bash
57
- pip install -e ".[dev]" # Install with development tools
58
  ```
 
 
 
59
 
60
- 3. Set up API keys:
61
  ```bash
62
- export OPENAI_API_KEY="your-openai-key"
63
- export TAVILY_API_KEY="your-tavily-key" # Optional, for web search
 
64
  ```
65
 
66
- 4. Run the application:
67
  ```bash
68
- chainlit run app.py
 
69
  ```
70
 
71
- ## πŸ§ͺ Code Quality
72
-
73
- To check for dependency issues:
74
  ```bash
 
75
  deptry .
76
- ```
77
 
78
- For linting:
79
- ```bash
80
  black .
81
  ruff check .
82
  mypy .
 
 
 
83
  ```
84
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
85
  ## πŸ“š Resources
86
 
87
- - [Chainlit Documentation](https://docs.chainlit.io)
88
- - [LangChain Documentation](https://python.langchain.com/docs/get_started/introduction)
89
- - [LangGraph Documentation](https://langchain-ai.github.io/langgraph/)
 
 
 
4
 
5
  ```
6
  .
7
+ β”œβ”€β”€ app.py # Main Chainlit application (multi-agent RAG)
8
+ β”œβ”€β”€ app_simple_rag.py # Simplified single-agent RAG application
9
+ β”œβ”€β”€ Dockerfile # Docker container configuration
10
+ β”œβ”€β”€ pyproject.toml # Project configuration and dependencies
11
+ β”œβ”€β”€ requirements.txt # Basic requirements (for legacy compatibility)
12
+ β”œβ”€β”€ uv.lock # Lock file for uv package manager
13
+ β”œβ”€β”€ pstuts_rag/ # Package directory
14
+ β”‚ β”œβ”€β”€ pstuts_rag/ # Source code
15
+ β”‚ β”‚ β”œβ”€β”€ __init__.py # Package initialization
16
+ β”‚ β”‚ β”œβ”€β”€ configuration.py # Application configuration settings
17
+ β”‚ β”‚ β”œβ”€β”€ datastore.py # Vector database and document management
18
+ β”‚ β”‚ β”œβ”€β”€ rag.py # RAG chain implementation and factories
19
+ β”‚ β”‚ β”œβ”€β”€ graph.py # LangGraph multi-agent implementation
20
+ β”‚ β”‚ β”œβ”€β”€ state.py # Team state management for agents
21
+ β”‚ β”‚ β”œβ”€β”€ prompts.py # System prompts for different agents
22
+ β”‚ β”‚ β”œβ”€β”€ evaluator_utils.py # RAG evaluation utilities
23
+ β”‚ β”‚ └── utils.py # General utilities
24
+ β”‚ β”œβ”€β”€ setup.py # Package setup (legacy)
25
+ β”‚ └── CERT_SUBMISSION.md # Certification submission documentation
26
+ β”œβ”€β”€ data/ # Dataset files (JSON format)
27
+ β”‚ β”œβ”€β”€ train.json # Training dataset
28
+ β”‚ β”œβ”€β”€ dev.json # Development dataset
29
+ β”‚ β”œβ”€β”€ test.json # Test dataset
30
+ β”‚ β”œβ”€β”€ kg_*.json # Knowledge graph datasets
31
+ β”‚ β”œβ”€β”€ LICENSE.txt # Dataset license
32
+ β”‚ └── README.md # Dataset documentation
33
+ β”œβ”€β”€ notebooks/ # Jupyter notebooks for development
34
+ β”‚ β”œβ”€β”€ evaluate_rag.ipynb # RAG evaluation notebook
35
+ β”‚ β”œβ”€β”€ transcript_rag.ipynb # Basic RAG experiments
36
+ β”‚ β”œβ”€β”€ transcript_agents.ipynb # Multi-agent experiments
37
+ β”‚ β”œβ”€β”€ Fine_Tuning_Embedding_for_PSTuts.ipynb # Embedding fine-tuning
38
+ β”‚ └── */ # Fine-tuned model checkpoints
39
+ β”œβ”€β”€ docs/ # Documentation
40
+ β”‚ β”œβ”€β”€ DEVELOPER.md # This file - developer documentation
41
+ β”‚ β”œβ”€β”€ ANSWER.md # Technical answer documentation
42
+ β”‚ β”œβ”€β”€ BLOGPOST*.md # Blog post drafts
43
+ β”‚ β”œβ”€β”€ dataset_card.md # Dataset card documentation
44
+ β”‚ β”œβ”€β”€ TODO.md # Development TODO list
45
+ β”‚ └── chainlit.md # Chainlit welcome message
46
+ β”œβ”€β”€ scripts/ # Utility scripts (currently empty)
47
+ └── README.md # User-facing documentation
48
  ```
49
 
50
  ## 🧩 Dependency Structure
51
 
52
+ Dependencies are organized into logical groups in `pyproject.toml`:
53
 
54
+ ### Core Dependencies 🎯
55
+ All required dependencies for the RAG system including:
56
+ - **LangChain ecosystem**: `langchain`, `langchain-core`, `langchain-community`, `langchain-openai`, `langgraph`
57
+ - **Vector database**: `qdrant-client`, `langchain-qdrant`
58
+ - **ML/AI libraries**: `sentence-transformers`, `transformers`, `torch`
59
+ - **Web interface**: `chainlit==2.0.4`
60
+ - **Data processing**: `pandas`, `datasets`, `pyarrow`
61
+ - **Evaluation**: `ragas==0.2.15`
62
+ - **Jupyter support**: `ipykernel`, `jupyter`, `ipywidgets`
63
+ - **API integration**: `tavily-python` (web search), `requests`, `python-dotenv`
64
 
65
+ ### Optional Dependencies πŸ”§
66
+ - **dev**: Development tools (`pytest`, `black`, `mypy`, `deptry`, `ipdb`)
67
+ - **web**: Web server components (`fastapi`, `uvicorn`, `python-multipart`)
68
+
69
+ Installation examples:
70
  ```bash
71
+ pip install -e . # Core only
72
+ pip install -e ".[dev]" # Core + development tools
73
+ pip install -e ".[dev,web]" # Core + dev + web server
74
  ```
75
 
76
+ ## πŸ”§ Technical Architecture
 
 
77
 
78
  ### Key Components
79
 
80
+ #### πŸ—οΈ Core Classes and Factories
81
+ - **`Configuration`** (`configuration.py`): Application settings including model names, file paths, and parameters
82
+ - **`DatastoreManager`** (`datastore.py`): Manages Qdrant vector store, document loading, and semantic chunking
83
+ - **`RAGChainFactory`** (`rag.py`): Creates retrieval-augmented generation chains with reference compilation
84
+ - **`RAGChainInstance`** (`rag.py`): Encapsulates complete RAG instances with embeddings and vector stores
85
+
86
+ #### πŸ•ΈοΈ Multi-Agent System
87
+ - **`PsTutsTeamState`** (`state.py`): TypedDict managing multi-agent conversation state
88
+ - **Agent creation functions** (`graph.py`): Factory functions for different agent types:
89
+ - `create_rag_node()`: Video search agent using RAG
90
+ - `create_tavily_node()`: Adobe Help web search agent
91
+ - `create_team_supervisor()`: LLM-based routing supervisor
92
+ - **LangGraph implementation**: Multi-agent coordination with state management
93
+
94
+ #### πŸ“Š Document Processing
95
+ - **`VideoTranscriptBulkLoader`**: Loads entire video transcripts as single documents
96
+ - **`VideoTranscriptChunkLoader`**: Loads individual transcript segments with timestamps
97
+ - **`chunk_transcripts()`**: Async semantic chunking with timestamp preservation
98
+ - **Custom embedding models**: Fine-tuned embeddings for PsTuts domain
99
+
100
+ #### πŸ” Evaluation System
101
+ - **`evaluator_utils.py`**: RAG evaluation utilities using RAGAS framework
102
+ - **Notebook-based evaluation**: `evaluate_rag.ipynb` for systematic testing
103
+
104
+ ## πŸš€ Running the Applications
105
+
106
+ ### Multi-Agent RAG (Recommended) πŸ€–
107
  ```bash
108
+ chainlit run app.py
 
109
  ```
110
+ Features team of agents including video search and web search capabilities.
111
 
112
+ ### Simple RAG (Basic) πŸ”
113
  ```bash
114
+ chainlit run app_simple_rag.py
115
  ```
116
+ Single-agent RAG system for straightforward queries.
117
+
118
+ ## πŸ”¬ Development Workflow
119
 
120
+ 1. **Environment Setup**:
121
  ```bash
122
+ python -m venv venv
123
+ source venv/bin/activate # On Windows: venv\Scripts\activate
124
+ pip install -e ".[dev]"
125
  ```
126
 
127
+ 2. **Environment Variables**:
128
  ```bash
129
+ export OPENAI_API_KEY="your-openai-key"
130
+ export TAVILY_API_KEY="your-tavily-key" # Optional, for web search
131
  ```
132
 
133
+ 3. **Code Quality Tools**:
 
 
134
  ```bash
135
+ # Dependency analysis
136
  deptry .
 
137
 
138
+ # Code formatting and linting
 
139
  black .
140
  ruff check .
141
  mypy .
142
+
143
+ # Development debugging
144
+ ipdb # Available for interactive debugging
145
  ```
146
 
147
+ 4. **Notebook Development**:
148
+ - Use `notebooks/` for experimentation
149
+ - `evaluate_rag.ipynb` for systematic evaluation
150
+ - Fine-tuning experiments in `Fine_Tuning_Embedding_for_PSTuts.ipynb`
151
+
152
+ ## πŸ—οΈ Architecture Notes
153
+
154
+ - **Embedding models**: Uses custom fine-tuned `snowflake-arctic-embed-s-ft-pstuts` by default
155
+ - **Vector store**: Qdrant with semantic chunking for optimal retrieval
156
+ - **LLM**: GPT-4.1-mini for generation and routing
157
+ - **Web search**: Tavily integration targeting `helpx.adobe.com`
158
+ - **State management**: LangGraph for multi-agent coordination
159
+ - **Evaluation**: RAGAS framework for retrieval and generation metrics
160
+
161
  ## πŸ“š Resources
162
 
163
+ - [Chainlit Documentation](https://docs.chainlit.io) πŸ“–
164
+ - [LangChain Documentation](https://python.langchain.com/docs/get_started/introduction) 🦜
165
+ - [LangGraph Documentation](https://langchain-ai.github.io/langgraph/) πŸ•ΈοΈ
166
+ - [Qdrant Documentation](https://qdrant.tech/documentation/) πŸ”
167
+ - [RAGAS Documentation](https://docs.ragas.io/) πŸ“Š