Export Telegram channel information and use text-embedding-ada-002 to embed channel text - v1

Ultimate Requirement: Create a knowledge base that belongs solely to me. This is just the first step of this ultimate requirement. This is merely a demo, and there are many more tasks to be done ahead.

(First, let's take a look at the final effect)

Step One - Export Telegram Channel Information#

Before you start, you need to ensure that Python 3 is installed. You will also need the following:

Telethon and PySocks libraries: You can install them using pip install telethon PySocks.
Make sure you are a member of the channel from which you want to retrieve messages.
A valid Telegram account to obtain the API ID and API hash for the Telegram application, which you can get at https://my.telegram.org. (Keep your API key secure and do not expose it in public repositories or settings.)
A proxy server (optional, if you are behind a firewall).

Code Implementation - Export Channel Information#

Save the following Python script as telegram_to_csv.py:

import csv
import socks
from telethon import TelegramClient
from telethon.tl.functions.messages import GetHistoryRequest

# Set up TelegramClient and connect to the Telegram API
client = TelegramClient(
    'demo',
    'api_id',
    'api_hash',
    proxy=(socks.SOCKS5, '127.0.0.1', 1080)
)

async def export_to_csv(filename, fieldnames, data):
    """
    Export data to a CSV file.

    Parameters:
    filename -- Name of the export file
    fieldnames -- List of CSV header field names
    data -- List of dictionaries to export
    """
    with open(filename, 'w', newline='', encoding='utf-8') as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(data)

async def fetch_messages(channel_username):
    """
    Fetch all messages from the specified channel.

    Parameters:
    channel_username -- Username of the target channel
    """
    channel_entity = await client.get_input_entity(channel_username)
    offset_id = 0  # Initial message ID offset
    all_messages = []  # List to store all messages

    while True:
        # Request message history
        history = await client(GetHistoryRequest(
            peer=channel_entity,
            offset_id=offset_id,
            offset_date=None,
            add_offset=0,
            limit=100,  # Number of messages to request at a time
            max_id=0,
            min_id=0,
            hash=0
        ))
        if not history.messages:  # End loop when there are no more messages
            break

        for message in history.messages:
            if message.message:  # Only process messages with text content
                # Serialize message to dictionary form
                message_dict = {
                    'id': message.id,
                    'date': message.date.strftime('%Y-%m-%d %H:%M:%S'),
                    'text': message.message
                }
                all_messages.append(message_dict)
        offset_id = history.messages[-1].id
        print(f"Fetched messages: {len(all_messages)}")
    return all_messages

async def main():
    """
    Main program: Fetch messages from the specified channel and save to a CSV file.
    """
    await client.start()  # Start the Telegram client
    print("Client Created")

    channel_username = 'niracler_channel'  # Username of the Telegram channel you want to scrape
    all_messages = await fetch_messages(channel_username)  # Fetch messages

    # Define CSV file headers and export
    headers = ['id', 'date', 'text']
    await export_to_csv('channel_messages.csv', headers, all_messages)

# When this script is run as the main program
if __name__ == '__main__':
    client.loop.run_until_complete(main())

Run the Script telegram_to_csv.py#

Run the script in the terminal:

python telegram_to_csv.py

The script will start running and save all messages from the specified Telegram channel to a file named channel_messages.csv in the current directory.

After completing the above steps, you will find the text messages from the channel in the channel_messages.csv file, including the message ID, date, and content.

(The results won't be posted here~~)

Step Two - Use OpenAI's text-embedding-ada-002 Model for Text Embedding#

Install the openai and pandas libraries, which can be installed using pip install openai pandas.
A valid OpenAI API key.

Code Implementation - Embedding#

Save the following Python script as embedding_generator.py:

import pandas as pd
from openai import OpenAI

# Configure OpenAI client
client = OpenAI(api_key='YOUR_API_KEY')

def get_embedding(text, model="text-embedding-ada-002"):
    """
    Get the embedding vector for the text.
    """
    text = text.replace("\n", " ")  # Clean newline characters from text
    response = client.embeddings.create(input=[text], model=model)  # Request embedding vector
    return response.data[0].embedding  # Extract and return the embedding vector

def embedding_gen():
    """
    Generate embedding vector data for tutorial text.
    """
    df = pd.read_csv('channel_messages.csv')  # Read CSV file into DataFrame
    df['text_with_date'] = df['date'] + " " + df['text']  # Concatenate date and text
    df['ada_embedding'] = df[:100].text_with_date.apply(get_embedding)  # Apply text embedding function in batches

    del df['text_with_date']  # Delete 'text_with_date' column
    df.to_csv('embedded_1k_reviews.csv', index=False)  # Save results to a new CSV file
    
    # Print the first few rows of the DataFrame for confirmation
    print(df.head())

# When the script is run directly
if __name__ == "__main__":
    embedding_gen()

Run the Script#

python embedding_generator.py

Step Three - Perform Search#

Install the pandas, numpy, and tabulate libraries, which can be installed using pip install pandas numpy tabulate.
The tabulate library is used to print the DataFrame in table format.

Code Implementation - Search#

Save the following Python script as embedding_search.py:

import ast
import sys
import pandas as pd
import numpy as np
from tabulate import tabulate
from openai import OpenAI

# Configure OpenAI client
client = OpenAI(api_key='YOUR_API_KEY')

def get_embedding(text, model="text-embedding-ada-002"):
    """
    Get the embedding vector for the text.
    """
    text = text.replace("\n", " ")  # Clean newline characters from text
    response = client.embeddings.create(input=[text], model=model)  # Request embedding vector
    return response.data[0].embedding  # Extract and return the embedding vector

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def embedding_search(query, df, model="text-embedding-ada-002"):
    """
    Use OpenAI API to search for embedding vectors.
    """
    query_embedding = get_embedding(query, model=model)  # Get embedding vector for the query text
    df['similarity'] = df.ada_embedding.apply(lambda x: cosine_similarity(ast.literal_eval(x), query_embedding))  # Calculate similarity
    df = df.sort_values(by='similarity', ascending=False)  # Sort by similarity in descending order
    df = df.drop(columns=['ada_embedding'])  # Remove embedding vector column
    return df

if __name__ == "__main__":
    df = pd.read_csv('embedded_1k_reviews.csv')  # Read CSV file into DataFrame
    query = sys.argv[1]
    df = embedding_search(query, df)  # Search for embedding vectors
    print(tabulate(df.head(10), headers='keys', tablefmt='psql'))  # Print the top 10 results

Run the Script - Results#

$ python embedding_search.py Animal Crossing
+------+---------------------+--------------------------------------------------------------+----------------+
|      | date                | text                                                         |   similarities |
|------+---------------------+--------------------------------------------------------------+----------------|
| 1041 | 2021-04-03 06:18:40 | Neil's Animal Crossing                                       |       0.843896 |
|  836 | 2021-10-16 02:37:16 | Animal Crossing Direct Chinese Video                         |       0.826405 |
|      |                     | https://www.youtube.com/watch?v=rI_jWfNd2dc                  |                |
| 1208 | 2019-11-10 00:05:56 | Raising animals seems very interesting                       |       0.822377 |
|  489 | 2023-06-16 09:33:15 | Watching the life of a kitten reminds me of Sisyphus in mythology |       0.802677 |
|  369 | 2023-08-16 02:15:54 | Do house cats get bored and lonely?                          |       0.797062 |
|   13 | 2023-12-14 13:17:59 | Attended 🤗                                                  |       0.796492 |
| 1177 | 2020-02-12 10:27:45 | The reason why people eat wild animals repeatedly is related to the deeply rooted concepts in traditional Chinese medicine |       0.796363 |
|      |                     | Health preservation, dietary therapy, supplementation, medicinal cuisine, shape complementing shape, nourishing qi and blood... |                |
|      |                     | Pseudoscience is reviving, and if not curbed now, similar things will happen in the future. |                |
|      |                     | Science is the only way.                                     |                |
|  801 | 2021-11-07 13:46:21 | I didn't expect that this year's game of the year would still be Animal Crossing and Fire Emblem. |       0.796246 |
|      |                     | Animal Crossing is because I didn't play enough before, Fire Emblem is because of a major event that made me want to replay it. |                |
|  837 | 2021-10-16 02:37:16 | No way, is my game of the year going to be Animal Crossing again? |       0.795871 |
|  423 | 2023-07-29 14:11:22 | A profile picture that can be called spiritual pollution~~   |       0.794144 |
+------+---------------------+--------------------------------------------------------------+----------------+

The Long Road Ahead - Many More Tasks to Do#

Vector Database: The bot can use this vector database for searching; using a CSV file each time is too inefficient. Considering using Cloudflare's vectorize. However, I want to first do a simple experiment to understand the process. After all, Cloudflare's paid plan is required to use vectorize, and I don't know if this feature will meet my needs.
Continuous Database Updates: Not only my channel but also my articles and other relevant data sources, even some channels I follow, and continuously update the database using a Telegram bot.
Prompt Engineering: When asking ChatGPT, I can find relevant content from this vector database and include it in the prompt to ask ChatGPT.
Basic Knowledge: I can't wait until I have enough basic knowledge to do these things; I should learn while doing. I have already completed a few steps that I understand, and I need to supplement the corresponding knowledge reserves later.
Improve Quality: Some low-quality content should not be included, and efforts should be made to reduce image-related content since images cannot be embedded.
Make it a CLI: Actually, this functionality is written in the Nayako CLI, but the code is not organized yet, and there is no exception handling, so I released it as a demo first. Posting such a long string of code here is also not very good.

References#

Embedding paragraphs from my blog with E5-large-v2 - The original motivation for doing this was this article, but I basically just looked at the idea since I am directly using OpenAI's API for embedding, not a local model.
Telethon Documentation - This is a Python wrapper for the Telegram API for personal accounts, as personal accounts use the MTProto protocol, so the necessity of using this library is quite high.
OpenAI Embeddings Use Cases - I followed this example to do the embedding.

Postscript#

This is also an article with no technical content, just a record of some things I learned.