ChatGPT AI: Prepare Knowledge Data Using Web Content (Difficulty: 4)
-
Introduction
In a previous tutorial, you learned how to teach ChatGPT new knowledge by semantic search (search for questions with similar meaning to the user question).
The performance of such a chatbot mostly depends on the quality of the knowledge data we provide to the model. This process is often very time-consuming, involving collecting, cleaning, and formatting the data.
In this tutorial, you will learn to prepare the knowledge data for a new chatbot, which will answer questions about the VISIONS organization. The basic idea is to download data from its website and generate questions/answers using that data. This is a common use case: most businesses and organizations already have websites but need to build their chatbots to provide a better user experience.
Step 1 - Starting Project
You will need first to complete this tutorial: ChatGPT AI: QA Bot Using Semantic Search (RAG) (Difficulty: 4)
Save a copy of that project and rename it as “Chatbot Using Web Data.”
Step 2 - Baseline Test for the VISIONS Organization
As a baseline, we must test what ChatGPT knows about the VISIONS organization. If it can answer most questions without help, it is unnecessary to inject additional knowledge.
To run the test, we can use the plain ChatGPT assistant in this project:
https://play.creaticode.com/projects/6531b7e60fdce080a4481c1d
For example, we can see that ChatGPT already knows about the VISIONS organization:
However, if we try to get more specific information, it would fail:
This proves that ChatGPT needs our help to answer more specific questions about the VISIONS organization, even though such information is readily available on their website.[Advanced] Why ChatGPT doesn’t know the answer?
ChatGPT does not memorize any sentences. Instead, it memorizes the probabilities of the next words. If some words appear together very often in the training data, that pattern gets stored inside the model.For example, if a lot of websites contain the sentence like “The phone number of VISIONS is 1.888.245.8333”, then the next time ChatGPT sees “The phone number of VISIONS is”, it will predict the correct phone number as the next word.
However, since that sentence is not commonly seen in the training data, the next word with the highest probability is probably “sorry” or “I”, and the actual phone number has a much lower probability.
Step 3 - Fetch Data from the Website
Now, let’s look at the VISIONS organization website. Open this URL https://visionsvcb.org in a new tab, and you should see the following:
This web page contains a lot of information and links to other web pages. To download the information, we can run the following block (click the green ‘Add Extension’ block on the bottom left and select the ‘Cloud’ category):
Note that this block does two things for you:- Download the full content of the page;
- Convert the content into the Markdown format.
The Markdown format is very simple. It is almost the same as the text you see on the web page, but additional information is included, such as the URL of the links on the web page. This will be very useful in the next step.Note that the content for the URL is cached, so if you run fetch from the same URL repeatedly on the same day, it will be very fast after the first time.
Also, the website’s content will go through a moderation process, so if any inappropriate content is found, this block will reject the fetch request.
Step 4 - Learn about Web Pages and Links
Like most websites, the VISIONS page we have downloaded contains many links, such as these:
... [Home](https://visionsvcb.org) [About Us](#) [Annual Reports and Financials](https://visionsvcb.org/about/annual-report) [Board of Directors](https://visionsvcb.org/about/board-of-directors) [Management Team](https://visionsvcb.org/about/management-team) [Mission Statement](https://visionsvcb.org/about/mission-statement) [VISIONS History](https://visionsvcb.org/about/visions-history) ...
Each link leads to a new page, which may link to other pages. Some of these links may be duplicates as well. For example, page 1 may contain a link to page 2, and page 2 may contain links to both page 1 and page 3. With more pages, they will form a web with many links between the links:
In the next few steps, we will write code to parse the links on each page and put these links into a queue with no duplication. We will visit each link in this queue to get more content and extract more links.
Step 5 - The URLs List
To store all the links we will visit, please create a new list named “URLs”. The first URL will be the main URL of the site: “https://visionsvcb.org.” Note that we are deleting all items from the list first, which ensures we always start with the main URL.
Step 6 - Iterate Through Each URL
Next, we use a for-loop to visit every URL in the list. Since the list will grow as we discover more links on the pages we visit, we don’t know how many URLs there will be. We will use 1 for now to ensure our code works for 1 URL.
We will also add a guard to ensure the URL Index is never greater than the length of the list so we will always get a valid URL.
Step 7 - Fetch the Content of One URL
Next, we will fetch the content from the URL at the current index and store it in the “result” variable:
Step 8 - Define the “extract links” Block
Since the logic to extract all the links from the page’s content is fairly standalone, it’s better to define a new block dedicated to it: “extract links”. It will take the result from fetching the content and add all the URLs on that page to the “URLs” list.
Step 9 - Find All Links
To find all the links in the page’s content, we need to look for a certain pattern. In the markdown format, all the links are wrapped inside a pair of ( ) starting with “https://”. Therefore, we can use regular expression to find all the text that matches this pattern and store them in a new list named “links”.
You can copy the exact regular expression from here: \(https?:\/\/[^\)]+\)
Now, if we run the program, it will fetch the content from the main site and extract all the 60 links on that page into the “links” list:
Step 10 - Go Through the List of Links
To clean up this list of links, we need another for-loop to process each link. We will store each link in the “link” variable:
Step 11 - Remove the Parentheses
Each link contains a pair of parentheses. To remove them, we need to extract the substring of the link text from the second letter to the second but last letter. The result will be stored in a variable named “URL”.
Step 12 - Store URL Without Duplication
Now, we can add the new URL to the list of URLs, but we need to ensure the list doesn’t already contain this new URL.
Step 13 - Limit to the Main Site
There is still one small issue. Some links on the page are not from the same domain, such as “(https://accessibility-helper.co.il)”. Note that we should only download the data from the same main site. Otherwise, the list of URLs may grow exponentially as we visit more and more websites. Therefore, we need to add another condition: we only add a URL to our list if it contains “visionsvcb.org”. (Note that when you use this project for other websites, this main URL also needs to be changed.)
After this, run the program again, and the “URLs” list will grow to 42 items:
Step 14 - Test: Fetch from First 3 URLs
For a test, let’s try to fetch from the first 3 URLs in the list:
After we run this program again, we find the “URLs” list grows to 61 items:
We also find a new problem. A lot of the URLs are PDF files. That might be useful if we are creating a document-retrieval app, but for this project, we should exclude them since we only need web content. We can add an additional condition in the “extract links” block like this:
Now, if we rerun the program, the URLs list will only have 42 items in total. That means no new links have been discovered when we visit the second and third URLs.
Step 15 - Add the “Generate QA” Block
Now we can go through all the pages in a website, the next step is to generate questions and answers from each page. Let’s focus on generating them using the first page. Please define a new block “generate QA” and call it in the repeat loop. We will pass in the “result” variable, which is the text of the current page.
Step 16 - Cut the Page Content into Chunks
Next, we will feed the page content to ChatGPT to generate pairs of questions and answers. Note that we can not simply give all the content to ChatGPT in one request because that might exceed ChatGPT’s limit on the request length. Instead, we will have to cut the content into chunks.
For example, suppose we limit each chunk to at most 5000 characters. If the content has 12000 characters in total, we will send 3 requests to ChatGPT: character 1 to 5000, character 5001 to 10000, and character 10001 to 12000.
This idea can be implemented using a for-loop like this:
The “start index” is the position of the first character of each chunk, and the last character’s position will be “start index + 4999”. Therefore, the “chunk” variable will contain the content of each chunk of the page’s content.
Step 17 - Ask ChatGPT to Generate Questions and Answers
It is fairly straightforward to ask ChatGPT to generate some questions and answers using each chunk. However, to make it easier to parse the response of ChatGPT, we need to specify the output format as well. For example, we can use a prompt like this:
You will generate questions and answers based on the content of a web page. Start each question with "--QUESTION:" and start the answer with "--ANSWER:". Here is the web content:
Here is the code to compose and send the request:
Note that each request is a new chat so that ChatGPT won’t need to worry about previous messages. Otherwise, ChatGPt may not focus on the current chunk it is handling.
Step 18 - Split the Response into a List
The response we get from ChatGPT will look like this:
Our next task is to put the questions and answers into a table format so that we can use them to build the semantic database later. First, we need to split the response by the special symbol “--” and put the parts into a list named “qa”:
The list will look like this, with questions and answers as individual items:
Step 19 - Clear the Data Table
Since we will accumulate questions and answers and store them in the “data” table, we need to clear the table’s contents at the beginning.
Step 20 - Iterate through the “QA” List
To add all the questions and answers from the “qa” list to the “data” table, we can use a for-loop to go through every item in the “qa” list. We already know the first item of the list is always empty, so we should start with the second item. Also, we will consume 2 items at a time, one for the question and one for the answer, so we should increase the index “i” by 2 each time:
Step 21 - Add One Pair of Question and Answer
Now we can read the question at the item at index “i”, and its answer at the index of “i+1”. Then we add both of them as a new row to the “data” table:
For example, if there are 3 pairs of questions and answers, we would get 3 rows in the “data” table:
Step 22 - Remove the prefix
There is a minor issue: the questions contain the prefix of “QUESTION:”, and the answers contain the prefix of “ANSWER:”. We can remove them using the “replace” operator:
Now if we clear the data table (manually remove the existing items) and run the for-loop by itself again, we would get clean questions and answers:
Step 23 - Test Run
Now our program is ready. Let’s test it with the first 3 URLs:
It will take some time to run this since we need to use ChatGPT to generate the questions and answers from each page. When the run finishes, we will get some questions in the “data” table. The exact number of questions may be different when you run it.
The rest of the process is the same as before: we can create a semantic database using the “data” table (no need to do this repeatedly if the data stays the same), then we can query this database when the user asks a new question, and feed the query result to ChatGPT as reference.
Next Step
You can now try to change the program to fetch data from any other website, and then publish a chatbot using that data. Note that you will need to specify the new website in 2 places:
- In the beginning, when you specify the initial URL:
- In the “extract links” block’s definition, where you remove URLs not related to the target website:
-