Embeddings JSON of PDF for Gemini
This video shows the python code to generate the embeddings JSON for a sample PDF document. In this code it first extracts text from PDF using PyPDF2 library of Python. Then it splits the text into chunks. Finally, it uses ’embedding-001′ model to generate the embeddings which can be used for RAG implementation for Gemini 1.5 Flash model.
I hope you like this video. For any questions, suggestions or appreciation please contact us at: https://programmerworld.co/contact/ or email at: programmerworld1990@gmail.com
Details:
import google.generativeai as genai
import PyPDF2
import os
import numpy as np
from typing import List, Dict
import time
import json
# Configure your Google API key
API_KEY = "AIzaSyAXPo910ggmMWLyvpyxu4oKnv2oy8W3BXw"
genai.configure(api_key=API_KEY)
def extract_text_from_pdf(pdf_path: str) -> str:
with open(pdf_path, 'rb') as file:
reader = PyPDF2.PdfReader(file)
text = ""
for page in reader.pages:
text += page.extract_text() + "\n"
return text
def chunk_text(text: str, chunk_size: int = 500) -> List[str]:
words = text.split()
chunks = [' '.join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]
return chunks
def generate_embeddings(chunks: List[str]) -> List[List[float]]:
embeddings = []
for chunk in chunks:
result = genai.embed_content(
model="models/embedding-001",
content=chunk,
task_type="retrieval_document"
)
embeddings.append(result['embedding'])
return embeddings
class SimpleVectorStore:
def __init__(self, chunks: List[str], embeddings: List[List[float]]):
self.chunks = chunks
self.embeddings = np.array(embeddings)
def retrieve(self, query_embedding: List[float], top_k: int = 3) -> List[Dict[str, str]]:
query_embedding = np.array(query_embedding)
similarities = np.dot(self.embeddings, query_embedding) / (
np.linalg.norm(self.embeddings, axis=1) * np.linalg.norm(query_embedding)
)
top_indices = similarities.argsort()[-top_k:][::-1]
return [{"text": self.chunks[i], "similarity": similarities[i]} for i in top_indices]
def rag_query(query: str, vector_store: SimpleVectorStore) -> str:
query_embedding = genai.embed_content(
model="models/embedding-001",
content=query,
task_type="retrieval_query"
)['embedding']
retrieved_chunks = vector_store.retrieve(query_embedding)
context = "\n\n".join([chunk["text"] for chunk in retrieved_chunks])
prompt = f"""
Context:
{context}
Question: {query}
Answer based on the context provided above. If the answer isn't clear from the context, say so.
"""
model = genai.GenerativeModel('gemini-1.5-flash')
response = model.generate_content(prompt)
return response.text
def main():
pdf_path = "custom_document.pdf"
print("Extracting text from PDF...")
pdf_text = extract_text_from_pdf(pdf_path)
print("Chunking text...")
chunks = chunk_text(pdf_text)
print("Generating embeddings...")
embeddings = generate_embeddings(chunks)
print("Creating vector store...")
vector_store = SimpleVectorStore(chunks, embeddings)
query = "What is the main topic of the document?"
print(f"\nQuery: {query}")
response = rag_query(query, vector_store)
print(f"Response: {response}")
# Export chunks and embeddings to a JSON file
data_to_export = {
"chunks": chunks,
"embeddings": embeddings
}
with open("rag_data1.json", "w") as f:
json.dump(data_to_export, f)
print("Exported chunks and embeddings to rag_data.json")
time.sleep(2) # Keep this for gRPC cleanup
if __name__ == "__main__":
main()
(virtualpython) C:\Tools\Python\RAG_Gemini>python script.py
Extracting text from PDF...
Chunking text...
Generating embeddings...
Creating vector store...
Query: What is the main topic of the document?
Response: The main topic of the document is Artificial Intelligence (AI). The document provides an introduction to AI, its history, applications, and future prospects, including ethical considerations.
Exported chunks and embeddings to rag_data.json
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1740591643.817614 5540 init.cc:232] grpc_wait_for_shutdown_with_timeout() timed out.
(virtualpython) C:\Tools\Python\RAG_Gemini>
Screenshots:



Sample PDF File used:
{"chunks": ["Sample Document: Introduction to Artificial Intelligence Artificial Intelligence (AI) refers to the development of computer systems that can perform tasks that typically require human intelligence. These tasks include learning, problem -solving, decision -making, and perception. AI systems are powered by algorithm s and data, enabling them to mimic cognitive functions and improve over time through experience. History of AI The concept of AI dates back to the 1950s, with pioneers like Alan Turing proposing machines that could simulate human thought. Turing\u2019s famous \"Turing Test\" questioned whether a machine could exhibit intelligent behavior indistinguishable from a human. Th e term \"Artificial Intelligence\" was officially coined by John McCarthy in 1956 during the Dartmouth Conference, marking the birth of AI as a field of study. Early AI research focused on symbolic reasoning and rule-based systems, while modern AI leverages statistical methods and neural networks. Applications of AI AI is used in various fields such as healthcare, finance, and transportation. In healthcare, AI powers diagnostic tools that analyze medical images to detect diseases like cancer. In finance, it drives fraud detection systems by identifying unusual pattern s in transactions. In transportation, AI enables autonomous vehicles to navigate roads using sensors and real -time decision -making. Other applications include virtual assistants (e.g., Siri, Alexa), recommendation systems (e.g., Netflix, Spotify), and natu ral language processing tools. Future of AI The future of AI promises advancements in natural language processing, robotics, and ethical AI development. Researchers aim to create more sophisticated models that understand context and emotions better. Robotics powered by AI could revolutionize industr ies like manufacturing and elderly care. However, challenges remain, including ensuring fairness in AI decisions, maintaining transparency in how models work, and addressing job displacement caused by automation. The ethical implications of AI are a growing area of concern as its influence expands."], "embeddings": [[0.015937733, -0.049690623, -0.0204021, 0.0007430521, 0.05593546, 0.060835373, 0.034127835, 0.015798252, 0.014865388, 0.0647824, 0.004501347, 0.02073167, 0.041307747, 0.019409342, -0.02338441, -0.048979405, 0.00175639, 0.0494867, -0.05169524, -0.0029872488, 0.008235313, -0.04425001, 0.020743825, -0.021080898, 0.024118695, -0.026611507, -0.0024623415, -0.02872784, -0.0103659, 0.021297151, -0.035618663, 0.03648871, -0.029480629, 0.014664115, 0.0064022066, -0.040722784, 0.0067286775, 0.03197567, 0.022416105, 0.03604338, 0.008371223, -0.03324304, -0.022109516, 0.008300066, 0.009181495, -0.044486675, 0.00013025786, 0.036705945, 0.043699875, -0.061026238, 0.030195666, 0.024267554, 0.08894829, 0.0028238026, 0.0010572604, -0.04599659, 0.025037382, 0.015956644, -0.012850152, 0.010750755, -0.0092878835, 0.013064177, -0.012717866, 0.0054410263, 0.012139852, -0.04487596, -0.056222692, 0.018665195, 0.036022894, 0.0007064748, 0.06049654, 0.0054333922, 0.08654995, -0.012002155, -0.0398773, -0.07271985, -0.03050464, 0.026083704, 0.03753472, -0.029052755, -0.013759981, -0.026409278, -0.04711357, -0.044707566, -0.05994972, 0.019444847, 6.912404e-05, -0.030967508, -0.03660031, 0.01612776, -0.035901815, 0.040098023, 0.043284886, -0.032243546, -0.0075980285, 0.061743252, -0.017150588, -0.043187395, 0.0052927965, -0.041648347, -0.0003984908, -0.020512585, -0.06376851, 0.0369314, 0.0387055, 0.020558184, -0.012336664, 0.07740225, 0.009710032, 0.059394155, -0.043346696, -0.011663661, -0.013236541, -0.02946507, 0.05073721, -0.015902925, 0.003983061, -0.0012480431, 0.033538993, 0.01357407, 0.0030091542, -5.738158e-05, 0.06705514, -0.014455231, 0.004190682, 0.06077048, 0.0072714896, 0.014704358, 0.030258518, -0.01325368, 0.02309376, -0.025445024, -0.014987877, 0.0057038465, 0.034884263, 0.09502264, 0.04071962, 0.00695927, 0.035410013, 0.020899177, 0.009136633, -0.010116534, -0.007016623, 0.02180859, -0.019940283, 0.022875443, -0.06398348, -0.0014229885, 0.04992978, -0.06373337, 0.0015378344, 0.006679453, -0.05333586, -0.021301636, 0.056033663, 0.02073949, -0.009454011, 0.07411683, 0.021824801, 0.03633786, 0.009664076, 0.010216537, 0.016735462, 0.06768287, -0.04581567, -0.047274772, 0.0016150668, -0.026179891, 0.014153067, 0.0172116, -0.013210634, 0.044761594, -0.0013930007, -0.062453393, -0.02727489, -0.061817367, -0.0018438917, 0.017063046, -0.02036668, -0.006428096, -0.03683801, -0.022479918, 0.03659679, 0.019906605, 0.028995853, 0.0003945944, 0.053864058, -0.034257345, -0.06734754, 0.008313122, 0.0017552855, -0.014964498, -0.03535838, 1.1424553e-05, 0.025452448, 0.07250688, 0.05883478, 0.03916208, 0.0061630104, -0.017803451, 0.021428367, 0.06500971, -0.019534871, -0.060047086, -0.005530852, 0.0033090585, 0.0660379, -0.0077277333, -0.070355386, 0.049699314, -0.05304278, 0.024587676, 0.0034271968, 0.01679895, 0.0071134917, -0.0020015922, 0.023699218, -0.0076949187, 0.019677768, -0.025051823, -0.008719723, 0.0061841, -0.064047776, 0.046122093, -0.008883343, 0.052943565, -0.023913383, 0.024415245, 0.02907757, -0.04285641, -0.016512824, 0.073953815, 0.011839396, -0.015737407, 0.064493105, 0.0071743196, 0.009867819, 0.037423916, 0.04324142, 0.054122593, -0.03992351, 0.018336473, 0.0496274, 0.019233147, -0.053372253, -0.024768358, -0.058523644, 0.022124853, 0.024964062, 0.036711246, -0.027265575, -0.01956039, -0.015557396, 0.0047413725, -0.04075764, 0.0386848, -0.05072684, 0.0116405925, -0.012607731, -0.0006128997, 0.023374066, 0.014387301, 0.01986946, 0.010927249, -0.008451515, -0.03137258, -0.04829457, -0.07822872, 0.017459549, 0.05125215, -0.03377331, -0.08442833, 0.05544913, 0.0058780718, 0.04611517, 0.011229092, 0.0017561985, 0.03190164, 0.015662117, -0.024430783, 0.0005791091, -0.012369232, 0.036429815, -0.011309613, -0.016431445, 0.023900557, -0.053013287, -0.029711781, 0.013283226, -0.06636632, -0.028312866, -0.0044030487, -0.002637525, -0.08722542, -0.077643335, 0.020380896, -0.0115109775, 0.044436984, 0.040254474, -0.041377578, -0.016876306, -0.06668526, 0.00517847, -0.06631592, -0.0246784, -0.008839854, 0.02837776, -0.03193792, 0.024396565, 0.05067571, 0.04628903, -0.0589773, -0.039378867, -0.012344068, 0.07259791, 0.07994252, 0.021630451, -0.001081549, -0.00552696, 0.021120338, 0.030826252, 0.051611647, 0.012984188, -0.017994417, -0.04069849, 0.043664422, -0.03439143, 0.013621143, -0.0043348926, 0.012852082, -0.029259088, -0.0028064342, -0.047893044, 0.018586196, -0.01844385, 0.013625844, -0.11570167, 0.03394006, 0.01750292, -0.008058954, 0.011012508, 0.021228703, -0.017727114, -0.013536644, 0.026722452, 0.018608596, -0.08612556, -0.001815377, 0.086088754, -0.0054167397, 0.030782707, 0.051644508, -0.0493394, -0.010569267, -0.00025885765, -0.004875795, 0.044858832, -0.009041357, 0.09971623, -0.020451976, -0.027086664, 0.039829213, -0.023165608, -0.017565817, -0.024190255, -0.012250289, -0.0006819277, 0.047523163, -0.0067309006, 0.07020531, 0.025200514, -0.02797345, -0.0036610565, -0.037956573, 0.02860782, -0.022278793, -0.024428299, -0.05687103, 0.018811999, 0.004646171, -0.04951052, -0.014922714, 0.07843073, 0.04389206, -0.00089073874, 0.0054843314, -0.0051236134, 0.05634164, -0.027470073, 0.049337145, -0.014935135, 0.01125057, 0.100979894, -0.00036375507, 0.0034577502, -0.03856103, -0.03823353, -0.04507229, 0.0016518702, 0.048400387, -0.03177638, -0.019780912, -0.006582053, -0.0141180465, -0.013500944, -0.03010358, 0.044978272, -0.01982992, -0.013967415, 0.04356579, 0.009341139, -0.0092959115, 0.008178898, -0.0475293, -0.046589285, -0.011702044, 0.011692903, -0.0019787399, -0.0023206756, 0.025687642, -0.029119054, -0.032207325, 0.012919956, -0.0076698554, -0.07871898, -0.018896496, 0.014371966, 0.0102103, 0.028057354, 0.018049635, 0.027571365, -0.041244313, 0.0341185, -0.0132847885, 0.009983149, -0.030771801, -0.00802973, 0.025323084, -0.049184617, -0.008115789, 0.01897837, -0.0009168715, 0.0081048235, -0.044833012, -0.0627945, -0.036278456, -0.01569583, -0.053250417, 0.020824011, -0.1010228, 0.014328087, -0.03278986, -0.04869567, -0.024637906, -0.027096026, -0.0320477, 0.009735668, 0.030937275, -0.020631185, -0.003415371, 0.045160368, -0.07016449, 0.0060208514, -0.07534053, 0.025457505, -0.057314463, 0.010813801, -0.027304871, -0.039581034, 0.044135723, 0.018805286, -0.052193917, 0.057560883, 0.0026461673, -0.039432835, -0.027079953, -0.046797697, 0.054896712, -0.097453065, 0.009493537, 0.018215142, 0.03559076, 0.041505467, 0.009291867, -0.037611373, -0.008220406, 0.05739598, 0.018572247, -0.0071714544, 0.027492383, -0.0007672517, 0.029340424, -0.01908262, -0.050116904, -0.034887534, 0.0015773056, -0.03752532, 0.083446674, 0.044683754, 0.043232914, -0.004991584, 0.009663692, -0.010543812, 0.0098278485, 0.091826245, -0.044635363, 0.04331855, -0.006996937, -0.03972313, -0.012431925, 0.02074892, 0.031899042, -0.019880988, 0.0077145384, 0.06840948, -0.036762588, 0.018286794, -0.0060506593, 0.028786978, 0.0020494899, 0.010927087, -0.017445471, -0.117860846, -0.010804658, 0.04260833, -0.04460682, 0.004313246, 0.029645534, -0.06371792, -0.025484318, -0.034365837, 0.039286356, -0.048986245, -0.0022667223, -0.00073199923, -0.016565606, -0.010210421, -0.0010090302, 0.018692797, -0.02195638, 0.029425928, 0.0028283575, 0.04839934, 0.039316483, -0.031129254, 0.008511486, 0.021532277, -0.10140012, 0.017440228, -0.045239262, -0.014356461, -0.0006704565, -0.003675826, -0.047367416, 0.032411683, -0.012287991, -0.0078509655, -0.020727087, 0.0016333379, -0.020598425, -0.020938404, 0.027403027, 0.033502463, 0.029717032, 0.059931103, 0.001264345, -0.033374254, -0.020484531, 0.04942524, -0.042713333, -0.0114802625, -0.044165462, 0.020019066, -0.019690126, 0.03163057, -0.023107199, -0.021289922, 0.02055969, -0.039849162, -0.02369738, 0.072991244, 0.017919794, -0.011995686, 0.017090427, 0.010134111, -0.016831394, 0.031115562, 0.036736, 0.012217628, 0.0016613844, -0.06303714, 0.046404876, -0.030550165, 0.004221511, 0.0015106994, 0.035962638, 0.019923486, -0.036546458, -0.01274042, 0.023699276, -0.039390557, -0.056512266, 0.005445429, -0.0023738767, 0.032482892, 0.032642107, -9.127115e-06, 0.016064115, 0.00048732743, 0.019533297, 0.002003949, -0.055820957, -0.014017039, -0.0013552338, -0.028416937, -0.057297293, 0.07549086, 0.03002154, -0.0344023, -0.049031887, 0.012205708, 0.019521939, 0.038936846, 0.025922872, 0.023197515, 0.0027053559, -0.041828368, -0.011932597, 0.09492208, 0.07031678, 0.05003787, 0.06771662, -0.013675302, -0.02507077, -0.047122367, -0.032464422, -0.018087894, 0.024937652, -0.00084780063, 0.016263701, -0.06959095, 0.0064508244, 0.002332378, -0.011606934, 0.03548693, 0.039297428, -0.023541367, -0.08397377, -0.047677506, 0.0035153893, -0.031850506, 0.014992461, 0.04920879, -0.026219416, 0.0149139995, -0.003956472, -0.0077915904, -0.023765916, 0.05076665, 0.00068453775, -0.013344515, 0.021755088, 0.014721706, 0.009876042, -0.008728505, -0.029906679, -0.017609712, -0.06329055, -0.009052265, 0.033339575, -0.07146805, 0.055614427, 0.042565517, -0.027625734, 0.047788728, 0.0073321434, -0.0018334655, 0.058487877, 0.023336174, 0.0010983209, 0.0060013416, 0.017578518, -0.021726642, 0.016368028, -0.017886557, 0.0072122314, 0.0072680414, -0.014004148, -0.04558729, -0.024214637, 0.020501517, -0.02785427, 0.02573613, 0.043737296, 0.033565212, -0.006454003, -0.019322738, -0.03945689, 0.006336293, 0.051637273, 0.008084477, -0.062842764, 0.003810282, 0.041968573, 0.012026803, -0.014858716, -0.00031981067, 0.05912095, -0.0026807087, 0.04228106, 0.045894567, -0.04709101, 0.00897077, 0.009761777, -0.014156779, -0.06513087, 0.0134507725, 0.008747292, -0.01693257, 0.03207501, 0.046277817, 0.02978224, 0.034059193, -0.026005294, -0.070972174, 0.07830663, -0.055013753, -0.005208802, -0.0054895882, 0.016601022, 0.032797925, -0.039050337, -0.015307272, 0.059401408, -0.034186713, 0.036989942, 0.01034918, 0.06314615, -0.021075713, -0.095132895, 0.0067423834, -0.018803598, -0.012408568, 0.076515876, 0.056670092, -0.056374375, -0.01929818, 0.0014459653, 0.04872634, -0.03402996, -0.04095938, -0.05226812, -0.010170962, 0.045151826, 0.063585676, -0.031659152, -8.947953e-05, -0.015878012, -0.009755888, -0.0077300514, -0.039799515, -0.011702895, 0.0026955942, -0.017514782, 0.041243166, 0.02918135, -0.026672442, 0.023746604]]}