= """&P225104 = TIM 10, 134
akk1 #atf: use lexical
#Nippur 2N-T496; proverb; Alster proverbs
@tablet
@obverse
@column 1
1. dub-sar hu-ru
2. a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e-ne
3. dub-sar hu-ru
4. a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e#-ne
@reverse
@column 1
1. igi-bi 3(disz) 3(asz) 6(disz)
"""
Brushing up on your Python Skills
The basics of this class are taught in Python. And the neglected basics of ALP is preprocessing our texts.
Preprocessing for ALP is much broader than what computer and data scientists usually mean. Philological conventions in printed and digital publications hold much more information that needs to be correctly parsed before any computational manipulation (analysis).
In this notebook, we are going to provide four examples of messy texts: one in Egyptian and two in Akkadian. We are going to work through the process of how we should parse the texts, what information we are losing when parsing them, and brushing up on basic Python syntax and functions while we’re at it.
Akkadian Example 1:
https://cdli.mpiwg-berlin.mpg.de/artifacts/225104
&P225104 = TIM 10, 134 #atf: use lexical #Nippur 2N-T496; proverb; Alster proverbs @tablet @obverse @column 1 1. dub-sar hu-ru 2. a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e-ne 3. dub-sar hu-ru 4. a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e#-ne @reverse @column 1 1. igi-bi 3(disz) 3(asz) 6(disz)
Task 1:
How do we turn this raw text into a list of words?
akk1
'&P225104 = TIM 10, 134\n#atf: use lexical\n#Nippur 2N-T496; proverb; Alster proverbs\n@tablet\n@obverse\n@column 1\n1. dub-sar hu-ru\n2. a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e-ne\n3. dub-sar hu-ru\n4. a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e#-ne\n@reverse\n@column 1\n1. igi-bi 3(disz) 3(asz) 6(disz)\n'
# split string to lines of texts
= akk1.split("\n")
lines lines
['&P225104 = TIM 10, 134',
'#atf: use lexical',
'#Nippur 2N-T496; proverb; Alster proverbs',
'@tablet',
'@obverse',
'@column 1',
'1. dub-sar hu-ru',
'2. a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e-ne',
'3. dub-sar hu-ru',
'4. a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e#-ne',
'@reverse',
'@column 1',
'1. igi-bi 3(disz) 3(asz) 6(disz)',
'']
# remove blanks
= []
lines_full for line in lines:
if line != "":
lines_full.append(line)
lines_full
['&P225104 = TIM 10, 134',
'#atf: use lexical',
'#Nippur 2N-T496; proverb; Alster proverbs',
'@tablet',
'@obverse',
'@column 1',
'1. dub-sar hu-ru',
'2. a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e-ne',
'3. dub-sar hu-ru',
'4. a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e#-ne',
'@reverse',
'@column 1',
'1. igi-bi 3(disz) 3(asz) 6(disz)']
# keep only lines that begin with a number
# use regular expressions
import re
= []
text_lines for line in lines_full:
if re.match("^\d", line) != None:
text_lines.append(line)
text_lines
['1. dub-sar hu-ru',
'2. a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e-ne',
'3. dub-sar hu-ru',
'4. a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e#-ne',
'1. igi-bi 3(disz) 3(asz) 6(disz)']
# separate lines into words
= []
words_appended = []
words_extended for line in text_lines:
= line.split()
temp_words 1:]) # creates list of lists
words_appended.append(temp_words[1:]) # creates list
words_extended.extend(temp_words[
print(words_appended)
print("-------------------------------")
print(words_extended)
[['dub-sar', 'hu-ru'], ['a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e-ne'], ['dub-sar', 'hu-ru'], ['a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e#-ne'], ['igi-bi', '3(disz)', '3(asz)', '6(disz)']]
-------------------------------
['dub-sar', 'hu-ru', 'a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e-ne', 'dub-sar', 'hu-ru', 'a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e#-ne', 'igi-bi', '3(disz)', '3(asz)', '6(disz)']
# rewrite the code above as a function
What information did we lose when preprocessing the texts in this way?
Task 2:
Create a dictionary from the raw texts, of the following format:
{"pnum": ...
"textID": ...
"surface": [{
"surfaceType": ...
"columns": [{
"columnNum": ...
"text": [{
"lineNum": ...
"words": [..., ..., ...]
}]
}]
}]}
# separate text into lines
= akk1.split("\n")
lines
= []
lines_full for line in lines:
if line != "":
lines_full.append(line)
lines_full
['&P225104 = TIM 10, 134',
'#atf: use lexical',
'#Nippur 2N-T496; proverb; Alster proverbs',
'@tablet',
'@obverse',
'@column 1',
'1. dub-sar hu-ru',
'2. a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e-ne',
'3. dub-sar hu-ru',
'4. a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e#-ne',
'@reverse',
'@column 1',
'1. igi-bi 3(disz) 3(asz) 6(disz)']
# store the pnum and textID in variables
= lines_full[0]
text_ids = text_ids.split("=")
pnum, textID
= pnum.strip()[1:]
pnum = textID.strip()
textID print(pnum)
print(textID)
P225104
TIM 10, 134
# @title
# create a dictionary for each surface (simple no regex method)
# what do you do when you have different type of inscribed object? (e.g. cylinder, prism, bowl, slab, etc.)
= ["@obverse", "@reverse"]
valid_surface_values
= []
surface_idx
for index, line in enumerate(lines_full):
if line in valid_surface_values: # what is dangerous in this line? if the line of text is not exactly(!) part of surface, no lines will be found
surface_idx.append(index)print(surface_idx)
[4, 10]
# @title
# create a dictionary for each surface (complicated with regex method)
# what do you do when you have different type of inscribed object? (e.g. cylinder, prism, bowl, slab, etc.)
= ["@obverse", "@reverse"]
valid_surface_values
= r"^(?:" + "|".join([re.escape(value) for value in valid_surface_values]) + ")"
pattern
= []
surface_idx
for index, line in enumerate(lines_full):
if re.match(pattern, line) != None:
surface_idx.append(index)print(surface_idx)
[4, 10]
# @title
# use surface indices to create surface dictionaries
# surfaceType; columnNum; lineNum; words
# surfaceType extracted using id values of lines
# columnNum needs first to check whether a column actually exists, then extracted using regex(?)/tokenize on space for any number after the word column
# lineNum is regex for any line that begins with a number plus any tags attached: how would be best to define line numbers, as integers or as string variables?
# words extracted from each text line after lineNum using regex and tokenized on spaces
for index, id in enumerate(surface_idx):
= lines_full[id].replace('@', '')
surfaceType print(index, id)
if index < len(surface_idx) - 1:
= surface_idx[index+1]
end_of_surface else:
= len(lines_full)
end_of_surface
# Extract the text content for the current surface designation
= lines_full[id+1:end_of_surface]
surface_content
# Print the surface type and its content
print(f"Surface Type: {surfaceType}")
# print("Content:")
# print('\n'.join(surface_content))
print('---')
# Extract column number, line numbers, and words for each surface content
for line in surface_content:
= None
columnNum = None
lineNum = []
words
# Check if the line contains a column number
if '@column' in line:
= line.split()
parts if len(parts) >= 2:
try:
= int(parts[1])
columnNum except ValueError:
pass
print(f"Column Number: {columnNum}")
print('---')
continue # Skip processing the line with @column
# Check if the line contains a line number
if '.' in line:
= line.split('.')
parts if len(parts) >= 2:
= parts[0].strip()
lineNum
# Tokenize the words in the line
if lineNum:
= parts[1].strip().split()
words else:
= line.strip().split()
words
# Print the extracted information for each line
print(f"Line Number: {lineNum}")
print(f"Words: {words}")
print('---')
0 4
Surface Type: obverse
---
Column Number: 1
---
Line Number: 1
Words: ['dub-sar', 'hu-ru']
---
Line Number: 2
Words: ['a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e-ne']
---
Line Number: 3
Words: ['dub-sar', 'hu-ru']
---
Line Number: 4
Words: ['a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e#-ne']
---
1 10
Surface Type: reverse
---
Column Number: 1
---
Line Number: 1
Words: ['igi-bi', '3(disz)', '3(asz)', '6(disz)']
---
# @title
# Combine the surfaces and metadata into one dictionary
= {
output "pnum": pnum,
"textID": textID,
"surface": []
}
for index, id in enumerate(surface_idx):
= lines_full[id].replace('@', '')
surfaceType = {
surface "surfaceType": surfaceType,
"columns": []
}
if index < len(surface_idx) - 1:
= surface_idx[index+1]
end_of_surface else:
= len(lines_full)
end_of_surface
# Extract the text content for the current surface designation
= lines_full[id+1:end_of_surface]
surface_content
# Extract column number, line numbers, and words for each surface content
= None
columnNum = {
column "columnNum": None,
"text": []
}for line in surface_content:
= None
lineNum = []
words
# Check if the line contains a column number
if '@column' in line:
= line.split()
parts if len(parts) >= 2:
try:
= int(parts[1])
columnNum "columnNum"] = columnNum
column[except ValueError:
pass
continue # Skip processing the line with @column
# Check if the line contains a line number
if '.' in line:
= line.split('.')
parts if len(parts) >= 2:
= parts[0].strip()
lineNum
# Tokenize the words in the line
if lineNum:
= parts[1].strip().split()
words else:
= line.strip().split()
words
# Add the line information to the column
= {
line_info "lineNum": lineNum,
"words": words
}"text"].append(line_info)
column[
# Add the column to the surface
"columns"].append(column)
surface[
# Add the surface to the output
"surface"].append(surface)
output[
# Print the output in the specified dictionary format
print(output)
{'pnum': 'P225104', 'textID': 'TIM 10, 134', 'surface': [{'surfaceType': 'obverse', 'columns': [{'columnNum': 1, 'text': [{'lineNum': '1', 'words': ['dub-sar', 'hu-ru']}, {'lineNum': '2', 'words': ['a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e-ne']}, {'lineNum': '3', 'words': ['dub-sar', 'hu-ru']}, {'lineNum': '4', 'words': ['a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e#-ne']}]}]}, {'surfaceType': 'reverse', 'columns': [{'columnNum': 1, 'text': [{'lineNum': '1', 'words': ['igi-bi', '3(disz)', '3(asz)', '6(disz)']}]}]}]}
# Save the output dictionary as a JSON file
import json
with open(f"{pnum}.json", "w") as json_file:
=4) json.dump(output, json_file, indent
# rewrite the code above into a function
Egyptian Example 1:
= """,text,line,word,ref,frag,norm,unicode_word,unicode,lemma_id,cf,pos,sense
eg2_csv 92,3Z5EM77HJFCOPKZDDZFEMI6KVY,5,7,3Z5EM77HJFCOPKZDDZFEMI6KVY.5.7,gꜣu̯.w,gꜣu̯.w,<g>V96</g>𓅱,"['<', 'g', '>', 'V', '9', '6', '<', '/', 'g', '>', '𓅱']",166210,gꜣu̯,VERB,eng sein; entbehren; (jmdn.) Not leiden lassen
151,4WVXFJZFLNAYHP3Y5O5SLWD7DA,2,2,4WVXFJZFLNAYHP3Y5O5SLWD7DA.2.2,nꜥw,nꜥw,𓈖𓂝𓅱<g>I14C</g>𓏤,"['𓈖', '𓂝', '𓅱', '<', 'g', '>', 'I', '1', '4', 'C', '<', '/', 'g', '>', '𓏤']",80510,Nꜥw,PROPN,Sich windender (Personifikation der Schlange)
153,4WVXFJZFLNAYHP3Y5O5SLWD7DA,2,5,4WVXFJZFLNAYHP3Y5O5SLWD7DA.2.5,nꜥw,nꜥw,𓈖𓂝𓅱<g>I14C</g>𓏤,"['𓈖', '𓂝', '𓅱', '<', 'g', '>', 'I', '1', '4', 'C', '<', '/', 'g', '>', '𓏤']",80510,Nꜥw,PROPN,Sich windender (Personifikation der Schlange)
200,67HZI45S3REA3LWVZOKJ6QJOIE,14,9,67HZI45S3REA3LWVZOKJ6QJOIE.14.9,nbi̯.n,nbi̯.n,𓈖𓎟𓃀<g>D107</g>𓈖,"['𓈖', '𓎟', '𓃀', '<', 'g', '>', 'D', '1', '0', '7', '<', '/', 'g', '>', '𓈖']",82520,nbi̯,VERB,schmelzen; gießen
204,67HZI45S3REA3LWVZOKJ6QJOIE,14,13,67HZI45S3REA3LWVZOKJ6QJOIE.14.13,nḏr.n,nḏr.n,𓈖𓇦𓂋<g>U19A</g>𓆱𓈖,"['𓈖', '𓇦', '𓂋', '<', 'g', '>', 'U', '1', '9', 'A', '<', '/', 'g', '>', '𓆱', '𓈖']",91630,nḏr,VERB,(Holz) bearbeiten; zimmern
206,67HZI45S3REA3LWVZOKJ6QJOIE,14,15,67HZI45S3REA3LWVZOKJ6QJOIE.14.15,b(w)n.wDU,bwn.wDU,𓃀𓈖𓏌𓅱<g>T86</g><g>T86</g>,"['𓃀', '𓈖', '𓏌', '𓅱', '<', 'g', '>', 'T', '8', '6', '<', '/', 'g', '>', '<', 'g', '>', 'T', '8', '6', '<', '/', 'g', '>']",55330,bwn,NOUN,Speerspitzen (des Fischspeeres)
"""
import pandas as pd
from io import StringIO
# Convert the string into a StringIO object
# This is only necessary because we presented the csv as a string instead of loading it as a file
= StringIO(eg2_csv)
csv_data
# Read the data into a pandas DataFrame
= pd.read_csv(csv_data)
df
# Display the DataFrame
df
Unnamed: 0 | text | line | word | ref | frag | norm | unicode_word | unicode | lemma_id | cf | pos | sense | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 92 | 3Z5EM77HJFCOPKZDDZFEMI6KVY | 5 | 7 | 3Z5EM77HJFCOPKZDDZFEMI6KVY.5.7 | gꜣu̯.w | gꜣu̯.w | <g>V96</g>𓅱 | ['<', 'g', '>', 'V', '9', '6', '<', '/', 'g', ... | 166210 | gꜣu̯ | VERB | eng sein; entbehren; (jmdn.) Not leiden lassen |
1 | 151 | 4WVXFJZFLNAYHP3Y5O5SLWD7DA | 2 | 2 | 4WVXFJZFLNAYHP3Y5O5SLWD7DA.2.2 | nꜥw | nꜥw | 𓈖𓂝𓅱<g>I14C</g>𓏤 | ['𓈖', '𓂝', '𓅱', '<', 'g', '>', 'I', '1', '4', ... | 80510 | Nꜥw | PROPN | Sich windender (Personifikation der Schlange) |
2 | 153 | 4WVXFJZFLNAYHP3Y5O5SLWD7DA | 2 | 5 | 4WVXFJZFLNAYHP3Y5O5SLWD7DA.2.5 | nꜥw | nꜥw | 𓈖𓂝𓅱<g>I14C</g>𓏤 | ['𓈖', '𓂝', '𓅱', '<', 'g', '>', 'I', '1', '4', ... | 80510 | Nꜥw | PROPN | Sich windender (Personifikation der Schlange) |
3 | 200 | 67HZI45S3REA3LWVZOKJ6QJOIE | 14 | 9 | 67HZI45S3REA3LWVZOKJ6QJOIE.14.9 | nbi̯.n | nbi̯.n | 𓈖𓎟𓃀<g>D107</g>𓈖 | ['𓈖', '𓎟', '𓃀', '<', 'g', '>', 'D', '1', '0', ... | 82520 | nbi̯ | VERB | schmelzen; gießen |
4 | 204 | 67HZI45S3REA3LWVZOKJ6QJOIE | 14 | 13 | 67HZI45S3REA3LWVZOKJ6QJOIE.14.13 | nḏr.n | nḏr.n | 𓈖𓇦𓂋<g>U19A</g>𓆱𓈖 | ['𓈖', '𓇦', '𓂋', '<', 'g', '>', 'U', '1', '9', ... | 91630 | nḏr | VERB | (Holz) bearbeiten; zimmern |
5 | 206 | 67HZI45S3REA3LWVZOKJ6QJOIE | 14 | 15 | 67HZI45S3REA3LWVZOKJ6QJOIE.14.15 | b(w)n.wDU | bwn.wDU | 𓃀𓈖𓏌𓅱<g>T86</g><g>T86</g> | ['𓃀', '𓈖', '𓏌', '𓅱', '<', 'g', '>', 'T', '8', ... | 55330 | bwn | NOUN | Speerspitzen (des Fischspeeres) |
def split_tags(text):
= [] # List to collect output of the function
parts while '<g>' in text and '</g>' in text:
= text.split('<g>', 1) # splits at the first <g> found
pre, rest = rest.split('</g>', 1) # splits the rest at the first </g> found
tag_content, post
# adds elements before the first <g></g> tag to the List
parts.extend(pre)
# adds element inside the first <g></g> tag to the List
parts.append(tag_content)
# text variable is set to remaining text
= post
text
# After last tag found, the remainder of the text is split and added to the List
parts.extend(text)return parts
def process_text(text):
if pd.isna(text): # deals with NaN
return []
else:
return split_tags(text)
# apply functions to every row of the column 'unicode_word'
'unicode_splitted'] = df['unicode_word'].apply(process_text)
df[# delete obsolete column
'unicode', axis=1, inplace=True)
df.drop(
# show modified dataframe
df# export as *.csv file
#df.to_csv("EG-TLA-example.csv")
Unnamed: 0 | text | line | word | ref | frag | norm | unicode_word | lemma_id | cf | pos | sense | unicode_splitted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 92 | 3Z5EM77HJFCOPKZDDZFEMI6KVY | 5 | 7 | 3Z5EM77HJFCOPKZDDZFEMI6KVY.5.7 | gꜣu̯.w | gꜣu̯.w | <g>V96</g>𓅱 | 166210 | gꜣu̯ | VERB | eng sein; entbehren; (jmdn.) Not leiden lassen | [V96, 𓅱] |
1 | 151 | 4WVXFJZFLNAYHP3Y5O5SLWD7DA | 2 | 2 | 4WVXFJZFLNAYHP3Y5O5SLWD7DA.2.2 | nꜥw | nꜥw | 𓈖𓂝𓅱<g>I14C</g>𓏤 | 80510 | Nꜥw | PROPN | Sich windender (Personifikation der Schlange) | [𓈖, 𓂝, 𓅱, I14C, 𓏤] |
2 | 153 | 4WVXFJZFLNAYHP3Y5O5SLWD7DA | 2 | 5 | 4WVXFJZFLNAYHP3Y5O5SLWD7DA.2.5 | nꜥw | nꜥw | 𓈖𓂝𓅱<g>I14C</g>𓏤 | 80510 | Nꜥw | PROPN | Sich windender (Personifikation der Schlange) | [𓈖, 𓂝, 𓅱, I14C, 𓏤] |
3 | 200 | 67HZI45S3REA3LWVZOKJ6QJOIE | 14 | 9 | 67HZI45S3REA3LWVZOKJ6QJOIE.14.9 | nbi̯.n | nbi̯.n | 𓈖𓎟𓃀<g>D107</g>𓈖 | 82520 | nbi̯ | VERB | schmelzen; gießen | [𓈖, 𓎟, 𓃀, D107, 𓈖] |
4 | 204 | 67HZI45S3REA3LWVZOKJ6QJOIE | 14 | 13 | 67HZI45S3REA3LWVZOKJ6QJOIE.14.13 | nḏr.n | nḏr.n | 𓈖𓇦𓂋<g>U19A</g>𓆱𓈖 | 91630 | nḏr | VERB | (Holz) bearbeiten; zimmern | [𓈖, 𓇦, 𓂋, U19A, 𓆱, 𓈖] |
5 | 206 | 67HZI45S3REA3LWVZOKJ6QJOIE | 14 | 15 | 67HZI45S3REA3LWVZOKJ6QJOIE.14.15 | b(w)n.wDU | bwn.wDU | 𓃀𓈖𓏌𓅱<g>T86</g><g>T86</g> | 55330 | bwn | NOUN | Speerspitzen (des Fischspeeres) | [𓃀, 𓈖, 𓏌, 𓅱, T86, T86] |
Egyptian Example 2:
A sentence from the sarcophagus of the Napatan king Aspelta (c. 600-580 BCE), found in his pyramid in Nuri, Sudan (Nu. 8), https://collections.mfa.org/objects/145117
Get the context of the sentence from the Thesaurus Linguae Aegyptiae: https://thesaurus-linguae-aegyptiae.de/text/27KHHMEP4VHSDH737F2OFLKNSE/sentences
# This Dictionary was created from the original json file
= {'publication_statement': {'credit_citation': 'Doris Topmann, Sentence ID 2CBOF5UQ7JGETCXG2CQKPCWDZM <https://github.com/thesaurus-linguae-aegyptiae/tla-raw-data/blob/v17/sentences/2CBOF5UQ7JGETCXG2CQKPCWDZM.json>, in: Thesaurus Linguae Aegyptiae: Raw Data <https://github.com/thesaurus-linguae-aegyptiae/tla-raw-data>, Corpus issue 17 (31 October 2022), ed. by Tonio Sebastian Richter & Daniel A. Werning on behalf of the Berlin-Brandenburgische Akademie der Wissenschaften and Hans-Werner Fischer-Elfert & Peter Dils on behalf of the Sächsische Akademie der Wissenschaften zu Leipzig (first published: 22 September 2023)', 'collection_editors': 'Tonio Sebastian Richter & Daniel A. Werning on behalf of the Berlin-Brandenburgische Akademie der Wissenschaften and Hans-Werner Fischer-Elfert & Peter Dils on behalf of the Sächsische Akademie der Wissenschaften zu Leipzig', 'data_engineers': {'input_software_BTS': ['Christoph Plutte', 'Jakob Höper'], 'database_curation': ['Simon D. Schweitzer'], 'data_transformation': ['Jakob Höper', 'R. Dominik Blöse', 'Daniel A. Werning']}, 'date_published_in_TLA': '2022-10-31', 'rawdata_first_published': '2023-09-22', 'corresponding_TLA_URL': 'https://thesaurus-linguae-aegyptiae.de/sentence/2CBOF5UQ7JGETCXG2CQKPCWDZM', 'license': 'Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) <https://creativecommons.org/licenses/by-sa/4.0/>'}, 'context': {'line': 'III', 'paragraph': None, 'pos': 7, 'textId': '27KHHMEP4VHSDH737F2OFLKNSE', 'textType': 'Text', 'variants': 1}, 'eclass': 'BTSSentence', 'glyphs': {'mdc_compact': None, 'unicode': None}, 'id': '2CBOF5UQ7JGETCXG2CQKPCWDZM', 'relations': {'contains': [{'eclass': 'BTSAnnotation', 'id': 'DYJEAXFKBJAXJPVLJGWREJZJ5M', 'ranges': [{'end': 'OKLGJLCEQFHU7HDRYUTYR352YA', 'start': '22TFIMS2CBBCFFCDSCAIT3HR3Y'}], 'type': 'ägyptologische Textsegmentierung'}], 'partOf': [{'eclass': 'BTSText', 'id': '27KHHMEP4VHSDH737F2OFLKNSE', 'name': 'Isis (HT 15, HT 14, HT 17)', 'type': 'Text'}]}, 'tokens': [{'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': '(unspecified)', 'lingGloss': 'PTCL', 'numeric': 3}, 'glyphs': {'mdc_artificially_aligned': True, 'mdc_compact': 'D35:N35', 'mdc_original': 'D35-N35', 'mdc_original_safe': None, 'mdc_tla': 'D35-N35', 'order': [1, 2], 'unicode': '𓂜𓈖'}, 'id': '22TFIMS2CBBCFFCDSCAIT3HR3Y', 'label': 'nn', 'lemma': {'POS': {'type': 'particle'}, 'id': '851961'}, 'transcription': {'mdc': 'nn', 'unicode': 'nn'}, 'translations': {'de': ['[Negationspartikel]']}, 'type': 'word'}, {'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': 'SC.act.ngem.nom.subj_Neg.nn', 'lingGloss': 'V\\tam.act', 'numeric': 210020}, 'glyphs': {'mdc_artificially_aligned': False, 'mdc_compact': 'W11-V28-A7', 'mdc_original': 'W11-V28-A7', 'mdc_original_safe': None, 'mdc_tla': 'W11-V28-A7', 'order': [2, 3, 4], 'unicode': '𓎼𓎛𓀉'}, 'id': 'IOLUGQXLCRGNLMTAPJ65LI7MHU', 'label': 'gḥ', 'lemma': {'POS': {'subtype': 'verb_3-lit', 'type': 'verb'}, 'id': '166480'}, 'transcription': {'mdc': 'gH', 'unicode': 'gḥ'}, 'translations': {'de': ['matt sein']}, 'type': 'word'}, {'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': 'Noun.pl.stpr.3sgm', 'lingGloss': 'N.f:pl:stpr', 'numeric': 70154}, 'glyphs': {'mdc_artificially_aligned': True, 'mdc_compact': 'D36:X1*F51B-Z2', 'mdc_original': 'D36-X1-F51B-Z2', 'mdc_original_safe': None, 'mdc_tla': 'D36-X1-F51B-Z2', 'order': [5, 6, 7, 8], 'unicode': '𓂝𓏏𓄹︀\U00013440𓏥'}, 'id': 'GUVBJUGCSVF5VN55PN6RYS4YLI', 'label': 'ꜥ,t.pl', 'lemma': {'POS': {'subtype': 'substantive_fem', 'type': 'substantive'}, 'id': '34550'}, 'transcription': {'mdc': 'a.t.PL', 'unicode': 'ꜥ.t.PL'}, 'translations': {'de': ['Glied; Körperteil']}, 'type': 'word'}, {'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': '(unspecified)', 'lingGloss': '-3sg.m', 'numeric': 3}, 'glyphs': {'mdc_artificially_aligned': False, 'mdc_compact': 'I9', 'mdc_original': 'I9', 'mdc_original_safe': None, 'mdc_tla': 'I9', 'order': [9], 'unicode': '𓆑'}, 'id': 'GIHCJ27JXVAM7GDUYWGEPKBRB4', 'label': '=f', 'lemma': {'POS': {'subtype': 'personal_pronoun', 'type': 'pronoun'}, 'id': '10050'}, 'transcription': {'mdc': '=f', 'unicode': '=f'}, 'translations': {'de': ['[Suffix Pron. sg.3.m.]']}, 'type': 'word'}, {'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': '(unspecified)', 'lingGloss': 'dem.f.pl', 'numeric': 3}, 'glyphs': {'mdc_artificially_aligned': True, 'mdc_compact': 'M17-Q3:N35', 'mdc_original': 'M17-Q3-N35', 'mdc_original_safe': None, 'mdc_tla': 'M17-Q3-N35', 'order': [10, 11, 12], 'unicode': '𓇋𓊪𓈖'}, 'id': 'Z6HTGGPBPRDT3OZTZNXRF2GRDA', 'label': 'jp〈t〉n', 'lemma': {'POS': {'subtype': 'demonstrative_pronoun', 'type': 'pronoun'}, 'id': '850009'}, 'transcription': {'mdc': 'jp〈t〉n', 'unicode': 'jp〈t〉n'}, 'translations': {'de': ['diese [Dem.Pron. pl.f.]']}, 'type': 'word'}, {'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': '(unspecified)', 'lingGloss': 'TITL', 'numeric': 3}, 'glyphs': {'mdc_artificially_aligned': False, 'mdc_compact': 'D4-Q1-A40', 'mdc_original': 'D4-Q1-A40', 'mdc_original_safe': None, 'mdc_tla': 'D4-Q1-A40', 'order': [13, 14, 15], 'unicode': '𓁹𓊨𓀭'}, 'id': 'UCFJWBLRKJG4NJWTWT22WDR2MU', 'label': 'Wsr,w', 'lemma': {'POS': {'subtype': 'title', 'type': 'epitheton_title'}, 'id': '49461'}, 'transcription': {'mdc': 'wsr.w', 'unicode': 'Wsr.w'}, 'translations': {'de': ['Osiris (Totentitel des Verstorbenen)']}, 'type': 'word'}, {'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': '(unspecified)', 'lingGloss': 'N', 'numeric': 3}, 'glyphs': {'mdc_artificially_aligned': True, 'mdc_compact': 'M23-X1:N35', 'mdc_original': 'M23-X1-N35', 'mdc_original_safe': None, 'mdc_tla': 'M23-X1-N35', 'order': [16, 17, 18], 'unicode': '𓇓𓏏𓈖'}, 'id': 'LI5FJI4ZUJEMPIKS5RQ5HHNBUE', 'label': 'nzw', 'lemma': {'POS': {'type': 'substantive'}, 'id': '88040'}, 'transcription': {'mdc': 'nzw', 'unicode': 'nzw'}, 'translations': {'de': ['König']}, 'type': 'word'}, {'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': '(unspecified)', 'lingGloss': 'ROYLN', 'numeric': 3}, 'glyphs': {'mdc_artificially_aligned': True, 'mdc_compact': 'V30:N17-N17', 'mdc_original': 'V30-N17-N17', 'mdc_original_safe': None, 'mdc_tla': 'V30-N17-N17', 'order': [19, 20, 21], 'unicode': '𓎟𓇿𓇿'}, 'id': 'ICADWHGbHkfdokpooG4eCy3Zfe8', 'label': 'nb-Tꜣ,du', 'lemma': {'POS': {'subtype': 'epith_king', 'type': 'epitheton_title'}, 'id': '400038'}, 'transcription': {'mdc': 'nb-tA.DU', 'unicode': 'nb-Tꜣ.DU'}, 'translations': {'de': ['Herr der Beiden Länder (Könige)']}, 'type': 'word'}, {'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': '(unspecified)', 'lingGloss': 'TITL', 'numeric': 3}, 'glyphs': {'mdc_artificially_aligned': True, 'mdc_compact': 'V30:D4-Aa1*X1:Y1', 'mdc_original': 'V30-D4-Aa1-X1-Y1', 'mdc_original_safe': None, 'mdc_tla': 'V30-D4-Aa1-X1-Y1', 'order': [22, 23, 24, 25, 26], 'unicode': '𓎟𓁹𓐍𓏏𓏛'}, 'id': 'ICADWHT2O1dc30SXuRZUlquIDpM', 'label': 'nb-jr(,t)-(j)ḫ,t', 'lemma': {'POS': {'subtype': 'title', 'type': 'epitheton_title'}, 'id': '400354'}, 'transcription': {'mdc': 'nb-jr(.t)-(j)x.t', 'unicode': 'nb-jr(.t)-(j)ḫ.t'}, 'translations': {'de': ['Herr des Rituals']}, 'type': 'word'}, {'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': '(unspecified)', 'lingGloss': 'ROYLN', 'numeric': 3}, 'glyphs': {'mdc_artificially_aligned': True, 'mdc_compact': '<-M17-O34:Q3-E23-N17->', 'mdc_original': '<-M17-O34-Q3-E23-N17->', 'mdc_original_safe': None, 'mdc_tla': '<-M17-O34-Q3-E23-N17->', 'order': [18, 19, 20, 21, 22, 23], 'unicode': '𓍹\U0001343c𓇋𓊃𓊪𓃭𓇿\U0001343d𓍺'}, 'id': 'J3MLYALWVNAMDDG33VZ3RIEEUA', 'label': 'Jsplt', 'lemma': {'POS': {'subtype': 'kings_name', 'type': 'entity_name'}, 'id': '850103'}, 'transcription': {'mdc': 'jsplt', 'unicode': 'Jsplt'}, 'translations': {'de': ['Aspelta']}, 'type': 'word'}, {'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': '(unspecified)', 'lingGloss': 'N.m:sg', 'numeric': 3}, 'glyphs': {'mdc_artificially_aligned': True, 'mdc_compact': 'U5:D36-P8h', 'mdc_original': 'U5-D36-P8h', 'mdc_original_safe': None, 'mdc_tla': 'U5-D36-P8h', 'order': [25, 26, 27], 'unicode': '𓌷𓂝𓊤︂'}, 'id': 'OKLGJLCEQFHU7HDRYUTYR352YA', 'label': 'mꜣꜥ-ḫrw', 'lemma': {'POS': {'subtype': 'substantive_masc', 'type': 'substantive'}, 'id': '66750'}, 'transcription': {'mdc': 'mAa-xrw', 'unicode': 'mꜣꜥ-ḫrw'}, 'translations': {'de': ['Gerechtfertigter (der selige Tote)']}, 'type': 'word'}], 'transcription': {'mdc': 'nn gH a.t.PL=f jp〈t〉n wsr.w nzw nb-tA.DU nb-jr(.t)-(j)x.t jsplt mAa-xrw', 'unicode': 'nn gḥ ꜥ.t.PL=f jp〈t〉n Wsr.w nzw nb-Tꜣ.DU nb-jr(.t)-(j)ḫ.t Jsplt mꜣꜥ-ḫrw'}, 'translations': {'de': ['Diese seine Glieder werden nicht matt sein, (die des) Osiris Königs, des Herrn der Beiden Länder, des Herrn des Rituals, Aspelta, des Gerechtfertigten.']}, 'type': None, 'wordCount': 11, 'editors': {'author': 'Doris Topmann', 'contributors': None, 'created': '2020-12-23 12:24:26', 'type': None, 'updated': '2022-08-29 10:22:01'}}
eg1 print(eg1)
{'publication_statement': {'credit_citation': 'Doris Topmann, Sentence ID 2CBOF5UQ7JGETCXG2CQKPCWDZM <https://github.com/thesaurus-linguae-aegyptiae/tla-raw-data/blob/v17/sentences/2CBOF5UQ7JGETCXG2CQKPCWDZM.json>, in: Thesaurus Linguae Aegyptiae: Raw Data <https://github.com/thesaurus-linguae-aegyptiae/tla-raw-data>, Corpus issue 17 (31 October 2022), ed. by Tonio Sebastian Richter & Daniel A. Werning on behalf of the Berlin-Brandenburgische Akademie der Wissenschaften and Hans-Werner Fischer-Elfert & Peter Dils on behalf of the Sächsische Akademie der Wissenschaften zu Leipzig (first published: 22 September 2023)', 'collection_editors': 'Tonio Sebastian Richter & Daniel A. Werning on behalf of the Berlin-Brandenburgische Akademie der Wissenschaften and Hans-Werner Fischer-Elfert & Peter Dils on behalf of the Sächsische Akademie der Wissenschaften zu Leipzig', 'data_engineers': {'input_software_BTS': ['Christoph Plutte', 'Jakob Höper'], 'database_curation': ['Simon D. Schweitzer'], 'data_transformation': ['Jakob Höper', 'R. Dominik Blöse', 'Daniel A. Werning']}, 'date_published_in_TLA': '2022-10-31', 'rawdata_first_published': '2023-09-22', 'corresponding_TLA_URL': 'https://thesaurus-linguae-aegyptiae.de/sentence/2CBOF5UQ7JGETCXG2CQKPCWDZM', 'license': 'Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) <https://creativecommons.org/licenses/by-sa/4.0/>'}, 'context': {'line': 'III', 'paragraph': None, 'pos': 7, 'textId': '27KHHMEP4VHSDH737F2OFLKNSE', 'textType': 'Text', 'variants': 1}, 'eclass': 'BTSSentence', 'glyphs': {'mdc_compact': None, 'unicode': None}, 'id': '2CBOF5UQ7JGETCXG2CQKPCWDZM', 'relations': {'contains': [{'eclass': 'BTSAnnotation', 'id': 'DYJEAXFKBJAXJPVLJGWREJZJ5M', 'ranges': [{'end': 'OKLGJLCEQFHU7HDRYUTYR352YA', 'start': '22TFIMS2CBBCFFCDSCAIT3HR3Y'}], 'type': 'ägyptologische Textsegmentierung'}], 'partOf': [{'eclass': 'BTSText', 'id': '27KHHMEP4VHSDH737F2OFLKNSE', 'name': 'Isis (HT 15, HT 14, HT 17)', 'type': 'Text'}]}, 'tokens': [{'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': '(unspecified)', 'lingGloss': 'PTCL', 'numeric': 3}, 'glyphs': {'mdc_artificially_aligned': True, 'mdc_compact': 'D35:N35', 'mdc_original': 'D35-N35', 'mdc_original_safe': None, 'mdc_tla': 'D35-N35', 'order': [1, 2], 'unicode': '𓂜𓈖'}, 'id': '22TFIMS2CBBCFFCDSCAIT3HR3Y', 'label': 'nn', 'lemma': {'POS': {'type': 'particle'}, 'id': '851961'}, 'transcription': {'mdc': 'nn', 'unicode': 'nn'}, 'translations': {'de': ['[Negationspartikel]']}, 'type': 'word'}, {'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': 'SC.act.ngem.nom.subj_Neg.nn', 'lingGloss': 'V\\tam.act', 'numeric': 210020}, 'glyphs': {'mdc_artificially_aligned': False, 'mdc_compact': 'W11-V28-A7', 'mdc_original': 'W11-V28-A7', 'mdc_original_safe': None, 'mdc_tla': 'W11-V28-A7', 'order': [2, 3, 4], 'unicode': '𓎼𓎛𓀉'}, 'id': 'IOLUGQXLCRGNLMTAPJ65LI7MHU', 'label': 'gḥ', 'lemma': {'POS': {'subtype': 'verb_3-lit', 'type': 'verb'}, 'id': '166480'}, 'transcription': {'mdc': 'gH', 'unicode': 'gḥ'}, 'translations': {'de': ['matt sein']}, 'type': 'word'}, {'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': 'Noun.pl.stpr.3sgm', 'lingGloss': 'N.f:pl:stpr', 'numeric': 70154}, 'glyphs': {'mdc_artificially_aligned': True, 'mdc_compact': 'D36:X1*F51B-Z2', 'mdc_original': 'D36-X1-F51B-Z2', 'mdc_original_safe': None, 'mdc_tla': 'D36-X1-F51B-Z2', 'order': [5, 6, 7, 8], 'unicode': '𓂝𓏏𓄹︀\U00013440𓏥'}, 'id': 'GUVBJUGCSVF5VN55PN6RYS4YLI', 'label': 'ꜥ,t.pl', 'lemma': {'POS': {'subtype': 'substantive_fem', 'type': 'substantive'}, 'id': '34550'}, 'transcription': {'mdc': 'a.t.PL', 'unicode': 'ꜥ.t.PL'}, 'translations': {'de': ['Glied; Körperteil']}, 'type': 'word'}, {'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': '(unspecified)', 'lingGloss': '-3sg.m', 'numeric': 3}, 'glyphs': {'mdc_artificially_aligned': False, 'mdc_compact': 'I9', 'mdc_original': 'I9', 'mdc_original_safe': None, 'mdc_tla': 'I9', 'order': [9], 'unicode': '𓆑'}, 'id': 'GIHCJ27JXVAM7GDUYWGEPKBRB4', 'label': '=f', 'lemma': {'POS': {'subtype': 'personal_pronoun', 'type': 'pronoun'}, 'id': '10050'}, 'transcription': {'mdc': '=f', 'unicode': '=f'}, 'translations': {'de': ['[Suffix Pron. sg.3.m.]']}, 'type': 'word'}, {'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': '(unspecified)', 'lingGloss': 'dem.f.pl', 'numeric': 3}, 'glyphs': {'mdc_artificially_aligned': True, 'mdc_compact': 'M17-Q3:N35', 'mdc_original': 'M17-Q3-N35', 'mdc_original_safe': None, 'mdc_tla': 'M17-Q3-N35', 'order': [10, 11, 12], 'unicode': '𓇋𓊪𓈖'}, 'id': 'Z6HTGGPBPRDT3OZTZNXRF2GRDA', 'label': 'jp〈t〉n', 'lemma': {'POS': {'subtype': 'demonstrative_pronoun', 'type': 'pronoun'}, 'id': '850009'}, 'transcription': {'mdc': 'jp〈t〉n', 'unicode': 'jp〈t〉n'}, 'translations': {'de': ['diese [Dem.Pron. pl.f.]']}, 'type': 'word'}, {'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': '(unspecified)', 'lingGloss': 'TITL', 'numeric': 3}, 'glyphs': {'mdc_artificially_aligned': False, 'mdc_compact': 'D4-Q1-A40', 'mdc_original': 'D4-Q1-A40', 'mdc_original_safe': None, 'mdc_tla': 'D4-Q1-A40', 'order': [13, 14, 15], 'unicode': '𓁹𓊨𓀭'}, 'id': 'UCFJWBLRKJG4NJWTWT22WDR2MU', 'label': 'Wsr,w', 'lemma': {'POS': {'subtype': 'title', 'type': 'epitheton_title'}, 'id': '49461'}, 'transcription': {'mdc': 'wsr.w', 'unicode': 'Wsr.w'}, 'translations': {'de': ['Osiris (Totentitel des Verstorbenen)']}, 'type': 'word'}, {'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': '(unspecified)', 'lingGloss': 'N', 'numeric': 3}, 'glyphs': {'mdc_artificially_aligned': True, 'mdc_compact': 'M23-X1:N35', 'mdc_original': 'M23-X1-N35', 'mdc_original_safe': None, 'mdc_tla': 'M23-X1-N35', 'order': [16, 17, 18], 'unicode': '𓇓𓏏𓈖'}, 'id': 'LI5FJI4ZUJEMPIKS5RQ5HHNBUE', 'label': 'nzw', 'lemma': {'POS': {'type': 'substantive'}, 'id': '88040'}, 'transcription': {'mdc': 'nzw', 'unicode': 'nzw'}, 'translations': {'de': ['König']}, 'type': 'word'}, {'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': '(unspecified)', 'lingGloss': 'ROYLN', 'numeric': 3}, 'glyphs': {'mdc_artificially_aligned': True, 'mdc_compact': 'V30:N17-N17', 'mdc_original': 'V30-N17-N17', 'mdc_original_safe': None, 'mdc_tla': 'V30-N17-N17', 'order': [19, 20, 21], 'unicode': '𓎟𓇿𓇿'}, 'id': 'ICADWHGbHkfdokpooG4eCy3Zfe8', 'label': 'nb-Tꜣ,du', 'lemma': {'POS': {'subtype': 'epith_king', 'type': 'epitheton_title'}, 'id': '400038'}, 'transcription': {'mdc': 'nb-tA.DU', 'unicode': 'nb-Tꜣ.DU'}, 'translations': {'de': ['Herr der Beiden Länder (Könige)']}, 'type': 'word'}, {'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': '(unspecified)', 'lingGloss': 'TITL', 'numeric': 3}, 'glyphs': {'mdc_artificially_aligned': True, 'mdc_compact': 'V30:D4-Aa1*X1:Y1', 'mdc_original': 'V30-D4-Aa1-X1-Y1', 'mdc_original_safe': None, 'mdc_tla': 'V30-D4-Aa1-X1-Y1', 'order': [22, 23, 24, 25, 26], 'unicode': '𓎟𓁹𓐍𓏏𓏛'}, 'id': 'ICADWHT2O1dc30SXuRZUlquIDpM', 'label': 'nb-jr(,t)-(j)ḫ,t', 'lemma': {'POS': {'subtype': 'title', 'type': 'epitheton_title'}, 'id': '400354'}, 'transcription': {'mdc': 'nb-jr(.t)-(j)x.t', 'unicode': 'nb-jr(.t)-(j)ḫ.t'}, 'translations': {'de': ['Herr des Rituals']}, 'type': 'word'}, {'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': '(unspecified)', 'lingGloss': 'ROYLN', 'numeric': 3}, 'glyphs': {'mdc_artificially_aligned': True, 'mdc_compact': '<-M17-O34:Q3-E23-N17->', 'mdc_original': '<-M17-O34-Q3-E23-N17->', 'mdc_original_safe': None, 'mdc_tla': '<-M17-O34-Q3-E23-N17->', 'order': [18, 19, 20, 21, 22, 23], 'unicode': '𓍹\U0001343c𓇋𓊃𓊪𓃭𓇿\U0001343d𓍺'}, 'id': 'J3MLYALWVNAMDDG33VZ3RIEEUA', 'label': 'Jsplt', 'lemma': {'POS': {'subtype': 'kings_name', 'type': 'entity_name'}, 'id': '850103'}, 'transcription': {'mdc': 'jsplt', 'unicode': 'Jsplt'}, 'translations': {'de': ['Aspelta']}, 'type': 'word'}, {'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': '(unspecified)', 'lingGloss': 'N.m:sg', 'numeric': 3}, 'glyphs': {'mdc_artificially_aligned': True, 'mdc_compact': 'U5:D36-P8h', 'mdc_original': 'U5-D36-P8h', 'mdc_original_safe': None, 'mdc_tla': 'U5-D36-P8h', 'order': [25, 26, 27], 'unicode': '𓌷𓂝𓊤︂'}, 'id': 'OKLGJLCEQFHU7HDRYUTYR352YA', 'label': 'mꜣꜥ-ḫrw', 'lemma': {'POS': {'subtype': 'substantive_masc', 'type': 'substantive'}, 'id': '66750'}, 'transcription': {'mdc': 'mAa-xrw', 'unicode': 'mꜣꜥ-ḫrw'}, 'translations': {'de': ['Gerechtfertigter (der selige Tote)']}, 'type': 'word'}], 'transcription': {'mdc': 'nn gH a.t.PL=f jp〈t〉n wsr.w nzw nb-tA.DU nb-jr(.t)-(j)x.t jsplt mAa-xrw', 'unicode': 'nn gḥ ꜥ.t.PL=f jp〈t〉n Wsr.w nzw nb-Tꜣ.DU nb-jr(.t)-(j)ḫ.t Jsplt mꜣꜥ-ḫrw'}, 'translations': {'de': ['Diese seine Glieder werden nicht matt sein, (die des) Osiris Königs, des Herrn der Beiden Länder, des Herrn des Rituals, Aspelta, des Gerechtfertigten.']}, 'type': None, 'wordCount': 11, 'editors': {'author': 'Doris Topmann', 'contributors': None, 'created': '2020-12-23 12:24:26', 'type': None, 'updated': '2022-08-29 10:22:01'}}
# parse the dictionary (json)
= []
unicodeHiero = []
transcription = []
translLemma = []
posLemma = []
tokenID
for text_word in eg1["tokens"] :
print(text_word["glyphs"]["unicode"], text_word["transcription"]["unicode"], text_word["translations"]["de"][0], text_word["lemma"]["POS"]["type"], text_word["id"] )
"id"])
tokenID.append(text_word["glyphs"]["unicode"])
unicodeHiero.append(text_word["translations"]["de"][0])
translLemma.append(text_word["lemma"]["POS"]["type"])
posLemma.append(text_word[
if text_word["transcription"]["unicode"][0] == "=" : # replace equal sign as it will cause trouble in spreadsheet software like MS Excel
"transcription"]["unicode"].replace("=", '⸗')) # U+2E17
transcription.append(text_word[else :
"transcription"]["unicode"]) transcription.append(text_word[
𓂜𓈖 nn [Negationspartikel] particle 22TFIMS2CBBCFFCDSCAIT3HR3Y
𓎼𓎛𓀉 gḥ matt sein verb IOLUGQXLCRGNLMTAPJ65LI7MHU
𓂝𓏏𓄹︀𓑀𓏥 ꜥ.t.PL Glied; Körperteil substantive GUVBJUGCSVF5VN55PN6RYS4YLI
𓆑 =f [Suffix Pron. sg.3.m.] pronoun GIHCJ27JXVAM7GDUYWGEPKBRB4
𓇋𓊪𓈖 jp〈t〉n diese [Dem.Pron. pl.f.] pronoun Z6HTGGPBPRDT3OZTZNXRF2GRDA
𓁹𓊨𓀭 Wsr.w Osiris (Totentitel des Verstorbenen) epitheton_title UCFJWBLRKJG4NJWTWT22WDR2MU
𓇓𓏏𓈖 nzw König substantive LI5FJI4ZUJEMPIKS5RQ5HHNBUE
𓎟𓇿𓇿 nb-Tꜣ.DU Herr der Beiden Länder (Könige) epitheton_title ICADWHGbHkfdokpooG4eCy3Zfe8
𓎟𓁹𓐍𓏏𓏛 nb-jr(.t)-(j)ḫ.t Herr des Rituals epitheton_title ICADWHT2O1dc30SXuRZUlquIDpM
𓍹𓇋𓊃𓊪𓃭𓇿𓍺 Jsplt Aspelta entity_name J3MLYALWVNAMDDG33VZ3RIEEUA
𓌷𓂝𓊤︂ mꜣꜥ-ḫrw Gerechtfertigter (der selige Tote) substantive OKLGJLCEQFHU7HDRYUTYR352YA
# get the ID of this sentence
= eg1["id"] sentenceID
# create a dataframe and fill it
import pandas as pd
= pd.DataFrame({
df_eg 'unicode_hieroglyphs': unicodeHiero,
'unicode_transcription': transcription,
'lemma_translation': translLemma,
'part-of-speech': posLemma,
'tokenID' : tokenID
})
df_eg
unicode_hieroglyphs | unicode_transcription | lemma_translation | part-of-speech | tokenID | |
---|---|---|---|---|---|
0 | 𓂜𓈖 | nn | [Negationspartikel] | particle | 22TFIMS2CBBCFFCDSCAIT3HR3Y |
1 | 𓎼𓎛𓀉 | gḥ | matt sein | verb | IOLUGQXLCRGNLMTAPJ65LI7MHU |
2 | 𓂝𓏏𓄹︀𓑀𓏥 | ꜥ.t.PL | Glied; Körperteil | substantive | GUVBJUGCSVF5VN55PN6RYS4YLI |
3 | 𓆑 | ⸗f | [Suffix Pron. sg.3.m.] | pronoun | GIHCJ27JXVAM7GDUYWGEPKBRB4 |
4 | 𓇋𓊪𓈖 | jp〈t〉n | diese [Dem.Pron. pl.f.] | pronoun | Z6HTGGPBPRDT3OZTZNXRF2GRDA |
5 | 𓁹𓊨𓀭 | Wsr.w | Osiris (Totentitel des Verstorbenen) | epitheton_title | UCFJWBLRKJG4NJWTWT22WDR2MU |
6 | 𓇓𓏏𓈖 | nzw | König | substantive | LI5FJI4ZUJEMPIKS5RQ5HHNBUE |
7 | 𓎟𓇿𓇿 | nb-Tꜣ.DU | Herr der Beiden Länder (Könige) | epitheton_title | ICADWHGbHkfdokpooG4eCy3Zfe8 |
8 | 𓎟𓁹𓐍𓏏𓏛 | nb-jr(.t)-(j)ḫ.t | Herr des Rituals | epitheton_title | ICADWHT2O1dc30SXuRZUlquIDpM |
9 | 𓍹𓇋𓊃𓊪𓃭𓇿𓍺 | Jsplt | Aspelta | entity_name | J3MLYALWVNAMDDG33VZ3RIEEUA |
10 | 𓌷𓂝𓊤︂ | mꜣꜥ-ḫrw | Gerechtfertigter (der selige Tote) | substantive | OKLGJLCEQFHU7HDRYUTYR352YA |
# save as *.csv
= "aspelta_TLA_Sentence_" + sentenceID + ".csv"
fileName df_eg.to_csv(fileName)
Akkadian Example 2:
consider the following Akkadian text:
http://www.achemenet.com//fr/item/?/sources-textuelles/textes-par-publication/Strassmaier_Cyrus/1665118
6 udu-nita2 ina šuII Iden-gi a-šú šá Id[
a-na 8 gín 4-tú kù-babbar i-na kù-babbar
šá i-di é [ o o o ] a-na é-babbar-ra
it-ta-din 5 udu-nita2 šá Ika-ṣir
a-šú šá Iden-mu a-na 7 gín 4-tú
kù-babbar šá muh-hi dul-lu Imu-mu
ú-šá-hi-su a-na lìb-bi sì-na
1 udu-nita2 a-na 1 gín 4-tú kù-babbar
ina šuII Idutu-ba-šá! [
1 udu-nita2 šá IDU-[
a-na 1? gín [
pap [13 udu-nita2-meš
iti du6 u4 [o-kam] mu sag nam-lugal-la
Iku-ra-áš lugal tin-tirki u kur-kur
How would you preprocess this text?
= """ 6 udu-nita<sub>2</sub> <i>ina</i> šu<sup>II</sup> <sup>Id</sup>en-gi a-<i>šú šá</i> <sup>Id</sup>[
akk2 <i>a-na</i> 8 gín 4-<i>tú</i> kù-babbar <i>i-na</i> kù-babbar
<i>šá</i> <i>i-di</i> é [ o o o ] <i>a-na</i> é-babbar-ra
<i>it-ta-din</i> 5 udu-nita<sub>2</sub> <i>šá</i> <sup>I</sup><i>ka-ṣir</i>
a-<i>šú šá</i> <sup>Id</sup>en-mu <i>a-na</i> 7 gín 4-<i>tú</i>
kù-babbar <i>šá</i> <i>muh-hi</i> <i>dul-lu</i> <sup>I</sup>mu-mu
<i>ú-šá-hi-su a-na</i> <i>lìb-bi</i> sì-<i>na</i>
1 udu-nita<sub>2</sub> <i>a-na</i> 1 gín 4-<i>tú </i>kù-babbar
<i>ina</i> šu<sup>II</sup> <sup>Id</sup>utu-ba-<i>šá</i><sup>!</sup> [
1 udu-nita<sub>2</sub> <i>šá</i> <sup>I</sup>DU-[
<i>a-na</i> 1<sup>?</sup> gín [
pap [13 udu-nita<sub>2</sub>-meš
iti du<sub>6</sub> u<sub>4</sub> [o-kam] mu sag nam-lugal-la
<sup>I</sup><i>ku-ra-áš</i> lugal tr/i> kur-kur """
= akk2.split('\n')
akk2_lines print(akk2_lines)
[' 6 udu-nita<sub>2</sub> <i>ina</i> šu<sup>II</sup> <sup>Id</sup>en-gi a-<i>šú šá</i> <sup>Id</sup>[', ' <i>a-na</i> 8 gín 4-<i>tú</i> kù-babbar <i>i-na</i> kù-babbar', ' <i>šá</i> <i>i-di</i> é [ o o o ] <i>a-na</i> é-babbar-ra', ' <i>it-ta-din</i> 5 udu-nita<sub>2</sub> <i>šá</i> <sup>I</sup><i>ka-ṣir</i>', ' a-<i>šú šá</i> <sup>Id</sup>en-mu <i>a-na</i> 7 gín 4-<i>tú</i>', ' kù-babbar <i>šá</i> <i>muh-hi</i> <i>dul-lu</i> <sup>I</sup>mu-mu', ' <i>ú-šá-hi-su a-na</i> <i>lìb-bi</i> sì-<i>na</i>', ' 1 udu-nita<sub>2</sub> <i>a-na</i> 1 gín 4-<i>tú </i>kù-babbar', ' <i>ina</i> šu<sup>II</sup> <sup>Id</sup>utu-ba-<i>šá</i><sup>!</sup> [', ' 1 udu-nita<sub>2</sub> <i>šá</i> <sup>I</sup>DU-[', ' <i>a-na</i> 1<sup>?</sup> gín [', ' pap [13 udu-nita<sub>2</sub>-meš', ' iti du<sub>6</sub> u<sub>4</sub> [o-kam] mu sag nam-lugal-la', ' <sup>I</sup><i>ku-ra-áš</i> lugal tin-tir<sup>ki</sup> <i>u</i> kur-kur ']
= 1
line_count for line in akk2_lines:
print("line:", line_count)
= line.split()
words = 1
word_count
for word in words :
print(word_count, end=" ")
if word.startswith("<i>") and word.endswith("</i>") :
#print(word)
= word.split("-")
signList for sign in signList :
#print(sign)
= sign.replace("<i>", "")
sign = sign.replace("</i>", "")
sign = [sign, {"sign_function": "phon" }]
sign print(sign)
elif "-<i>" in word :
= word.split("-")
signList for sign in signList :
#print(sign)
if sign.startswith("<i>") :
= sign.replace("<i>", "")
sign = [sign, {"sign_function": "phon" }]
sign print(sign)
else:
print([sign, {"sign_function": "log" }])
elif word.endswith("</i>") :
= word.replace("</i>", "")
word = [word, {"sign_function": "phon" }]
word print(word)
#elif :
# print(word)
elif "</sup" in word :
#print(word.split("-"), "logogram")
= word.split("-")
signCluster #print(signCluster)
for elem in signCluster :
if "</sup>" in elem :
#print("yes")
= elem.replace("</sup>", "</class>-")
elem = elem.replace("<sup>", "-<class>")
elem = elem.split("-")
signs for sign in signs :
if sign.startswith("<class>") and sign.endswith("</class>"):
print([sign[7:-8], {"sign_function": "class" }])
else:
print([sign, {"sign_function": "log" }])
else:
print([elem, {"sign_function": "log" }])
else:
if "-" in word :
= word.split("-")
signs for sign in signs :
print([sign, {"sign_function": "log" }])
else:
print([word, {"sign_function": "log" }])
+=1
word_count
+= 1
line_count print("------------------")
line: 1
1 ['6', {'sign_function': 'log'}]
2 ['udu', {'sign_function': 'log'}]
['nita<sub>2</sub>', {'sign_function': 'log'}]
3 ['ina', {'sign_function': 'phon'}]
4 ['šu', {'sign_function': 'log'}]
['II', {'sign_function': 'class'}]
['', {'sign_function': 'log'}]
5 ['', {'sign_function': 'log'}]
['Id', {'sign_function': 'class'}]
['en', {'sign_function': 'log'}]
['gi', {'sign_function': 'log'}]
6 ['a', {'sign_function': 'log'}]
['šú', {'sign_function': 'phon'}]
7 ['šá', {'sign_function': 'phon'}]
8 ['', {'sign_function': 'log'}]
['Id', {'sign_function': 'class'}]
['[', {'sign_function': 'log'}]
------------------
line: 2
1 ['a', {'sign_function': 'phon'}]
['na', {'sign_function': 'phon'}]
2 ['8', {'sign_function': 'log'}]
3 ['gín', {'sign_function': 'log'}]
4 ['4', {'sign_function': 'log'}]
['tú</i>', {'sign_function': 'phon'}]
5 ['kù', {'sign_function': 'log'}]
['babbar', {'sign_function': 'log'}]
6 ['i', {'sign_function': 'phon'}]
['na', {'sign_function': 'phon'}]
7 ['kù', {'sign_function': 'log'}]
['babbar', {'sign_function': 'log'}]
------------------
line: 3
1 ['šá', {'sign_function': 'phon'}]
2 ['i', {'sign_function': 'phon'}]
['di', {'sign_function': 'phon'}]
3 ['é', {'sign_function': 'log'}]
4 ['[', {'sign_function': 'log'}]
5 ['o', {'sign_function': 'log'}]
6 ['o', {'sign_function': 'log'}]
7 ['o', {'sign_function': 'log'}]
8 [']', {'sign_function': 'log'}]
9 ['a', {'sign_function': 'phon'}]
['na', {'sign_function': 'phon'}]
10 ['é', {'sign_function': 'log'}]
['babbar', {'sign_function': 'log'}]
['ra', {'sign_function': 'log'}]
------------------
line: 4
1 ['it', {'sign_function': 'phon'}]
['ta', {'sign_function': 'phon'}]
['din', {'sign_function': 'phon'}]
2 ['5', {'sign_function': 'log'}]
3 ['udu', {'sign_function': 'log'}]
['nita<sub>2</sub>', {'sign_function': 'log'}]
4 ['šá', {'sign_function': 'phon'}]
5 ['<sup>I</sup><i>ka-ṣir', {'sign_function': 'phon'}]
------------------
line: 5
1 ['a', {'sign_function': 'log'}]
['šú', {'sign_function': 'phon'}]
2 ['šá', {'sign_function': 'phon'}]
3 ['', {'sign_function': 'log'}]
['Id', {'sign_function': 'class'}]
['en', {'sign_function': 'log'}]
['mu', {'sign_function': 'log'}]
4 ['a', {'sign_function': 'phon'}]
['na', {'sign_function': 'phon'}]
5 ['7', {'sign_function': 'log'}]
6 ['gín', {'sign_function': 'log'}]
7 ['4', {'sign_function': 'log'}]
['tú</i>', {'sign_function': 'phon'}]
------------------
line: 6
1 ['kù', {'sign_function': 'log'}]
['babbar', {'sign_function': 'log'}]
2 ['šá', {'sign_function': 'phon'}]
3 ['muh', {'sign_function': 'phon'}]
['hi', {'sign_function': 'phon'}]
4 ['dul', {'sign_function': 'phon'}]
['lu', {'sign_function': 'phon'}]
5 ['', {'sign_function': 'log'}]
['I', {'sign_function': 'class'}]
['mu', {'sign_function': 'log'}]
['mu', {'sign_function': 'log'}]
------------------
line: 7
1 ['<i>ú', {'sign_function': 'log'}]
['šá', {'sign_function': 'log'}]
['hi', {'sign_function': 'log'}]
['su', {'sign_function': 'log'}]
2 ['a-na', {'sign_function': 'phon'}]
3 ['lìb', {'sign_function': 'phon'}]
['bi', {'sign_function': 'phon'}]
4 ['sì', {'sign_function': 'log'}]
['na</i>', {'sign_function': 'phon'}]
------------------
line: 8
1 ['1', {'sign_function': 'log'}]
2 ['udu', {'sign_function': 'log'}]
['nita<sub>2</sub>', {'sign_function': 'log'}]
3 ['a', {'sign_function': 'phon'}]
['na', {'sign_function': 'phon'}]
4 ['1', {'sign_function': 'log'}]
5 ['gín', {'sign_function': 'log'}]
6 ['4', {'sign_function': 'log'}]
['tú', {'sign_function': 'phon'}]
7 ['</i>kù', {'sign_function': 'log'}]
['babbar', {'sign_function': 'log'}]
------------------
line: 9
1 ['ina', {'sign_function': 'phon'}]
2 ['šu', {'sign_function': 'log'}]
['II', {'sign_function': 'class'}]
['', {'sign_function': 'log'}]
3 ['<sup>Id</sup>utu', {'sign_function': 'log'}]
['ba', {'sign_function': 'log'}]
['šá</i><sup>!</sup>', {'sign_function': 'phon'}]
4 ['[', {'sign_function': 'log'}]
------------------
line: 10
1 ['1', {'sign_function': 'log'}]
2 ['udu', {'sign_function': 'log'}]
['nita<sub>2</sub>', {'sign_function': 'log'}]
3 ['šá', {'sign_function': 'phon'}]
4 ['', {'sign_function': 'log'}]
['I', {'sign_function': 'class'}]
['DU', {'sign_function': 'log'}]
['[', {'sign_function': 'log'}]
------------------
line: 11
1 ['a', {'sign_function': 'phon'}]
['na', {'sign_function': 'phon'}]
2 ['1', {'sign_function': 'log'}]
['?', {'sign_function': 'class'}]
['', {'sign_function': 'log'}]
3 ['gín', {'sign_function': 'log'}]
4 ['[', {'sign_function': 'log'}]
------------------
line: 12
1 ['pap', {'sign_function': 'log'}]
2 ['[13', {'sign_function': 'log'}]
3 ['udu', {'sign_function': 'log'}]
['nita<sub>2</sub>', {'sign_function': 'log'}]
['meš', {'sign_function': 'log'}]
------------------
line: 13
1 ['iti', {'sign_function': 'log'}]
2 ['du<sub>6</sub>', {'sign_function': 'log'}]
3 ['u<sub>4</sub>', {'sign_function': 'log'}]
4 ['[o', {'sign_function': 'log'}]
['kam]', {'sign_function': 'log'}]
5 ['mu', {'sign_function': 'log'}]
6 ['sag', {'sign_function': 'log'}]
7 ['nam', {'sign_function': 'log'}]
['lugal', {'sign_function': 'log'}]
['la', {'sign_function': 'log'}]
------------------
line: 14
1 ['<sup>I</sup><i>ku-ra-áš', {'sign_function': 'phon'}]
2 ['lugal', {'sign_function': 'log'}]
3 ['tin', {'sign_function': 'log'}]
['tir', {'sign_function': 'log'}]
['ki', {'sign_function': 'class'}]
['', {'sign_function': 'log'}]
4 ['u', {'sign_function': 'phon'}]
5 ['kur', {'sign_function': 'log'}]
['kur', {'sign_function': 'log'}]
------------------
This notebook was created by Eliese-Sophia Lincke and Shai Gordin in Fall 2024 for the course Ancient Language Processing, incorporating materials from Avital Romach. Code can be reused under a CC-BY 4.0