This is an exploratory analysis of demographic data from Vall de Almonacid (https://commons.wikimedia.org/wiki/File:VGeneral_VallAlmonacid.jpg). Vall de Almonacid (Spain) is a small village on the Valencian region. Of Muslim origin, it was conquered, refounded and repopulated with Catholics from the Aragon Kingdom in 1238.
In the church there are several record books where births, marriages and funerals were recorded by the church prist. In the 1980s these records were taken by Josep Maria Perez Rodriguez who copied them by hand with the help of Jose Maria Perez Rodriguez, Maria Rovira Barbera, Pilar Perez Rodriguez, and some others.
The hand-copied files were later digitalised (by hand, again) and a copy of these files is the starting point of this analysis. Not all data was digitalised and many typos and errors are present in the dataset.
For this analysis, I opted for Python (v.3.5.2+) as a flexible yet user-friendly tool to explore data.
I have previosuly done some visualisation experiments with this data using d3.js that you can find at https://fuyas.github.io/nacimientos/ (in Spanish).
In addition, there is a small report study authored by Josep Maria Perez Rodriguez about this demographic data (http://perezrovira.net/adria/pdfs/icap_demografia_JM.pdf, in Spanish).
All modules are commonly used modules
# plot figures inline in the IPython notebook %matplotlib inline import csv import numpy as np import datetime import matplotlib.pyplot as plt from collections import Counter # set width of figures by default plt.rcParams['figure.figsize'] = (14.0, 6.0)
To import the data I created a small function that reads the CSV files (there are two, one for baptisms and one for funerals). The function returns the header (name of the fields stored on the first line of the CSV file), a numpy 2D array with all fields as strings, and a second copy with all fields capitalised. The second copy is used to compare strings, where capitalisation might differ due to the poor quality of data. In addition to read the data the function prints a list of all the fields, detailing the number of entries for each field and the value and frequency of the most common entry.
def readDB( str_file ): file = open(str_file, "r") reader = csv.reader(file) rownum = 0 allData =  ALLDATA =  for iRow, row in enumerate(reader): # Header row. if iRow == 0: header = row else: allData.append( row ) ALLDATA.append( [x.upper() for x in row] ) file.close() # put all data in numpy npData = np.array(allData) NPDATA = np.array(ALLDATA) print ( "\nNumber of Entries: %4d" % npData.shape ) print ( "Number of Columns: %4d\n" % npData.shape ) # Check number of entries per column and the most common value print ( " # - Unique - Most common" ) for idx in range(len(header)): vec = NPDATA[:,idx] vec2 = vec [ NPDATA[:,idx] != '' ] if len(vec2) > 0: count = Counter(vec2) print ( "[%2s] %-15s: %4d - %6d - ( %4d | %s )" % (idx, header[idx], vec2.size, len(count), count.most_common(1), count.most_common(1)) ) else: print ( "[%2s] %-15s: %4d" % (idx, header[idx], vec2.size) ) return [header, npData, NPDATA]
[headerBirth, npBirthData, NPBIRTHDATA] = readDB ('data/allNacimientos.csv')
Number of Entries: 5237 Number of Columns: 42 # - Unique - Most common [ 0] NOMBRE : 5237 - 2442 - ( 241 | MANUEL ) [ 1] APELL1 : 5221 - 461 - ( 381 | RODRIGUEZ ) [ 2] APELL2 : 5217 - 569 - ( 305 | RODRIGUEZ ) [ 3] SEXO : 5237 - 2 - ( 2722 | HOMBRE ) [ 4] NACIMIENTO : 5237 - 4967 - ( 4 | 03/07/1839 ) [ 5] DIA : 5237 - 366 - ( 26 | 83 ) [ 6] YEAR : 5237 - 216 - ( 53 | 1831 ) [ 7] YEAR_ID : 5237 - 53 - ( 216 | 6 ) [ 8] DEFUNCION : 0 [ 9] BAUTISMO : 1117 - 1092 - ( 3 | 04/07/39 )  ENTERRAMIE : 0  TESTIGDEF : 0  ALBACEDEF : 0  ORIGEN : 8 - 6 - ( 3 | PARVULO )  VECINDAD : 0  ESTADCIVIL : 0  LIBRAS : 0  DINEROS : 0  NOMPADRE : 5224 - 237 - ( 702 | MANUEL )  APELL2PAD : 5222 - 1133 - ( 167 | RODRIGUEZ )  NOMMADRE : 5210 - 282 - ( 986 | MARIA )  APELL2MAD : 5217 - 1265 - ( 96 | SANCHO )  VECINDPAD : 2 - 2 - ( 1 | PASAGEROS POBRES )  ORIGPAD : 25 - 18 - ( 4 | (1) )  ORIGMAD : 55 - 37 - ( 4 | SEGORBE )  OFICIOPAD : 533 - 65 - ( 381 | LABRADOR )  OFICIOMAD : 0  NPADRINO : 3720 - 248 - ( 437 | MANUEL )  NMADRINA : 3794 - 309 - ( 454 | MARIA )  OFPADRINO : 331 - 109 - ( 51 | CIRUJANO )  OFMADRINA : 8 - 7 - ( 2 | CRIADA DEL GOVERNADOR )  VECPADRINO : 111 - 55 - ( 18 | ALXIMIA )  VECMADRINA : 109 - 47 - ( 20 | ALXIMIA )  PARENTESCP : 898 - 78 - ( 224 | M )  PARENTESCM : 868 - 234 - ( 227 | D )  CONYGE1 : 0  FEMATRI1 : 0  ECIVCONY1 : 0  NOTARIO1 : 0  OBSERVACN : 1798 - 1425 - ( 27 | (1) ABUELOS PATERNOS SE IGNORAN )  APELLMADRINA : 3795 - 477 - ( 295 | RODRIGUEZ )  APELLPADRINO : 3716 - 482 - ( 424 | RODRIGUEZ )
The births CSV file contains 5237 entries with 42 columns. The number of entries for each column varies widely and 12 contain no data at all. We can already see that Rodriguez is the most common surname and Manuel the most common name. Looking at " NOMMADRE" we can see that the most common name for mothers, hence women, is Maria.
Other interesting data, albeit expected, is that the most common job " OFICIOPAD" is "LABRADOR" (english: farmer) while the most common job for the padrino (Godfather) is CIRUJANO (surgeon). From this it is possible to infer that the Godfather role was most commonly taken by a higher social class person.
In addition there are 1798 comments "observaciones" with rich, yet unstructured, information, for example:
print( npBirthData[427,39] )
bautizé sup condicione un niño enmcontrado cuyos padres se ignoran el qual fue allado a la puerta de Felipe TORRES a media noche
English: I baptised sup condicione a boy with unknown parents who was found at the door of Felipe Torres at midnight.
print( npBirthData[4013,39] )
Fue bautizada en el acto de la nacencia por su abuela Theodora Pitarque y murio immediatamente. Examine la forma e intencion del bautismo y allí que estubo bien administrado
English: Was baptised at birth by her grandmother Theodora Pitarque and died immediately. I examined the intention and correctness of the baptism and found that it was well performed.
[headerDeath, npDeathData, NPDEATHDATA] = readDB ('data/allDefunciones.csv')
Number of Entries: 4330 Number of Columns: 28 # - Unique - Most common [ 0] ALBACEDEF : 471 - 401 - ( 18 | EL RETOR ) [ 1] APELL1 : 3115 - 440 - ( 244 | RODRIGUEZ ) [ 2] APELL2 : 1341 - 212 - ( 128 | RODRIGUEZ ) [ 3] APELL2MAD : 2246 - 316 - ( 186 | RODRIGUEZ ) [ 4] APELL2PAD : 2483 - 282 - ( 222 | RODRIGUEZ ) [ 5] CONYGE1 : 1275 - 898 - ( 17 | D ) [ 6] DEFUNCION : 3704 - 3466 - ( 4 | 5 6 1748 ) [ 7] ECIVCONY1 : 1193 - 13 - ( 787 | ESP ) [ 8] DEFUNCION2 : 3703 - 3458 - ( 4 | 25/3/1826 ) [ 9] edatTEXT : 2649 - 176 - ( 1523 | P )  ENTERRAMIE : 3155 - 2856 - ( 11 | 25 )  ENTERRAMIENTO2 : 3155 - 2856 - ( 11 | 25 )  LIBRAS : 31 - 9 - ( 7 | 20 )  NOMBRE : 3978 - 742 - ( 273 | MARIA )  NOMMADRE : 2429 - 246 - ( 442 | MARIA )  NOMPADRE : 2673 - 205 - ( 368 | MANUEL )  NOTARIO1 : 40 - 29 - ( 9 | LABRADOR )  OBSERVACN : 2407 - 1548 - ( 131 | IG )  OFICIO : 334 - 78 - ( 84 | LABRADOR )  OFICIOMAD : 3 - 3 - ( 1 | AMA )  OFICIOPAD : 235 - 36 - ( 109 | LABRADORES )  ORIGPAD : 9 - 6 - ( 3 | ALGIMIA )  SEXO : 3923 - 4 - ( 2034 | HOMBRE )  SUELDOS : 4 - 3 - ( 2 | 0 )  TESTIGDEF : 1312 - 439 - ( 545 | EL PARROCO )  TIPO : 340 - 48 - ( 262 | POBRE )  VECINDAD : 750 - 58 - ( 621 | Ç )  VECINDPAD : 18 - 6 - ( 11 | Ç )
Again, the most common surname is Rodriguez, given names are Manuel and Maria, and the most common job is "labrador". At this point we can see that the data is very incomplete with only 334 job entries out the 4330 subjects. However names of parents are plentiful (~2500, so we can hope to reconstruct some genealogy tree later on) and there are also observaciones (comments) in 2407 entries, which are quite interesting:
print( npDeathData[1051,17] )
IG HF No sacramentos: murió de repente sin poderlo remediar pues cayó en un lagar lleno de vino
English: IG, HF (Hijo de Familia, denotes a teenager), No Sacrament: Died suddenly without remedy because he fell in a wine press full of wine.
print( npDeathData[3794,17] )
Se encontro asesinado el dia 14 en la partida llamada mojonada de este termino natural y vecino de Segorbe
English: Was found murdered the on 14th in the "Mojonada" in this municipality. He was from Segore (Segorbe is the largest town in the region, 8Km away from Vall de Almonacid)
print( npDeathData[2825,17] )
De ALTURA que estaba en este pueblo por huir de los Franceses Se le hizo entierro ordinario que pagaron sus hijos
English: From Altura (another village, 10 km away) who was in this village fleeing from the French. A regular funeral was performed paid by his children.
An yes, that was during the Spanish-French war (https://en.wikipedia.org/wiki/Peninsular_War):
print( npDeathData[2825,8] )
print( npDeathData[524,17] )
“aviendo enfermado de enfermedad mortal, dicho dia al amanecer fue allada aogada en la balsa de la huerta mayor, y dudando el pueblo de la Vall si se le avia de dar sepultura o, no, cosa que al Pueblo no pertenecia sino al señor obispo de Segorbe y su vicario General, el señor don Chisostomo ROYO DE CASTELLVI, o RENS de la Sta Iglesia de Segorbe, su oficial general dicho dia mandaron que asta que se averiguare con testigos e informasemos si se avia desesperado, o, no, que la metiesen en una caxa hecha de tablas y fuese llevado el cadaver, asta que le recibieren los testº y se hicieron las debidas deligencias todo lño qual se hizo obedeciuendo al dicho Rereen dissimo Sr, fue llevado el cadaver a la partida llamada rincon del pico de espadan cerca de la raya del termino distanmcia un tiro de piedra en donde ay un secano cuyo dueño vivae en la alcudia. Despues se recibieron ofrendas de personas fidelkes y testificaron que el dia de S Pedro la avian visto confesar con el cura por espoacio de dos horas y el mismo cura certifica que se comparo generalmente de toda su vida y aviendo tenido tar juntos de Teologos en diferentes dias se informaron del medico que la avia visitado en una enfermedad y dicho y testifico que su enfermedad le hallo en punto, que iba dedicando a delirio ultimamente dichos informaciones determino y mando a san Vicario general que se tocasen a muerto y se le dijera misa cantada de cuerpo presente y que el curo fuese personalmente al pico de Espedan y que se le vendijese la sepultaura solemnemente y que estubiese el cadaver depritado asta tanto que qede convertido y despues que rezen tratidos los mesos la misma caxa el cementerio a Campo Santo del dixo lugar en donde esta enterrado los demas fieles Serian diez libras para que se celebre bien por su alma y que se le diga misas rezadas.... Esta enterrado el cadaver en el cementerio al pie de la cruz en 09.08.1682
English: Being mortally ill she was found drowned on the fields watering pond. The village doubted if she was to be buried on the graveyard or not (suicides would prevent people to get buried there) but the decision was to be taken by the bishop and the church. The general official of the bishop then instructed to put the body in a plank box and leave it to the "Rincon del pico espadan", at the edge of the municipality, until the witnesses could testify and the investigation was concluded. Witnesses testified that they had seen her confessing with the priest for 2 hours and the priest detailed that she had confesses about her entire life. Theologians consulted the doctor that had attended her and he explained that her illness induced her delirium. With this information the general vicar instructed to toll for the dead and that she was to have sang mass with the deceased's body present. In addition, once the body had been converted, the priest was to go personally to the "Rincon del pico espadan" and perform the funeral duties solemnly, accompanying the body until she was buried in the graveyard with all other Catholics.
Now that we have an idea of what is in the data, let's see what can we learn from it.
To check the most common entries of each field I wrote the following function, which given a list of entries relies on the Counter structure to sort the items in the list and print them in order, with the number of occurrences in brackets:
def printMostCommon( vec, n=float("inf") ): vec2 = vec [ vec != '' ] count = Counter(vec2) print ("Most common entries:") for idx in range(len(count)): if idx > n-1: break print ( "%4d : %s" % (count.most_common()[idx], count.most_common()[idx]) )
A quick look to jobs show that something is amiss.
printMostCommon( NPDEATHDATA[:,18], 15 )
Most common entries: 84 : LABRADOR 37 : LABRADORA 36 : HOSPITAL GENERAL 21 : STO HOSP 18 : HOSP GRAL VCIA 16 : AMA 10 : STO HOSPITAL 6 : PASTOR 6 : LABRADOR POBRE 5 : LABRADORA POBRE 4 : TEXEDOR 4 : LABRADORES 4 : HOSPITAL GENERAL DE VALENCIA 3 : CIRUJANO 3 : SASTRE
So followed by farmer, the most common job was... farmer (but the female version of it)!
Nevertheless something is amiss here, as the 3rd, 4th, and 5th most common jobs are a hospital. The most plausible explanation though is that these were kids that had been taken from a hospital (couples who couldn't have kids, or only had girls, usually did that back in the day) and died young, and the priest (who wrote these records) opted to mention that in the job field.
Notice that beside farmer, another "popular" job was "LABRADOR POBRE" (Poor farmer).
printMostCommon( NPDEATHDATA[:,13], 15 )
Most common entries: 273 : MARIA 271 : MANUEL 183 : FRANCISCO 148 : JOSEPH 148 : MIGUEL 127 : MANUELA 124 : JOSEF 120 : VICENTE 111 : ROSA 80 : JUAN 79 : FRANCISCA 78 : VICENTA 67 : TERESA 64 : MARIA ROSA 51 : ANTONIO
Looking at the 15 most common names we can see Joseph and Josef add to 272 occurrences, overtaking Manuel as the most common name. And that is without counting all the Jose and compound names so popular in Spain.
It is common in Spain to bury the deceased in the few immediate days after death. That contrasts with other Catholic countries, like Ireland, where burials can happen up to a week later than the death.
We can check the distance between death and burial, as we have these two dates in NPDEATHDATA[:,8] and NPDEATHDATA[:,11]. For this we have to use the try instruction to avoid getting errors when there is no date string or this is uncomplete.
diffBurial =  for ii, defuncion in enumerate(NPDEATHDATA[:,8]): if (defuncion != '') and (NPDEATHDATA[ii,11] != ''): try: deathDate = datetime.datetime.strptime(defuncion, '%d/%m/%Y').date() burialDate = datetime.datetime.strptime(NPDEATHDATA[ii,11], '%d/%m/%Y').date() diffBurial.append( (burialDate - deathDate).days ) except: pass delay = Counter(diffBurial) print ( "\nDelay between death and funeral for %d funerals:" % len(diffBurial)) for idx in range(len(delay)): print ( "%4d : %d dias" % (delay.most_common()[idx], delay.most_common()[idx]) )
Delay between death and funeral for 2349 funerals: 2283 : 1 dias 25 : 0 dias 13 : 2 dias 4 : -1 dias 3 : -30 dias 2 : 367 dias 1 : 3 dias 1 : 4 dias 1 : 5 dias 1 : 7 dias 1 : -89848 dias 1 : 10 dias 1 : -1460 dias 1 : 335 dias 1 : -364 dias 1 : 21 dias 1 : -96423 dias 1 : 90 dias 1 : 93 dias 1 : 29 dias 1 : 366 dias 1 : 8 dias 1 : -81479 dias 1 : 381 dias 1 : -2 dias
From 2349 funerals, 2283 (97.2%) happened the day immediately after the death, 25 the same day, 13 after 2 days, and a 7 between 3 and 21 days.
Keep in mind, however, that for the 4330 entreis, there is no recording of the death date in around 2000. Probably because dead and funeral happened at the same time and the priests opted for only filling one field in the book.
In addition it can clearly be seen that there are numerous typos on the data, as 12 entries report a funeral before the death. In 12 other cases it looks like there is shift likely to be due to a typo (1 month, 3 months, 1 year, etc).
Next step is to visualise the distribution of births and deaths by year, and also by month. For that I wrote these 3 functions:
# Hide the right and top axes def remove2axis(ax): ax.spines['right'].set_visible(False) ax.spines['top'].set_visible(False) # Only show ticks on the left and bottom spines ax.yaxis.set_ticks_position('left') ax.xaxis.set_ticks_position('bottom') # Modified z-score def MADscore(vec): median = np.median(vec) median_absolute_deviation = np.median([np.abs(v - median) for v in vec]) modified_z_scores = [0.6745 * (v - median) / median_absolute_deviation for v in vec] return modified_z_scores # Print and plot distributions def plotDateDistribution ( vec, str_ylabel, showPlot=True, showByYear=True, showByDay=False ): # str_day = ['Domingo', 'Lunes', 'Martes', 'Miercoles', 'Jueves', 'Viernes', 'Sabado'] # str_months = ['Enero', 'Febrero', 'Marzo', 'Abril', 'Mayo', 'Junio', # 'Julio', 'Agosto', 'Setiembre', 'Octubre', 'Noviembre', 'Diciembre'] str_day = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'] str_months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'] distributionDay =  * 7 distributionMonth =  * 12 distributionYear =  * 2000 # put entries in bin for each occurance for d in vec: try: deathDate = datetime.datetime.strptime(d, '%d/%m/%Y').date() distributionDay[deathDate.weekday()] += 1 distributionMonth[deathDate.month-1] += 1 distributionYear[deathDate.year] += 1 except: pass # compute z-scores to detect outliers zScoreDay = MADscore(distributionDay) zScoreMonth = MADscore(distributionMonth) zScoreYear = MADscore(distributionYear) # find first and last year with data min_year = next(i for i, v in enumerate(distributionYear) if v > 0) max_year_tmp = next(i for i, v in enumerate(reversed(distributionYear)) if v > 0) max_year = len(distributionYear) - 1 - max_year_tmp yearRange = np.arange(min_year, max_year+1) # Print distributions on console if showByDay: print( "\nBy day: ") for i, v in enumerate(distributionDay): print ( "%12s : %4d ( %+.2f )" % (str_day[i], v, zScoreDay[i]) ) print( "\nBy month: ") for i, v in enumerate(distributionMonth): print ( "%12s : %4d ( %+.2f )" % (str_months[i], v, zScoreMonth[i]) ) if showPlot: fig = plt.figure() fig.suptitle("By Month") ax1 = plt.gca() ax1.bar(range(12), distributionMonth, align='center', linewidth = '0', alpha=0.8, color='darkorange') ax1.set_xticks( range(12) ) ax1.set_xticklabels( [v for v in str_months] ) ax1.set_ylabel( str_ylabel ) plt.margins(0.05, 0) x1,x2,y1,y2 = ax1.axis() ax1.axis((x1,x2,y1,y2 + 5)) remove2axis(ax1) if showByDay: fig = plt.figure() fig.suptitle("By Day") ax2 = plt.gca() ax2.bar(range(7), distributionDay, align='center', linewidth = '0', alpha=0.8, color='darkorange') ax2.set_xticks( range(7) ) ax2.set_xticklabels( [v for v in str_day] ) plt.ylabel( str_ylabel ) plt.margins(0.05, 0) x1,x2,y1,y2 = ax2.axis() ax2.axis((x1,x2,y1,y2 + 5)) remove2axis(ax2) if showByYear: values = distributionYear[min_year:max_year+1] fig = plt.figure() fig.suptitle("By Year") ax3 = plt.gca() ax3.bar( yearRange, values, align='center', width=1, linewidth = '0', alpha=0.8, color='darkorange') ax3.set_xticks( [v for v in yearRange if v % 10 == 0] ) plt.ylabel( str_ylabel ) plt.margins(0.03, 0) x1,x2,y1,y2 = ax3.axis() ax3.axis((x1,x2,y1,y2 + 5)) remove2axis(ax3) # fig.subplots_adjust(left=0.06, right=0.99, top= 0.99, bottom=0.05) plt.show() else: fig = None return [fig, distributionDay, distributionMonth, distributionYear]
Notice that this function also allows to explore the distributions of births and deaths across the days of the week. However I removed that part of the analysis as nothing interesting came out of it (all distributions were basically uniform)
[figBirth, birthByDay, birthByMonth, birthByYear] = plotDateDistribution ( npBirthData[:,4], '# Births', True )
By month: January : 447 ( +0.09 ) February : 475 ( +0.65 ) March : 570 ( +2.57 ) April : 438 ( -0.09 ) May : 513 ( +1.42 ) June : 452 ( +0.19 ) July : 462 ( +0.39 ) August : 354 ( -1.78 ) September : 336 ( -2.14 ) October : 432 ( -0.21 ) November : 408 ( -0.69 ) December : 350 ( -1.86 )
The distribution by month shows a higher rate of births in the spring. March, the month with highest number of births has a modified z-score of 2.57. So a bit of an outlier but nothing exceptional (usually an outlier is considered as such when the z-score is over 3.5).
Despite not statistically significant, December shows a reducition in births, probably due to the reduced conception during Lent.
On the distribution by year we can see a steady increase in births along time, with a marked boom in the 1750 to 1770.
[figDeath, deathByDay, deathByMonth, deathByYear] = plotDateDistribution ( npDeathData[:,8], '# Funerals', True )
By month: January : 322 ( +0.78 ) February : 236 ( -1.25 ) March : 272 ( -0.40 ) April : 293 ( +0.09 ) May : 277 ( -0.28 ) June : 259 ( -0.71 ) July : 364 ( +1.77 ) August : 486 ( +4.66 ) September : 325 ( +0.85 ) October : 285 ( -0.09 ) November : 267 ( -0.52 ) December : 316 ( +0.64 )
From the month distribution we see that August (A) has an abnormal spike of deaths with a modified z-score of +4.66 (Usually anything above 3.5 is considered outlier). In addition July and September are the 2nd and 3rd months with highest number of funerals.
At first we could think that given the high child mortality back then, a spike in deaths could be explained by a spike in births. However looking at the distribution of births previously plotted we see than that is not case, with August and Septembre being the months with lowest number of births. Could it be that summer and the spanish heat is worse for the population? Maybe it was due to accidents in the fields, wars being more common in the summer? We don't know, but we will come back to analysis in Section 3.4.
Distribution by year show no births from 1637 until 1663. This, I have been told, was due not because of a lack of births, but because the sheets from the records book of that period were tore. Probably to light a fire...
Similarly, in 1780 there are no deaths reported, but this is also probably due to missing records, as birth records exist therefore it is likely that the priest would have recorded funerals as well.
Another interesting fact is 1803, with that huge spike in deaths: Could it be wars, the plague, a bad crop? We don't know but we will take a look to that later on in Section 3.5, to see what the data says.
As seen in the distribution of deaths by month of the year, the summer months showed a higher mortality. This was a bit of a surprise for me, as I expected a nearly uniform distribution or an increase during winter, when the conditions are harsh (subzero temperatures are not rare during the winter months in this region).
In order to investigate any possible underlying cause that explains the super mortality spike, we first will separate adults from children:
In the field NPDEATHDATA[:,9] there is written an abbreviation of the age of people at their funerals. 'ALB' or 'A' (Albado/a), 'P' (Parvulo), 'N' (Niño/a), 'NIÑO' and 'NIÑA' denote children, while an empty field '' denote adults. HF (Hijo de familia) and the remaining indicators are excluded from this analysis.
printMostCommon( NPDEATHDATA[:,9], 15 )
Most common entries: 1523 : P 391 : A 158 : N 85 : ALB 57 : HF 25 : RN 23 : 2 19 : 1 16 : D 11 : MY 9 : NIÑA 8 : 4 7 : 18M 7 : 73 7 : NIÑO
# Separate adults and children adults =  children =  for i, v in enumerate(NPDEATHDATA[:,9]): if v == '': adults.append(NPDEATHDATA[i,8]) if v in ['P', 'ALB', 'A', 'N', 'NIÑA', 'ŃIÑO']: children.append(NPDEATHDATA[i,8]) print ("Total number of adult deaths: %d" % len(adults)) [figDeathAdults, deathByDayAdults, deathByMonthAdults, deathByYearAdults] = \ plotDateDistribution ( adults, '# Funerals Adults', True )
Total number of adult deaths: 1681 By month: January : 138 ( +1.00 ) February : 121 ( -0.10 ) March : 120 ( -0.16 ) April : 141 ( +1.19 ) May : 122 ( -0.03 ) June : 111 ( -0.74 ) July : 123 ( +0.03 ) August : 138 ( +1.00 ) September : 113 ( -0.61 ) October : 142 ( +1.25 ) November : 118 ( -0.29 ) December : 135 ( +0.80 )
When only looking at adult deaths, the distribution per months seems pretty niform, without a summer spike as before.
print ("Total number of children deaths: %d" % len(children)) [figDethChildren, deathByDayChildren, deathByMonthChildren, deathByYearChildren] = \ plotDateDistribution ( children, '# Funerals Children', True )
Total number of children deaths: 2166 By month: January : 154 ( +1.10 ) February : 91 ( -1.56 ) March : 130 ( +0.08 ) April : 118 ( -0.42 ) May : 120 ( -0.34 ) June : 126 ( -0.08 ) July : 211 ( +3.50 ) August : 316 ( +7.93 ) September : 181 ( +2.23 ) October : 121 ( -0.30 ) November : 124 ( -0.17 ) December : 150 ( +0.93 )
When we focus on children, the summer spike becomes increasingly pronounced (z-score +7.93). This should then exclude the explanation that summer deaths were related to an increase in labour activity or warfare, a task mainly performed by adults, or older children.
To exclude violent deaths as the cause of the summer spike in mortality, we can look at deaths separated by gender. A high rate of male deaths could indicate a spike in violent deaths, as men were more involved in warfare and farming labour:
# Spearate adults deaths by gender male =  female =  for i, v in enumerate(NPDEATHDATA[:,9]): if v == '': if NPDEATHDATA[i,22] == 'HOMBRE': male.append(NPDEATHDATA[i,8]) else: female.append(NPDEATHDATA[i,8]) print ("Total number of male adult deaths: %d" % len(male)) print ("Total number of female adult deaths: %d" % len(female))
Total number of male adult deaths: 796 Total number of female adult deaths: 885
And the ratio by month:
deathByMonthMaleAdults =  * 12 deathByMonthFemaleAdults =  * 12 for m in male: try: deathDate = datetime.datetime.strptime(m, '%d/%m/%Y').date() deathByMonthMaleAdults[deathDate.month-1] += 1 except: pass for f in female: try: deathDate = datetime.datetime.strptime(f, '%d/%m/%Y').date() deathByMonthFemaleAdults[deathDate.month-1] += 1 except: pass MFratio = np.array(deathByMonthMaleAdults) / (np.array(deathByMonthFemaleAdults) + np.array(deathByMonthMaleAdults)) * 100 MFscore = MADscore(MFratio) print( "\nBy month: ") for i, v in enumerate(MFratio): print ( "%12s : %4.1f ( %+.2f )" % (i, v, MFscore[i]) )
By month: 0 : 53.6 ( +1.84 ) 1 : 45.5 ( -0.35 ) 2 : 47.5 ( +0.20 ) 3 : 48.9 ( +0.58 ) 4 : 41.0 ( -1.55 ) 5 : 45.9 ( -0.22 ) 6 : 43.9 ( -0.77 ) 7 : 52.2 ( +1.45 ) 8 : 46.0 ( -0.20 ) 9 : 50.0 ( +0.87 ) 10 : 44.9 ( -0.49 ) 11 : 57.0 ( +2.75 )
fig = plt.figure() plt.plot(range(12), MFratio, alpha=0.8, linewidth=4, color='darkorange') plt.plot([-0, 11], [50, 50], linewidth=2, alpha=0.3, color='gray') plt.ylabel( 'Percentage of deaths that were male' ) plt.xticks( range(12), ['J', 'F', 'M', 'A', 'M', 'J', 'J', 'A', 'S', 'O', 'N', 'D'] ) plt.margins(0.05) remove2axis(plt.gca())
The rate of (adult) male deaths remains between 41% (May) to 57% (December), with an august rate of 52.2% in August. This constant ratio suggests that the increase in summer deaths were not violent in nature.
So while it is hard to extract conclusions from this limited dataset, the data shows that the summers had an increased mortality among children that was not related to violent deaths.
When we compare this data with other demographic studies we see that in "Evolucion Demografica de Cortes de Arenoso desde 1560 a 1660" by Antonio Poveda Ayora, MSc Thesis, Universidad de Valencia, 1982, a similar spike is reported in the month of August (with a 38% increase over the mean). In page 282 he reports:
"El máximo de defunciones que se registran en Cortes de Arenoso en los meses de agosto y septiembre puede deberse a la incidencia que, sobre estos meses, tendría la mortalidad de niños, pues las causas de la mortalidad infantil eran, fundamentalmente, las diarreas estivales antes de las lluvias de otoño, debidas a la sequedad, altas temperaturas, empleo de aguas contaminadas, etc."
English: "The maximum of deaths registered in Cortes de Arenoso in August and September can be due the high children's mortality, as main children mortality causes were summer diarrhea before autumn rains as consequence of dry weather, high temperatures, contaminated waters, etc."
We have seen that the period between 1750 and 1780 showed an increase in births that might be attributed to a higher standard of living. To confirm that we can take a look at child mortality, as there is an association between higher child mortality and poorer living conditions.
# we already have computed deathByYearChildren and deathByYearAdult #ChildrenMortalityRatio = np.array(deathByYearChildren) / (np.array(deathByYearAdults) + np.array(deathByYearChildren)) * 100 #CRscore = MADscore(ChildrenMortalityRatio) childrenMortalityRatio = [np.nan] * 2000 for i, v in enumerate(deathByYearChildren): if v > 0 or deathByYearAdults[i] > 0: childrenMortalityRatio[i] = deathByYearChildren[i] / (deathByYearChildren[i] + deathByYearAdults[i]) * 100 #print( childrenMortalityRatio ) # find first and last year with data min_year = next(i for i, v in enumerate(childrenMortalityRatio) if v != np.nan) max_year_tmp = next(i for i, v in enumerate(reversed(childrenMortalityRatio)) if v != np.nan) max_year = len(childrenMortalityRatio) - 1 - max_year_tmp yearRange = np.arange(min_year, max_year+1) fig = plt.figure() fig.suptitle("Child mortality by year") plt.bar(yearRange, childrenMortalityRatio[min_year:max_year+1], align='center', linewidth=0, width=1, color='darkorange', alpha=0.8) #plt.plot([-0, 11], [50, 50]) plt.ylabel( 'Children mortality ratio' ) plt.xticks( [v for v in yearRange if v % 10 == 0] ) plt.margins(0.03, 0) remove2axis(plt.gca())
It is clear from that graph that the expected decline in child mortality is not present in the period between 1750 to 1770. In addition we can see that there is virtually no child mortality between 1705 and 1750. This was clear not the case and the likely reason for that gap is that child mortality was not reported during this period.
Similarly, the 100% child mortality reported from 1840 to 1850 is probably a consequence of bad encoding of deaths of adults, or bad parsing of the data.
As seen in the plot of all deaths by year,a huge spike in mortality occurs in 1803. To study the possible explanation of that spike we first look at the distributions by month on that year:
deaths1803 =  deaths1803Adult =  deaths1803Child =  for i, d in enumerate(npDeathData[:,8]): try: deathDate = datetime.datetime.strptime(d, '%d/%m/%Y').date() if deathDate.year == 1803: deaths1803.append(d) if NPDEATHDATA[i,9] == '': deaths1803Adult.append(d) if NPDEATHDATA[i,9] in ['P', 'ALB', 'A', 'N', 'NIÑA', 'ŃIÑO']: deaths1803Child.append(d) except: pass [figDeath1803, deathByDay1803, deathByMonth1803, deathByYear1803] = plotDateDistribution ( deaths1803, '# Funerals in 1803', True, False )
By month: January : 33 ( +9.44 ) February : 10 ( +1.69 ) March : 6 ( +0.34 ) April : 5 ( +0.00 ) May : 5 ( +0.00 ) June : 1 ( -1.35 ) July : 3 ( -0.67 ) August : 7 ( +0.67 ) September : 4 ( -0.34 ) October : 5 ( +0.00 ) November : 0 ( -1.69 ) December : 3 ( -0.67 )
If we lat at the distribution independently for children and adults:
[figDeath1803, deathByDay1803, deathByMonth1803, deathByYear1803] = plotDateDistribution ( deaths1803Adult, '# Adult funerals in 1803', True, False ) [figDeath1803, deathByDay1803, deathByMonth1803, deathByYear1803] = plotDateDistribution ( deaths1803Child, '# Children funerals in 1803', True, False )
By month: January : 1 ( +0.67 ) February : 0 ( -0.67 ) March : 4 ( +4.72 ) April : 1 ( +0.67 ) May : 0 ( -0.67 ) June : 0 ( -0.67 ) July : 0 ( -0.67 ) August : 1 ( +0.67 ) September : 0 ( -0.67 ) October : 1 ( +0.67 ) November : 0 ( -0.67 ) December : 1 ( +0.67 )
By month: January : 32 ( +9.44 ) February : 10 ( +2.02 ) March : 2 ( -0.67 ) April : 4 ( +0.00 ) May : 5 ( +0.34 ) June : 1 ( -1.01 ) July : 3 ( -0.34 ) August : 6 ( +0.67 ) September : 4 ( +0.00 ) October : 4 ( +0.00 ) November : 0 ( -1.35 ) December : 2 ( -0.67 )
We can see that the exceptional number of 33 deaths (of which 32 where children) occured in the single month of January 1803. Was that a trend coming from 1802?
deaths1802 =  for d in npDeathData[:,8]: try: deathDate = datetime.datetime.strptime(d, '%d/%m/%Y').date() if deathDate.year == 1802: deaths1802.append(d) except: pass [figDeath1802, deathByDay1802, deathByMonth1802, deathByYear1802] = plotDateDistribution ( deaths1802, '# Funerals 1802', True, False )
By month: January : 4 ( +0.45 ) February : 0 ( -1.35 ) March : 5 ( +0.90 ) April : 3 ( +0.00 ) May : 5 ( +0.90 ) June : 1 ( -0.90 ) July : 1 ( -0.90 ) August : 2 ( -0.45 ) September : 2 ( -0.45 ) October : 3 ( +0.00 ) November : 3 ( +0.00 ) December : 5 ( +0.90 )
No, January 1803 was an isolated case of high mortality. Maybe a bad crop, a strong winter, or a sever flu? We don't know.
Ot the spike is because somehow a bunnch of births were recorded with the same date repeated? Maybe the 1st of January of 1803?
# to dates type dates1803 = [datetime.datetime.strptime(strDate, '%d/%m/%Y').date() for strDate in deaths1803] dayOfMonth = [d.day for d in dates1803 if d.month == 1] # only for january print(dayOfMonth) # because who needs np.histogram()? distDay =  * 32 for d in dayOfMonth: distDay[d] += 1 fig = plt.figure() plt.bar(range(len(distDay)), distDay, alpha=0.8, linewidth=0, color='darkorange') plt.ylabel( '# Funerals January 1803, per day' ) remove2axis(plt.gca())
[3, 4, 4, 4, 5, 5, 5, 8, 8, 10, 10, 11, 11, 13, 14, 14, 15, 15, 16, 16, 17, 19, 19, 23, 25, 25, 25, 25, 27, 28, 29, 31, 31]
No, the deaths are spread among the days of the month. January 1803 was simply a tough month.
While the data contains numerous inaccuracies and typographical errors, we can already see certain trends.
The first interesting result that arises from this study is that summers, specially August, showed a higher children mortality rate than the rest of the year. This trend is seen in other demographic studies of the region where it is attributed to the hotter conditions before autunm rains. In this study we showed that the reasons for such increase in mortality were probably not related to violent deaths, as male mortality (associated with warfare and labour accidents) remained under 50% during this period.
Furthermore, the months of Decembre and January showed a second spike in mortality, more pronounced among children andmales. This is also consistent with other studies of the region.
A second result of interest is that January of 1803 was a month with very high mortality among children, albeit we don't know the reason for that.
In addition, we can see an increase in number of births between 1750 and 1780. However we do not see a decrease in child mortality during the same period that would suggest an increase in standards of living during this period.
The data is available to be download as CSV files at:
Copyright 2017 Adria Perez-Rovira (MIT License)
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.